Inexplicable performance on big.LITTLE technology (on Android)

只看该作者 · 2018-9-8 20:30

I apologise for the long question, but I am trying to measure performance of different indexing techniques on various platforms, one of which is Adaptive Radix tree.
I have run tests where the basic steps look like this (c/c++):
Step 1: Generate or load data (few million key-value pairs)
Step 2: Insert into index and measure time taken (insert_time)
Step 3: Retrieve from index and measure time taken (retrieve_time)
I find that always insert_time > retrieve_time on most platforms such as Intel desktops (i386/amd64), iPad (Apple A9), Android (ARMv7) and Raspberry Pi 3 (ARMv8). This is expected, as insert complexity is higher than retrieve complexity.
But when I run the steps on big.LITTLE platforms, specifically Snapdragon 845 (Xiaomi POCO F1) and HiSilicon Kirin 659 (Honor 9 lite), I find insert_time < retrieve_time, except when data size is too low.
To diagnose what could be wrong, I went through the following steps:
Ensure that the thread is running at maximum speed by using following code:
void set_thread_priority() {
nice(-20);\nint policy = 0;\nstruct sched_param param;\npthread_getschedparam(pthread_self(), &policy, &param);\nparam.sched_priority = sched_get_priority_max(policy);\npthread_setschedparam(pthread_self(), policy, &param);}
I could see that the nice value is reflected against the process and the thread runs 100% CPU in most cases (it is basically single thread algorithm).
Set CPU affinity using following code:
void set_affinity() {
cpu_set_t mask;\nCPU_ZERO(&mask);\nCPU_SET(4, &mask);\nCPU_SET(5, &mask);\nCPU_SET(6, &mask);\nCPU_SET(7, &mask);\nsched_setaffinity(0, sizeof(mask), &mask);}
This code also reflects well on big.LITTLE because when I set CPUs as 0, 1, 2, 3, the code runs much slower than when I set CPUs as 4, 5, 6, 7. Even then insert_time < retrieve_time in both cases.
Ensure that sufficient free RAM is available for my dataset
To avoid the possibility that Step 3 might retrieve from virtual memory, I added Step 4, which is just repeating Step 3:
Step 4: Retrieve from index and measure time taken again (retrieve_time2)
To my surprise, retrieve_time2 > retrieve_time > insert_time (by 2 to 3 seconds for 10 million records).
As for my code, the insert code looks like this:
it1 = m.begin();\nstart = getTimeVal();\nfor (; it1 != m.end(); ++it1) {\n art_insert(&at, (unsigned char*) it1->first.c_str(),\n (int) it1->first.length() + 1, (void *) it1->second.c_str(),\n (int) it1->second.length());\n ctr++;\n}\nstop = getTimeVal();and retrieve code looks like this:
it1 = m.begin();\nstart = getTimeVal();\nfor (; it1 != m.end(); ++it1) {\n int len;\n char *value = (char *) art_search(&at,\n (unsigned char*) it1->first.c_str(), (int) it1->first.length() + 1, &len);\n ctr++;\n}\nstop = getTimeVal();Any pointers as to what I could do further? Or is there an explanation for this from the platform perspective?

Inexplicable performance on big.LITTLE technology (on Android)

相关下载

相关帖子