Contest: When Constant Factors Matter…

When Constant Factors Matter

This is Part II of the competitive programming blog I started last time. You can read Part I here.

This blog is a bit of a departure from the previous blogs. Previously I discussed some algorithms/data-structures that I employed, to solve various problems. This blog continues in that tradition, but it also talks about the “constant factors” hidden behind the Big-Oh notation. In particular, we will see that sometimes optimizing the hidden constant can be enough to just push us into the desired time-limit, even with a sub-optimal algorithmic solution. To be sure, it is not something I recommend constantly. But it is an option when one can’t further think of algorithmic optimizations.

Previously, I discussed all but the last problem from the contest, so without further ado, here is the last problem:

Problem 5: White Falcon and XOR

Here is the problem statement and example from HackerRank:

There are two sequences, a and b, each of length N. We will create new sequences out of a and b and write them onto a paper one by one with the following rules:

Choose two integers L and R. (1≤L≤R≤N)

Obtain a new sequence with size R−L+1 by choosing one of ai or bi for each i L≤i≤R.

More formally let the new sequence be w. Then:

w1∈{aL,bL}

w2∈{aL+1,bL+1}

w3∈{aL+2,bL+2}

wR−L+1∈{aR,bR}.

Note that there are 2R−L+1 ways to choose w in total, given L and R.

We would like to know number of sequences on the paper with xor sum equal to 0 for each length. Specifically, we want N numbers, where the ith number denotes the number of sequences with length i and with xor sum equal to 0. We only want each number modulo M.

Constraints

1≤N≤8000

0≤ai<64

0≤bi<64

1≤M≤10^5

Example:

N = 4  M = 1000

A = 2 2 3 1

B = 2 3 2 2

Output:

0

5

2

4

Explanation

There are 5 possible ws with length 2:

For (L,R)=(1,2): (2,2) and (2,2)

For (L,R)=(2,3): (2,2) and (3,3)

For (L,R)=(3,4): (2,2)

There are 2 possible ws with length 3:

For (L,R)=(2,4): (2,3,1) and (3,2,1)

There are 4 possible ws with length 4:

For (L,R)=(1,4): (2,2,2,2), (2,2,2,2), (2,3,3,2) and (2,3,3,2)

A sub-optimal solution:

At first the problem appears pretty intractable. But then we notice that the various values for ai and bi lie between [0, 63]. This can be exploited in order to design a dynamic programming solution. Here was my first take on this solution.

In particular, consider maintaining a DP cache of capacity [2][64]. The [2] portion is used to maintain the sizes of the cache for the “previous” and the “next” iterations in the bottom-up DP algorithm. The [64] is used to track the number of sequences that XOR to that particular value.

This cache is initialized by the actual a[i] and b[i] counts (i.e. all XOR sequences with size = 1). This is our base case. We then proceed through increasing values of size from 2 to L, and keep updating the counts in the DP cache. Simultaneously, we update the xorCount[] array with the results after each iteration of size, so that we can print out the answers at the end.

This solution is functionally correct. However it is really slow. The algorithmic complexity is O(N^2 * 2^M), or in this case O(N^2 * 64); where N = length of the sequences and 2^M = the max value of each element (Here 2^M = 64). In terms of actual runtime, this algorithm takes around 19 seconds on most of the inputs. The time limit cutoff was 2 seconds, so this is about 10 times slower!

An Algorithmic Optimization Attempt

Of course, my first thought in the contest was to try and optimize this algorithmically. It seems that the problem should be amenable to a divide-and-conquer approach. Divide the sequences into two portions, compute the corresponding values for size/2, and then in the merge step get the corresponding values for everything from size/2+1 to size.

Seems simple in theory. There was only one problem…. For the life of me, I could not figure out how to do the merge step in time less than O(n^2). No matter what I tried, the merge step kept taking quadratic time. And if the merge takes quadratic time, then the overall complexity of the algorithm will be dominated by this factor, as per the master theorem. Clearly such a solution is no better than my DP solution.

EDIT: I peeked at the editorial after the contest. Turns out D&C was indeed the way to go, but the merge step required knowledge of Fast Fourier Transform to do it faster than O(N^2). Sounds fascinating! I need to put that in my mental queue, so that I can go back and learn this technique.

So what do I do now? The clock is ticking on the problem, and I can’t seem to figure out a solution better than O(N^2). Well, since it was a weekend, and I had plenty of time, I literally started fooling around with the code. What optimization tricks could I pull off on my DP solution, in order to make it run faster?

When Constant Factors Matter…

I’m lucky to have had a really good compilers professor in grad school. I have also worked at Intel on the x86 simulator team for the last 5 years, so I’m pretty familiar with the low level workings of a CPU. Could I perhaps exploit this knowledge to improve the runtime of my code somehow? Of course, there’s no way I could improve the runtime from 19 seconds to 2 seconds needed for a successful submission… could I? I had no idea, but I knew I wanted to tweak the code as much as possible…

Let’s examine the bottlenecks in my code:

Optimizing away the modulus:

The very first thing that jumps out is the number of times I perform modulus all over the code. This wouldn’t be a problem if M was a power of two. In such a case this turns out to be a simple bit operation. However the general integer division instruction is far slower. (Integer division performs both divide and modulus to give quotient and remainder). So performing modulus all over the place is definitely slowing down the code by introducing latencies in the CPU pipeline. These have got to go.

But how do we do this? I mean the problem statement asks for the answers “modulo M”. The key insight here is to understand that we don’t need to perform the modulo on *every* iteration. Luckily for us, the maximum value of M is 10^5. And U32’s can hold about 4×10^9 and U64 can go up to 16×10^18. This gives us quite a bit of room to play with!

If we use a 32-bit number to hold the results, we may be able to get away with performing the modulo every X iterations. On the other hand, if we use a 64-bit number, we may be able to perform the modulo every Y iterations (Y > X). But keep in mind this increases the size of our DP cache (in bytes) because a U64 takes twice the space of a U32. This will affect CPU data cache performance somewhat.

So what do we do? Take the smaller or larger data type? And how do we determine the values of X and Y? Luckily, both of these questions are answered by performing repeated experiments. In fact, let’s take a page out of the algorithms techniques, and binary search on these values in our experiments! Therefore I performed a number of experiments using U32 and U64, and determined the following:

//#define MODCNT (45)  //experimentally determined for U64, dont’ touch

//#define MODCNT (15)  //experimentally determined for U32, dont’ touch

i.e. X = 15 and Y = 45. Every 15 iterations I need to perform the mod M operation if I use a U32, and every 45 iterations I need to perform the mod M if I used a U64. This would ensure that the modulus was done *before* any data overflow, and hence keep the results correct. In practice, the latter (U64, w/ 45 iterations) performed better, and that is what I settled on.

Here’s another interesting problem, how do I know (within the inner for loop) whether I have passed 45 iterations? The naïve way is to do a modulo 45 on the iteration counter to figure this out. But the whole point is that we are trying to optimize away the modulo in the for loop. 🙂 Therefore, I pre-computed the Boolean values for the various iteration counts *outside* the loops:

for (int mult=0; (mult*MODCNT)&lt;LIMIT; mult++) {
        int val  = mult*MODCNT;
        multCheck[val] = true;
}

Optimizing away unnecessary memory writes:

At the end of the for loop, I am swapping the prev and next caches, so that I can start writing into them. Instead of copying the values, I re-factored the code to perform a simple pointer swap. This saved the write of 64 U64s or 512 bytes of memory at the end of every inner iteration. While this may not seem like much, over the course of the entire program, it saved (8000 * 8000 * 512) = ~31 GB of unnecessary memory writes! This was HUGE!

Manual loop unrolling, and re-factoring writes to the cache:

Here’s another place where I could optimize:

U8 xorval = i ^ as[R];
NumSeq incBy = cache[prev][L].results[i];
cache[next][L].results[xorval] += incBy;
cache[next][L].results[xorval] %= M
                :
                :
U8 xorval = i ^ bs[R];
NumSeq incBy = cache[prev][L].results[i];
cache[next][L].results[xorval] += incBy;
cache[next][L].results[xorval] %= M;

This is a snippet of code in the inner for loop. The compiler really has no idea whether these various memory writes alias, and therefore it is very conservative in generating code for these writes.

Instead, with a bit of refactoring, I could re-write the code in the following fashion:

cacheNext[i] = cachePrev[asr ^ i] + cachePrev[bsr ^ i];

This loop goes from i=0 to 63. Not only is there no dependency across loop iterations, but it has very good L1-cache behavior as well! This is a definite win. I wasn’t sure whether the compiler would perform loop unrolling, so I manually unrolled all 64 iterations for i.

Here is my resulting final submission.

Here’s the real shocker:

On gcc 4.8.1, with O3 optimization, by first DP submission ran for 19 seconds.

On gcc 4.8.1, with O3 optimization, my final DP submission ran for 1.92 seconds!

This is an astonishing 10x improvement *over O3 optimized gcc code*. Without any algorithmic complexity changes! Of course, the solution is still functionally correct.

I can honestly say that I have never performed such an optimization before. It was good enough to pass the 2 second time limit on all the test cases, and I succeeded with this problem in the contest!

It was also an eye-opener for me. Yes, algorithmic optimization is the preferred way to go. But when you can’t optimize it any further with Big-Oh, perhaps optimizing away the hidden constants can help you hit your target runtimes! 🙂

Leave a comment