Walks in a Hypercuboid!
This blog post is about an extremely interesting problem I solved last night (after it gave me much grief)! I started off with a naïve and horrible implementation, and over the course of 5 iterations, morphed the solution into an elegant, correct, and blazingly fast implementation in C++! Since it gave me great joy, I had to write a blog to share it. The final solution simultaneously employed 3 dynamic programming algorithms on 3 different subproblems, and 1 divide and conquer algorithm.
But first, just what the hell is a hypercuboid?
Well, we all know what a cube is, right? A cuboid is just like a cube, but its three lengths need not be equal. In other words, a cuboid is to a cube what a rectangle is to a square.
Ok, so what about the “hyper” part? Well, a hypercuboid is just a cuboid in higher dimensional space. In other words, a hypercuboid is to a cuboid what a cuboid is to a rectangle. i.e. a rectangle is a hypercuboid in 2-D space, a cuboid is a hypercuboid in 3-D space, a tesseract (yes, like from the Thor movies!) is a hypercube in 4-D space. A hypercuboid is a generalized cuboid over some N-dimensional space.
Problem Statement (Grid Walking)
The problem I solved was as follows:
You are given a hypercuboid, with N dimensions, and the length of each dimension specified. You also start off somewhere in this hypercuboid, at a start position specified by an N-dimensional vector. You can take 1 step along any dimension in the + or – direction at one time. You are asked to find out how many paths you can traverse through this structure, in M steps, subject to the constraint that each path must lie entirely within the hypercuboid. i.e. you are not allowed to fall out of the hypercuboid.
So for example, if this were a 2-D hypercuboid, with the following constraints and starting point, and you were asked to take 2 steps, you could trace out 16 different paths as shown below. Note, I haven’t shown {N, S}, {S, N}, {E, W}, {W, E} arrows due to space.
Doesn’t sound too bad right? Sure, we need to make sure we don’t fall off the structure, but it sounds like a straightforward application of dynamic programming.
Solution 1: Straightforward dynamic programming
Here’s the recurrence equation. Let P(S, I, j, k, l, …) denote the number of ways to take S steps from the starting vector V(I, j, k, l….). Then I can take 1 step forward or backward along any dimension, and recurse on S-1 steps.
P(S, I, j, k, l, …) = P(S-1, i-1, j, k, l…) + P(S-1, i+1, j, k, l, …) + …. + P(S-1, I, j-1, k, l, …) + ……
Of course, we have to have base cases for S=0, and there is exactly one way to take no steps from anywhere, and that way is to just remain where we are.
Further, we need to make sure that we only do the -1 along any dimension when we are not at the lowest point, and only do the +1 when we are not at the highest point in that dimension.
We can maintain two caches, one for the previous S-1 step, and one for the current S step, and just iterate through all the steps, and compute the values.
This first solution in C++ is pretty simple. It is also a complete disaster.
It has at least 2 main problems.
- The algorithmic complexity is off the charts. I step over all S steps, and for each of the (D1 * D2 * … * Dn) indices, I do the sum over both the candidates in each dimension. If we denote max dimension size as D, maximum number of dimensions as N, maximum steps as M, we get O(M * (D^N) * N).
- The second problem is that I am computing ALL of the (D^N) indices prior to starting any computations. Each index is N wide, so I’m using O(N*D^N) space just to store the indices! This doesn’t even account for the space used by the caches. In my defence, I’m very used to lazy programming in Haskell. In Haskell, if I wrote a function to give me ALL indices, it wouldn’t actually do that. It would just create a thunk and sit tight, and provide me indices on demand as I asked for them. I was so used to this way of thinking that I reproduced it in a strict language like C++, to disastrous ends! I ran out of memory! 🙂
Solution 2: Iterate over indices, instead of generating them
Ok, so after I found that I had way too many indices eating up my memory I decided to mimic Haskell and supply indices on a one-by-one basis. You can see the resulting changes in solution 2.
Basically I have hacked together a primitive iterator scheme. The function getStartIndex() gives me the starting index, and I can repeatedly keep calling getNextIndex() to get the next lexicographically bigger index. When this function rolls over to the start index, it signals a flag which can end iteration.
This solved my memory issues, since I wasn’t creating so many indices. But the basic problem still remained. I had just way too many indices to go through and process. In fact, the bottleneck now became the initialization function! My software caches were so big that the CPU just got stuck writing the initial values into the caches (see the loop in allocateCaches()). And my algorithmic complexity was unchanged.
This was a dismal failure of gigantic proportions! I needed to go back to the drawing board, and start from scratch. L
Solution 3: Exploiting independence!
The key insight came in the form of orthogonality of vector dimensions. In other words, taking x steps along dimension I is independent of taking y steps along dimension J.
Why is this so crucial?
Imagine, if you will, a hypercuboid of just 2 dimensions, I and J. Let’s say I need to take M steps from some starting position in this structure. I can break this down into:
- Taking x steps in dimension I,
- Taking M-x steps in dimension J
- Interleaving the x steps from I with the (M-x) steps from J in all possible combinations.
- Repeating the above for x <- [0..M] (and hence (M-x) <- [M..0)!
Taking s steps in any dimension I of size Di is a much simpler dynamic programming problem. I can in fact create N caches (called DimCache in the code) that solves this problem for each of the N dimensions independently. This gives me the values of the first two bullets.
How do I interleave x steps from I with (M-x) steps from J? Simple. Imagine an array of size M representing these steps. The number of ways I can interleave the two, is simply the number of ways I can choose x from this array of M (M `choose` x)! That gives me the value of bullet number 3.
The last bullet is simply a loop around the first three.
That’s it! Now I can generalize this from 2-dimensional space to N-dimensional spaces, to get solution 3.
Generating N `choose` k values: This itself is an interesting dynamic programming subproblem within the original problem. I generated Pascal’s triangle via dynamic programming to get these values.
Compute ways is now a recursive function, where each level of recursion strips away 1 dimension. Each level also runs a loop from 0 to remaining step count.
Unfortunately this was still too slow.
Solution 4: Divide and Conquer
To be honest, at this point I was pretty frustrated, because I had spent significant time optimizing the algorithm without much results. I decided to google, and check if there was something fundamental about the problem that I was missing.
Credit for D&C insight: I happened upon this post by Peter Shor, an MIT professor, who discussed the exact same problem that I was solving. I found out that I was on the right track, but in order to further reduce the complexity, I could divide and conquer across all the dimensions! D’oh!! In retrospect, this makes perfect sense. I was stripping away the dimensions one at a time, and taking O(N) calls to arrive at the final dimension. I could easily halve the number of dimensions at every step using divide and conquer.
The insight was that I could divide the first N dimensions into two sets of N/2 each. I could perform my walks in a similar fashion to solution 3, but using only the dimensions in that set. Finally I combine the two sets in the same manner as before (n `choose` k).
My fourth solution for this problem enhanced my prior solution to implement D&C. I was very optimistic that this would work, but alas, I still got time-limit-exceeded over the problem sets!
What could I possibly be missing now?
Solution 5: Memoization
While staring at the code some more, I realized that I was re-computing the various walks through the individual sets again and again. i.e. I ended up calling computeways(lhsdim, rhsdim, steps) multiple times with the same 3 parameters (due to the way the problem was structured). In other words, I could exploit overlapping subproblems by caching them, and further improve my runtime.
I added this enhancement to the code, and the relevant portion in the file is called dnccache (for divide and conquer cache).
Finally, after all of these enhancements, the final solution ran successfully! In fact, it was lightning fast, and computed each problem set within 40 ms (far below the 2sec limit on runtime)!
Summary of design techniques
Here is a summary of the various techniques used:
- Bottom up dynamic programming for n `choose` k. This is the called PascalsTriangle in the code.
- Bottom up dynamic programming to compute the dimension caches for each dimension and maxsteps. This is called DimCache in the code.
- Top-down dynamic programming to memoize various invocations of computeWays(). This is called dnccache in the code.
- Divide and Conquer over the various dimensions to efficiently attack the problem.
Caveats and pitfalls:
Here are some pitfalls that caused me a lot of grief:
Programmer Beware: For n `choose` k, note that if you do the naïve method (N! / (K! (N-K)!) the intermediate values can overflow, even though the final result won’t, so be careful. It is better to compute n `choose` k using Pascal’s Triangle.
Programmer Beware: Also note that, even when you follow the suggestion above, for the problem constraints, your n `choose` k can still overflow. For example, for 300 steps, computing 300 `choose` 150 will give you an 89 digit number. 300 and 150 are well within the constraints of the problem. Therefore make sure you store the values in Pascal’s Triangle “mod 10^9+7”, which is the way the final result needs to be presented as well. I spent a lot of time debugging this on my end, because it wasn’t obvious to me that I would overflow in this as well! J
Programmer Beware: Note that the multiplication of 2 32-bit ints can produce a 64-bit int. If you don’t plan for this, your results will be wrong. It is easy to overlook this.
Table of runtimes:
The table below shows the speed of the five different versions normalized to the speed of the first try (which itself was a DP solution).
Note that the speedups are NOT percentages. In other words, V3 was 9 *times* faster than V1. The final version was 20525 times faster than the original version! This really showcases why algorithmic speedups will always always beat out any speedups obtained from faster hardware. J
[Note that in order to measure these speedups, I had to gradually increase the number of tests run. Had I run the entire test suite on V1, the results wouldn’t be available for a very long time! Of course, I ran the same number and types of tests, multiple times, on two versions, while comparing their speeds relative to each other.]
V1 | V2 | V3 | V4 | Final | |
Speed (Normalized to V1) | 1x | 1x | 9x | 187.7x | 20525x |