Walks in a hypercuboid!

Walks in a Hypercuboid!

This blog post is about an extremely interesting problem I solved last night (after it gave me much grief)! I started off with a naïve and horrible implementation, and over the course of 5 iterations, morphed the solution into an elegant, correct, and blazingly fast implementation in C++! Since it gave me great joy, I had to write a blog to share it. The final solution simultaneously employed 3 dynamic programming algorithms on 3 different subproblems, and 1 divide and conquer algorithm.

But first, just what the hell is a hypercuboid?

Well, we all know what a cube is, right? A cuboid is just like a cube, but its three lengths need not be equal. In other words, a cuboid is to a cube what a rectangle is to a square.

Ok, so what about the “hyper” part? Well, a hypercuboid is just a cuboid in higher dimensional space. In other words, a hypercuboid is to a cuboid what a cuboid is to a rectangle. i.e. a rectangle is a hypercuboid in 2-D space, a cuboid is a hypercuboid in 3-D space, a tesseract (yes, like from the Thor movies!) is a hypercube in 4-D space. A hypercuboid is a generalized cuboid over some N-dimensional space.

Problem Statement (Grid Walking)

The problem I solved was as follows:

You are given a hypercuboid, with N dimensions, and the length of each dimension specified. You also start off somewhere in this hypercuboid, at a start position specified by an N-dimensional vector. You can take 1 step along any dimension in the + or – direction at one time. You are asked to find out how many paths you can traverse through this structure, in M steps, subject to the constraint that each path must lie entirely within the hypercuboid. i.e. you are not allowed to fall out of the hypercuboid.

So for example, if this were a 2-D hypercuboid, with the following constraints and starting point, and you were asked to take 2 steps, you could trace out 16 different paths as shown below. Note, I haven’t shown {N, S}, {S, N}, {E, W}, {W, E} arrows due to space.

hypercuboid

Doesn’t sound too bad right? Sure, we need to make sure we don’t fall off the structure, but it sounds like a straightforward application of dynamic programming.

Solution 1: Straightforward dynamic programming

Here’s the recurrence equation. Let P(S, I, j, k, l, …) denote the number of ways to take S steps from the starting vector V(I, j, k, l….). Then I can take 1 step forward or backward along any dimension, and recurse on S-1 steps.

P(S, I, j, k, l, …) = P(S-1, i-1, j, k, l…) + P(S-1, i+1, j, k, l, …) + …. + P(S-1, I, j-1, k, l, …) + ……

Of course, we have to have base cases for S=0, and there is exactly one way to take no steps from anywhere, and that way is to just remain where we are.

Further, we need to make sure that we only do the -1 along any dimension when we are not at the lowest point, and only do the +1 when we are not at the highest point in that dimension.

We can maintain two caches, one for the previous S-1 step, and one for the current S step, and just iterate through all the steps, and compute the values.

This first solution in C++ is pretty simple. It is also a complete disaster.

It has at least 2 main problems.

  • The algorithmic complexity is off the charts. I step over all S steps, and for each of the (D1 * D2 * … * Dn) indices, I do the sum over both the candidates in each dimension. If we denote max dimension size as D, maximum number of dimensions as N, maximum steps as M, we get O(M * (D^N) * N).
  • The second problem is that I am computing ALL of the (D^N) indices prior to starting any computations. Each index is N wide, so I’m using O(N*D^N) space just to store the indices! This doesn’t even account for the space used by the caches. In my defence, I’m very used to lazy programming in Haskell. In Haskell, if I wrote a function to give me ALL indices, it wouldn’t actually do that. It would just create a thunk and sit tight, and provide me indices on demand as I asked for them. I was so used to this way of thinking that I reproduced it in a strict language like C++, to disastrous ends! I ran out of memory! 🙂

Solution 2: Iterate over indices, instead of generating them

Ok, so after I found that I had way too many indices eating up my memory I decided to mimic Haskell and supply indices on a one-by-one basis. You can see the resulting changes in solution 2.

Basically I have hacked together a primitive iterator scheme. The function getStartIndex() gives me the starting index, and I can repeatedly keep calling getNextIndex() to get the next lexicographically bigger index. When this function rolls over to the start index, it signals a flag which can end iteration.

This solved my memory issues, since I wasn’t creating so many indices. But the basic problem still remained. I had just way too many indices to go through and process. In fact, the bottleneck now became the initialization function! My software caches were so big that the CPU just got stuck writing the initial values into the caches (see the loop in allocateCaches()). And my algorithmic complexity was unchanged.

This was a dismal failure of gigantic proportions! I needed to go back to the drawing board, and start from scratch. L

Solution 3: Exploiting independence!

The key insight came in the form of orthogonality of vector dimensions. In other words, taking x steps along dimension I is independent of taking y steps along dimension J.

Why is this so crucial?

Imagine, if you will, a hypercuboid of just 2 dimensions, I and J. Let’s say I need to take M steps from some starting position in this structure. I can break this down into:

  • Taking x steps in dimension I,
  • Taking M-x steps in dimension J
  • Interleaving the x steps from I with the (M-x) steps from J in all possible combinations.
  • Repeating the above for x <- [0..M] (and hence (M-x) <- [M..0)!

Taking s steps in any dimension I of size Di is a much simpler dynamic programming problem. I can in fact create N caches (called DimCache in the code) that solves this problem for each of the N dimensions independently. This gives me the values of the first two bullets.

How do I interleave x steps from I with (M-x) steps from J? Simple. Imagine an array of size M representing these steps. The number of ways I can interleave the two, is simply the number of ways I can choose x from this array of M (M `choose` x)! That gives me the value of bullet number 3.

The last bullet is simply a loop around the first three.

That’s it! Now I can generalize this from 2-dimensional space to N-dimensional spaces, to get solution 3.

Generating N `choose` k values: This itself is an interesting dynamic programming subproblem within the original problem. I generated Pascal’s triangle via dynamic programming to get these values.

Compute ways is now a recursive function, where each level of recursion strips away 1 dimension. Each level also runs a loop from 0 to remaining step count.

Unfortunately this was still too slow.

Solution 4: Divide and Conquer

To be honest, at this point I was pretty frustrated, because I had spent significant time optimizing the algorithm without much results. I decided to google, and check if there was something fundamental about the problem that I was missing.

Credit for D&C insight: I happened upon this post by Peter Shor, an MIT professor, who discussed the exact same problem that I was solving. I found out that I was on the right track, but in order to further reduce the complexity, I could divide and conquer across all the dimensions! D’oh!! In retrospect, this makes perfect sense. I was stripping away the dimensions one at a time, and taking O(N) calls to arrive at the final dimension. I could easily halve the number of dimensions at every step using divide and conquer.

The insight was that I could divide the first N dimensions into two sets of N/2 each. I could perform my walks in a similar fashion to solution 3, but using only the dimensions in that set. Finally I combine the two sets in the same manner as before (n `choose` k).

My fourth solution for this problem enhanced my prior solution to implement D&C. I was very optimistic that this would work, but alas, I still got time-limit-exceeded over the problem sets!

What could I possibly be missing now?

Solution 5: Memoization

While staring at the code some more, I realized that I was re-computing the various walks through the individual sets again and again. i.e. I ended up calling computeways(lhsdim, rhsdim, steps) multiple times with the same 3 parameters (due to the way the problem was structured). In other words, I could exploit overlapping subproblems by caching them, and further improve my runtime.

I added this enhancement to the code, and the relevant portion in the file is called dnccache (for divide and conquer cache).

Finally, after all of these enhancements, the final solution ran successfully! In fact, it was lightning fast, and computed each problem set within 40 ms (far below the 2sec limit on runtime)!

Summary of design techniques

Here is a summary of the various techniques used:

  1. Bottom up dynamic programming for n `choose` k. This is the called PascalsTriangle in the code.
  2. Bottom up dynamic programming to compute the dimension caches for each dimension and maxsteps. This is called DimCache in the code.
  3. Top-down dynamic programming to memoize various invocations of computeWays(). This is called dnccache in the code.
  4. Divide and Conquer over the various dimensions to efficiently attack the problem.

Caveats and pitfalls:

Here are some pitfalls that caused me a lot of grief:

Programmer Beware: For n `choose` k, note that if you do the naïve method (N! / (K! (N-K)!) the intermediate values can overflow, even though the final result won’t, so be careful. It is better to compute n `choose` k using Pascal’s Triangle.

Programmer Beware: Also note that, even when you follow the suggestion above, for the problem constraints, your n `choose` k  can still overflow. For example, for 300 steps, computing 300 `choose` 150 will give you an 89 digit number. 300 and 150 are well within the constraints of the problem. Therefore make sure you store the values in Pascal’s Triangle “mod 10^9+7”, which is the way the final result needs to be presented as well. I spent a lot of time debugging this on my end, because it wasn’t obvious to me that I would overflow in this as well! J

Programmer Beware: Note that the multiplication of 2 32-bit ints can produce a 64-bit int. If you don’t plan for this, your results will be wrong. It is easy to overlook this.

Table of runtimes:

The table below shows the speed of the five different versions normalized to the speed of the first try (which itself was a DP solution).

Note that the speedups are NOT percentages. In other words, V3 was 9 *times* faster than V1. The final version was 20525 times faster than the original version! This really showcases why algorithmic speedups will always always beat out any speedups obtained from faster hardware. J

[Note that in order to measure these speedups, I had to gradually increase the number of tests run. Had I run the entire test suite on V1, the results wouldn’t be available for a very long time! Of course, I ran the same number and types of tests, multiple times, on two versions, while comparing their speeds relative to each other.]

V1 V2 V3 V4 Final
Speed (Normalized to V1) 1x 1x 9x 187.7x 20525x

Dice Path

Dice Path

I solved another exciting problem last week, in the functional programming section of HackerRank. The problem is called “Dice Path”, and is a variation of grid-based problems. Here is a detailed description of the problem.

Basically we are given an MxN grid, with a dice placed at the top left corner of the grid. Every step, we can choose to move either one step down, or one step to the right, from our current position. When we take a step, the die is *rolled* in that direction, so its orientation changes and a new face is on top. Also, when we take a step, we add the number on the top of the dice to a running sum. Our goal is to trace a path in this manner, to the bottom right corner of the grid, such that the path has the maximum possible total.

dicepath

Intuition

If your first thought is that this problem just screams dynamic programming, you are right on the money! J However, it has an interesting twist to it, which is why I wanted to blog about it.

First, let’s review what dynamic programming is, and why it is applicable here. Dynamic programming is basically glorified brute force recursion, with memoization. Dynamic programming problems have two main characteristics:

  1. Optimal Substructure: This means that an optimal solution to our problem can be found from an optimal solution to smaller sub-problems that it uses. This is why programmers often talk about “Recurrence Equations” when it comes to these problems. Detour: This is closely related to the idea of mathematical induction as well, recursion is in fact mathematical induction in disguise. The base case of mathematical induction is the terminating condition for recursion. And to code the function for the problem, we call the function on a smaller variant of the problem, assuming that the smaller variant gives us the right results. This is the inductive step! Anyway… back on track…
  1. Overlapping Subproblems: This is the other defining feature of dynamic programming. Just having a recurrence relation is not enough. In dynamic programming, we find that in the course of computing various functions, we repeatedly compute the same subproblem again and again and again. Therefore we memoize the results of our computations, so that we can just lookup a previously computed result in our cache. Detour: It is said that there are two main revolutionary ideas in computer science (everything else is evolutionary). Those ideas are ILP (instruction level parallelism) and locality of reference (caching). Whether we believe that or not, caching certainly is beneficial and essential to DP!

Ok, so how does this problem exhibit these characteristics?

Consider the final cell (M, N). From the problem constraints, the only two ways to get to this cell are from the top, or the left of the cell (since we move only downwards or rightwards). So basically we figure out the optimal solution to get to the top, and also the optimal solution to get to the left (optimal substructure), and compute the optimal step to get to this cell using those results!

But wait. In order to compute (M, N-1) and (M-1, N), we need to compute their respective tops and lefts as well. You will notice that the top of the former and the left of the latter is the exact same thing (M-1, N-1) (overlapping subproblems). Thus we don’t need to compute this twice, we can compute it only once, and cache the results, and profit!

Note that at this point, it may not seem like much of a gain, after all we only reduced 2 function calls to 1 function call and a lookup. But do not be deceived. These functions will in turn call other smaller subproblems, which will call yet other smaller subproblems… Thus, you will find that we in fact end up making an exponential number of calls to the smallest subproblems (e.g. calling (1, 1)). And even though (1, 1) is the base case and does O(1) amount of work, if you call it an exponential number of times, you are doing O(2^n) amount of work!

Thus we find the following:

Exploiting Optimal Substructure helps us guarantee the correctness of the algorithm, which can be trivially proven by mathematical induction (remember the parallels between the recurrence equation and mathematical induction).

Exploiting Overlapping Subproblems helps us guarantee the efficiency of the algorithm. For example, we converted the above O(2^n) algorithm to O(n^2) by caching the results.

Okay, so far so good. Now we know why this is a dynamic programming problem. So what’s the big deal here?

An Interesting Twist

Notice that to exploit overlapping subproblems, to get the maximum path to (M, N) for example, we basically want to compute the maximum paths to (M-1, N) and (M, N-1) and go from there. However, in this case, caching just the maximal paths isn’t enough!

Why is that?

Well consider for example the various paths that can bring you to cell (M, N-1). You might have a path of length 15, but it ended up with die face 2 on top. You might also have a path of length 14, that ended up with die face 1 on top. If you saved only the maxpath, you would only end up saving this 15 in your cache. So what’s the problem?

Well, if we roll this die to the right, the first path (with length 15 and 2 on top) might end up showing a 3 as the new die position. But the second (suboptimal) path (with length 14, and 1 on top) might end up showing a 6 as the new die position! Thus the second path (14 + 6) was a better candidate than the first one (15 + 3).

Note that this does not imply that optimal substructure is absent. It is still very much present, but the optimal substructure is over ALL possible orientations of the die at any place on the grid (rather than just the maximum path value upto that point). See below for more…

Ok, this is kinda messed up. So just how many such candidates should we store in our cache for each location? We basically have 2 options; at each location (I, J) in the cache, we can store:

  1. Either the maxpath to (I, J) for every orientation of the die.
  2. Or the maxpaths to (I, J) within <=6 values of each other.

The first one is pretty obvious. If we store the best path for every orientation, we are obviously guaranteed to be able to pick the correct one.

For the second option, remember that every path from (1, 1) to (I, J) travels the same manhattan distance (this is a fancy way of saying that it takes the same number of steps downward and the same number of steps rightward). Therefore, if the difference in the maxpaths is bigger than 6, which is the biggest value of the die, you can never hope to catch up with the leaders using that path (since you would need an extra roll to catch up), and we can throw those paths out!

Now we are finally in a position to write out a complete dynamic programming solution to this problem. I have the DP solution below, and I implemented option #2 above.

https://github.com/cbrghostrider/Hacking/blob/master/HackerRank/FunctionalProgramming/MemoizationAndDP/dicePath.hs

Wonder no more!

Always wondered why dynamic programming is called…. dynamic programming? I will leave you with this humorous gem, from Richard Bellman:

CHOICE OF THE NAME DYNAMIC PROGRAMMING “I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. “An interesting question is, ‘Where did the name, dynamic programming, come from?’ The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, ‘programming.’ I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying—I thought, let’s kill two birds with one stone. Let’s take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities”

— Richard Bellman

 

Probability!

I think it is fitting to start my first blog with a post about one of my favorite topics – Probability!

Problem Statement

I recently came across this interesting problem, while solving problems on HackerRank.

You are given a random number generator that, given an input M, can generate real valued random numbers with a uniform probability density function in the range [0, M]. You use this generator to generate two random numbers: x in the range [0, A] and y in the range [0, B]. What is the probability that (x+y) < C. Here A, B, and C are all non-negative integers.

The naive pitfall

The trivial case is when A+B < C, in this case, the probability is 1, since any numbers you generate will add up to less than C. The non-trivial case is when A+B>C. We know that since x <= A and y <= B, therefore (x+y) <= (A+B). Similarly, since x >= 0 and y >= 0, (x + y) >= 0.

The naïve and incorrect approach is to assume a uniform probability distribution in the range [0, A+B]. Since C< A + B, this approach yields the answer C/(A +B). The problem was of course deceptive, since the naïve answer is most certainly not the correct one.  The problem with the naïve approach is that the probability density function over [0, A+B] is NOT uniform. i.e. some values have a higher chance of showing up than other values.

An intuition from the discrete world

In order to get an intuition for this, let us for a moment leave the world of real valued numbers, and transfer the problem into a discrete (integral) valued world. Here is a similar version of the problem now, one that we are all familiar with:

Given two unbiased dies one 6-sided (cubic) and one 4-sided (pyramidal), producing values within [1, 6], and [1, 4], what is the probability that the sum of two such values is (x+y) < C. We know of course that the sum of two dies does not yield a uniform distribution, even though the dies themselves yield uniformly distributed numbers. In particular, we get the following probabilities:

x+y Combinations (6 sided, 4 sided) Probability
2 (1, 1) 1/24
3 (1, 2) (2, 1) 2/24
4 (1, 3) (2, 2) (3, 1) 3/24
5 (1, 4) (2, 3) (3, 2) (4, 1) 4/24
6 (2, 4) (3, 3) (4, 2) (5, 1) 4/24
7 (3, 4) (4, 3) (5, 2) (6, 1) 4/24
8 (4,4) (5, 3) (6, 2) 3/24
9 (5, 4) (6, 3) 2/24
10 (6, 4) 1/24

If we were to graph this, we will find that the probability mass function is an isosceles trapezoid.

combinations

Now if we were to take the same intervals ([1, 6], [1, 4]) but with step sizes of 0.5 instead of 1, we would get a graph of the same nature [2, 10] but with more bars. As we reduce the step size, to the limit that the step size tends to 0, the figure becomes continuous in nature.

Back to the continuous domain

We can transfer these learnings back to our original problem now.

We have learnt that the sum of two continuous uniform distributions, is a symmetric isosceles trapezoid. We shall see in a moment why this is so useful, but first, here is another chart depicting the situation for [0, A], [0, B] intervals. I shall assume that A < B, and note that the other problem (B < A) can be trivially converted to this one by flipping the inputs A and B. Here is a sample graph for A = 5, B = 15.

continuous

How do we get the height of 1/B? We simply recognize that the probability of the any value being on the trapezoid is in fact 1. i.e. The probability of the value falling within the entire sample space is 1. This is nothing but the area under the curve of the isosceles trapezoid.

Area = LHS triangle + rectangle + RHS triangle

1 = ½ * A * h + (B-A)*h + ½ * A * h = B*h

Or h = 1/B

Also note that the slope of the isosceles triangle is nothing but (y/x). This is because this line passes through the origin, and its equation is in the form y = mx, where m is the slope. Since we know that one of the points is (A, 1/B), the slope of the line is 1/(AB).

Note that this is not meant to be a “Proof by analogy”. Indeed, to arrive at the precise dimensions of the figure above (e.g. why it is A, B-A, A) we carry out a process called the “convolution of probability distributions”, that tells us this answer. The analogy was just to get some intuition for the problem.

Now we can finally solve the original problem

If (C > A+B), the probability is 1.

If (C < A) it is the area of the smaller sub-triangle shown above (the part to the left of the leftmost dotted line, assume dotted line intersects X-axis at C). This is  ½ * C * C/(BA).

If (A < C < B) then we compute the area to the left of the second dotted line. So it is the area of the lhs triangle + part of the rectangle = ½ h * A + (C-A)*h = A/2B + (C-A)/B For C > B ( while C < A+B) we just consider it to be the same as the case (C<A) and subtract the result from 1.

Here is the entire Haskell code that solves the problem now.