Interesting case study in dynamic programming

Introduction

Dynamic programming (DP) is just recursive programming with caching. The cache comes in for saving the results of smaller subproblems.

Because we ultimately have to solve smaller subproblems, we have to traverse our solution space. You can think of the solution space as a directed acyclic graph where each node represents a subproblem. Each outward edge from a node represents immediate dependence on a smaller subproblem. The solution space DAG (directed acyclic graph) will have leaf nodes which have no outward edges. These leaf nodes are base cases.

There are two general ways we might go about solving a dynamic programming problem and they are labeled top-down or bottom-up. Solving a DP problem top-down is like traversing the solution space in post-order depth first search (DFS) starting at the root node. Solving a DP problem bottom-up is like processing the nodes in breath first search traversal starting at the leaf node. That last statement is not quite true from a technical standpoint, but the general idea of computing sub-problems in order of “closeness” to the base case is accurate. In the bottom-up approach, we only compute a sub-problem when all smaller sub-problems have been computed. One implication of this is that the bottom-up approach doesn’t involve recursion.

From a logical standpoint, top-down and bottom-up are logically consistent and correct ways to solve a DP problem. Usually, implementation depends on which is easier to express in code that other developers can understand. However, programs have finite resources like memory and I have a good example which demonstrates some interesting limitations which I’ll dive into.

Our case study is a problem presented to me by David Hopper at Outco. This problem caught my attention when my top-down approach failed despite being logically correct. I had never run into a situation where the top-down approach failed so badly. The analysis of exactly why and my efforts to overcome those limitations made for some interesting insights.

In the rest of this post, We’ll explore the naive top-down and bottom-up approaches for a concrete problem. The original top-down approach will break badly, and I’ll explain exactly why. We’ll verify the limitations by showing how some theoretical approximations line up with real-world performance. Then we’ll explore a modified top-down approach which turns out to be unusually performant. We’ll analyze the performance of the modified top-down and bottom-up to narrow down what might be causing this run-time difference. Although the modified top-down comes out as the clear winner among all the approaches, we’ll see that performance isn’t necessarily everything because we also want code that is easy to reason about.

Problem Statement

Write a function which accepts a positive integers and returns the minimum number of steps to get to 1 assuming we can perform the following operations:

  1. Subtract 1 from input
  2. Divide by 2 if the number is divisible by 2
  3. Divide by 3 if the number is divisible by 3

For example, if our input is 10, our answer is 3. To see why this is, we see that if we subtract once and divide by three twice, we get to 1. That is, (10-1) = 9, and 9/3 = 3, and 3/3 = 1. There are alternative ways to get to 1, like dividing by 2, subtracting once, and dividing by 2 twice more. Those are four steps since 10/2 = 5, 5 – 1 = 4, 4/2 = 2, and 2/2 = 1.

Pseudo-code approaches

Here is the pseudocode for each of the approaches:

Top-down:

int f(int x)
   static map cache
   (1 <= x) return 0
   if(cache.contains(x)) return cache.at(x)
      current_min = 1 + f(x-1)
   if(x % 2 == 0)
      current_min = min(current_min, 1 + f(x/2));
   if(x % 3 == 0)
      current_min = min(current_min, 1 + f(x/3));
   cache.at(x) = current_min
   return current_min

Bottom-up:

int f(int n)
   vector a(n+1);
   a[1] = 0;
   for each i from 2 ... n
      a[i] = INT_MAX
      if i % 2 == 0:
         a[i] = min(a[i], 1 + a[i/2])
      if i % 3 == 0:
         a[i] = min(a[i], 1 + a[i/3])
      a[i] = min(a[i], 1 + a[i-1])
   return a[n]

Actual Implementation

I've also implemented both approaches in C++ and I'll put it below:


// careful, this will break for large x
unsigned top_down(unsigned x) {
   static unordered_map cache;
   if(x == 1) return 0;
   if(x == 2 || x == 3) return 1;
   if(cache.count(x) > 0) return cache[x];
   unsigned current_min = UINT_MAX;
   if(x % 3 == 0) {
      current_min = min(current_min, 
                        1 + top_down(x/3));
   }
   if(x % 2 == 0) {
      current_min = min(current_min, 
                        1 + top_down(x/2));
   }
   current_min = min(current_min, 
                     1 + top_down(x - 1));
   cache[x] = current_min;
   return current_min;
}

// works for any x
unsigned bottom_up(unsigned x) {
   vector cache(x+1);
   cache.at(1) = 0;
   for(unsigned i = 2; i <= x; i++) {
      cache[i] = 1 + cache[i-1];
      if(i % 2 == 0) {
         cache[i] = min(cache[i], 
                        1 + cache[i/2]);
      }
      if(i % 3 == 0) {
         cache[i] = min(cache[i], 
                        1 + cache[i/3]);
      }
   }
   return cache[x];
}

Analysis of Performance, Part 1

When running the top-down approach our program aborts for large enough input. Taking a look in GDB (debugger) we see that at the time our program breaks we have 175k stack frames. My first impression was that somehow we don't terminate the recursion, but that's actually not the case. If you look at the code for the top down solution, we see that each subproblem f(x) will end up calling f(x-1) at some point. This means f(x-1) will call f(x-2) and so on. Allocating a stack frame for a recursive call won't hurt us, but allocating 100k+ stack frames is problematic. We see that the memory requirements for our top-down approach is linear with the input to our function f. While the bottom-up approach also requires memory whose complexity is linear with respect to the input, the constant factor is much smaller. Each entry in an array of ints should be sizeof(int) which should be 4 bytes. However, according to GDB, the size of each stack frame is 48 bytes. That's more than an order of magnitude of difference. Furthermore, allocating memory for an array comes from the heap, and memory for the stack frame comes from the stack.

There's a little more to this analysis that we can dive into. I'll describe some details on the limitations of each of the approaches and then I'll create a top-down approach that is actually better and faster than the bottom-up approach.

On my operating system (Ubuntu 16.04), we can use the ulimit command to see our stack size. From this, we see that our stack size is 8192 KB. As I mentioned before, taking a look at the memory locations of the stack frames in GDB showed us that they were 48 bytes apart, suggesting that they are each 48 bytes in size. 8192 KB / (48 bytes per stack frame) is about 170k stack frames. Inspecting the depth of the stack frames in GDB shows us that we go 175k stack frames deep. This suggests that our estimates line up well with that is actually happening.

Right now the top-down approach breaks at input just above 320k. This is not that surprising because the by the way the actual implementation was written (computing f(x/3), then f(x/2), then f(x-1)), if we recurse down successive calls to f(x-1), eventually we'll call our recursive function with some input which should equal x/2, which will be cached. Likewise, when calling f(x/2), the most we can go down the path of subtracting one and recursing is x/2. So if our input is above 320k, then we can expect to have above 160k stack frames, which is very close to the total memory we have on the stack. We are almost out of stack space.

Furthermore, ulimit tells me that each program has unlimited virtual memory. That virtual memory is our heap space. Testing this out, I run the bottom-up approach with input of 1 billion. It takes a moment to run, but it computes the result correctly. For an input of 1 billion, our array will require allocating space for 1 billion ints, which is 1 billion sets of four bytes, or 4 GB.

A Modified Top-Down Approach

Right now, the bottom-up approach looks more scaleable than the top-down approach. If we're clever, we can see some patterns to this problem that allows us to exclude traversing certain paths in our solution space. For example, consider the operation of subtracing twice and dividing by two once. In other words, ((x - 1) - 1) / 2 = x/2 - 1. That was three steps. However, dividing by two and subtracting once gets us (x / 2) - 1 = x/2 - 1. That was two steps. Clearly, we should never subtract twice and then divide by 2.

Likewise, consider the case where we subtract three times and then divide by three. We have (((x-1) - 1) - 1) / 3 = x/3 - 1. That's four steps. But if we divide by three then subtract one we get (x / 3) - 1 = x/3 - 1. That's two steps.

So let's construct a smarter top-down solution and see how it performs. Here I've written one.


unsigned top_down(unsigned x, unsigned counter = 0){
   static unordered_map cache;
   if(x == 1) return 0;
   if(cache.count(x) > 0) return cache[x];
   // if we add 1 to UINT_MAX we roll over
   unsigned current_min = UINT_MAX / 2; 
   if(x % 3 == 0) {
      current_min = min(current_min, 
                        1 + top_down(x/3));
   }   
   if(x % 2 == 0) {
      current_min = min(current_min, 
                        1 + top_down(x/2));
   }   
   if(counter < 3) {
      current_min = min(current_min, 
                        1+top_down(x-1,counter+1));
      // this needs to be in here because we can't 
      // risk store data for incomplete results. 
      // Bug in case of 10M input. Get 23, but  
      // answer is 22.
      cache[x] = current_min; 
   }   
   return current_min;
}

Analysis of Performance, part 2

As far as performance goes, the modified top-down is way faster than the bottom up one. This may be because we don't need to traverse the same solution space as the bottom-up or it may be that allocating large blocks of memory is a very slow operation. For our input of 1 billion we have the following from using the time utility:

bottom-up:

real	0m17.561s
user	0m17.184s
sys	0m0.372s

modified top-down:

real	0m0.003s
user	0m0.000s
sys	0m0.000s

That's a huge difference. For a discussion on the differences between real, user, and sys time, see here. When allocating memory, we'll probably need to switch to kernel mode which will be reflected in the system time. For the bottom-up approach, this value is very small, so allocating memory does not have a large contribution to the time spent in the bottom-up approach. It seems that right now either the top-down approach needs to examine much less of the solution space, or perhaps that paging is expensive from the bottom-up approach needing so much memory.

Let's see if paging might be the issue. Running our bottom-up approach on inputs of 1M decreases the running time by three orders of magnitude, but we should be able to fit 4MB into memory without issue. Paging shouldn't be an issue with 4MB of memory required.

All we have left is that the top-down approach explores less of the solution space. This makes sense if we think about it the right way. For example, for a large x, we recurse possibly with x/3, x/2, and x-1. For our modified top-down, we won't consider recursing by subtracting more than three times in a row. So for input of 10^9, we should never need to compute x = 10^9 - 4, 10^9 - 5, ..., 10^9 / 2 + 1. That is, we can immediately exclude computing almost x/2 sub-problems. My intuition tells me that the time complexity of the modified top-down approach is O(log_2(n)) and the bottom-up approach's is O(n).

From this information and the drastic difference in the running times, we see that approaching this in a "smarter" way can get us a top-down solution which is many orders of magnitude faster than the bottom-up approach for most inputs. It's so fast even for inputs of 10^9 that the timing data registers as 0.000 seconds for user time.

A Word of Caution

There is an interesting caveat though for the top-down solution. There's a comment in the block with the test checking that the counter value is less than 3. If we were to update the cache outside of this block, we might be updating our cache in a scenario where we really haven't computed f(x-1). We can only update our cache when we've called each of f(x/2), f(x/3), and f(x-1) as they apply. If we were not to, it's possible we might be updating the cache with incorrect values.

In my testing, I identified a scenario where updating the cache in the wrong place gets us an incorrect value for an input of 10M. The answer is 22, but we would get 23. It's truely a one in 10M problem because I was unable to find an error case for any smaller input. (Though I didn't test exhaustively.)

Finding out exactly where things go wrong is difficult to track down but it's sort of unimportant. This has already become too complex for developers to easily reason about program behavior. For most purposes, the bottom-up scenario is fine, and easy to understand. The modified top-down is awesomely performant, but it's complex and much more difficult to understand and reason about.

Conclusions

Top-down and bottom-up DP solutions are both logically valid ways of solving a DP problem. However, if the top-down solution has a significant level of recursion depth, it'll be pretty fragile and unwieldy for larger inputs. Here I illustrated a problem that seems unamenable to a top-down approach.

As a twist, I also illustrated the biggest weakness of the bottom-up approach with the same problem: needing to compute way more sub-problems than logically necessary, because we can't always know which sub-problems are relevant to our original problem.

Then I introduced a modified version of the top-down solution which avoided unnecessary recursion due to some mathematical properties of our problem. With this modified solution, I compared its performance to the bottom-up solution. We saw that the modified top-down solution was logarithmic in time complexity while the bottom-up approach was linear, explaining the huge performance difference seen in practice. Even for extremely large input, the modified top-down solution computed the result so quickly the system clock didn't register one millisecond of user time.

Lastly, I expressed a word of caution about the modified top-down solution because it was complex and hard for developers to reason about compared to the bottom up solution.