Post-mortem of a strange bug: dependencies that aren’t what they appear to be

This is my favorite case study among bugs I encountered while working on a search engine at FactSet.

Some context is necessary: The front-end for this search engine is html-based, and it has a feature where charts are shown for some queries. For example, type in “price ibm” and a little chart might show the stock price over the last 5 years. It’s an attractive visual component and also a core part to this kind of feature (showing data in response to a query in addition to links). We have maybe 10 – 15 direct javascript dependencies, each of these having their own dependencies, etc. This chart is rendered by an internal charting library at FactSet. There’s a CDN for which we use a CDN resolver. This CDN resolver takes those script element source files and changes them to direct CDN links if possible. This CDN resolver is buggy so some dependencies are hard-coded to use the CDN directly. Managing these dependencies as they update versions is challenging. We promote devel to QA on Tuesdays. We have internal tools to quickly test the product, which is very close to a clone, but technically a similar imitation.

Now onto our problem.

On Tuesday we promote the latest version of our search engine for testing. Everything is going smoothly, up until a patch is made on Friday night during the testing phase. Once the snapshot was released to clients, it was discovered that charting was completely, utterly broken. This would be an uninteresting bug if this patch touched on any charting logic, except that wasn’t the case.

Oddity 1: A minor patch unrelated to charting code occurred coincidental to the total failure of charting logic.

This bug is very real and not some anomaly vanishing under a fresh system reboot or cache resetting. Despite my best efforts, I cannot replicate this bug using our internal tools, which are designed to be clones of the actual product. I check the dependencies downloaded by the live product and compare it with those used by the internal tool. They appear the same.

Oddity 2: An internal clone of the product fails to replicate the bug. The dependencies appear to be the same but the behavior is different.

This kind of oddity is vaguely ominous. Our internal clone is so close to the actual product that it’s unclear how they could by out of sync. Next I turn to a clean environment through JEdit to attempt to replicate. Going to this step is sort of like the nuclear option of debugging and requires some effort to set up. Since I was the guy who set up this internal charting engine within the search engine UI, I’m familiar enough with it to debug it directly. Examining the dependencies being used within the live environment (what gets downloaded by our dependency manager), I set up JEdit to replicate this environment.

Oddity 3: The JEdit environment I set up doesn’t replicate the bug, either.

I take another look at the error message in the live product’s debugging console. It’s complaining about a non-existent function in Lodash. But Lodash hasn’t been updated or changed in weeks. I examine the source code of Lodash and find that the function is indeed present.

Oddity 4: The live product is complaining about a function in one dependency not existing, but on inspection of the source code it does exist.

It’s unclear what direction to go next. The deeper I go I just find more questions without answers. These sorts of bugs almost challenge your grip on reality. The first clue was that Lodash deprecated some features across concrete versions (3 -> 4). I find that changing the Lodash version allows me to replicate the bug. I suspect that this may somehow be related to the latest minor version upgrade of our charting library. When I downgrade the charting library version, I find that I can’t replicate the bug with either version of Lodash.

The next part, sadly, is a little strange. It turned out that the libraries downloaded by the dependency manager weren’t necessarily the versions actually used. Remember that CDN resolver? Like a regex, it would convert references to libraries downloaded by the dependency manager directly to CDN links. So to see the dependencies used, you’d actually need to examine the script elements’ src attribute value of the html file acting as the UI.

Oddity 5: The dependency manager and CDN resolver seemingly created a catastrophic conflict. What happened?

Looking at the dependencies required by the actual charting library, I see it requires a specific version of Lodash (^3.9.0 or something). But we’re using 4.X, which is totally unnecessary. The truth, as it turned out, was that the Lodash CDN link was hard-coded directly into the html page’s script element’s src attribute value. The dependency manager was indeed downloading the compatible version of Lodash, but since the CDN link was hard-coded, there was no way of actually using the correct version. It was kind of strange; a dependency was being downloaded which had 0% chance of actually being used, almost acting as a decoy.

Taking this all into account, a plausible explanation for this unjust tragedy emerged: There was a dependency mis-match the entire time, and a minor version update to the charting library used a deprecated feature in the mis-matched dependency, breaking the entire system. The internal clone failed to replicate the bug because of some minor differences in how dependencies were specified in the html file’s script elements. This bug took so long to uncover because the downloaded dependencies misrepresented the actual dependencies as some were hard-coded in, silently overriding the cautious choices of the dependency manager.

But why did this all happen after the patch? It had happened that the charting library has a release schedule different from ours, on every other Friday. So when the a snapshot was taken on that Tuesday, the current version of our charting library was included. Unbeknown to us, a minor version update was made to the charting library after the snapshot was taken but before the patch was made. After the patch was made, the system had to be rebuilt. When the system rebuilt, the just-released version of the charting library was included. Since managing our dependencies is complicated and tedious, we allow our dependency manager some flexibility in choosing a suitable version. For safety, we ask that the latest minor version of a given major version be used. A minor version update, according to semantic versioning, is supposed to be backwards compatible. It had happened that some other logic was refactored, calling a function that was deprecated in newer versions of a Lodash but allowed in the specified version requested by the charting library. The result was that an innocuous update uncovered a silent bug and uncovered a sort of loophole in our testing process. In the event that patches are made, the system build before v. after is thought to be the same in all other aspects unrelated to the patch. Not so in this case.

The fix was quick but unsatisfying. Downgrading the hard-coded version used in Lodash used made us compliant and got things working again. But did we really need to hard-code this version? Couldn’t we use the CDN resolver to do it properly? Unfortunately, not easily. The CDN resolver was buggy. I had known this tool was deprecated and suggested we use the newer CDN tool that was actively maintained. But that would have meant switching to a new build system framework, enough work to be its own project. Using this deprecated tool was merely an extended trial before switching to a better tool. It was useful enough to keep but not painful enough to discard, yet. For now, that migration didn’t take priority over other projects with measurable results to the bottom line.

Fixing a bug isn’t enough though. We just learned the system didn’t work in the way we expected in a very unpleasant way whose solution was protracted and overly laborious to arrive at. A good question to ask is How can we modify the system to correct for this class of bugs? The proximate bug, if you will, was over-riding the decisions of the dependency manager due to a broken CDN resolver. Replacing the CDN resolver isn’t something that can be done yet, so at best we can make a mental note. But what we can do is demystify our dependencies, somehow. One colleague remarks “there must be a way to print the version of the library in code.” As it turns out, he’s right. Lodash does have a function which can print the version. This seems like a useful feature, and it turns out that most of the other core dependencies can do this as well. So now the plan is to print all of the library dependencies in the console log on rendering. Since we rarely add or remove new dependencies this is likely to serve us well without much maintenance. The next developer trying to make sense of an unusual bug won’t have to discover the dependency trickery even if we do risky things like using a deprecated tool which forces us to hack our way to a functional product.

I see a brighter, more sane future here. We flag the CDN resolver as “risky” and push for the upgrade as soon as politically feasible. All dependencies used are printed in the console abstracting the byzantine process of actually choosing which dependencies to use. Printing in the console is a good idea because developers will immediate check it for important log messages when the system isn’t behaving as expected. Now the next guy won’t need to learn things the hard way, because it’s already been done once. Learning things the hard way is an overly long, frustrating way to get things done, and engineering time is valuable. I imagine that this feature will be useful, and will save some developer several hours in the future. And, if I ever have a problem like this again, it will also help me.