Hidden Technical Debt in Machine Learning Systems

Hidden Technical Debt in Machine Learning Systems

  • Category: Article
  • Created: January 17, 2022 3:28 PM
  • Status: Open
  • Updated: January 17, 2022 5:02 PM
  • url: https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

New Metaphors

technical debt: long term costs incurred by moving quickly in software engineering.

smell: In software engineering, a design smell may indicate an underlying problem in a component or system.


  1. This paper argues it is dangerous to think of machine learning’s quick wins as coming for free. Using the software engineering framework of technical debt, the authors find it is common to incur massive ongoing maintenance costs in real-world ML systems.
  2. Developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.
  3. The authors argue that ML systems have a special capacity for incurring technical debt, because they have all of the maintenance problems of traditional code plus an additional set of ML-specific issues. This debt may be difficult to detect because it exists at the system level rather than the code level.


  1. This paper does not offer novel ML algorithms, but instead seeks to increase the community’s awareness of the difficult tradeoffs that must be considered in practice over the long term.
Screen Shot 2022-01-17 at 16.03.52.png


Complex Models Erode Boundaries

Traditional software engineering practice has shown that strong abstraction boundaries using encapsulation and modular design help create maintainable code. However, it is difficult to enforce strict abstraction boundaries for machine learning systems by prescribing specific intended behavior.

ML is required in exactly those cases when the desired behavior cannot be effectively expressed in software logic without dependency on external data. The real world does not fit into tidy encapsulation.


  1. Machine learning systems mix signals together, entangling them and making isolation of improvements impossible.
  2. The authors refer to this here as the CACE principle: Changing Anything Changes Everything.
  3. One possible mitigation strategy is to isolate models and serve ensembles. But relying on the combination creates a strong entanglement: improving an individual component model may actually make the system accuracy worse

Correction Cascades.

  1. Correction model has created a new system dependency on \(m_a\), making it significantly more expensive to analyze improvements to that model in the future.
  2. A correction cascade can create an improvement deadlock, as improving the accuracy of any individual component actually leads to system-level detriments.

Undeclared Consumers

  1. Without access controls, some of these consumers may be undeclared, silently using the output of a given model as an input to another system. In more classical software engineering, these issues are referred to as visibility debt.
  2. In practice, this tight coupling can radically increase the cost and difficulty of making any changes to ma at all, even if they are improvements.

Data Dependencies Cost More than Code Dependencies

Unstable Data Dependencies

  1. Some input signals are unstable, meaning that they qualitatively or quantitatively change behavior over time.
  2. One common mitigation strategy for unstable data dependencies is to create a versioned copy of a given signal.

Underutilized Data Dependencies

  1. Underutilized data dependencies are input signals that provide little incremental modeling benefit. These can make an ML system unnecessarily vulnerable to change, sometimes catastrophically so, even though they could be removed with no detriment.
  2. Underutilized data dependencies can creep into a model in several ways.
    • Legacy Features. The most common case is that a feature \(F\) is included in a model early in its development. Over time, \(F\) is made redundant by new features but this goes undetected.
    • Bundled Features. Sometimes, a group of features is evaluated and found to be beneficial. Because of deadline pressures or similar effects, all the features in the bundle are added to the model together, possibly including features that add little or no value.
    • \(\epsilon\)-Features. As machine learning researchers, it is tempting to improve model accuracy even when the accuracy gain is very small or when the complexity overhead might be high.
    • Correlated Features. Often two features are strongly correlated, but one is more directly causal. Many ML methods have difficulty detecting this and credit the two features equally, or may even pick the non-causal one. This results in brittleness if world behavior later changes the correlations.

Static Analysis of Data Dependencies

  1. Tools for static analysis of data dependencies are far less common, but are essential for error checking, tracking down consumers, and enforcing migration and updates.

Feedback Loops

One of the key features of live ML systems is that they often end up influencing their own behavior if they update over time. This leads to a form of analysis debt, in which it is difficult to predict the behavior of a given model before it is released.

Direct Feedback Loops

A model may directly influence the selection of its own future training data.

Hidden Feedback Loops

A more difficult case is hidden feedback loops, in which two systems influence each other indirectly through the world.

ML-System Anti-Patterns

Glue Code

  1. ML researchers tend to develop general purpose solutions as self-contained packages.
  2. Glue code is costly in the long term because it tends to freeze a system to the peculiarities of a specific package
  3. In this way, using a generic package can inhibit improvements, because it makes it harder to take advantage of domain-specific properties or to tweak the objective function to achieve a domain-specific goal.

Pipeline Jungles

  1. As a special case of glue code, pipeline jungles often appear in data preparation. These can evolve organically, as new signals are identified and new information sources added incrementally.
  2. Without care, the resulting system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output.

Glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles. When ML packages are developed in an ivory tower setting, the result may appear like black boxes to the teams that employ them in practice. A hybrid research approach where engineers and researchers are embedded together on the same teams (and indeed, are often the same people) can help reduce this source of friction significantly.

Dead Experimental Codepaths.

  1. It becomes increasingly attractive in the short term to perform experiments with alternative methods by implementing experimental codepaths as conditional branches within the main production code.
  2. For any individual change, the cost of experimenting in this manner is relatively low—none of the surrounding infrastructure needs to be reworked. However, over time, these accumulated codepaths can create a growing debt due to the increasing difficulties of maintaining backward compatibility and an exponential increase in cyclomatic complexity.

Abstraction Debt

There is a distinct lack of strong abstractions to support ML systems.

Common Smells

  1. Plain-Old-Data Type Smell. ****
  2. Multiple-Language Smell.
  3. Prototype Smell.

Configuration Debt

  1. Another potentially surprising area where debt can accumulate is in the configuration of machine learning systems.
  2. The Authors have observed that both researchers and engineers may treat configuration (and extension of configuration) as an afterthought.

Dealing with Changes in the External World

One of the things that makes ML systems so fascinating is that they often interact directly with the external world. Experience has shown that the external world is rarely stable.

Fixed Thresholds in Dynamic Systems.

  1. It is often necessary to pick a decision threshold for a given model to perform some action.
  2. However, such thresholds are often manually set. Thus if a model updates on new data, the old manually set threshold may be invalid.

Monitoring and Testing

  1. Prediction Bias: In a system that is working as intended, it should usually be the case that the distribution of predicted labels is equal to the distribution of observed labels.
  2. Action Limits: It can be useful to set and enforce action limits as a sanity check.
  3. Up-Stream Producers: These up-stream processes should be thoroughly monitored, tested, and routinely meet a service level objective that takes the downstream ML system needs into account.
  1. Data Testing Debt ****
  2. Reproducibility Debt ****
  3. Process Management Debt ****
  4. Cultural Debt


Technical debt is a useful metaphor, but it unfortunately does not provide a strict metric that can be tracked over time. A team is still able to move quickly is not in itself evidence of low debt or good practices, since the full cost of debt becomes apparent only over time.

A few useful questions to consider are:

  1. How easily can an entirely new algorithmic approach be tested at full scale?
  2. What is the transitive closure of all data dependencies?
  3. How precisely can the impact of a new change to the system be measured?
  4. Does improving one model or signal degrade others?
  5. How quickly can new members of the team be brought up to speed?

Personal thoughts

  1. The reason why a machine learning system has more hidden debt is that it has not only the technical debt that traditional software has, but also the technical debt associated with machine learning.
  2. Hidden debts of machine learning system includes boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.