To be a more effective data scientist, think in experiments


The fundamental unit of value in data science is the experiment.


Data defies comprehension. The datasets we work with contain far too many records to keep in your head. Despite our best efforts to visualize and understand there will always be externalities and "weirdness" we couldn't correct or aren’t aware of. To convince yourself that this is true, read Randy Au's "Data Science Foundations: Know your data. Really, really, know it".

And, machine learning algorithms are non-deterministic. Google coined the acronym CACE to describe this phenomenon: "changing anything changes everything". Even small changes in data or model parameters can cause unexpected changes in model performance. It’s extremely difficult to predict what effect a change in a model parameter or a data input will have on performance. To convince yourself that this is true, read the seminal paper "Machine Learning: The High Interest Credit Card of Technical Debt".

As a data scientist these two facts together present a problem. If our goal is to build a great model or perform a great analysis, and we neither fully understand the first-order effects in our data, nor the second-order effects in our models, how do we actually "do" effective data science?

The answer: use the scientific method!

The field of science has been grappling with the problem of determining hard facts about complex systems that nobody fully understands for millenia. In science, advancement in what we know starts with posing hypotheses, then designing and performing experiments which attempt to validate or disprove that hypothesis.

How does this translate to data science?

Every machine learning model you build will start off with something simple, well-understood, and not very performant. You then make a sequence of choices about what to do to improve it. Choices are hypotheses. Adding this feature will improve the performance of my model. Lowering the dropout smooth out my loss curve. Running the experiment means training the updated model. Finally, validating or invalidating your hypothesis means evaluating loss or some other metric of that model. If your hypothesis is correct, you've made progress on a metric you care about, and you now have one more and even better model in your "model lineage". If your hypothesis was incorrect, you discard or set aside the new model, and return to iterating on the old one.

What this means for data scientists

"All models are bad, but some are useful". If we can agree that thinking about data science as experimentally driven is a useful framework, we can make some interesting observations about how data science works as a field (and how you can become better at it).

One observation has to do with costs. Classical science experiments require serious time and hardware: Petri dishes and survey groups on the low end, satellite telescopes and particle colliders on the high. This is why the artifact that come at the end of the science experiment—the scientific paper—is considered so important. In data science, by contrast, each individual experiment is extremely cheap: you just run a job on your computer, or increasingly on the cloud, until it completes. Unless you’re a fancy PhD machine learning researcher, adding to the sum of human knowledge isn’t the goal—it’s building a performant model.

As a result, the road of model satisfiability is paved with a lot of individual experiments.

Data science experimentation is non-linear. For example, suppose that you spend a lot of time iterating on a particular idea, only to determine that it doesn’t actually improve your performance. You shelve the idea and go back to iterating on the last "good" version of the model. The approach that failed is a "branch" that "forks" an early version of the model. Nothing stops us from plucking ideas from the branch and trying them out again later. Over time, our "model lineage" becomes a complicated direct graph of things that worked, things that didn't, and everything in between.

Experimentation also allows us to better understand what people mean when they talk about the mythical "art of data science".

The art of data science is the art of generating good hypotheses.

A "good" data scientist knows which avenues they can explore that are likeliest to actually have an impact on model performance. They are able to make hypotheses, and run experiments, which have the highest probability of actually improving their model performance. This "sense" of what the "right" thing to do is comes from experience first, from understanding the dataset second, and from technical expertise third. Being an good "artist" translates into being a good worker: you spent less time trying out dead ends, and more time making valuable improvements to your models.

This is an "art" because it’s not something that’s easily written down. You can’t learn it in a university program, or by doing complicated math in a research paper. The only way to get good at the "art" of data science is to build lots of models on lots of different problems, iterating on your sense of "what’s right" until it starts to point you in the right direction more often than the wrong one.

What this means for data science tools

Thinking of data science as an experimental framework has important implications on the tools we build.

Effective data science tools hold experimentation core to their design. That means doing things like:

And so on.

Parting thoughts

I’m not the first person to bring up this idea. Kaggle grandmasters like Owen Zhang ("How to win Kaggle") have been talking about it for years as a part of the "secret sauce" for winning a Kaggle competition. Roger Peng wrote a great piece presenting a broader framework for data science titled "Divergent and convergent phases of data analysis". And I think it’s telling that new tools like DVC and CometML are very deliberate about how they improve your experiment-making in their product descriptions.

— Aleksey