theme | title | revealOptions | ||
---|---|---|---|---|
black |
Test your (data science) work! |
|
https://tinyurl.com/test-sdm
Eric J. Ma
- To share hard-won lessons borne out of failures in my past.
- To encourage you to embrace testing as part of your workflow.
- 📍 Principal Data Scientist, DSAI, Moderna
- 🎓 ScD, MIT Biological Engineering.
- 🧬 Inverse protein, mRNA, and molecule design.
- 🧠 Accelerate and enrich insights from data.
If you write automated tests for your work, then:
- ⬆️ Your work quality will go up.
- 🎓 Your work will become trustworthy.
- Tests apply to all software.
- Data science work is software work.
- Tests apply to data science.
- Testing in Software
- Testing in Data Science
- 🤔 Why do testing?
- 🧪 What does a test look like?
- ✍️ How do I make the test automated?
- 💰 What benefits do I get?
- 👆 What kinds of tests exist?
Tests help falsify the hypothesis that our code works.
Without testing, we will have untested assumptions about whether our code works.
def clean_names(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
cleaned_columns = []
for column in df.columns:
column = (
str(column)
.lower()
.replace(" ", "_")
.strip("_")
)
cleaned_columns.append(column)
df.columns = cleaned_columns
return df
def test_clean_names():
# Arrange
df = pd.DataFrame(
columns=["Apple", "banana", "Cauliflower Sunshine"]
)
# Act
df_cleaned = clean_names(df)
# Assert
assert list(df_cleaned.columns) == \
["apple", "banana", "cauliflower_sunshine"]
# Cleanup: nothing needed in this case
Read: Anatomy of a Test
Update your environment configuration:
name: project_env # your project environment!
channels:
- conda-forge
dependencies:
- python>=3.9
- ...
- pytest>=7.1 # add an entry here!
Then run:
mamba env update -f environment.yml
With pytest
installed, use it to run your tests:
cd /path/to/my_project
conda activate my_project
pytest .
Use GitHub actions to check every commit. Don't merge unless all tests pass.
from string import punctuation
import re
def clean_names(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
cleaned_columns = []
for column in df.columns:
column = (
str(column)
.lower()
.replace(" ", "_")
.strip("_")
)
# 👀 CHANGE HAPPENS HERE!
column = re.sub(punctuation, "_", column)
cleaned_columns.append(column)
df.columns = cleaned_columns
return df
pytest
If the test fails, we falsify our assumption that the change does not break expected behaviour.
def test_clean_names():
# Arrange
df = pd.DataFrame(
# 👀 change made here!
columns=["Apple.Sauce", "banana", "Cauliflower Sunshine"]
)
# Act
df_cleaned = clean_names(df)
# Assert
assert list(df_cleaned.columns) == \
# 👀 change made here!
["apple_sauce", "banana", "cauliflower_sunshine"]
# Cleanup: nothing needed here
We update the test to establish new expectations.
- ✅ Guarantees against breaking changes.
- 🤔 Example-based documentation for your code.
Testing is a contract between yourself (now) and yourself (in the future).
def func1(data):
...
return stuff
def test_func1(data):
stuff = func1(data)
assert stuff == ...
A test that checks that an individual function works correctly. Strive to write this type of test!
def func1(data):
...
return stuff
def test_func1(data):
func1(data)
A test that only checks that a function executes without erroring. Use only in a pinch.
def func1(data):
...
return stuff
def func2(data):
...
return stuff
def pipeline(data):
return func2(func1(data))
def test_pipeline(data):
output = pipeline(data)
assert output = ...
Checks that a system is working properly. Use this sparingly if the tests are long to execute!
<iframe width="560" height="315" src="https://www.youtube.com/embed/cpbtcsGE0OA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
You can't do data science in a GUI...
>>> code == software
True-ish
...implying that you'll be writing some kind of software to do data science work!
Testing your DS code will be good for you!
- Machine Learning Model Code
- Data
- Pipelines
from project.models import Model
from project.data import DataModule
from project.trainers import default_trainer
model = Model()
dm = DataModule()
trainer = default_trainer()
trainer.fit(model, dm)
model = Model()
dm = DataModule()
dm
must serve up tensors
of the shape that model
accepts.
- Unit test:
dm
produces correctly-shaped outputs when executed. - Unit test: Given random inputs,
model
produces correctly-shaped outputs. - Integration test: Given
dm
outputs,model
produces correctly-shaped outputs. - Execution test:
model
does not fail in training loop withtrainer
anddm
.
def test_datamodule_shapes():
# Arrange
batch_size = 3
input_dims = 4
dm = DataModule(batch_size=batch_size)
# Act
x, y = next(iter(dm.train_loader()))
# Assert
assert x.shape == (batch_size, data_dims)
assert y.shape == (batch_size, 1)
from jax import random, vmap, numpy as np
def test_model_shapes():
# Arrange
key = random.PRNGKey(55)
batch_size = 3
input_dims = 4
inputs = random.normal(shape=(num_samples, input_dims))
model = Model(input_dims=input_dims)
# Act
outputs = vmap(model)(inputs)
# Assert
assert outputs.shape == (num_samples, 1)
def test_model_datamodule_compatibility():
# Arrange
dm = DataModule()
model = Model()
x, y = next(iter(dm.train_dataloader()))
# Act
pred = vmap(model)(x)
# Assert
assert pred.shape == y.shape
def test_model():
# Arrange
model = Model()
dm = DataModule()
trainer = default_trainer(epochs=2)
# Act
trainer.fit(model, dm)
a.k.a. Data Validation
def func(df):
# The column we need is actually present
assert "some_column" in df.columns
# Correct dtype
assert df["some_column"].dtype == int
# No null values
assert pd.isnull(df["some_column"]).sum() == 0
# The rest of the logic
...
import pandera as pa
df_schema = pa.DataFrameSchema(
columns={
# Declare that `some_column` must exist,
# that it must be integer type,
# and that it cannot contain any nulls.
"some_column": pa.Column(int, nullable=False)
}
)
def func(df):
df = df_schema.validate(df)
# The rest of the logic
...
Runtime validation code is abstracted out.
Code is much more readable.
def pipeline(data):
data = df_schema.validate(data)
d1 = func1(data)
d2 = func2(d1)
d3 = func3(d1)
d4 = func4(d2, d3)
output = outfunc(d4)
return output_schema.validate(output)
def test_func1(data):
...
def test_func2(data):
...
def test_func3(data):
...
def test_func4(data):
...
def test_pipeline()
# Arrange
data = pd.DataFrame(...)
# Act
output = pipeline(data)
# Assert
assert output = ...
We assume your pipeline is quick to run.
from hypothesis import given
schema = pa.DataFrameSchema(...)
@given(schema.strategy(3))
def test_func1(data):
...
import pymc as pm
with pm.Model() as model:
mu = pm.Normal("mu")
sigma = pm.Exponential("sigma")
pm.Normal("observed", mu=mu, sigma=sigma, observed=data)
idata = pm.sample()
idata.extend(pm.sample_posterior_predictive(idata))
# idata.posterior_predictive now contains
# simulated data that looks like your original data!
Integrating testing into your work is one manifestation of defensive programming.
- Save headaches in the long-run.
- Improve code quality.
Others can:
- Feel confident about our code.
- Understand where their assumptions may be incorrect.
Do unto others what you would have others do unto you.
<iframe width="560" height="315" src="https://www.youtube.com/embed/7NEQApSLT1U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
How Software Skillsets Will Accelerate Your Data Science Work |
<iframe width="560" height="315" src="https://www.youtube.com/embed/Dx2vG6qmtPs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Principled Data Science Workflows |
<iframe width="560" height="315" src="https://www.youtube.com/embed/5RKuHvZERLY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Testing Data Science Code |
- ✅ Write tests for your code.
- ✅ Write tests for your data.
- ✅ Write tests for your models.