Skip to content

Latest commit

 

History

History
666 lines (457 loc) · 12.5 KB

File metadata and controls

666 lines (457 loc) · 12.5 KB
theme title revealOptions
black
Test your (data science) work!
transition
fade

Software Testing in Open Source and Data Science!

https://tinyurl.com/test-sdm

Eric J. Ma


🤔 My goals today

  • To share hard-won lessons borne out of failures in my past.
  • To encourage you to embrace testing as part of your workflow.

🙋🏻‍♂️ whoami

  • 📍 Principal Data Scientist, DSAI, Moderna
  • 🎓 ScD, MIT Biological Engineering.
  • 🧬 Inverse protein, mRNA, and molecule design.
  • 🧠 Accelerate and enrich insights from data.

🕝 tl;dr

If you write automated tests for your work, then:

  • ⬆️ Your work quality will go up.
  • 🎓 Your work will become trustworthy.

👀 also...

  • Tests apply to all software.
  • Data science work is software work.
  • Tests apply to data science.

⭕️ Outline

  • Testing in Software
  • Testing in Data Science

💻 Testing in Software

  • 🤔 Why do testing?
  • 🧪 What does a test look like?
  • ✍️ How do I make the test automated?
  • 💰 What benefits do I get?
  • 👆 What kinds of tests exist?

🤔 Why do testing?

Tests help falsify the hypothesis that our code works.


Without testing, we will have untested assumptions about whether our code works.


🧪 What does a test look like?


➡️ Given a function

def clean_names(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    cleaned_columns = []
    for column in df.columns:
        column = (
            str(column)
            .lower()
            .replace(" ", "_")
            .strip("_")
        )
        cleaned_columns.append(column)
    df.columns = cleaned_columns
    return df

➡️ We test for expected behaviour

def test_clean_names():
    # Arrange
    df = pd.DataFrame(
        columns=["Apple", "banana", "Cauliflower Sunshine"]
    )

    # Act
    df_cleaned = clean_names(df)

    # Assert
    assert list(df_cleaned.columns) == \
        ["apple", "banana", "cauliflower_sunshine"]

    # Cleanup: nothing needed in this case

Read: Anatomy of a Test


✍️ How do I make tests automated?


📦 Install pytest

Update your environment configuration:

name: project_env  # your project environment!
channels:
- conda-forge
dependencies:
- python>=3.9
- ...
- pytest>=7.1  # add an entry here!

Then run:

mamba env update -f environment.yml

🏃‍♂️ Run pytest

With pytest installed, use it to run your tests:

cd /path/to/my_project
conda activate my_project
pytest .

Run checks on every commit

Use GitHub actions to check every commit. Don't merge unless all tests pass.


💰 What benefits do I get?


🚇 Changes happen

from string import punctuation
import re

def clean_names(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    cleaned_columns = []
    for column in df.columns:
        column = (
            str(column)
            .lower()
            .replace(" ", "_")
            .strip("_")
        )
        # 👀 CHANGE HAPPENS HERE!
        column = re.sub(punctuation, "_", column)
        cleaned_columns.append(column)
    df.columns = cleaned_columns
    return df

✅ Guarantee expectations

pytest

If the test fails, we falsify our assumption that the change does not break expected behaviour.


💡 Update exepectations

def test_clean_names():
    # Arrange
    df = pd.DataFrame(
        # 👀 change made here!
        columns=["Apple.Sauce", "banana", "Cauliflower Sunshine"]
    )

    # Act
    df_cleaned = clean_names(df)

    # Assert
    assert list(df_cleaned.columns) == \
        # 👀 change made here!
        ["apple_sauce", "banana", "cauliflower_sunshine"]

    # Cleanup: nothing needed here

We update the test to establish new expectations.


💰 Benefits of Testing

  1. ✅ Guarantees against breaking changes.
  2. 🤔 Example-based documentation for your code.

Testing is a contract between yourself (now) and yourself (in the future).


👆 What kind of tests exist?


1️⃣ Unit Test

def func1(data):
    ...
    return stuff

def test_func1(data):
    stuff = func1(data)
    assert stuff == ...

A test that checks that an individual function works correctly. Strive to write this type of test!


2️⃣ Execution Test

def func1(data):
    ...
    return stuff

def test_func1(data):
    func1(data)

A test that only checks that a function executes without erroring. Use only in a pinch.


3️⃣ Integration Test

def func1(data):
    ...
    return stuff

def func2(data):
    ...
    return stuff

def pipeline(data):
    return func2(func1(data))

def test_pipeline(data):
    output = pipeline(data)
    assert output = ...

Checks that a system is working properly. Use this sparingly if the tests are long to execute!


🧔‍♂️ Hadley says...

<iframe width="560" height="315" src="https://www.youtube.com/embed/cpbtcsGE0OA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

You can't do data science in a GUI...


💻 Data science needs code

>>> code == software
True-ish

...implying that you'll be writing some kind of software to do data science work!


👀 Test your code

Testing your DS code will be good for you!


😎Testing in Data Science

  • Machine Learning Model Code
  • Data
  • Pipelines

🧠 Testing Machine Learning Model Code

from project.models import Model
from project.data import DataModule
from project.trainers import default_trainer


model = Model()
dm = DataModule()
trainer = default_trainer()
trainer.fit(model, dm)

👆 What do we need guarantees on?

model = Model()
dm = DataModule()

dm must serve up tensors of the shape that model accepts.


🤔 What can we test here?

  1. Unit test: dm produces correctly-shaped outputs when executed.
  2. Unit test: Given random inputs, model produces correctly-shaped outputs.
  3. Integration test: Given dm outputs, model produces correctly-shaped outputs.
  4. Execution test: model does not fail in training loop with trainer and dm.

🟩 DataModule output shapes

def test_datamodule_shapes():
    # Arrange
    batch_size = 3
    input_dims = 4
    dm = DataModule(batch_size=batch_size)

    # Act
    x, y = next(iter(dm.train_loader()))

    # Assert
    assert x.shape == (batch_size, data_dims)
    assert y.shape == (batch_size, 1)

🟦 Model input/output shapes

from jax import random, vmap, numpy as np

def test_model_shapes():
    # Arrange
    key = random.PRNGKey(55)
    batch_size = 3
    input_dims = 4
    inputs = random.normal(shape=(num_samples, input_dims))
    model = Model(input_dims=input_dims)

    # Act
    outputs = vmap(model)(inputs)

    # Assert
    assert outputs.shape == (num_samples, 1)

🤝 Model and DataModules work together

def test_model_datamodule_compatibility():
    # Arrange
    dm = DataModule()
    model = Model()
    x, y = next(iter(dm.train_dataloader()))

    # Act
    pred = vmap(model)(x)

    # Assert
    assert pred.shape == y.shape

⭕️ Ensure no errors in training loop

def test_model():
    # Arrange
    model = Model()
    dm = DataModule()
    trainer = default_trainer(epochs=2)

    # Act
    trainer.fit(model, dm)

📀 Testing Data

a.k.a. Data Validation


👆 What data guarantees do we need?

def func(df):
    # The column we need is actually present
    assert "some_column" in df.columns
    # Correct dtype
    assert df["some_column"].dtype == int
    # No null values
    assert pd.isnull(df["some_column"]).sum() == 0
    # The rest of the logic
    ...

📕 Schemas to declare expectations

import pandera as pa

df_schema = pa.DataFrameSchema(
    columns={
        # Declare that `some_column` must exist,
        # that it must be integer type,
        # and that it cannot contain any nulls.
        "some_column": pa.Column(int, nullable=False)
    }
)

🏃‍♂️ Runtime dataframe validation

def func(df):
    df = df_schema.validate(df)
    # The rest of the logic
    ...

Runtime validation code is abstracted out.

Code is much more readable.


🚇 Testing Pipeline Code


💡 Pipelines are functions

def pipeline(data):
    data = df_schema.validate(data)
    d1 = func1(data)
    d2 = func2(d1)
    d3 = func3(d1)
    d4 = func4(d2, d3)
    output = outfunc(d4)
    return output_schema.validate(output)

👆 Each unit function can be unit tested

def test_func1(data):
    ...

def test_func2(data):
    ...

def test_func3(data):
    ...

def test_func4(data):
    ...

🤝 The whole pipeline can be integration tested

def test_pipeline()
    # Arrange
    data = pd.DataFrame(...)

    # Act
    output = pipeline(data)

    # Assert
    assert output = ...

We assume your pipeline is quick to run.


🕓 One more thing


💰 Mock-up Realistic Fake Data


☁️ Schema Generators

from hypothesis import given

schema = pa.DataFrameSchema(...)

@given(schema.strategy(3))
def test_func1(data):
    ...

🎲 Probabilistic Modelling

import pymc as pm

with pm.Model() as model:
    mu = pm.Normal("mu")
    sigma = pm.Exponential("sigma")
    pm.Normal("observed", mu=mu, sigma=sigma, observed=data)

    idata = pm.sample()
    idata.extend(pm.sample_posterior_predictive(idata))

# idata.posterior_predictive now contains
# simulated data that looks like your original data!

☁️ Philosophy

Integrating testing into your work is one manifestation of defensive programming.


1️⃣ Testing raises quality

  • Save headaches in the long-run.
  • Improve code quality.

2️⃣ Testing is other-centric

Others can:

  • Feel confident about our code.
  • Understand where their assumptions may be incorrect.

Do unto others what you would have others do unto you.


💡 Resources

<iframe width="560" height="315" src="https://www.youtube.com/embed/7NEQApSLT1U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
How Software Skillsets Will Accelerate Your Data Science Work
<iframe width="560" height="315" src="https://www.youtube.com/embed/Dx2vG6qmtPs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Principled Data Science Workflows
<iframe width="560" height="315" src="https://www.youtube.com/embed/5RKuHvZERLY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Testing Data Science Code

😎 Summary

  1. ✅ Write tests for your code.
  2. ✅ Write tests for your data.
  3. ✅ Write tests for your models.

Thank you! 😁