How to use Hamilton with Pandas in 5 minutes
Hamilton is a declarative dataflow micro-framework for Python. In this post I’ll attempt to quickly explain what it is and how to use it…
Hamilton is an open source, declarative dataflow micro-framework for Python. In this post I’ll attempt to quickly explain what it is and how to use it with Pandas; reading this post shouldn’t take you more than five minutes. For the backstory and longer introduction, we invite you to read this TDS post.
What is Hamilton?
Hamilton is an opinionated way to write python data transform functions. It forces two things:
You write declarative python functions to encapsulate your data transform logic.
Decoupling, i.e. separating, you transform logic, from your runtime logic.
What does this mean exactly?
To point (1) above, instead of writing Pandas code that manipulates a dataframe, that looks like this:
df['age_mean'] = df['age'].mean()
df['age_zero_mean'] = df['age'] - df['age_mean']
df['age_std_dev'] = df['age'].std()
df['age_zero_mean_unit_variance'] = df['age_zero_mean'] / df['age_std_dev']
You instead write it like this:
# a_function_module.py
def age_mean(age: pd.Series) -> float:
"""Average of age"""
return age.mean()
def age_zero_mean(age: pd.Series, age_mean: float) -> pd.Series:
"""Zero mean of age"""
return age - age_mean
def age_std_dev(age: pd.Series) -> pd.Series:
"""Standard deviation of age."""
return age.std()
def age_zero_mean_unit_variance(age_zero_mean: pd.Series, age_std_dev: pd.Series) -> pd.Series:
"""Zero mean unit variance value of age"""
return age_zero_mean / age_std_dev
There is no central dataframe object to manipulate. You instead think in “columns”, and only what you need to express to create a particular column.
More specifically, the functions declare what they output, using the function name, and declare what they require as input, using the function arguments. The logic to the computation, is wholly contained within the function, that’s what we mean by encapsulation. The function name is important, because it maps to what you can request as output.
A quick way to remember how a function maps to a Pandas dataframe, is that function names map to columns, and function arguments, map to either other columns, or other inputs.
To point (2), your transform function(s) live in a python module(s) that are independent of the script that you use to get a dataframe. That is, to actually compute the above, you’d need to write a file that looks like this:
import pandas as pd
from hamilton import driver
import a_function_module # where your transforms live
dr = driver.Driver({'age': pd.Series([...])}, a_function_module)
# Here we're only requesting two outputs, but you could additionally
# request `age_mean`, `age_zero_mean`, `age_std_dev` ...
df = dr.execute(['age', 'age_zero_mean_unit_variance'])
This forces you to follow a software engineering principle known as decoupling. Decoupling is a great idea with data work, because it means your transform functions don’t care how you load or get the data, they only care about computation. This means it’s very, very, easy to reuse your Hamilton data transform functions in other contexts — and all without stepping on anyone else’s toes.
Benefits to using Hamilton
Without diving into detail, though it should become obvious once you think about it, here’s some benefits the Hamilton way of doing things brings:
All Hamilton data transform functions are 100% unit testable. The declarative nature of the functions means it’s really easy to pass in test data!
All Hamilton data transform functions are 100% documentation friendly. Each function has a purposeful spot to place documentation. Imagine a data engineering/science code base with good documentation 😲?!
Reusable transform code. Because there is forced decoupling from the beginning, it means that you as a user curate python modules with your transform functions. Since they reside in modules, it’s easy to reuse that code for in other contexts, be it for different “driver” scripts, or to use simply as a library.
Code base uniformity. With the above points, a feature engineering code base written with Hamilton looks very standardized, and as a result it’s easy to maintain and upkeep.
How do I use Hamilton?
Step 1. Install Hamilton
To get started install Hamilton:
pip install sf-hamilton
If you’re using a Notebook environment, see this post on how to use Hamilton with a Notebook.
Step 2. Write your python transforms
Create a python file, e.g. my_module.py, and fill it with your transform functions. Remember that the function name maps to columns you could request as output, and that the function arguments map to other columns, or inputs that you provide at runtime.
import pandas as pd
def age_mean(age: pd.Series) -> float:
"""Average of age"""
return age.mean()
def age_zero_mean(age: pd.Series, age_mean: float) -> pd.Series:
"""Zero mean of age"""
return age - age_mean
def age_std_dev(age: pd.Series) -> pd.Series:
"""Standard deviation of age."""
return age.std()
def age_zero_mean_unit_variance(age_zero_mean: pd.Series, age_std_dev: pd.Series) -> pd.Series:
"""Zero mean unit variance value of age"""
return age_zero_mean / age_std_dev
Step 3. Write a “driver” script to bring everything together
Create a python file, e.g. my_driver.py.
import pandas as pd
from hamilton import driver
import my_module
dr = driver.Driver({'age': pd.Series([...])}, my_module)
# execute() returns a dataframe by default
df = dr.execute(['age_zero_mean_unit_variance'])
# do something with the dataframe
Step 4. Run the “driver” script
python my_driver.py
is all that’s needed to run everything!
If you’re using an orchestration/scheduling system like airflow, metaflow, kubeflow pipelines, prefect, etc. you would put the contents of the driver script into a “step” of your ETL to be executed.
Other things you can do with Hamilton
Because this is a quick post, we’re not going to dive into the details here. But, to pique your interest, with Hamilton you get the following features out of the box when using Pandas:
Scaling to “big data” sizes because Hamilton has support for Dask, Pandas on Spark, and Ray, practically for free. See the Hamilton examples folder for the code & here for some documentation.
Data lineage. You can visualize the steps of computation, and ask questions of your graph with Hamilton easily. Try using
dr.visualize_execution(...).
Ability to decorate functions, e.g. conditionally include a function based on configuration. Rather than litter your code with
if else
statements, e.g.if region == 'US'
, you can instead capture that logic by decorating a function with@config.when
. This is just one of several decorators that provide functionality and extra capabilities with Hamilton. See our docs for more information.Hamilton’s use is not limited to Pandas. You can use it to create numpy matrices, scikit-learn models, any python object, etc.! See our examples folder for ideas.
In Closing
Thanks for reading this post. We’d love to make you successful in using Hamilton. Feel free to leave issues/comments in our github repository (we’d love a ⭐️ too!), or join us in our slack server to ask for help, or offer suggestions/improvements.