Tidy Production Pandas with Hamilton
“Tidy”, “Pandas”, & “Production” are not words that one would associate together. Read how Hamilton enables their confluence.
Writing production-grade Pandas code with Hamilton.
“Tidy” & “Pandas” are not two words one would often associate together, let alone with the word “Production”. In this post I’ll argue that if you’re using Pandas in production, you should be using Hamilton, an open source micro-framework, as it enables you to write tidy and production grade code by default.
You might be thinking? Is this post about the R Tidyverse equivalent but for Pandas? Yes, in the spirit of tidy code, but no in terms of how it’s achieved.
Also, just to ground our terms before we get started:
Production: By production we mean that this code needs to be run in order for a business process to function. E.g. creating features for a machine learning model, or transforming data for ingestion into a dash boarding tool.
Tidy: by tidy we mean code that is readable, maintainable and testable. We over index a little on the ability for your code to live a long time; your production code generally lives a longer life that you intended it to, so let’s make it easy to maintain. Another way to think about it, is that we want your Pandas code to easily facilitate software engineering best practices.
Pandas Production Problems
There is a lot of disagreement in the industry about whether Pandas code should ever be run in production. While it’s clearly great for quick prototyping and research, Pandas heavy codebases often end up tripping over themselves; software engineering best practices are hard to follow.
This next section should hopefully be all head nods; common paint points that are felt when using Pandas in production.
Integration testing & unit testing is a challenge
Pandas code is commonly written as a linear python script. That’s most likely because it first started as code in various cells from within a Jupyterhub notebook, that are themselves linearly executed. This approach makes it difficult to test over time. Sure, as you’re developing the script, you’re “testing” it as you go. But once it’s in production, time can pass and data & context can change. For example, if you need to adjust production Pandas code that’s running, how do you gain confidence that the change you’re making won’t have adverse effects? Unit testing is likely non-existent, or the test coverage is spotty; writing inline Pandas manipulations is easy to do and hard to programmatically test. While integration testing generally involves running your entire script. If it’s doing a lot of computation, this means a slow iteration cycle to test everything, or perhaps skipping testing altogether.
Documentation is non-existent
Documentation is critical to collaboration and code maintenance. If you’re used to writing Pandas code that creates a bunch of columns inline (because it’s easy to do) you invariably sacrifice documentation. Documentation should be easily surface-able and in sync with the code. Once you start to place documentation outside the code, it’s very easy for it to get out of date…
The code is hard to reuse
It’s common to see a single python file that contains all your Pandas code: logic to pull the data, logic to perform transforms, and logic to save the output. This means that a transform function in this script, isn’t accessible or reusable in an easy manner. Some people retort with “I’ll just refactor”, but how many times does that happen? People end up cutting & pasting code, or reimplementing the same logic over and over.
No one understands your Pandas code but you
As a former data scientist, I know firsthand the horror of inheriting someone else’s code. Pandas is a powerful tool, but everyone wields it differently. With the above concerns of testing, documentation, and reusability, taking ownerships of someone else’s Pandas code is daunting, to say the least.
Hamilton
Hamilton was built to solve the exact problems elicited above — turning a messy Pandas codebase into something tidy. No, not exactly the same as the R tidyverse that you might know and love, but in the same spirit…
What is Hamilton?
Hamilton is a declarative paradigm for specifying dataflows. That’s just an academic way of saying:
You write python functions that declare what they output and what they depend on, by encoding it directly in the function definition.
You write these python functions to specify how data and computation should flow, i.e. a dataflow (AKA pipeline/workflow).
In code this means that, rather than writing:
df['age_mean'] = df['age'].mean()
df['age_zero_mean'] = df['age'] - df['age_mean']
You write
# a_function_module.py
def age_mean(age: pd.Series) -> float:
"""Average of age"""
return age.mean()
def age_zero_mean(age: pd.Series, age_mean: float) -> pd.Series:
"""Zero mean of age"""
return age - age_mean
You then have to write a bit of “driver” code to actually perform computation. The “driver” code’s responsibility is to instantiate what functions can and should be computed, and how.
import pandas as pd
from hamilton import driver
import a_function_module # where your transforms live
config_and_inputs = {'age': pd.Series([...])}
dr = driver.Driver(config_and_inputs, a_function_module)
# here we're constructing a data frame with only two columns
df = dr.execute(['age', 'age_zero_mean'])
We’ll skip an extensive introduction to Hamilton here in lieu of links to prior introductions:
Otherwise you just need pip install sf-hamilton
to get started.
Why does using Hamilton result in tidy production pandas code?
Here are the four main reasons:
Testable code, always.
Documentation friendly code, always.
Reusable logic, always.
Runtime data quality checks, always.
Let’s use the following function to discuss these points:
Testable code, always
Hamilton forces you to write functions decoupled from specifying how data gets to the function. This means that it is inherently straightforward to provide inputs for unit testing, always. In the function above, providing height_zero_mean
and height_std_dev
is just a matter of coming up with a representative Pandas series for each one to test this function; in defining the function, we did not specify how the inputs are going to be provided.
Similarly one can easily test the above function end to end with Hamilton. You only need to specify computing height_zero_mean_unit_variance
to the “driver” code, and it will execute the functions required to produce height_zero_mean_unit_variance.
That way even integration testing cycles can be relatively quick. You do not need to run your entire script and compute everything to test a single change. I.e.:
df = dr.execute(['height_zero_mean_unit_variance'])
Documentation friendly code, always.
Hamilton has four features that help with documentation:
Functions. By using functions as the abstraction, one can naturally insert documentation through the function’s docstring. This can then connect with tooling such as sphinx to surface these more broadly.
Naming. Hamilton forces naming to be front and center in your mind. As the name of a column requested by the driver
.execute()
function, corresponds with a function written by you (or a colleague), descriptive, concise names evolve to be the norm. Furthermore, the code reads naturally and intuitively, from function name to function arguments; it’s very hard to name anything important `foobar` when using Hamilton.Visualization. Hamilton can produce a graphviz file & image that can produce a graphical representation of how functions tie together. This is a key tool to help someone grok the big picture. For example, see the example visualization below.
Tags. Hamilton enables one to assign tags (key value pairs) by annotating a function. As the above example demonstrates,
@tag(owner='Data-Science', pii='False')
provides extra metadata to help code readers understand, for example, who owns the code and whether it contains personal identifying information.
Reusable logic, always.
To be useful, the function above needs to be curated in a python module. Because that module is not coupled to the “driver” code, it is very easy to refer to that function in a variety of contexts:
Multiple drivers can use that same function. Those drivers could construct the DAG differently, e.g. by loading data from different locations. So you have code reuse from day one.
If you have a Python REPL, it’s easy to import the function and run it.
You can publish the function’s module and version it for reuse and reusability.
In addition, as Hamilton forces all core logic to be in functions that are decoupled from “driver code”, it is easy for Hamilton to provide out of the box ways to scale computation. Frameworks like Ray and Dask are straightforward to switch on and integrate. All you need to do is change a few lines in your “driver” code to do so. See Hamilton’s Ray & Dask examples for more information.
Runtime data quality checks, always.
Unit testing is valuable, but it does not replace validating assumptions in production. Rather than employing separate tasks (or even a separate function) to check your data. Hamilton enables the use of a simple decorator to run validations over the output of a function at runtime:
@check_output(data_type=np.float64, range=(-5.0, 5.0), allow_nans=False)
This makes it easy for someone who does not understand the context of the code to grok a few basic properties of the output. As the decorator lives adjacent to the transform function definition, it is much simpler to maintain. There is no separate system to update — you can do it all in a single pull request!
When the function is executed and validation fails, current options are to log a warning or throw an exception. This is a very quick and easy way to ensure what’s running in production matches your expectations.
For those who prefer the power of Pandera, rejoice! Data validation in Hamilton comes out of the box with a full integration — you can pass a Pandera schema to the decorator.
Additional Benefits
Aside from making your Pandas code base tidy, Hamilton also helps in these more macro aspects of your development workflow with Pandas.
Faster iteration cycles.
Once you have Hamilton up and running, the flexibility of adding, changing, and adjusting what your code does is straightforward. You can:
develop in a test driven manner.
easily test your changes by requesting only what is required to compute them
debug methodically by tracing computational data lineage. You start with the function, debug that logic, and if the problem lies elsewhere, you can iteratively recurse through the inputs of the function.
create drivers for multiple contexts very easily, leveraging the power of your prior defined functions.
Faster onboarding.
As a consequence of writing functions with various hooks for documentation, ramping up new hires becomes a much simpler task. Exploring the code base can be done graphically, and running and testing code is straightforward to explain.
Less time spent on code maintenance and upkeep
By design, Hamilton makes it easy to follow software engineering best practices. This means maintaining, inheriting, or even handing off code is very manageable. It also means that it’s simple to make all your transform logic appear uniform and aesthetically pleasing (for example see links in the next section) and keep it that way.
A realistic example
I’ve touted the benefits, but what does the code actually look like? Here are some examples:
> Combining Hamilton with Metaflow:
See normalized_features.py and feature_logic.py. 🤔 Rhetorical question: how would you feel inheriting this code?
> An example in the Hamilton repository:
Data quality (based on the example above, but includes
@check_output
annotations). For a Pandera example see this example instead.
To conclude
Code lives for much longer than you generally anticipate. Making sure it is easy to write, maintain, and accessible to whomever comes after you, is a key ingredient in making Pandas work in a production environment. Hamilton with Pandas helps you do all that. Using Hamilton with Pandas results in tidy production code that, no matter the author, can be maintained and scaled, both computationally (e.g. Ray, Dask) and organizationally.
We’d love for you to:
💪 try Hamilton if you have not yet. Just
pip install sf-hamilton
to get started.⭐️ us on github,
📝 leave us an issue if you find something,
📣 join our community on slack — we’re more than happy to help answer questions you might have or get you started.
Other Hamilton posts you might be interested in:
Developing scalable feature engineering DAGs (Hamilton with Metaflow)
The perks of creating dataflows with Hamilton (Organic user post on Hamilton!)