Hamilton + DBT in 5 minutes
A quick walkthrough of using these two open source projects together
A quick walkthrough of using these two open source projects together
In this post, we’re going to show you how easy it is to run Hamilton inside a DBT task. Making use of DBT’s exciting new python API, we can blend the two frameworks seamlessly.
Hamilton is a declarative micro-framework to describe dataflows in python. For example, it’s excellent for expressing the flow of feature transformations and their connection to fitting machine learning models.
DBT is a framework for helping one to describe dataflows that are comprised of SQL, and now with their latest feature, even python!
While the two frameworks might look incompatible or even competitive at first glance, DBT and Hamilton are actually quite complementary.
DBT is best at managing SQL logic and handling materialization, while Hamilton excels at managing python transforms; some might even say that Hamilton is the “DBT for python functions”.
DBT contains its own orchestration capabilities, whereas Hamilton relies on an external framework that runs python code to execute what you’ve defined.
DBT does not model transformations at the level of “columns”, but instead at the level of “tables”. Hamilton thrives at enabling a user to describe “columnar” transformations, as well as “tabular” ones, in a readable, maintainable, python first way.
DBT is focused on analytic/warehouse-level transformations, whereas Hamilton thrives at expressing ML-related transforms.
At a high-level, DBT can help you get the data/run large-scale operations in your warehouse, while Hamilton can help you make a machine learning model out of it. DBT can help with the testing and documentation story for SQL. Hamilton can help with the software engineering best practices, testing, and documentation story for your python code (e.g. Tidy Production Pandas with Hamilton) !
To demonstrate this, we’ve taken inspiration from xLaszlo’s code quality for DS tutorial, and re-written it using a combination of DBT + Hamilton. This plays with Titanic dataset available via scikit-learn’s openml datasets feature.
While we only specify a single python task with Hamilton to be executed by DBT, it should be enough for you to get started on your own journey to combining Hamilton with DBT!
Not familiar with DBT or Hamilton?
To keep this post short, we assume familiarity with both. However if you want to know more about either project, here are some links:
For DBT:
For Hamilton:
Introducing Hamilton (backstory and introduction)
Putting Hamilton in a DBT Task
Pre-requisite: have your Hamilton based code developed and ready to be used.
Overview of the integration:
You define your DBT project as normal.
Create a table that will serve as input to your python
dbt
model.Write the
dbt
python model and put the Hamilton “driver” code within it.Continue on as you would normally with
dbt
.
The python code for DBT to run Hamilton:
For all the code, with README and instructions, see the DBT example in the Hamilton repository’s example folder. Do note, that we chose duckdb as the dbt
backend here, since it’s easy to run locally; for other backends things might differ a little — do let us know if what we have doesn’t end up working for you. For the rest of this post we summarize the salient points to understand the integration between the two.
With DBT you define a “dbt model”, as described by a python function, that takes in dbt
and session
objects (see dbt docs for more details). These give you access to what you’ve defined with DBT.
A quick note before looking at the code, we’re using the dbt-fal
adapter (link) to help manage python dependencies. We expect DBT’s official support to change in the future, as DBT’s python support evolves (e.g. see this discussion); we’ll update this example so it’s always correct.
The code to run Hamilton within DBT looks like the following:
import pandas as pd
# import our Hamilton related functions that will define a DAG.
from python_transforms import data_loader, feature_transforms, model_pipeline
# import Hamilton modules for making it all run.
from hamilton import base, driver
def model(dbt, session):
"""A DBT model that does a lot -- it's all delegated to the hamilton framework though.
The goal of this is to show how DBT can work for SQL/orchestration, while Hamilton can
work for workflow modeling (in both the micro/macro sense) and help integrate python in.
:param dbt: DBT object to get refs/whatnot
:param session: duckdb session info (as needed)
:return: A dataframe containing predictions corresponding to the input data
"""
raw_passengers_df = dbt.ref("raw_passengers")
# Instantiate a simple graph adapter to get the base result
adapter = base.SimplePythonGraphAdapter(base.DictResult())
# DAG for training/inferring on titanic data
titanic_dag = driver.Driver(
{
"random_state": 5,
"test_size": 0.2,
"model_to_use": "create_new",
},
data_loader,
feature_transforms,
model_pipeline,
adapter=adapter,
)
# gather results
results = titanic_dag.execute(
final_vars=["model_predict"], inputs={"raw_passengers_df": raw_passengers_df}
)
# Take the "predictions" result, which is an np array
predictions = results["model_predict"]
# Return a dataframe!
return pd.DataFrame(predictions, columns=["prediction"])
In terms of the code within the function — it reads pretty much like standard Hamilton “driver” code.
We:
Import the right python modules (see comments for details).
Create a “Driver”, passing in the right configuration and modules, and adapter.
Execute the code, passing in the data that DBT provides as a pandas dataframe. The code featurizes, fits a machine learning model, and then creates some predictions.
Return a dataframe of “predictions” which is just what the model predicts over the entire dataset.
For you as a practitioner, what you need to think about to integrate Hamilton with DBT is:
With DBT you define the input data set(s).
With Hamilton you define transforms to featurize the inputs, create a model, and then use it to predict over, in this case, the same data set. A simple extension to this code would be to predict over a different data set provided by
dbt
and return that instead.
Running DBT with Hamilton
To run this example, you’ll need to do three things:
(1) Checkout the Hamilton repository.
$ git clone git@github.com:stitchfix/hamilton.git
$ cd hamilton
(2) Change to the dbt example directory and install the dependencies (we encourage using a new python virtual environment for this).
$ cd examples/dbt
$ pip install - r requirements.txt
(3) Execute dbt
!
# Currently this has to be run from within the directory
$ dbt run
00:53:20 Running with dbt=1.3.1
00:53:20 Found 2 models, 0 tests, 0 snapshots, 0 analyses, 292 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:53:20
00:53:20 Concurrency: 1 threads (target='dev')
00:53:20
00:53:20 1 of 2 START sql table model main.raw_passengers ............................... [RUN]
00:53:20 1 of 2 OK created sql table model main.raw_passengers .......................... [OK in 0.06s]
00:53:20 2 of 2 START python table model main.predict ................................... [RUN]
00:53:21 2 of 2 OK created python table model main.predict .............................. [OK in 0.73s]
00:53:21
00:53:21 Finished running 2 table models in 0 hours 0 minutes and 0.84 seconds (0.84s).
00:53:21
00:53:21 Completed successfully
00:53:21
00:53:21 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
This will modify a duckdb file representing our database. You can inspect the results using python or your favorite duckdb interface. Replace duckdb with your DB of choice in real life.
Congratulations! You’ve just run DBT together with Hamilton!
A few more details
To help you jump into the code of the DBT example and modify it for your needs, we’ve organized the code into two separate DBT models:
raw_passengers: This is a simple select and join using duckdb and DBT defined in SQL.
train_and_infer: Line 10
dbt.ref("raw_passengers").df()
is what links thisdbt
model with (1). With the data provided the code does:
— feature engineering to extract a test/train set
— train a model using the train set
— run inference over an inference set
To see the details of this, look at the transforms defined in the python_transforms package.
Note (1): out of what could be computed in the train_and_infer
step, as specified, Hamilton only runs a subset of the transforms that are defined — we could easily request it to output metrics, return the fit model, etc. For this post we just wanted to keep it simple.
Note (2): again, to stress the earlier point, we’re using the dbt-fal
adapter to help manage python dependencies for this example. DBT in python is still in beta, and we’ll be opening issues/contributing to get it more advanced, as well as updating this example as their python support evolves!
Note (3): if you wanted to use the output of the train_and_infer, you would then just reference it as you would like any other downstreamdbt
model.
Future Directions
We think that Hamilton + DBT have a long/exciting future together. In particular, we could do more to make this experience better:
Compile Hamilton to DBT for orchestration — the new SQL adapter we’re working on would compile nicely to a dbt task.
Add more natural integration — including a
dbt
plugin for a Hamilton task.Add more examples with different SQL dialects/different python dialects. hint: we’re looking for contributors…
If you’re excited by any of this, drop on by! Some resources to get you help:
📣 join our community on slack — we’re more than happy to help answer questions you might have or get you started.
DBT support pages.
⭐️ us on github
📝 leave us an issue if you find something
Other Hamilton posts you might be interested in:
Developing scalable feature engineering DAGs (Hamilton with Metaflow)
The perks of creating dataflows with Hamilton (Organic user post on Hamilton!)