Declarative data orchestration: Dagster & Hamilton

Learn how Dagster and Hamilton are similar and different.

Thierry Jean

and

DAGWorks Inc.

Mar 27, 2024

In this post, we’ll compare the open-source Python frameworks Dagster and Hamilton covering the following points:

What is a declarative approach and how does it compare to imperative frameworks like Airflow.
How Dagster is a macro-orchestrator and Hamilton is a micro-orchestrator.
Which do you need?

See the full comparison table at the end of the post!
See the code example on GitHub

Imperative vs. Declarative

Orchestration is the coordination and management of tasks to achieve a desired outcome. In the data world, it is used to interact with multiple systems (e.g., a database, a data warehouse, some compute cluster, or a machine learning platform) to streamline and automate tasks. There are two common approaches to describing what these tasks are and how they relate: imperative and declarative. Let’s explain the difference using Airflow, Hamilton, and Dagster.

Airflow (imperative)

Define

Airflow, the canonical orchestrator, is imperative and requires developers to tell it exactly “what to do”. First, you create tasks with the @task decorator:

from airflow.operators.python import get_current_context

@task(params={"external_input": ...})
def A() -> int:
    """Modulo 3 of input value"""
    context = get_current_context()
    external_input = context["params"]["external_input"]
    return external_input % 3

@task
def B(A: int) -> float:
    """Divide A by 3"""
    return A / 3

@task
def C(A: int, B: float) -> float:
    """Square A and multiply by B"""
    return A ** 2 * B

Assemble

Then, organize them in a recipe, which takes the shape of a directed acyclic graph (DAG), using the @dag decorator:

@dag(...)
def abc():
    a = A()
    b = B(a)
    C(A=a, B=b)

Execute

The Airflow system then manages a catalog of DAGs and is responsible for scheduling and executing them.

Learn more about Airflow and Prefect (another imperative macro-orchestrator)

Simplify Airflow DAG Creation and Maintenance with Hamilton

Thierry Jean and DAGWorks Inc.

June 28, 2023

Read full story

Simplify Prefect Workflow Creation and Maintenance with Hamilton

Thierry Jean and DAGWorks Inc.

July 25, 2023

Read full story

Hamilton (declarative)

Define

Hamilton uses a declarative approach to expressing a DAG. Developers write regular Python functions to declare what can be computed using the function name, in Hamilton they’re called “nodes” (equivalent of a “task”), while each Python function also declares dependencies using the function parameter name and type.

ABC of Hamilton. We declare three functions A, B, & C. They then individually declare their dependencies via the function parameter arguments.

Assemble

Contrary to imperative approaches, Hamilton is responsible for loading definitions and automatically assembling the DAG. This is done through the Driver object:

from hamilton import driver
import definitions  # contains node definitions

dr = driver.Builder().with_modules(definitions).build()

Execute

While an Airflow DAG is a recipe that has to be executed from start to finish, a Hamilton DAG is more like a recipe book. To execute code, users request nodes and Hamilton determines the recipe to compute them on the fly.

Compute all nodes and only return value for “C”:

# request node named "C"; returns a dictionary of results
results = dr.execute(["C"], inputs={"external_input": 7})

Compute only nodes “A” & “B” and return value for “B”:

# request node named "B"; returns a dictionary of results
results = dr.execute(["B"], inputs={"external_input": 7})

Compute all nodes and return their values:

# request node named "B"; returns a dictionary of results
results = dr.execute(["A", "B", "C"], inputs={"external_input": 7})

Dagster (declarative)

Define

Dagster was launched as an imperative framework (e.g. with the @op decorator) but introduced software-defined assets in 2022 a declarative API reminiscent of Hamilton. “Assets” (equivalent to a “node” or “task”) are defined with @asset:

from dagster import asset, Config

class AssetConfig(Config):
    external_input: int

@asset
def A(config: AssetConfig) -> int:
    """Modulo 3 of input value"""
    return config.external_input % 3

@asset
def B(A: int) -> float:
    """Divide A by 3"""
    return A / 3

@asset
def C(A: int, B: float) -> float:
    """Square A and multiply by B"""
    return A ** 2 * B

Assemble

Similar to Hamilton, asset definitions are first registered then the orchestrator assembles the DAG:

from dagster import Definitions, load_assets_from_modules
from .assets import definitions  # contains assets definitions

defs = Definitions(assets=load_assets_from_modules([definitions]))

Execute

Code execution is done through “asset jobs”. To create one, you need to specify all the assets to compute and Dagster will structure the recipe:

from dagster import (
    Definitions,
    load_assets_from_modules,
    define_asset_job
)
from .assets import definitions  # contains assets definitions

defs = Definitions(
    assets=load_assets_from_modules([definitions]),
    jobs=[
        define_asset_job(name="abc")  # includes all assets by default
    ]
)

A Dagster asset job assumes you need all intermediary results. This is less flexible than Hamilton which allows querying only subpaths of its DAG.

Imperative vs. Declarative Summary

DAG definition

Nowadays, Airflow, Hamilton, and Dagster share very similar APIs to define tasks/nodes/assets — they can be python functions. One differentiator is that Hamilton relies primarily on standard Python constructs (e.g., function signature) instead of framework code via annotations. This makes code easier to reuse outside Hamilton.

While Airflow is purely imperative and Hamilton is declarative, Dagster can be both and may require mixing the two approaches, using @asset, @op, @graph_asset, and @graph_multi_asset decorators; this can get confusing.

DAG assembly

Being imperative, Airflow is the most restrictive and requires developers to specify how each task relates to the other. Hamilton and Dagster take very similar declarative approaches that involve importing and registering a Python module containing nodes/assets definitions.

DAG execution

For Airflow, you specify a DAG, and it will be executed from start to finish.
For Hamilton, once the DAG is assembled, you can request nodes and the Driver will determine the necessary nodes to execute and produce the results.
For Dagster, once the DAG is assembled, you also need to create asset jobs to execute code. This adds a layer to manage.

DAG complexity

Imperative orchestrators (Airflow) are efficient when there are only a few tasks and the recipe is linear. When the number of tasks grows, it becomes difficult to manually specify dependencies and maintain code.

Declarative orchestrators (Hamilton, Dagster) better manage a large number of tasks and complex dependencies by automating the DAG assembly. Consequently, it favors writing smaller functions resulting in code that’s easier to read, test, debug, and maintain.

Macro vs. Micro

Despite both adopting a declarative API, Dagster operates at a macro level while Hamilton at the micro one.

Macro-orchestration

Macro-orchestrators (Dagster, Airflow, Metaflow, Prefect) are platforms. A central instance is deployed on a server as a long-running process and manages the DAG definitions, executor, scheduler, metadata, and UI. Data resides outside of the macro-orchestrator in a database or data warehouse. When a DAG is executed, the orchestrator delegates the task to worker nodes, which load the data, transform it, and store results at the designated location.

Dagster system architecture from their documentation. You see a web server, a daemon, a database, and management of computational resources.

Micro-orchestration

Micro-orchestrators are libraries. They are designed to operate in a single Python process without the need for a centralized server. Being a library, they are easy to install and a consequently portable. For instance, you can use Hamilton in a script, notebook, dashboard, Streamlit app, FastAPI server, pyodide browser kernel, and anywhere else Python runs; even in Dagster or Airflow! Hamilton is the only general-purpose micro-orchestrator that covers data, machine learning, and LLM use cases.

Considerations

Use case flexibility

Dagster provides a great platform for creating data artifacts and handling the scheduling of computation. It’s a reasonable choice for a macro-orchestrator.

Hamilton doesn’t provide a platform, but it covers a broader range of use cases. Once you adopt it, you can use it for your data APIs, web apps, and LLM applications. Migrating to it is minimal and it doesn’t restrict you regarding which other tools to use and pairs nicely with any macro-orchestrator.

Data size and computation

To be clear, macro vs. micro has nothing to do with the size of the data or the computation resources required. Both Hamilton and Dagster have integrations for Spark, Dask, Polars, etc. So both could be used for data of any size.

Commitment / Lock-in

Choosing a macro-orchestration framework is an important decision because it will be at the center of your data architecture. It is rare for people to migrate from them. Therefore it’s generally an “all or nothing” proposition.

Micro-orchestration on the other hand influences the way you write code, and represents a much simpler proposition to adopt. With Hamilton, you can get started with a small task, see if you like it, and then continue from there without having to fully commit.

Haven’t tried Hamilton yet? Try Hamilton in your browser on tryhamilton.dev. As we said, it’s just a lightweight library that can even run inside your browser!

What do you need?

With Dagster, you…

Get a platform that can manage infrastructure and scale cloud resources
Have access to a full orchestration toolbox: scheduler, sensors, refresh, backfill
Commit to a data engineering platform with lock-in

With Hamilton, you…

Get a library to standardize how data transformations are expressed
Can develop interactively in a notebook or with the VSCode extension
Adopt a pluggable framework that can expand to your platform needs

Finally, it’s possible to use Dagster and Hamilton together! Since Hamilton is a Python library, you simply have to import it and use it with a Dagster @asset function or @op function. Pick either tool to get started depending on your current needs, and try introducing the other once you are familiar. A sketch of using them both could look like this — run Hamilton within a Dagster function:

@asset
def ml_model(raw_data: pd.DataFrame) -> Model:
    dr = driver.Builder().with_modules(...).build()
    results = dr.execute(["ml_model"])
    return results["ml_model"]

view the Hamilton documentation for clickable links

FAQ

Q: What is the main difference between Dagster and Hamilton?

A: Dagster is a macro-orchestrator, meaning it is a full-fledged platform for data orchestration, while Hamilton is a micro-orchestrator, which is a lightweight Python library for defining and executing data pipelines.

Q: When should I use Dagster versus Hamilton?

A: False dilemma. You can use both together. Dagster is a good choice if you need a complete platform for managing data infrastructure, scheduling, and scaling cloud resources. Hamilton is a good choice if you want a flexible and lightweight library for defining and executing data pipelines within your existing Python applications, without the need for a dedicated orchestration platform. Both complement each other and can work well together.

Q: Can I use Dagster and Hamilton together?

A: Yes, since Hamilton is a Python library, you can import and use it within Dagster's @asset or @op functions.

Q: How do Dagster and Hamilton differ in terms of defining and assembling data pipelines?

A: Dagster uses a declarative approach with @asset decorators to define data assets, while Hamilton relies more on standard Python constructs like function signatures. Dagster requires creating separate "asset jobs" to execute pipelines, while Hamilton allows dynamically requesting the execution of specific nodes or pipelines.

Q: What are the benefits of Hamilton's micro-orchestration approach?

A: Hamilton's micro-orchestration approach offers flexibility, portability, and ease of adoption. It can be used in various Python environments like scripts, notebooks, web apps, and even within other orchestrators. It also avoids lock-in to a specific platform.

Q: How do Dagster and Hamilton handle complex dependencies and large numbers of tasks?

A: Both Dagster and Hamilton leverage a declarative approach to automatically assemble and manage complex dependencies between tasks or assets. This approach is generally better suited for handling large numbers of tasks compared to imperative frameworks like Airflow.

Q: What types of use cases can Hamilton handle beyond data pipelines?

A: Hamilton can be used for data APIs, web applications, and large language model (LLM) applications, in addition to data pipelines. Just browse this blog.

DAGWorks’s Substack

Simplify Airflow DAG Creation and Maintenance with Hamilton

Simplify Prefect Workflow Creation and Maintenance with Hamilton

Enterprise Ready Data Pipelines with Hamilton

Expressing PySpark Transformations Declaratively with Hamilton

Portable dataflows with Ibis and Hamilton

Discussion about this post

DAGWorks’s Substack

Declarative data orchestration: Dagster & Hamilton

Learn how Dagster and Hamilton are similar and different.

Imperative vs. Declarative

Airflow (imperative)

Define

Assemble

Execute

Simplify Airflow DAG Creation and Maintenance with Hamilton

Simplify Prefect Workflow Creation and Maintenance with Hamilton

Hamilton (declarative)

Define

Assemble

Execute

Dagster (declarative)

Define

Assemble

Execute

Imperative vs. Declarative Summary

DAG definition

DAG assembly

DAG execution

DAG complexity

Macro vs. Micro

Macro-orchestration

Micro-orchestration

Considerations

Use case flexibility

Data size and computation

Commitment / Lock-in

What do you need?

FAQ

Q: What is the main difference between Dagster and Hamilton?

Q: When should I use Dagster versus Hamilton?

Q: Can I use Dagster and Hamilton together?

Q: How do Dagster and Hamilton differ in terms of defining and assembling data pipelines?

Q: What are the benefits of Hamilton's micro-orchestration approach?

Q: How do Dagster and Hamilton handle complex dependencies and large numbers of tasks?

Q: What types of use cases can Hamilton handle beyond data pipelines?

Links

Enterprise Ready Data Pipelines with Hamilton

Expressing PySpark Transformations Declaratively with Hamilton

Portable dataflows with Ibis and Hamilton

Discussion about this post