Monitoring Dataflows with Hamilton + Datadog

Integrate DDOG + more with one line of code

and

Jan 23, 2024

We listen closely to our open-source community. One of the most frequent lines of questioning for current + potential users is “How can Hamilton work with X (tool I’m pushing to integrate with/have already paid for)?”. As we just built out a new lifecycle plugin framework, the answers we give have generally moved from maybe, with some caveats to yes, with <20 minutes of work!

This is the first of a series of integration posts, in which we:

Show how to use Hamilton with some framework X, usually through a new plugin or a set of best practices
Discuss trade-offs of using X for a set of common use-cases
Talk about alternatives to X and how Hamilton makes it easy to swap between them and compare.

We gear these posts towards those who want to/are already using X, who want to/are already using Hamilton, or who want to better understand the problems space that X solves.

We’ll start off with one of the most common requests we’ve gotten, Datadog. For those who don’t know, Datadog is a widely-used infrastructure/service visibility vendor with a host of powerful capabilities. In this post we will be focusing on Datadog’s distributed tracing capability.

We will start with an overview of the OS library Hamilton, dig into two specific Hamilton use-cases, show how you gain visibility into those use-cases by integrating Datadog’s distributed tracing feature, and talk about swapping/comparing with other tools that solve similar problems.

First, let’s introduce Hamilton; if you’re already familiar skip to the next section.

Hamilton

If you are more of a learn-by-doing type, we recommend spending a few minutes on tryhamilton.dev to grok the following concepts, then skipping ahead to the next section.

Hamilton is a standardized way to build dataflows (any code that processes data) in python. The core concepts are simple – you write each transformation step as a single python function, with the following rules:

The name of the function corresponds to the output variable it computes.
The parameter names (and types) correspond to inputs. These can be either passed-in parameter or names of other upstream functions.

# simple_pipeline.py

def digits_df() -> pd.DataFrame:
    """Load the digits dataset."""
    digits = datasets.load_digits()
    _digits_df = pd.DataFrame(digits.data) 
    _digits_df["target"] = digits.target
    return _digits_df

def digits_model(digits_df: pd.DataFrame) -> svm.SVC:
    """Train a model on the digits dataset."""
    clf = svm.SVC(gamma=0.001, C=100.)
    _digits_model = clf.fit(
          digits_df.drop('target', axis=1), 
          digits_df.target
    )
    return _digits_model

def predicted_digits(
       digits_model: svm.SVC, 
       input_digits: pd.DataFrame) -> pd.Series:
    """Predict the digits."""
    return pd.Series(digits_model.predict(input_digits))

Together, Hamilton turns the functions into a Directed Acyclic Graph (DAG) for execution. To actually run the code, you instantiate a driver. You tell the driver what you need and it will run only the required functions.

The DAG formed by the code above. “input_digits” is not a function so it is expected to be passed in as an input. You can play around with this example live in your browser here.

import simple_pipeline
from hamilton import driver

# create a driver (this uses the builder API), passing in modules
dr = (
    driver
    .Builder()
    .with_modules(simple_pipeline)
    .build()
)

# run the desired (+ upstream) nodes
results = dr.execute(
    ['predicted_digits'], 
    inputs={"input_digits" : _load_data(...)}
)

In summation, Hamilton is simply a way of organizing your code into a graph of functions. You can use it anywhere well-organized code is valuable (everywhere), and logic can be modeled with a DAG (most applications that manage the movement of data/objects).

Let’s dig into two different use-cases and how Hamilton can help. While we’ve narrowed these down to illustrate, they allude to broader applications, so read on even if you’re not working on Machine Learning (ML)/ Large Language Models (LLMs).

ML Pipelines

While Hamilton was initially built for feature engineering, we quickly found that it was a great tool to manage entire ML pipelines, a.k.a. ETLs. See the code above for an example.

Note that Hamilton excels at representing most pipelines – if ML is not your bread and butter, feel free to s/ML/your_expertise/g the rest of this post. The batch observability use-case is not one to which Datadog is typically applied, and this is something we’ll dig into in a bit.

What do you care about in visibility for ML Pipelines? It varies. Monitoring often attempts to answer some of these questions (and more):

Is my pipeline getting less performant?
If so, what are the bottlenecks?
Is my data still good?
Has my model trained well (does the loss curve converge)?

And so on – note that you tend to care about both the data concerns (ML specific), and the system concerns.

To answer these, you have a variety of tools:

Macro-orchestration systems often come with some visibility - e.g. did the task run & complete?
Offline profiling tools (either logged or inspecting tables in your warehouse).
Data quality/lineage solutions (vendor-provided).

While there are many approaches to the above, these often require adding hooks in your code with various callouts to external services.

LLM RAG Applications

Along with batch jobs, Hamilton is widely used to express request-level service logic. More recently, we’ve observed an influx of engineers leveraging Hamilton to run Retrieval Augmented Generation for inference with LLMs. While this example does not do retrieval (just generation), you can easily imagine extending it to query results from a vector DB.

def base_system_prompt() -> str:
    """The base system prompt for summarizing text."""
    return """You are a summarization expert that knows how to extract key points from text."""

def base_prompt_template(content_type: str = "a blog") -> str:
    """The base prompt template."""
    return f"""Summarize this text from {content_type}. Extract any key points with reasoning.

Content:

{{content}}"""

def summarize_user_prompt(
    base_prompt_template: str, 
    raw_text: str) -> str:
    """Final prompt for summarizing a chunk of text."""
    return base_prompt_template.format(content=raw_text)

def summarized_text(
    base_system_prompt: str, 
    summarize_user_prompt: str, 
    llm_name: str
) -> str:
    response = LLM_LIBRARY.ChatCompletion.completion(
        model=llm_name, messages=[
          {"role": "system", "content": base_system_prompt},
          {"role": "user", "content": summarize_chunk_of_user_prompt}
        ], temperature=0
    )
    return response["choices"][0]["message"]["content"]

To integrate Hamilton with a web service, you would typically store a driver as a global variable and have each endpoint pass it data from the request, compute the DAG, and return the output (often a final node that compiles data into a pydantic model).

from hamilton import driver
dr = driver.Driver(...)
app = App(...)

@app.get("/v0/api/summarize_text")
def summarize_text(input_text: str) -> str:
    return dr.execute(
        ['summarized_text'], 
        inputs={
            'llm_name' : 'gpt-3.5-turbo', 
            'raw_text': input_text}
    )['summarized_text']

This use-case can be generalized to pretty much all online applications. So, if LLMs are not your day-to-day, feel free to s/RAG/microservice/g the rest of this post. For more information on running Hamilton in an online setting, see the writeup on Hamilton + FastAPI.

In instrumenting LLM RAG applications (and microservices in general), you care about the data and the performance as well:

Does the response make sense?
Is the latency acceptable?
Where is the bottleneck in your code?
How much are you spending on external APIs (OpenAI)?

To answer these, you would typically instrument your code. Datadog, for instance, can provide answers to (2) and (4) with minimal changes (see Datadog monitors and OpenAI integration). Answering (3) requires significant instrumentation/use of the profiling feature in Datadog.

Integrating Datadog

Both of the above cases can use the same Hamilton integration with datadog. This leverages lifecycle adapters, and can be added with a single line of code:

from hamilton.plugins import h_ddog 
from hamilton import driver

datadog_hook = h_ddog.DDOGTracer(root_name="hamilton_dag_trace")

dr = (
    driver
    .Builder()
    .with_modules(...)
    .with_adapters(datadog_hook)
    .build()
)
dr.execute(...)

As you can see, all we’re doing is:

Instantiating a DDOGTracer with a grouping span hamilton_dag_trace
Integrating it with our driver and executing it

Behind the scenes, the DDOGTracer implements a set of hooks that Hamilton calls at various stages in execution. These occur before and after the graph runs, a new task is spawned, or an individual function is executed. It maintains state, allowing it to specify span links from nodes to their dependencies and accurately capture the parent-child relationship of runs -> tasks -> nodes.

Looking at these in datadog, we have quite a few examples (read the captions for explanation/context).

A sample DAG to chunk and summarize a pdf using OpenAI. Note this is not *exactly* RAG (it doesn’t use retrieval, just generation), but it is fairly simple to extend this to be a full RAG pipeline.

A run of that DAG visualized in a datadog flame graph. Note that it is all run in series (although Hamilton has capabilities to do otherwise). The top layer is the root span (pdf_summarizer), and the bottom layer breaks down into individual nodes.

A DAG that leverages Hamilton’s task-based parallelism to conduct cross-fold validation in a machine learning pipeline. Everything between “training_state” and “all_training_results” (exclusive) is repeated for each result of training_state using Hamilton’s *Parallelizable*/Collect constructs.

The same DAG in datadog executing 10-fold cross-validation by delegating to Ray. Note the additional level (the top of each cluster) – this is a span that represents the task (grouping of nodes), an execution unit on Ray.

The Hamilton @tag decorator stores metadata with the node, which the integration attaches to the stored span (datadog’s tagging structure matches Hamilton’s perfectly, it turns out). We add a few extra tags on everything for datadog to make querying easier (which is why we have prepended `hamilton.` to the tags).

An example of *span links* (a beta feature in datadog). These show causal links between spans distinct from parent-child relationships. These naturally form a DAG! The DDOGTracer does not include this by default, but it can be toggled by setting include_causal_links=True. In DataDog’s view, there exists a dependency – the parent row depends on the result in the child rows.

A rather preposterous trace with 4k+ spans, using parallelism over a large number of tasks. Impressively the DDOG UI works smoothly at this scale, but it does not all fit into one view (this grouping is probably suboptimal). This is purely for illustration – most people do not do 100-fold x-validation. This purely to illustrate that the Hamilton/DDOG integration can handle arbitrarily big/complex traces.

As you can see, Hamilton with Datadog monitoring forms a clean way of profiling your dataflow! Each function corresponds to a span, so you get a fine-grained view into exactly what your code is doing and how long it takes. You can do all of this without polluting your code with with tracer() statements, muddling logical concerns with telemetry concerns to get deeper insight (as illustrated in the code below, which can be simplified with Hamilton + DDOG).

def some_artifact(some_input: pd.DataFrame) -> pd.DataFrame:
    with tracer.trace("do_something"):
        input = do_something(some_input)
    with tracer.trace("do_something_else"):
        do_something_else(input)
    return input

Furthermore, should you want to (a) make the span part of a broader trace or (b) add more specific spans (within functions), you can opt to use the datadog tracer API directly and the implementation will automatically set parent/children spans (although, as shown above, watch out for readability/decoupling concerns).

Back to our Use Cases

Let’s go back to the monitoring concerns we talked about above:

ML Pipelines

Is my pipeline getting less performant?
If so, what are the bottlenecks?
Is my data still good?
Has my model trained well (does the loss curve converge)? Does this hold up against the validation set?

Hamilton’s (current) Datadog integration will happily answer (1) and (2) – you can create metrics to track spans/determine bottlenecks, etc. The fact that it instruments each step makes it easy to identify/improve bottlenecks, which will map directly to your code. It does not address (2) or (3), however.

LLM RAG Apps

Does the response make sense?
Is the latency acceptable?
Where is the bottleneck in your code?
How much are you spending on external APIs (OpenAI)?

The Hamilton/datadog integration we showed will help with (2) and (3). Datadog has capabilities for additional OpenAI monitoring, (which should be easy to integrate although we have not done so yet), but (1) is not a Datadog concern.

Trade-offs

There are a few other benefits to this approach (Datadog + Hamilton) for the microservice/RAG setting (that we have not shown in this post):

You can hook this up with any other services by setting the trace context before calling Hamilton, allowing you to trace data across service boundaries, into your DAG, and back to the client.
You can gather specific metrics on function-level performance, which follows naturally from the structure of Hamilton
Datadog stores and displays exception-level information

There are, however, some drawbacks:

Trace retention is low. While this is fine with a high-volume (LLM/RAG service), you might not be able to find the ML pipeline that ran, say, three weeks ago.
Traces tend to run for longer in a pipeline setting – while we have not observed a maximum span duration in Datadog, it does not have streaming of full trace results so you won’t see what’s happening until the DAG completes.
The dependency visualization capabilities (E.G. viewing your dataflow as a DAG) are rudimentary, although we expect this to improve given that the feature is in beta. This is not a replacement for a lineage system.
You can’t link the spans to the code that was actually run (although you probably could do something clever/hacky with tags…).
You have no view into the data – ETLs often break because data looks off, and this won’t help you debug.

Evaluation

For a microservice/LLM setting (aside from cost, which is yours alone to consider), this is a no-brainer. In fact, it’s a step up from most other Datadog integrations we’ve observed – very few of them naturally provide visibility into the logic of your code. This could significantly improve your experience working with LLM APIs, which can be notoriously slow and unreliable. Some ideas:

You can tag OpenAI calls with metadata to look at their latency/success rate over time.
You can monitor the latency of your vector DB by setting up metrics on spans.
You can bottleneck the slowest component of your generative AI pipeline and add caching for latency reduction.

For ML Pipelines, this likely fits in the “can’t hurt” category (cost calculations aside) – having visibility + tracing is always better than not, and when combined with Hamilton, Datadog does give some unique insight.

Alternatives

One of the values of a one-line change is that you can easily swap out for alternative providers. The unique value of Hamilton is that the abstraction (with the addition of lifecycle adapters) allows you to do this with ease, running multiple integrations at once while not touching the code itself. (Keep your code clean! Your infrastructure and monitoring it should be a separate concern.)

While we have implemented only Datadog’s tracing capabilities, integrating with a competitor such as Signoz or Middleware should be as easy as a one-line change ( contribute back if you build the integration, please! Or, better yet, build an OpenTel integration!).

At DAGWorks, we’re creating an observability platform for dataflows that works well with Hamilton and can complement a distributed tracing system such as Datadog. It is meant for both a development and production setting, and includes lineage metadata and a catalog (so you can keep track of how things connect and artifacts produced). It also takes the form of a single-line addition (similar to Datadog). It’s free to sign up and try out!

from dagworks.adapters import DAGWorksTracker

tracker = DAGWorksTracker(
    project_id=project_id_from_the_ui,
    api_key=os.environ["DAGWORKS_API_KEY"],
    username="my_email@my_organization.com",
    dag_name="my_dag_name",
) 

dr = (
    driver
    .Builder()
    .with_modules(...)
    .with_adapters(tracker) 
)
dr.execute(...)

It can help provide mid-execution visibility for pipelines as it streams results in, provides visualizations in terms of DAGs (with the associated code), and stores/displays data summaries for completed nodes (prompts, dataframes, etc…).

The DAGWorks flame-graph – an execution of the same ML Pipeline we ran above .

A visualization of the same (parallel ML Pipeline) DAG from above using the DAGWorks UI, with code embedded in.

Summary of a node’s result (this is a node that outputs a dataframe – DAGWorks computes summaries on a variety of other output types as well).

A response to an LLM prompt to summarize a PDF, captured in the DAGWorks platform.

Looking back at the initial monitoring requirements above, we can see how DAGWorks can help round out some of the other use-cases. In particular, you can monitor the outputs/prompts to your LLM queries, gain insight into the data generated and used in your ML Pipelines, and visualize the results of intermediate run artifacts, including data and metrics.

Next Steps/References

So, if you’re asking yourself what tools to use, the answer is, most likely, why not both? Datadog provides invaluable insight into system performance, attaches it to a variety of other metrics, and gives you some incredibly clean and powerful UIs for managing distributed tracing. Hamilton allows you to gain more value out of Datadog, instrumenting your traces with fine-grained details that make the logic clear. Furthermore, it allows you to easily swap out/plug in tooling without touching the code for the logic itself.

DAGWorks gives you information to really dig into the performance of your LLMs + ML models and understand exactly how the data materializes.

A mixture of tools that complement each other nicely is often the best approach, as you can never fully predict what production problem you’ll hit and what solutions will prove most beneficial. Hamilton can help you plug and play, and swap between/add new integrations.

In summation, we:

Discussed how to use Hamilton for two cases – batch ML Pipelines and online RAG services
Presented the new Datadog integration
Talked about trade-offs of using Datadog for both of those cases
Talked a little about Datadog alternatives/complements, how Hamilton can be used to swap between them/toggle them on and off, and how DAGWorks picks up the slack where Datadog leaves off.

We’re really excited about the future of data pipelines. As everything in Hamilton is open-source, you can play a role as well! We are looking for contributions – please share back any successful integrations you build.

To get started:

Read the post on lifecycle adapters
Look over the DDOG integration
Check out a few of the included hooks
Download a dataflow from the hub to test it with

We Want to Hear from You

If you’re excited by any of this, or have strong opinions, drop by our Slack channel / or leave some comments here! Some resources to get you help:

📣 join our community on Slack — we’re more than happy to help answer questions you might have or get you started.

⭐️ us on GitHub

📝 leave us an issue if you find something

Other Hamilton posts you might be interested in:

DAGWorks’s Substack

Discussion about this post