Building a Better Feature Platform with Hamilton

A guest post by Ryan Whitten, Director of ML Data Engineering @ Best Egg

and

Oct 02, 2024

For the past three years, Best Egg has been on a journey to find the ideal machine learning feature engineering platform. As a fintech company offering lending products ranging from personal loans to flexible rent solutions, machine learning and data are at the heart of our business. Providing our customers with the resources they need to be money confident requires us to constantly refine our processes. Upon closer examination, we found that our data scientists spend a large part of any new model’s development time manually curating and analyzing training data. We also recognized the challenge when the training data doesn’t match what the model sees at inference time (the dreaded train-serve skew).

A typical development workflow for each model includes pulling features from our Snowflake data warehouse, engineering some features manually, and implementing a standard set of pandas dataframe transformations. This leads to a lot of time spent modifying Snowflake queries and repeated code that results in a very isolated feature engineering pipeline that’s not easily accessible to other data scientists. The reusability of our features is low, and the relative universe of features considered for feature selection is small due to the overhead in trying to explore new features.

We knew we could increase efficiency for our data scientists, minimize train-serve skew, and unlock new use cases that hadn’t been explored yet. Our goal was clear: we needed a centralized platform to support and promote the reuse of ML features across multiple models. This is not a problem unique to our business, and feature stores and commercial feature platforms clearly tout themselves as the solution to these problems.

We evaluated and completed proof-of-concepts with the top five commercial and open-source providers in the market. While each platform had its strengths, none were able to fully meet our needs. Ultimately, we chose to build our own in-house solution, leveraging existing infrastructure and integrating open-source tools including Hamilton.

Our Feature Platform Requirements

As we set out to find the right feature platform, we developed a list of requirements to help us focus on what we really needed and to evaluate each platform consistently. At a high level, the list included capabilities like:

Connects to our existing data sources (Snowflake, S3, Postgres, etc.), provides flexibility for any other source we may have, and allows joining across data sources
Enables calculating on-demand features for real time inference
Supports pre-computing features to both an online and offline feature store, in batch and streaming contexts
Generates accurate training data using point-in-time joins
Supports windowed aggregations
Handles complex chained transformations
Is Python-native and makes it easy to express features as code

Why No Platform Checked All the Boxes

Our requirements are fairly standard and what you’d expect any feature platform to provide. However, the main lesson we learned was that while a platform may check the boxes on paper, it often fell short during our proof-of-concept. Specifically, we had challenges with:

Flexibility: Some platforms supported only a limited set of data sources, or restricted us to a single data source at a time. Several platforms allowed only a narrow range of Python packages for transformation logic, with rigid support for custom packages and environments. As our business grows, we didn’t feel we’d be able to adapt quickly enough by waiting on a third party to add necessary functionality.
Scalability: Our data tends to be very wide (1000s of columns), and no platform was able to process it more efficiently than our existing infrastructure and pipelines. Our on-demand feature engineering requires very low latency, and there was often additional serving overhead compared to what we achieved with our DIY prototypes.
Complex Transformations: We currently manage tens of thousands of features, and calculating them often involves a complex directed acyclic graph (DAG) of operations to transform raw data into output features. Only one platform we tried really solved this problem, and we realized how key it was to the solution we needed.

Why We Chose to Build In-House

As a company with a strong foundation in data engineering, we believed we had the expertise to build a better, more tailored solution in-house. The key factors that drove this decision were:

Control and Flexibility: We needed to be able to design and evolve the system to meet the specific requirements of our models. By building it ourselves, we can iterate and optimize as needed without waiting on a vendor’s roadmap.
Integration with Existing Infrastructure: We already have a complete set of tools at our disposal. Building in-house allows for seamless integration with our Kubernetes clusters, Snowflake data warehouse, and enterprise event bus, without requiring us to refactor everything to fit into a rigid commercial solution.
No Vendor Lock-in: We were honestly hoping to find a perfect solution we’d feel comfortable locking in to. Building a feature platform is no easy feat, and we recognize that every vendor is solving difficult problems, but for now, retaining flexibility is a key advantage.

Enter Hamilton: The Right Framework for the Job

During our evaluation, we discovered Hamilton, an open-source Python framework designed for creating flexible, modular, and portable data pipelines (think everything from data transforms, to ML, to GenAI). It has a growing community on slack, with weekly releases and a meet-up group, and is driven by the DAGWorks Inc. team that built the internal ML Platform at Stitch Fix. What caught our attention was how Hamilton’s approach to expressing transformations, i.e. features in our context, aligned perfectly with our goals of flexibility and scalability.

Modular and Declarative: Hamilton allows us to define extremely complex feature transformations in a modular way. We have several pipelines already which compute 500 – 2,000 features each, including multi-layered dependencies between features. Each feature is defined as a function, with clear dependencies and no hidden state. This makes our pipelines far more maintainable, testable, and faster to build.
Efficient Feature Projection: While our pipelines may each support providing thousands of features, the average model may use 20-100 features for inference. Hamilton lets us dynamically prune the DAG to run only the nodes needed to produce the features requested by a model. This reduces our overall latency, especially when coupled with Hamilton’s async integration which lets us only fetch from data sources that are needed for the request.
Write Once, Run Anywhere: As mentioned, a key concern in feature engineering is to eliminate train-serve skew. Hamilton lets us write transformations for on-demand inference which may have a single record at a time. We can then reuse the same transformation code, swap out how data is loaded (e.g., Snowflake instead of a Redis cache used on-demand), and run large-scale backfills to generate training data. We can orchestrate our Python code anywhere without needing a JVM environment or any other specialized infrastructure.
Compatibility with Our Existing Tools: Hamilton integrates well with our current machine learning ecosystem, which primarily includes pandas and Snowpark. This made adopting it a smoother process compared to other platforms that require complete infrastructure overhauls.

Building the Platform Around Hamilton

We built our entire platform around Hamilton, which is responsible for calculating features in four contexts:

On-Demand Feature Calculation: When models require features for individual records, the system can calculate these features in real time. We execute Hamilton DAGs as Ray tasks within a FastAPI server running on Kubernetes.
Batch Processing: Many of our features, especially lifetime or long-windowed aggregations (e.g., last two years), can be calculated in batch. We use Hamilton to help modularize pipelines for both pandas and Snowpark. These features are then loaded into our offline feature store (Snowflake) for training and batch inference, but they can also be cached in our online store (DynamoDB) for online inference.
Stream Processing: While we haven’t fully integrated stream processing yet, we plan to use Hamilton to help write dataframe transformations that could be used with a streaming database like Rising Wave, thanks to Hamilton’s integration with Ibis. These streaming features would be available in both the online and offline stores for inference and training.
Backfills: We support backfills across the three main execution contexts, and spin up ephemeral Kubernetes jobs to handle the processing (ideally pushing down the execution to Snowflake). Hamilton lets us use reuse the same code to backfill, but also helps structure our DAGs to perform temporally-accurate backfills to compute features as-of a specific point in time, which could be unique for each entity in the selected population (i.e., millions of unique observation times to compute).

*Hamilton allows dynamically switching the data source depending on the context, while keeping calculation logic the same.*

By using Hamilton as the core execution engine of our platform, we’ve been able to maintain a high degree of flexibility while supporting the complex requirements of feature engineering at scale. Pipelines that used to take months to build can now be onboarded in days or weeks. We’ve also reduced silos across departments and created a centralized place to engineer new features.

Know anyone building feature platforms? Share it with them!

Looking Forward

The decision to build our own feature engineering platform was not made lightly, but it has proven to be the right choice for us so far. With Hamilton, we’ve created a system that:

Scales efficiently with our data
Is flexible enough to adapt to changing business needs
Ensures consistency between training and production environments

And most importantly, we’ve built a platform that gives us complete control, allowing us to continue innovating to meet the growing demands of our machine learning initiatives, all with the goal of better serving our customers.

Some of our future roadmap items include evaluating the companion self-hostable UI that Hamilton comes with to cover:

Observability - making sure we catch and find data issues quickly will be key to ensuring reliability of the platform. With Hamilton we have feature provenance by default, and can inject data observability easily. Hamilton has features here that we’re looking to leverage.
Catalog - a UI to search for, find, and understand features. We expect that as our platform grows, curation & understanding of the feature platform system will become more important, e.g. data dependencies, governance, cost, etc. We’re looking to invest in this area as time goes.

Resources

A guest post by

Ryan Whitten

Director, ML Data Engineering at Best Egg

DAGWorks’s Substack

Discussion about this post