GrainBridge’s first feature from our data refinery.

Lessons from building an agricultural data refinery

Mark Johnson
7 min readJul 20, 2021


Farmers have a tough job. Not only do they spend time in their fields maximizing the output of their crops, but they also need to pay attention to commodities markets, figuring out when to sell their crop.

At GrainBridge, our job is to help farmers make better decisions about when to sell (or “market”) their grain. We believe that data is the answer, which is why we undertook the task of building a data refinery.

The idea behind a data refinery is to ingest multiple datasets, clean them up, and put them in an environment where the data is easy to access by everyone in the company, whether it’s a PM who needs user data or a data scientist building a predictive model. We also needed to build the tools to serve that refined data back to our farmers.

Our legacy with data goes back to our founding. As a joint-venture between ADM and Cargill, the two largest grain buyers in America, GrainBridge has a huge dataset of grain transactions in the US, going back several years. By combining those data with external datasets like historical futures prices and USDA market forecasts, plus adding in our own user data, we can then build decision support to help the farmer make smart, profitable decisions.

Our first application, launched in September of last year, gave farmers digital access to their data. The architecture of our app is largely transactional, focused on scaling up to millions of farmers, storing hundreds of millions of records, and ingesting data very quickly.

For the data refinery, we needed to build a new architecture, more tuned towards analysis, where we can scan through millions of records, rather than just return a set of records associated with a single farmer. We also needed to be prepared for any kind of large dataset, structure or unstructured, from time series, to weather, to satellite data. We don’t yet know all the data we need, so flexibility was important to us.

Our application architecture vs. our data refinery architecture.

Ultimately our goal is to use data to become a personalized AI consultant, giving the farmer suggestions to help maximize profit and reduce risk. However, we started off with a straightforward MVP.

GrainBridge released our first data-focused feature in Q2. Instead of rushing out with minimal infrastructure built on epoxy and duct tape, we decided to spend time building out the basic pieces of the underlying data refinery, so we could accelerate our experimentation for the rest of 2021 and beyond. That way, we could accelerate data experiments for the rest of the year and transform GrainBridge into a data-animated organization.

I recently wrote about how GrainBridge became a product development machine. Now, I’d like to share some lessons about how we put that machine to work and built an agricultural data refinery.

Hire data expertise…

One of the first hires after I arrived at GrainBridge was a top data scientist with experience building models on top of large consumer behavior datasets, plus several years at a big data platform. Brian has been the data inspiration for the team, sharing his wisdom based on a career of successes and mistakes. He regularly scheduled presentations to the team, outlining his vision for building a data refinery and what we could do with it.

In addition to Brian, we added another data scientist, a data engineer, and have an open position for a data BA (product support for data).

We also leveraged our relationship with AWS. We leaned on them for architectural support, to make sure that we were making sound, long-term decisions.

…but, everyone on the team should understand data

If your company’s product is based on a data refinery, then everyone in the company should understand data.

Spearheading our internal effort was GrainBridge’s long-time architect, Fayez Barbari (who sadly has left the company). He had spent months ahead of the project educating himself about big data technologies, building prototypes, and creating a prototype architecture. Together with AWS, he ran a data workshop to train the entire team on the AWS architecture we’d be using and the related programming paradigms. Even though everyone wasn’t working on the data refinery, we felt it important to have everyone understand it.

That’s helped tremendously both in building the data refinery and having product conversations about what’s possible with our new architecture.

Spend time understanding the context of the data

The data we received from ADM and Cargill included contracts (what kind of grain to be sold, where, at what price, and on what delivery date), delivery tickets (when grain was delivered and what quality), and settlements (payment after the fulfillment of a contract). However, each of these pieces of data has dozens of fields. Some of those fields, like the type of contract, have dozens of different possibilities.

Though it was time consuming for our owners to explain the business and operation context of the data to us, this was a critical step. It isn’t enough just to drop data at the feet of a skilled data scientist and expect them to stumble on gold. ADM and Cargill provided insights and contextual information that is simply impossible to extrapolate from the data alone.

Create a model of your solution space

The worst direction you can give a data scientist is “see what you can do with this dataset!” A much better, user-focused question is: what data features would be valuable to an end-user and are those features possible with this dataset?

To focus our thinking, our data scientist came up with a model very early on:

  • Score — giving every farmer a score related to some aspect of their grain marketing
  • Benchmark — for any given score, how well does the farmer do with respect to others
  • Insights — how do farmers who consistently score high relative to various benchmarks market their grain compared to those who do not, and can we communicate those insights to others

That framework gave us a “solution space,” a framework to discuss the types of features that we could engineer on top of our datasets, which ultimately led to…

Have a clear MVP goal

If I had to choose one key lesson, this was it.

For our MVP, we were not optimizing for user value, rather, we were trying to optimize for something that was the simplest possible feature that allowed a roundtrip of data through the data refinery. Of course, we also wanted something that was useful to our farmer customers, but it didn’t necessarily have to be the most valuable feature for them. It’s more important to get something out to test the infrastructure.

We decided to choose the average sales price of the contract. It was something that was easily calculable given that we had contracts for each of our farmers. And it was a useful number that they probably wouldn’t have ever calculated for themselves. Plus, given the data feature model above, you need a score if you’re going to benchmark farmers against their peers or breakeven price.

Build out infrastructure for the future

A key principle was that we didn’t just want to build our first data feature, we wanted it to be built on a working data refinery. This is a tough balance to maintain: on one hand, we didn’t want to write custom infrastructure for a single feature, on the other hand, we didn’t want to build out the complete data refinery so that any feature could be built. Plus, security and data privacy are critical to us to maintain the trust of our farmers, so we couldn’t just slap something together.

Our head of engineering went to the white board and sketched out the complete data refinery. We used that model as our basis for building a “round trip” with our MVP feature: could data be ingested into the data refinery, analyzed, and then appear in the final product. In this way, we were still creating pieces of each of the pieces of infrastructure, while still focused on delivering customer value.

Our actual whiteboard.

What’s next?

Our first order of business is to finish all of the components of the data refinery that need refining. There are more datasets to ingest and more tools to build. We’re busily working on that this quarter, with the hopes of releasing at least one more public data-focused feature.

Even more exciting is that we can start asking questions from our data: what makes a farmer really good at selling his grain? What are the habits of farmers who aren’t as profitable? Based on your past habits and the market, do we recommend that you sell today or hold off until the future? Are local prices historically really good or really bad? How does your average price compare to other farmers like you?

These kinds of questions will allow us to get closer to our goal of helping farmers to become more profitable.

What about your company? Have you built a data refinery? What have you learned?



Mark Johnson

CTO of Stand Together. Former CEO of GrainBridge, Co-founder of Descartes Labs, CEO of Zite. Love product, philosophy, data refineries, and models.