A few thoughts on the analytical value of Ag Data

Not long ago (November 2020, revised March 2021), Shane Thomas released on his Upstream Ag Insights Blog the graphic below, providing a comparison of digitised acres across all the major ag tech players. This particular graphic has been making the rounds across the industry and I have certainly seen it being used in a  semi-marketing/sales-y capacity. Rather excitingly, Proagrica features in the second position, just  after John Deere.

However, if we were to take a step back and think this through, do these numbers actually tell us anything meaningful? Sure, high values kind of tell you that a large portion of the industry trusted your solution at some point in time for some  specific purpose - and that is a message that carries certain weight within some contexts - but is that all? Shane himself accompanies the graphic with the following very important (and spot-on) disclaimer:

“There is of course a challenge in identifying how each organization defines what an ‘acre’ entails. Not all acres are created equal when talking layers data management, value to the organization, revenue, how engaged the farmer is etc. (or even how the acres are counted, depending on the organization) which makes the comparisons null in many respects. The point of this is to track how companies that do report acres publicly compare to others and helps to create a list for individuals interested in learning more about the organization themselves.”

From our perspective as data practitioners, when we are presented with something that implies an underlying data estate, the first question that pops to our head is a simple “but what can you do with that data?”; followed by “data is irrelevant unless there is analytical value to be extracted from it”. So, the fundamental question I am inviting the reader to ponder on:

  • How do you go about understanding the analytical value of an FMIS / Ag Dataset?
  • Does the total digitised acreage tell you anything about that value?

Before we continue, let us call out the fact that the analytical value of a dataset is defined by its ability to provide the kind of insights and would solve real-life problems. And with that in mind, let's jump into a metaphor to illustrate our point...

Cue 'Scrapheap Challenge'

The greatest UK TV Show ever

Also known as Junkyard Wars, it was a brilliant tv show on British television in the early 2000s. Its premise was based on a bunch of engineers and scientists being unleashed in a junkyard with a predefined challenge:

  • Build a machine capable of racking up points on a giant ice-bound pinball game
  • Build a machine capable of launching lumber in a caber-tossing fashion
  • Build a human-powered aircraft
  • Build a device capable of chopping its way quickly through a range of extremely tough targets…

Steam Conversion of a Reliant Robin - Reddit

Steam Conversion of a Reliant Robin - Reddit

The insanity of the show should be illustrated by these few examples, but nevertheless, it was a great example of a team of experts coming together, identifying a potential solution to a problem, finding the appropriate ingredients/components and getting it done.

Now, let’s assume that you are asked to evaluate the chances of a team being able to successfully tackle a challenge.  For the purposes of this metaphor, we shall assume that the team members themselves are all geniuses and the best they could be in their field, and thus removing the human factor from the equation. Even then, what would a mind-map of areas to explore look like in our mission to evaluate the chances of success?

The starting point should always  be “what is the challenge?”. The question of success cannot be speculated upon unless you know what you are trying to achieve - it’s as simple as that.

Once the “what” has been established, then it turns into an exercise of evaluating how likely it is to find the ideal components for your solution. At this point, the junkyard itself becomes important:

For example, if it’s one containing mainly domestic material, trying to build an amphibious vehicle might be a bit of a challenge. At the same time, even if the correct type of components can be found, but they are all 25 years old and rusty, then there’s a good chance you’d think twice before stepping onto a boat made out of such inferior material. But no matter which way we try to slice-it-and-dice it, the fact remains:

The value of the junkyard can only be quantified through the lens of a particular challenge/problem. Being valuable for one challenge does not automatically make it valuable for another.

How does this have anything to do with Ag Data?

Let’s take a minute and acknowledge the fact that “how big is the junkyard?” does not feature very high in the list of questions to ask when evaluating the chances of a Scrapheap Challenge team being successful. Sure, a very big junkyard might slightly increase the chances of us finding the right material - but this is subject to other, more important prerequisites, such as whether we are looking for a steam boiler in a aircraft graveyard.

Yet, when we are presented with the "digitized acres  per system" figure above, in terms of analytical value, all we are being told is “look, my junkyard is bigger than yours”. However, these numbers...

...tell you nothing about the nature of the data being collected.

This is about knowing whether your junkyard contains rusty cars or domestic appliances. Similarly, Ag Tech specialises on different areas - some systems are excellent at scouting operations, whilst others are great at processing OEM data. Understanding the strengths of these systems would go a long way towards understanding what kind of data you are likely to find.

...tell you nothing about the temporal aspect of the data

Are we talking about a system that acquired the bulk of its acreage years ago and has now lost a big chunk of its users? Are we looking at crumbling, rusty parts or something that with a bit of TLC can be shaped into something useful?

...do not tell you whether the components fit in a way that makes up a coherent picture

Sure, we have managed to find the pieces required to build our homemade rocket launcher - however, will these pieces align correctly as the blueprint requires? Just finding piles and piles of what appears to be the right piece of data, doesn’t necessarily mean you are sorted. But we better illustrate with an example below.


When the data stars don't quite align as you'd wish - an example (and very real) Ag Retailer

Who is our Ag Retailer?

The following example refers to real data that Proagrica holds in our agX Database of a very real, North American Ag Retailer, who takes pride in their data capabilities (even employing a data manager). Let’s call them “Great Crop Company Inc.” (GCC Inc.).

What is the analytical use case?

Keeping in line with the principle of “tell me your analytical use case before I can tell you how valuable your data is”, we shall focus on the fairly common use case of generating  hybrid performance insights based on precision planting and yield maps (i.e. Yield by Hybrid).

 What are the ingredients required?

For the aforementioned insight to be generated, the following combination of data elements is required: For a field on a given season, ensure that you have a precision planting map and a precision yield map. 

In a nutshell, it’s the combination of field / season / planting / yield that is required to provide the insight - not the standalone layers by themselves.

What does our data estate look like?

GCC Inc. manages a total of 9,000 Fields through our solutions (numbers rounded for ease) and have been active with us for the past 22 years. Assuming that each one of these fields has been an active part of GCC’s estate for the past 22 years, then the maximum potential number of Field-Seasons would be 198,000 (22 x 9,000). Of course, that is a big assumption since it implies that GCC has not acquired or lost any new clients for the past 22 years. Nevertheless, halving or even doubling that number has very little effect on the story, so for ease of calculations we shall stick to our assumption.

This number (198,000) is an important one: It indicates what would be the maximum number of Yield by Hybrid data points available to GCC Inc, if they had perfectly captured Yield and Planting maps for each Field Season. Furthermore, we can see a total of 11,000 Yield Maps and 6,000 Planting Maps.

When the Data Manager of GCC Inc. looks at the very high level stats, he tends to be proud - thousands of fields managed, 100’s of Gigabytes of actual data, plenty of it in high precision. But is there much reason for celebration?

How much Yield by Hybrid insight can GCC Inc. get?

As mentioned above, the key to generating the Yield by Hybrid insight is to identify successful Field / Season / Yield / Planting combinations. Having a Yield map by itself for a Field doesn’t quite cut the mustard. So, let’s see what the (let me remind you, real) data looks like for GCC Inc:

Max Potential Field-Seasons: 198k With no Planting or Yield: 187k With Planting or Yield: 11k Planting & Yield: 3k Yield Only: 7k Planting Only: 1k Out of 198k Max Potential Field-Seasons, 187k have neither Planting nor Yield Out of 198k Max Potential Field-Seasons, 11k have either Planting or Yield Out of 11k Field-Seasons with either Planting or Yield, 3k have both Out of 11k Field-Seasons with either Planting or Yield, 7k have Yield only Out of 11k Field-Seasons with either Planting or Yield, 1k have Planting only Max Potential Field-Seasons 198,000 With no Planting or Yield 187,000 With Planting or Yield 11,000 Planting & Yield 3,000 Yield Only 7,000 Planting Only 1,000

The most important insight from the Sankey Diagram above is the fact that out of a potential 192k Field-Seasons, only 3k of them had the right combination of data to give us a Yield by Hybrid Insight. That’s 1.5% of the maximum potential. Even if we were to revisit our original assumption of “22 Seasons of 9,000 active fields” and halve it, we would still be left with a single digit percentage.

Are the rest of the Field-Seasons (187k + 7k + 1k) “dead”? Well, as far as their ability to give us the answer to Yield vs Hybrid, the answer is “yes, they are”. This doesn’t mean that they don’t contain valuable data - it’s just so happens that, that data is of no use to us in this case (thus, our original statement that the analytical value of a dataset can only be evaluated through the lens of a use case).

What do the Ag Tech companies do to help with this?

Looking around, it is hard to find instances where an ag tech provider has consciously gone out of their way to support the type of user activities that results in more complete data estates - and thus, more elaborate insights (if Yield by Hybrid can even be classified as such).

Companies such as Proagrica have invested time and effort to ensure that individual data assets are of the highest quality at a granular level. For example, when a Yield Map is imported into our Sirrus product (via the Transform API), we will ensure that incorrect heading values are fixed, observation overlaps are addressed, reference data are standardised etc. But the scope of such activities is always atomic. But what could be done to assist with all of this?

Seamless data integration with other systems

Gone are the days that Product Owners and CEOs believed they can build one-stop-shops that are great at everything and their users will never feel the need to jump to another system. Instead, products are now consciously try to integrate with as many as ecosystems as possible (even competitors) in an attempt to ensure that they at least have a more comprehensive data picture.  But nevertheless, this does not go a long way towards addressing our problem - it just makes our junkyard larger.

If you build it, they will come

A fairly common mindset across the data product community has been one that dictated “don’t bother building something (an insight in our case) until you have a critical mass of good data for it”. However, this easily leads into a situation that your users will never be incentivised to collect that data - whilst you are waiting for the user to collect the data so that you can given them the incentive…? In retrospect, a bit of a chicken-and-egg situation. Ultimately, it’s the products that should take that leap of faith  - not the user. It’s up to the products to prove the value of their insight.

 

Support for Custom Data Collection Protocols

Data collection systems have long used compulsory and optional fields as a means of dictating the type of data to be collected. However, quite often these have been the by-product of a back-end database design, or a narrow view of what good/complete data should look like. Nevertheless, what constitutes good for an Agronomy outfit that is all about developing pest pressure models based on Scouting records, is not the same as a more traditional company that focuses on fertility. These definitions vary greatly - yet, the behaviours of data collection products, do not. It is understandable that it’s not straight forward to start building systems that modify their data collection protocols based on particular use cases and user profiles.

However, if we are serious about supporting the analytical ambitions of our clients, then flexible mechanisms would have to be devised to support data collection and evaluate data estates for what matters to each individual clients. After all, agriculture is far too complex to settle for generic recipes. 


This article is part of a series of informal pieces written by the Proagrica Data Team, aiming to instigate honest and intelligent discussions on the state of agricultural data and associated systems. More often than not, these pieces are inspired by the team's deep-dives into Proagrica's data estate, but nevertheless, the opinions stated in these articles aim to remain impartial. If you have any comments, please contact the author of this article.