Most People Like Fruit: the importance of data disaggregation.

Sam Shannon
4 min readApr 27, 2020

I’m going to explain, using a fruit bowl, how we’re at risk of masking real problems — or worse, completely overlooking them — when we default to an aggregated view of data.

The fruit bowl

A bowl of fruit is put out in an office common area. At the end of the day, if there’s fruit left, it’s thrown out. The same fresh selection is put out the next morning. The fruit bowl programme is piloted for five days.

This is what the fruit bowl looks like at the start of each day:

This is what it looks like at the end of each day:

Depending on how closely people were keeping an eye on the bowl of fruit over the last five days, they may have varying opinions:

  1. There’s only a few pieces of a fruit left each day, therefore most people like fruit.
  2. There’s only kiwis left each day, therefore people must not like kiwis.

Deciding the fruit bowl’s fate

Budget cuts are introduced around the time the pilot finishes and the fruit bowl programme is put under intense scrutiny. The finance department looks closely at the numbers to see if it should be kept or cut.

Scenario 1: Overall data recorded

Table 1

Only 75% of the fruit is eaten each day, which means 25% of the fruit is consistently going to waste. Looks like the fruit bowl isn’t as successful as initially planned and times are tough.

Graph 1

Decision: the fruit bowl gets cut from the budget.

Scenario 2: Data recorded by fruit-type

Table 2

Finance can see bananas, apples and oranges are consistently eaten week after week, but no one is eating the kiwis.

Graph 2

Decision: kiwis are removed from the fruit order. A survey is sent out to see what people want instead. The fruit bowl lives to see another day.

So what?

The overall data in Table 1 is an aggregated view while data by fruit-type in Table 2 is disaggregated.

  • Aggregated is when data is summarised or lumped together.
  • Disaggregated is when data is broken down into smaller units or sub-categories, so we can see unique differences that aren’t reflected in the aggregated view.

As we saw with the fate of the fruit bowl, the detail of recorded data can influence an outcome and plays a pivotal role in decision-making. An aggregated view of data is useful in hiring or enrolment practices in order to eliminate bias but when it comes to making decisions that affect people’s lives, the details matter.

Data disaggregation in real life

In 2007, the World Health Organization (WHO) reported that data was rarely sex disaggregated when it came to epidemic-prone infectious diseases, limiting the understanding gender dynamics, identifying vulnerable groups, and developing appropriate responses.

Thirteen years later, in the midst of a global pandemic, not all countries have learned this lesson. The New York Times reported although the US has ramped-up testing is churning out reams of data by the minute, they’re not breaking data down by sex.

In early April, the American Medical Association (AMA) reported less than 12 states have shared data on the racial and ethnic patterns of COVID-19. The AMA is pleading with the U.S. Department of Health and Human Services and all health-related entities to standardise, collect and make existing race and ethnicity data publicly available in order to effectively manage the spread of the pandemic.

The UN is also calling for a constant flow of new and disaggregated data to inform efforts to save lives.

When we’re treated as a homogenous group, the solutions to problems are one size fits all, which do not fit most of us.

Why isn’t data always disaggregated?

Let’s look at the fruit bowl example again. In Scenario 1, the programme director collected a high-level view of what was going on with the fruit bowl. In Scenario 2, they collected a more detailed view, broken down by fruit type.

Scenario 2 required a bigger investment in time and effort and resulted in a lot more data. If this was a real-life example, with exponentially more data in each scenario, the data from Table 2 would cost more to store because there’s simply more of it.

Data collection is ultimately determined by who’s in charge and the resources available. But I’m sure you’ll agree Scenario 2 was worth the investment because it led to the root of the problem.

When we only look at the surface-level of data, we risk losing vital context, information and depth. If we keep defaulting to the aggregated view, we’ll continue to mistreat problems in our world because we don’t truly understand them. We’ll cut the fruit bowl from our budget. Bananas, kiwis, and all.

--

--

Sam Shannon

Using both sides of my brain. Research, data analysis, data visualization, and illustration. 📈✏️💭