The new coronavirus that has been ravaging countries and sending us all into lockdown is the most observed pandemic we’ve ever experienced. Data about the virus itself and perhaps more appropriately, the nations upon which it is having an impact have been shared from multiple sources. These include academic institutions such as John Hopkins University, national governments and international organisations such as the World Health Organisation. The data has been made available in many formats, from programmatically accessible APIs to downloadable comma delimited files to prepared data visualisations. We’ve never been more informed about the current status of anything.
Data Aggregation
What this newfound wealth of data has also brought to light is the true power of data aggregation. There is really only a limited number of conclusions that can be drawn from the number of active and resolved cases per nation and region. Over time, this can show us a trend and it also gives a very real snapshot of where we stand today. However, if we layer on additional data such as when actions were taken, we can see clear pictures of the impact of that strategy over time. With each nation taking differing approaches based on their own perceived position, mixed with culture and other socio-economic factors, we end up with a good side-by-side comparison of the strategies and their effectiveness. This is helping organisations and governments make decisions going forward, but data scientists globally are urging caution. In fact, the data we are producing today by processing all of these feeds may turn out to be far more valuable for the next pandemic, than it will for this one. It will be the analysis that helps create the “new normal.”
So why the suggestions of caution? Surely the more data we have the better? And the answer is an age-old problem relating to, well, age. Data fluctuates. Particularly medical data. With a small and very recent data set, what we have currently is small compared to the global population. The challenge we have at the moment is the data is often being presented to the public in a raw or semi-raw form with little regard as to how that might be interpreted.
The Need for Context
Therefore, when we see a reduction in infections within a country, it could be interpreted as an improvement in the condition rather than a simple variation that is expected. Data has the most value when it is presented in context. On May 15, 2020, at the time of this writing, the total number of recorded deaths from the novel coronavirus stood at more than 300,000. This is a large number and is bound to increase, exponentially for a time, but it needs to be understood in context. It can be large or small depending on the time frame, the geographic scale, and the demographic composition of the population affected. For example, one statistic that has not been clearly shown in the press covering the numbers, is the number expressed as a percentage of the population. This is largely because it does not make such compelling reading, as in the early stages, these numbers are thankfully very small. However, the context provided there is very important. If 5,000 people out of a population of 100,000 per square km is sick, this is very different that 5,000 out of a population of 1,000 per square km. It describes the situation in the context of the population that is feeling the impact.
This is information presented, not in its raw format but normalised, cleaned and presented alongside other influencing data. This is what data scientists have been doing and sharing with the community in order to drive valuable insights from the mass of data we have. The other work going on is around the augmentation of that data to provide new contexts and new insights that are proving ever more valuable in how societies react and even predict the impact of Covid-19.
One such effort is the inclusion of age and prosperity data. This leads to an understanding of how activities such as shuttering businesses, which disproportionately impacts lower paid workers will cause poverty or other social-economic challenges. How the distribution network will need to adapt to serve the new priorities of food and health items in the absence of luxury items. How transport networks that provide the vital links for our health professionals and key workers can be maintained in such a way to provide frequent links whilst still allowing for social distancing and not restricting transport so much that overcrowding is the unintended consequence.
Data Lessons Learned
What all of this work around data has taught eager data scientists and engineers is that there are really two personas when it comes to data science, producers and consumers. There are also multiple levels of consumers, the intended audience, often professionals who have their own implied context, and passive observers who are exposed to the data through news reporting, internet searches and general discovery.
The data scientists and engineers are the producers, and they have a specific view on that data and a firm understanding of how to interpret the data they are working with and which bits can be safely disregarded. Seeing a scatter plot or a hexbin map or other such visualisation can be intuitively processed and provide an immediate understanding to the viewer. The same can be said for the intended consumers. However, the general public do not have the experience and training required to make such judgements and so the way data is presented to the consumer must inform whilst taking care of all the assumptions and context. The skills the population learn in being able to parse and interpret the data, along with the fine tuning of the skills of the data professionals to present the information in consumable packages, will coalesce to bring a data literacy never before seen. This same skill set can then be leveraged for presenting social, political, financial and many other verticals of data to a data-savvy populace.
The Tip of the Iceberg
Guidance and principles on how to get started with data assessment and make sense of the numbers is needed. It should be aimed at citizen data enthusiasts, journalists that might need help interpreting data being presented rapidly, and by anyone consuming the data that would like to be able to understand context and provenance of data.
The work getting the focus in the news around Covid-19 data is the tip of a very large iceberg, and likely not the work that will have the most impact over the longer term. The ability to educate millions on the meaning of data when it is presented in context will drive new social conversations far in the future. Allow us to understand how our societies and economies really work, and fully understand what our priorities should be, so that when the next pandemic hits the world, we are ready and informed. The learning we are doing now, will be the best defence for the next event, whilst helping us make immediate decisions to inform our reaction to this one.