The SNIA Cloud Storage Technologies Initiative (CSTI) recently hosted a conversation with Glyn Bowden from HPE that I moderated on “Using Data Literacy to Drive Insight.” In a wide-ranging conversation just over 45 minutes, we had a great discussion on a variety of topics related to ensuring the accuracy of data in order to draw the right conclusions using current examples of data from the COVID-19 pandemic as well as law enforcement. In the process of the dialog, some questions and comments arose, and we’re collecting them in this blog.
Q. So who really needs Data Literacy skills?
A: Really, everyone does. We all make decisions in our daily life, and it helps to understand the provenance of the information being presented. It’s also important to find ways to the source material for the data when necessary in order to make the best decisions. Everyone can benefit from knowing more about data. We all need to interpret the information offered to us by people, press, journals, educators, colleagues, friends.
Q. What’s an example of “everyone” who needs data literacy?
A. I offered an example of my work as a board member in my local police and fire district, where I took on the task of statistical analysis of federal, state, and local COVID-19 data in order to estimate cases in the district that would affect the policies and procedures of the service district personnel. Glyn also offered simple examples of the differences between sheer numbers compared to percentages, and how they should be compared and contrasted. We cited some of the regional variations of COVID data given the methodologies of the people reporting it. There are many other examples of literacy that were shared in the material, including some wonderful data around emergency service call personnel, weather, pubs, paydays, and lunar cycles. Why haven’t you started watching it yet? Remember its on-demand along with the presentation slides.
Q. What’s the impact of bias in the “data chain”?
A. Bias can come from anywhere. Even the more “pure” providers of source data (in this case, doctors or hospital data scientists) can “pollute” the data. What you need to do to qualify the report is to determine how much trust you have in the provider of the data. Glyn cited several examples of how the filter of the interpreter can provide bias that must be understood by a viewer of the data. “Reality is an amplifier of bias,” was the non-startling conclusion. Glyn made an interesting comment on bias: When you see the summary, the first questions you should ask are what’s been left out and why was it left out? What’s left out is usually what creates the bias. It’s also useful to look for any data that supports a counter-opinion, which might lead you to additional source material.
Q. On the concept of data modeling. At some point, you create a predictive model. First, how useful is it to review that model? And what does an incorrect model mean?
A. You MUST review a model, you can’t assume that it will always be true, since you’re acting with the data you have, and more will always come in to affect the model. You need to review it, and you should pick a regular cadence. If you see something that is wrong in the model, it could mean that you have incomplete data or have injected bias. Glyn offered a great example of empty or full trash containers.
Q. So, the validity of the data model itself is actually data that you need to adjust your assumptions?
A. Absolutely. More data of any kind should affect the development of the next model. Everything needs to be challenged.
Q. Would raw data therefore be the best data?
A. Raw data could have gaps that haven’t been filled yet, or it might have sensor error of some type. There’s a necessity to clean data, though be aware that cleaning of the raw data has a potential to inject bias. It takes judgment and model creation to validate your methods for cleaning data.
Q. Would it be worthwhile to run the models on both cleaned and raw data to see if the model holds up in a similar way?
A. Yes, and this is the way that many artificial intelligence systems are trained.
Q. Another question that could occur would be data flow compared to data itself. Is the flow of the data something that can be insightful?
A. Yes. The flow of data, and the iteration of the data through its lifecycle can affect the accuracy. You won’t really know how it’s skewed until you look at the model, but make a determination and test that in order to see.
Q. How does this affect data and data storage?
A. As more data is collected and analyzed, we’ll start to see different patterns emerge in our use of storage. So, analysis of your storage needs is another data model for you to consider!
Please feel free to view and comment, and we’d be happy to hear about future webcasts that would interest you.