Healthcare Data’s Garbage-In, Garbage-Out Challenge

Part #1 - What Exactly is the GIGO Problem?

Introducing WCI’s new series detailing the issues and providing solutions to the key medical data challenge of the 2020s: Avoiding Garbage In-Garbage Out.

Series Overview

Garbage In, Garbage Out (GIGO for short). We’ve all heard this phrase before. While most understand it’s something to avoid, what exactly is the problem describing and what is the relevance to the medical community?

In this ongoing series, West Coast Informatics will be examining the issue, its impact on the medical community, and providing approaches to minimize and ultimately avoid the issues. The questions we will answer in this series include:

  • Part #1: What exactly is the GIGO problem? (current article)

  • Part #2: Why is the GIGO problem so relevant to the healthcare community?

  • Part #3: What are the implications if this issue isn’t handled properly? (May 2023)

  • Part #4: What immediate steps can be taken to mitigate the problem?

  • Part #5: What are the challenges in addressing historical data?

  • Part #6: What can we learn from outside the medical community?

What is GIGO?

Garbage-In Garbage-Out is the notion that if your data is flawed, any decisions based on the data will also be flawed. That’s true in finance, public policy, and most any field where data is part of the decision making process.

 This issue is especially prevalent in computer programs and processes. Code/algorithms take data as input, process that data based on some instructions, and provide an output of some type. Given this processing is done by computers rather than humans, most applications will not recognize, never mind correct, input data that is garbage. As a result, one cannot have confidence in the output produced by that application.

What is “Garbage Data”?

 There are a number of ways that the data arriving to a particular application could be classified as “garbage”:

  • Incomplete data: This is the case when one incorrectly believes they have the full data set needed for their analysis. This case is further complicated by the fact that it isn’t always self-evident to a human, never mind a computer, that data is missing. Alternatively, the data source might not have been in position to capture what was expected.

    • Example: Consider the display of scores of a game. If the system only receives 75% of the scores, the score posted might be correct or might be very wrong.

  • Corrupted Data: In this case, something about delivery of the data to be input or the reading of the data from the input source gets corrupted. Imagine pulling data from the cloud and having an internet disruption cancel the download mid-stream.

    • Example: If a system is reading a price from a store, but due to an error in transmitting data, only returns the cents value. The price of an expensive appliance may show as “99 cents”.

  • Wrong Data: This could occur if the data provided isn’t correctly populated. This can happen due to problems in data that is processed upstream (before arriving to the process in question). This could also be due to human data entry error either by choosing the wrong entry or the wrong data being provided to the user.

    • Example: If a system is sending the temperature, but instead of sending it in Fahrenheit, it sends the value as Celsius.

  • Sloppy Data: This is mainly having misplaced confidence in the data provided. Sloppy data can arrive in many ways such as represented in fields other than what the receiver expects or containing data from multiple sources but ignoring the disparities between them.

    • Example: A blood pressure reading where the systolic and diastolic entries are mistakenly added to a single field instead of splitting them into two. Thus, one might see “120/80” where two data entries would be expected: “120 systolic” and “80 diastolic”.

    • Example: Formatting differences between sources that aren’t aligned could involve a second source sending the same reading as “120S---80D”. It is now up to the consuming system to a) recognize that neither source is sending data in the expected format and b) that it must normalize each source’s data to the expected format before processing it.

      Note: There is an assumption that this issue can be easily identified. Unfortunately, that is not always the case.

In each of these examples, there are challenges for the system receiving and processing the data for its own purposes. That system must either identify the error or it will continue to process and use it thus rendering the output of the system inaccurate. This is the challenge of Garbage In Garbage Out.

The first two cases are straightforward and can be dealt with mature safeguards (both human and machine). The far larger challenge is in handling the wrong and sloppy data. This is due to the fact that it is up to the data consumer to identify such cases rather than the issues being identified by the system. To identify such issues, a human is required to analyze the data and identify its flaws and challenges. Even if humans are able to identify the issue, they may not be able to normalize or otherwise correct the sloppy data. This is the case most often in the medical industry… arbitrary representation of medical information that may not even align with two people sitting next to each other! We will be looking into this more in the next article.

Next Topic: Why is the GIGO problem so relevant to the healthcare community?

Previous
Previous

Less is More: Use Case Driven Terminology Implementation

Next
Next

SNOMED IPS Browser