Ugly side of healthcare data

This post describes some of the commonly seen types of errors in healthcare data.

Occasionally, memories flash across my mind, of the days I spent in maths lectures, of being amazed at the clean, efficient and elegant proofs that demonstrate the logical integrity of abstract maths theorems. Fast forward to today, I more often than not find myself marveling at the opposite end of the spectrum of cleanliness, efficiency and elegance…

Healthcare data is messy.

Healthcare is complex. Healthcare data is mostly generated by humans to reflect human conditions treated by human processes. To err is human

Also, a lot of the workflow and IT systems that generates healthcare data can have multiple legacy layers piled on over many years. These all introduce complexities and potential errors in the data.

A healthy dose of caution and skepticism will be critical to being able to pragmatically analyze healthcare data to generate accurate, actionable, reliable insights. The adage “garbage in garbage out” should be seared into the mind of all analysts.

Typically, I’d spend over 50% of any project doing data scrubbing/cleaning, ETL (extracting, transforming, loading), around 15% of the project liaising with stakeholders, 20%-30% on the actually analysis and the rest of the time on communication of the results.

Types of errors in healthcare data


Fat finger errors, like forgetting the decimal place or entering the wrong gender for patients occur, often… Also errors can be easily introduced when data is moved around, e.g. forgetting about filters on the data or neglecting some data elements.


Through carelessness, forgetfulness, errors can be introduced into the data. E.g. the nurse may forget to code a clinical condition.


Any transaction data, like insurance claims, or drug claims usually contain multiple lines, e.g. initial claim submission, subsequent claim processing, final payment etc. Any one of those steps could have generated duplicates which if not treated during the data extraction process, will lead to over counting of the underlying patient experience. When data tables are merged, e.g. drug claims with plan coverage data, if one is not careful, the result could artificially inflate the output.

Gaps in data

Despite efforts in recent years, interoperability (ability to merge data from multiple organizations) in healthcare remains a challenge. Doctors, hospitals, insurers, hold data of different formats in their own silos. E.g. you might see hospital claims from Hospital A but see no drug claims or office visit data that took place outside the hospital.

Typically, insurers would have the best 360 degree view of any patient. However, when people move jobs or insurers, their historical data often does not move with them. This hinders what is possible when you analyze the data.

When data such as ICDs get version updates, a whole series of change need to take place, which could introduce errors. Also, these version changes are usually moving towards increasing levels of detailed codes, which makes comparison back in time less straight forward.


When financial incentives are involved, people and organizations maybe intentionally creating incorrect or untruthful data. E.g. you may see some clinics under coding complications.

Wrong interpretation

The complexity of health systems creates complex data sets that make interpretation of the data tricky. The analyst has to question the results actively.


I’ll describe ways to detect, correct data errors in future posts. Subscribe here or on the right side panel.

9 thoughts on “Ugly side of healthcare data

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s