In this post, I show you the core SQL commands that will address a lot of the data management work you will do as a health data analyst.
If you work with data in a large organization that has lots of it, you probably use SQL.
Put another way, knowing SQL will open the door to analytics jobs in many organizations.
I cannot stress to you enough how important knowing how to work with large databases is. Health data comes from different places and are of different format. They are often full of duplicate lines, errors. Knowing SQL will enable you to understand a lot of the data nuances and work with these tables effectively and efficiently.
For those uninitiated, SQL may seem complex. But I have found SQL to be one of the easiest programming languages to learn.
What is it?
SQL (Structured Query Language, called SEQUEL or S Q L) is a database management programming language. It is the backbone of databases (relational database). It enables you to view these large data tables, create new measures, combine two or more tables, filter to subsets of the data and summarize the data as well as tons of other commands.
There are different variants of SQL with slightly different syntax, e.g. MySQL, MS SQL, PL/SQL, PostgreSQL. These syntax variations though important to realize, are largely cosmetic in that the structure and utility do not differ.
In this illustration, I’m using PostgreSQL. I’m using Datagrip, a software interface that makes the formatting more user friendly (color schemes, indentations, auto checks etc). There are a lot of these around, like ATOM.
How it works?
Let’s say I want to summarize information in a data table called DX_COST_CLAIMS that looks like this.
ID | transaction_date | diagnosis | cost |
AA | 1/26/2017 | HEART_FAILURE | $770 |
AA | 1/27/2017 | DIABETES2 | $110 |
AA | 1/28/2017 | HEART_FAILURE | $90 |
AA | 1/28/2017 | HEART_FAILURE | $370 |
AA | 1/30/2017 | ALLERGY | $610 |
Let’s look at the 4 part example SQL query:
Part 1. Measure specification
This part specifies the measures you want the query to generate.
In words: I am
- selecting patient ID, all the diagnosis codes in the data per patient
- counting the number of days each patient claimed a specific diagnosis, a measure I labelled “dates”
- adding up the costs for each patient for each diagnosis, a measure I labelled “total_costs”
- pulling all this from a data table called DX_COST_CLAIMS
You can almost describe what the query is doing by simply reading out. Simple right!
To clarify:
- Count, sum are 2 of the many aggregation commands SQL uses.
- From the data table you can see that dates can have multiple lines of claims for the same diagnosis code, (see lines 3 and 4) so I added the “distinct” command to the date count so that I don’t count those dates multiple times.
Part 2. Filters
This part filters the data extracted from the data table to your defined criteria.
In words: I am aggregating
- Only transactions before January 1, 2018
- For diagnoses HEART_FAILURE or any other diagnoses that have the leading letters HEART_
To clarify:
- There are lots of functions like to_date in SQL that allow you to do predefined manipulation of the data. Here are some. To_date converts a text to date.
- The % in the HEART_% part is a wild character and tells SQL to match any entry with those leading characters.
- The computation ordering of AND / OR work the same way as multiplication and addition, so put braces accordingly
Part 3. Aggregation
This part conforms to Part 1, where you asked for dates and costs to be aggregated over patient IDs and diagnoses.
In words: I am aggregating over patient ID and diagnoses
To clarify:
- At this point, there will not be duplicates of patient ID and diagnosis combination.
- You would need to have all aggregation fields from Part 1 here, else the query will kick out an error.
Part 4. Display
This part tells the query how to display the result
In words, I want the results to be order by patient ID and diagnosis shown in descending order of dates claimed, with the highest count of the dates first.
To clarify: I want to see this because for each patient, I want to know the most frequently claimed diagnosis code. The total_costs column will not be part of the sorting.
Let’s look at this again.
- I’m aggregating number of claim dates and total costs for each patient for each of their diagnosis codes.
- I’m doing this for claims that took place before 2018, for heart failure (or heart related) claims.
- I want the result for each patient to be sorted in descending order, by the number of dates each diagnosis was claimed.
From my over a decade experience doing health data analytics, the 4 basic lines I just took you through will enable you to do a lot in databases. Obviously you will require more practice to get comfortable with these SQL commands. But armed with these you should feel confident to explore in the databases at your work, or at least start asking “educated” questions.
SQL is a fast, reliable work horse that I turn to when I need to explore the raw data, merge, summarize and extract from huge databases. I cannot stress to you enough how important knowing SQL is for your health analytics career. It is hugely empowering and allows you to really get to know the nuances of data.
There are a lot of other features of SQL that will be useful in your analytics work. I’ll be writing future posts on SQL and it’s many other useful features and use cases. Please subscribe to my mailing list so you don’t miss out.
3 thoughts on “An intro to SQL – your key to databases”