Feature engineering is an important step in analytics. Some may say this is THE most important step. I typically spend between 50-70% of the time of an analytic exercise on feature engineering.
What is it:
In the predictive modeling context, features are basically elements of the data. E.g. age of patients, diagnoses and procedures patients have had are all features of the data.
Feature engineering is the process of creating new features based on the raw data.
This process typically requires domain expertise to identify relevant aspects of the data/features that are most relevant to the analysis. E.g. working with oncologists to create profiles of metastatic cancer based on line level claims and EHR data.
A fair amount of feature engineering will also be quantitatively driven, where you iteratively create and refine features based on predictive modeling output. E.g. splitting age into finer age groups after seeing age as a whole is a highly predictive feature in a generalized linear model.
Here are few reasons for feature engineering:
- Health data can contain many errors and data structure inconsistencies. You will do best to spend time up front to clean up the data and get a good feel of what you’re working with.
- Raw data can contain many components that are numerous and highly specific. Furthermore, the accuracy of coding at such specific level may be questionable.
- E.g. there are 70,000+ ICD10s diagnoses: while E11 indicates type 2 diabetes, E11.621 indicates type 2 diabetes with foot ulcer, which is very specific. E11 contains just over 100 codes, and there is only one E11.621 code. So with the full digits, each ICD10 code would have fewer data points, thus reducing the predictive power dramatically.
- Adding labels that aggregate individual code components reduces such dispersion and improves the predictive power. E.g. in most analytic exercises, knowing whether a patient had type 2 diabetes is sufficient, thus using E11 is preferable to using E11.621.
Predictive model training:
- Features that aggregate over individual data elements remove noise and allow the identification of stronger effect sizes using predictive models and fast training of models with higher goodness of fit outcomes.
Add insight layers:
- Sometimes, codes in the raw form may be insufficient, so adding extra intelligence layers based on domain expertise may be useful.
- E.g. some NDC codes indicate combinations of different ingredients, such as Exforge 10mg-320mg (NDC 00078049115) contains two ingredients, amlodipine and valsartan. The NDC alone would not identify both active ingredients.
Actionable insight detection
- When doing analyses, you typically have an idea in mind that you want test or at least some notion of where to find issues. If you can design features that mimic your hypotheses, then you can use analyses and predictive models to test whether your hypotheses are correct.
- E.g. if you know a priori what some patients experience preceding onset of opioid abuse, you may be able to build these features from data and identify optimal intervention points to prevent the onset of opioid abuse.
How to do it
There are numerous types of data transforms that can be used in feature engineering. Here are a few examples:
- Temporal measures allow detection of changes in events over time that are highly informative. E.g. whether someone had a stroke in the past year is highly informative of their risk of further stroke risks as well as needs for additional recuperative care in community.
- Discussed above, adding aggregation layers will reduce noise while adding specific insight layers allows more targeted analyses. Most large insurers and EHR vendors have some sort of code categorization that you will come across and can use. But there are many situations in which you would need to create your own intelligence layer for specific use cases.
- Setting boundaries could help you identify abnormal occurrences, e.g. laboratory test being too high or Blood Pressure being too low for a given patient. Conversely, you can specify whether a value is within the normal range. In medicine, these normal/abnormal thresholds are often specified in the clinical guidelines.
- You can thus converted a continuous set of data into binary, of whether someone was above or below given thresholds. This is a lot more informative than looking at all lab values or Blood Pressure values for the analysis.
Dependent on predictive models
- Different predictive models require different types of input items. Logistic regressions perform well with binary features and outcomes; random forests tend to do better with continuous variables; k-means clusters tend to do well with nominal variables. So know what type of models might work best given your analysis and then build types of features that best feed those models.
Ultimately, feature engineering is an iterative exercise. You build an initial set of features, do some analyses, learn which features are more useful and build additional subsets of features and so on. E.g. you may find age is a useful variable. Then you split age into >=65 and <65. Then you find those above 65 are important, then you might split >65s into 5 year age bands.
Thanks for reading! Please subscribe.