Advances in computing power and in machine learning techniques are rapidly changing how humans utilize data. Aside from the core statistical issues, what are the questions that an analyst needs to consider when doing predictive modeling in practice?
This post describes a few of these practical considerations.
I like to use the 5 Ws to remind myself to ask these questions when starting an analysis: Who / Why / What / When / hoW
To me, knowing your client is the most important consideration. Orienting the analysis as a service to a specific client, rather than a purely analytic exercise will ensure you engage the entire process with the end user in mind.
Sit down with the client and have a frank conversation. Asking difficult / sensitive questions upfront will save both of you lots of time down the line.
- For example, if your client is a medical doctor who has little stats knowledge, you want to scale back the complexity of the analysis, so that doctor understands and appreciates your work.
- If your client is the Chief Informatics Officer, you better beef up your analysis with the latest techniques that are executed correctly.
If your client is financially focused, do your analysis with more financial metrics.
If they are busy, keep your meetings and reports short.
If your client is a visual thinker, then use more graphs; if they are used to lengthy research reports, then use an academic report format.
Get to know why your client is facing the challenges, and what tools they have to address these challenges. Your analysis will be more actionable if you find insights that the client can implement, subject to their daily routines and tools available to them.
Bottom line, put yourself in their shoes and do the work with them in mind.
Your coworkers are the next most important consideration. Are they more or less experienced than you are? Can you lean on their experience/expertise? Are they team players or passive aggressive competitors?
What are their strengths and weaknesses? Are they detail oriented? Are they fast or slow? Get to know these traits so that you can have a productive collaboration with the least amount of frustration.
(Subscribe to see more posts like this)
Ask why the analysis is needed and why your clients believe they are facing these challenges. Their own interpretation can be revealing. E.g. they may blame external forces as the driver rather than looking more inwardly for issues. Either way, knowing their mental framework will help you direct your work.
What are the incentives that drive the factors underlying the analysis? For example, if doctors are financially incentivized to over prescribe a medication, you should include in your analysis how various drug companies reimburse doctors. If you don’t explicitly take these incentives into account, your analysis will likely fall flat, with no real world impact.
Ask what your client gains if the challenge they face is addressed. Their motivation to address this challenge will indicate how much support they will give you, how much scrutiny you will likely be under, and how fast they will likely need you to get the analysis done.
Ask what has been done in the past to address the challenges. Ask why those succeeded or failed. Your finding the same solutions that have failed in the past will not be helpful to your client.
Now onto a few more technical considerations.
Understand how you gain access to the data. Difficulties here will make your life miserable and your analysis slow to complete.
Understand what data you have to work with, where they come from, over how many years, from how many data warehouses. Understand how good and detailed the data is. This will inform what analysis you can do, how much data scrubbing you need to do, and how solid your findings might be.
Already mentioned, know what interventions your client can do and what outcomes these interventions hope to achieve. When you model your analysis, be sure you can concretely create feature sets that mimic these interventions so you can identify points of intervention, and that you can model the outcomes using data available.
E.g. let’s say your client wants to find patients most at risk of opioid overdoses to put on prevention programs, to reduce their ED visits. To do this, you will need to have drug and diagnosis data, as well as ED visit data, so you can identify features (using drug and diagnosis data) that predictive overdose related ED visits. You will probably ask for data many patients as ED visit rate tend to be low, offering low statistical signal (imbalanced data in ML parlance).
Click here for my post on opioid analytics.
When is the client expecting the analysis to be done? How long do you think the project will take, based on goal of the analysis and what data you have to work with? Sufficient thoughts put on time considerations ensure the project is done on time. Add ample buffers – always aim to under promise and over deliver.
Identify dependencies early on so you can manage risks proactively. E.g. if you depend on the IT team to get you the data, then you can only start your work until they do that.
Will findings from your predictive model be deployed daily, weekly, monthly or live? Do these have to align with your client’s existing workflow?
How often will you update your analysis? To what extent will your predictive model self-calibrate (machine learning…)? And how do you design trigger points for human intervention, to stop the machines running amok…
Ask how your client expects your analysis to help them face their challenges. Make sure you can deliver what they expect based on the data, time available.
You should consider given the goal and data available for the analysis, how should you work with the data, what tools/software to use. I always err on side of simplicity and speed, but sometimes, you do need to pull out the big guns and really get the technical execution perfect. E.g. if you want to publish your findings in a well respected journal, or the analysis informs a very important decision with lots of $ at stake.
Majority of the project time (>65% for me typically) will be spent on data scrubbing and feature engineering. Budget ample time for that including a few iterations on feature engineering, as you will discover additional features as you build the models.
Consider what statistical model to use, given your data set size and characteristics.
- Some statistical models fit specific data types, e.g. logistic regression suits data with binary outcomes.
- How much of your input is needed. Some models will require a lot of user input.
- How long each approach will take, how computationally intensive each approach is. E.g. neural nets and random forests tend to perform well but usually take longer time to build for larger data sets. Decision trees are far quicker to build but have their own weaknesses.
How easy will it be to deploy your findings? E.g. giving a list of patients most at risk of opioid overdose is a lot easier than having to integrate your algorithm into the electronic medical record.
For an analysis to grow, mature and increase in value to the client, you need to consider how to get feedback and incorporate that feedback into the models. Such feedback could be qualitative, e.g. surveys from end users, or quantitative, e.g. constant checking for prediction errors through your algorithm.
How transferable will the analysis results be? If you expect that you will need to extend the analysis to other settings, code the data ingest and the predictive model building processes to be more flexible.
Exactly how to make your models adaptive and transferable will require more explanation. Suffice to say for now that if you will spend more time thinking about the data structure and model parameter calibration at the least.
Lastly, I’m not saying that you need to bow backwards for your client. You should stick to what is technically and morally correct. Don’t cook the numbers to make your client happy. That said, within those boundaries, there is a lot of room for personal creativity to fit the situation. Hope the thoughts in this post help.