When building a machine learning classifier, especially for a customer, it’s important to think like a product manager. Your goal isn’t always to produce the most technically accurate classifier with a high F1-score or AUC—your goal should reflect whatever your customer’s goal is.
Measuring Success
The aim of almost every machine learning classification algorithm is to correctly identify the class an observation falls in. Medical diagnostics are not only interested in correctly identifying that sick people are sick, but also in confirming that healthy people are healthy. The equivalent can be said about almost any field: spam detection in emails, facial recognition, and, in the case of Apixio’s HCC Identifier product, Hierarchical Condition Category (HCC) identification in a patient’s medical charts.
There are many different ways to measure the performance of a classifier, such as:
- Precision (positive predictive value)
- Recall (sensitivity, true positive rate)
- Specificity (true negative rate)
- False Positive Rate (fallout)
- Accuracy
- F1 score
The measures you should choose depends on the goals you’re trying to achieve, because it is often impossible to get excellent scores in all of these at the same time. For example, I once worked on a project where we were developing a new medical diagnostic to keep healthy people out of a long, expensive process that determined if they had a heart condition, and if so, which one. Although it would have been nice to achieve high precision (e.g. almost everyone we say is sick is actually sick), we were more interested in having high negative predictive value (e.g. almost everyone we say is healthy is actually healthy) and high specificity (e.g. almost everyone who is healthy is screened out as being healthy).
In this case, the objective wasn’t to determine if someone had a heart condition. The objective was to definitively label healthy people as healthy, for as many healthy people as possible. This was quantifiable by negative predictive value and specificity, so we optimized those values.
These definitions can be cumbersome to customers who aren’t interested in data science—in this case, hospital administrators were interested in understanding the financial impact of our classifiers. So, what do negative predictive value and specificity mean for a hospital’s finances? Let’s say the current process—which we were trying to keep healthy people out of—costs $10,000 (made up number). If the classifier had 90% specificity, that means 90% of the patients who do not need to go through the expensive process were prevented from going through the process. If the hospital put 1,000 healthy patients through the process each year, then our product would save the hospital 900 patients/year x $10,000/patient = $9,000,000 each year. Even if the product costs $2,000,000 per year, that seems like a no-brainer.
This example illustrates why thinking about classifier performance in terms of real-world outcomes for your customer is extremely important. It helps you price your product, demonstrate value to the customer, and tune your classifiers to be the best at what your customer actually needs.
What Matters to Apixio’s Customers?
Apixio’s HCC Identifier solution mines patient medical charts and identify evidence of potential HCC codes, which are key for private insurers to get reimbursed for the care they provide to Medicare Advantage plan members. These are also used for plans made available by the Affordable Care Act to determine if there has been an adverse selection bias in a population’s health risk. Our HCC Identifier product finds HCCs that are sufficiently well documented according to the standards of the Centers for Medicare and Medicaid Services (CMS) in order to get reimbursed. Once we identify this set of codes, the customer then reviews and edits the list before they submit them to the government.
We have classifiers for the HCC Identifier product that we tune to optimize precision and recall. However, they’re tuned to two very important hyperparameters for each customer: their data and their goals.
Tuning to a Customer’s Data
The data we receive from customers is medical charts for members covered by their plan. These come to us as either Electronic Health Record (EHR) documents (e.g. from Cerner or Epic) or as PDFs.
The way that the data is recorded in the documents can vary dramatically between organizations. Clinicians can write in paragraphs, bullet points, or forms, in shorthand or full sentences. Organizations also have their own unique requirements for how charts are written, which can lead to tangible differences in what kinds of evidence support the presence of an HCC. As a result of these differences and many more, a classifier that performs well on one customer’s data may not be optimized to run another customer’s data. It’s important for us to tune a classifier for each customer in order to give them the best results possible.
Tuning to a Customer’s Goals
Hierarchical Condition Category (HCC) codes represent health conditions, such as diabetes or liver disease. Insurance companies that offer Medicare Advantage plans report these codes to the government, along with demographics about their patients, to represent the risk associated with their patient population. The Centers for Medicare and Medicaid Services (CMS) then use these risk scores to make payments to the insurance companies according to the risk they carry in the patient population they serve.
At Apixio, our HCC Identifier product identifies potential evidence for an HCC in a patient’s medical charts. If that evidence is confirmed by human review, that HCC can then be submitted as part of the risk adjustment process. We assess the efficacy of the core algorithms of the HCC ID product for each customer according to the metrics that most represent quantities that matter to that customer. The metrics that we most often use are precision and recall.
Think of recall as the percentage of valuable codes that are found—if our classifier has 90% recall (using this number for example), then we capture 90% of all the codes they could possibly submit, leaving 10% undiscovered. Think of precision as the percentage of work the customer is doing that’s valuable—if our classifier has 80% precision, then we expect that 80% of the HCCs that we provide to the customer will be confirmed by human inspection. Usually, high recall comes at the expense of lower precision, and vice versa.
Some customers want to emphasize the recovery of HCCs that have been overlooked in an initial review of a set of charts. We can tune our classifiers to have higher recall will result in the discovery of more HCCs, but will also decrease precision, which is the fraction of HCCs that will pass human inspection. Conversely, customers may constrained by impending deadlines, and thus have limited “coding hours” (e.g. man-hours to review our algorithms’ output). In this case, we can tune our classifiers to have higher precision, making each coding hour more valuable because there will be a higher agreement rate between humans and our algorithms, but this also limits recall, or the fraction of codes in the charts that are identified and presented.
In the same way that HCC Identifier emphasizes identifying the presence of HCCs, our HCC Auditor product looks to confirm that an HCC is not present. If our algorithms don’t find any strong evidence for an HCC, this indicates that the HCC should not be submitted to CMS based on the documentation that we were given.
How to Generalize This Concept
For whatever project you’re working on—for school, work, Kaggle competitions, and beyond—think about what your goals are. If your goals aren’t extremely technical, but are more focused on outcomes like saving money or diagnosing disease, make sure you are setting the constraints of your classifiers in a way that helps you achieve your goals. In the heart disease project that I mentioned, if we had high sensitivity but low specificity, no hospital would have ever considered our product because it wouldn’t have saved them any money. Likewise, at Apixio, if we optimize our classifiers for precision instead of recall for a client who wants to extract the most value, we would fall short of their goals.
What this usually looks like is maximizing one value while constraining another. For example, you could say you want the highest precision possible, but the minimum acceptable recall is 80%. In this scenario, if your recall is 82%, then you’re at the boundary. If you can increase precision by 7% by reducing your recall to 80%, that’s worth it, but increasing precision to 10% and reducing recall to 76% is not. Setting these goals (maximizing X) and constraints (while keeping Y above Z%) is an important part of requirements gathering based on your customers’ problems and objectives.