AI Explainability & Fairness (Consumer Credit) — Insights from Stanford & FinRegLab
An insightful research work from FinRegLab and Stanford. The white paper and the workshop panel talks are available here.
The white paper is ~120 Pages, my notes are ~20 pages and this blog is probably 4–5 pages. So it is very very condensed — I recommend reading the full paper — but, of course, after reading my blog !
For the attention challenged, I have annotated Best Practices like so, so that you can scan and pick up the 13 Best Practices while ignoring the rest!
Best Practice #0 : Scan & find all the 13 Best Practices in this blog !! Who knows may be there are more … or less …
The white paper is titled “Machine Learning Explainability & Fairness: Insights from Consumer Lending:(April 2022)”.
Very Detailed work, good insights and they have plans to cover a lot more. It is a solid beginning — congratulations to the researchers on their fine work. It is an empirical white paper, so it has observational, experimental results rather than any level of theory.
I do have a few quibbles about the finer methodologies, what is in and what is out, as well as the conclusions. But they are very minor compared to the value and depth of the work.
I have some thoughts in “Section 8. Future Suggestions”, below.
As a detour, during the same time I was reading Brandon Sanderson’s The Way of Kings Prime and The Way of Kings. Each book ~1,000 pages long ! Finished the books and this white paper (119 pages, but warrants more attention) almost at the same time - the context switch was interesting, to say the least ! BTW, if you are new to the Stormlight Archive series, read the Prime 1st — you’ll enjoy the books a lot better. “Life before death, strength before weakness, journey before destination” !
In this blog I won’t go into the methodologies but will cover the broader insights. Beware — on first glance, some of the observations might seem counter-intuitive and surprising — but on deeper thought they will make more sense.
I thought of multiple approaches to summarize the work — linear or best practices focused seemed to be the best. While a narrative focusing on best practices would be better for a short blog, it needs the backdrop — so I am going for a grounds-up narrative with annotated best practices. Please bear with me …
1. Empirical Structure
The work and the paper is empirical in the sense that the researchers selected already available datasets, and a set of algorithms. They then approached vendors in this space and worked with them to evaluate the offering and understand the pragmatic utility i.e., the usefulness of the information generated by the tools for day-to-day model pipeline activities.
2.Background
In the context of consumer lending, model explainability serves to further widely shared goals regarding anti-discrimination, consumer empowerment, and responsible risk taking.
- For lenders: model explainability is a key instrument to help them evaluate whether a model can be responsibly used in an intended application, to enable the day in, day out work of managing relevant prudential and consumer protection risks, and to document efforts to comply with law and regulation.
- For consumers: model explainability helps ensure that they receive basic information about how certain kinds of adverse credit decisions are made and enable effective recourse.
- For regulators and policy makers: model explainability is an instrument to enable oversight and detect shortcomings in adherence to laws and regulations
3.Focus & Approach
In this working paper, they focused on two consumer protection regulations that require lenders to:
- Adverse Action Notice (AAN) — Provide loan applicants who are denied credit or charged higher prices with the principal reasons for those decisions. To produce adverse action notices, lenders must be able to identify drivers of the model’s prediction for individual applicants who are subject to adverse decisions and map those drivers to descriptions or reason codes that will be given to the consumer
- Fair Lending, Disparate Impact and Least Discriminatory Alternative (FL, DI & LDA) — Investigate whether the underwriting models have disproportionately adverse effects based on protected characteristics, and if so to search for alternative models.
4. Viewpoints
The researchers evaluated the tools from 3 perspectives:
4.1 Fidelity
- For AAN, it is the ability to reliably identify features that can help describe how models take adverse credit decisions
- For DI it is the ability to reliably identify features that are in fact related to a model’s adverse impact
4.2 Consistency
- Whether drivers identified by the same tool across different models or by different tools across the same model vary i.e., consistency across diagnostic tools and consistency across models
4.3 Usability
- For AAN it is the ability of a model diagnostic tool to provide actionable information that helps an applicant subject to an adverse credit decision satisfy the criteria for approval within one year..
- For DI it is the ability to identify information that enables lenders to comply with the goals and purposes of consumer protection regulation
Best Practice #1 : When evaluating tools or systems for fairness assessment, compare them from three dimensions viz. fidelity, consistency and usability.
5. Models
They used a portfolio of models — from the simple logistic regression with ~45 features to GLM, XGBoost and Deep Learning models with ~ 650 features.
Best Practice #2 : Don’t compare bias and fairness with a single model. Try a spectrum of models — from the very simple to the most esoteric, from a few variables to the largest set of model features.
This gives a deep insight into the domain, data, algorithms, and the model variables. Also, it will make it easier to have effective conversations with stakeholders — with different levels of technical inclination.
Moreover, with that level of insight, you can fine tune the models — and make decisions on actions to increase the metrics the business wants while reducing the unwanted side effects.
6. Data
They procured credit bureau data for a representative sample of 50 million individuals from across the US, covering the period between 2009–2017 and used non-judging features i.e., no credit score, geography, or income estimates. The data was appropriately masked for research.
Best Practice #3 : To increase the model fidelity and the use of “not-well-formed” data, create missing value indicators as well as outlier indicators for numeric features as well as transformations to account for skewness.
Best Practice #4 : For the credit underwriting, one-hot encoding of categorical data would be sufficient, except for special category of models like LightGBM where integer/ordinal encoding will work well.
But for features with a huge number of categories (say > 10) n-gram subdomain split or hash encoding might be a better choices.
Embedding is a choice for large scale categories, like products — for example Instacart converts their 10 million products into a 10 dimensional embedding [Here]
An interesting technique employed by the researchers is the use of over-sampled data which they call “deployment data”. I think the term could be renamed to make it clearer.
To understand how the fairness properties of the models generalize to a context with a different composition of applicants, they built a second data set (the ‘deployment’ data) that over-sampled credit card applicants from geographies that have a higher proportion of minority applicants. You can extend this to any over-sampling technique to compare the DI of different schemes !
Best Practice #5 : Create a synthetic/over-sampled dataset with different distributions of the protected classes (i.e. a data set that was purposely designed to represent a different composition of applicants) to evaluate between different bias mitigation schemes. It will be very informative to simulate model behavior across hypothetical distributions of applicants and compare the model behaviors between different schemes.
But, be careful not to generalize the results — the results are valid only in the context of comparisons and should not be published as a metric to characterize and generalize the model performance.
7. Explorations & Results
7.1. Overview :: Cautious Optimism !
The results are a combination of good news-bad news
1There are diagnostic tools which can help lenders address transparency challenges associated with machine learning underwriting models — they can generate relevant information about the model’s behavior to help lenders comply with the two specific consumer protection regulations considered in the report
2But, there are no universal or “one size fits all” model diagnostic tools that lenders can use to help them explain, understand, and manage all aspects of machine learning underwriting models
3In short, lenders’ choices about which diagnostic tools to use and how to deploy them are important to achieving specific consumer protection goals, particularly for more complex models
Best Practice #6 : Carefully select the right tools and approaches for addressing specific transparency needs of machine learning underwriting models. At this stage of maturity, multiple tools might be needed to address specific areas. The responsible use of model diagnostic tools is part of governance decisions that an organization has to make.
Choosing the tools matters; interpreting the analysis from tools matters
7.2.AAN (Adverse Action Notice) :: Use wisely !
- The two tasks were to generate four drivers of adverse credit decisions for the set of 3000 rejected applicants and then to identify a feasible path towards acceptance within 12 months for each of the 3000 rejected applicants.
- The path to acceptance is interesting because it is a computational challenge as well as finding the right set of feasible features that are practical for a consumer to change. [Note : Counterfactuals is an important area for this — I have marked counterfactuals as future additions below]
1They found substantial variation in the fidelity of diagnostic tools that provide four drivers for an adverse credit decision.
2While the best tools identify features that indeed relate to the adverse credit decision, changing these features is often not sufficient to overturn the adverse credit decision
3Consistency across tools and models are … well … not that consistent; even 50% would be a high number to look for. One way this is good, because we can get different viewpoints from different tools.
4Consistency for simple models is higher than for complex models, but not by as much as we would expect — meaning it is the approach by the tools that matter more than the model algorithms themselves.
5From a usability perspective, the current tools do not necessarily do well in deriving actionable paths to acceptance —changing only a few features in isolation is unlikely to overcome a rejection. As I mentioned earlier, counterfactuals might be an answer — still in research mode, but we might derive value if we pair the counterfactuals with human-in-the-loop.
Best Practice #7 : In suggesting a feasible path towards acceptance, changing the features, flagged by the tools, by themselves are not sufficient to overturn the adverse credit decision. Instead, these features should be understood in context of their feature correlations: only moving them together with correlated features shows the full effect on credit approvals.
Also consider the features that can’t be changed, that can be moved only in one way and correlated features e.g. if we suggest a higher education, it also increases the age. While this common sense, machines do not understand these constraints unless explicitly told
7.3. FL, DI & LDA (Fair Lending, Disparate Impact & Least Discriminatory Alternative) :: Practical and improving !
- Two fair lending doctrines reflect these requirements: Disparate Treatment and Disparate Impact.
- Disparate treatment focuses on whether lenders have treated applicants differently based on protected characteristics like race, gender et al
- Disparate impact prohibits lenders’ use of facially neutral practices that have a disproportionately negative effect on protected classes, unless those practices meet a legitimate business need that cannot reasonably be achieved through alternative means with a smaller discriminatory effect. This is where the LDA comes in. As you will see down below, tools are able to find alternates using automated search.
- Financial institutions rely on statistical analyses to help them comply with both legal fair lending doctrines.
Best Practice #8 : To keep in mind — With the advent of advanced prediction tools, there has been heightened interest from the regulators, especially for complex models. And, where machine learning models rely on data from more varied sources or on more complex features, there are open questions concerning whether lenders and regulators may need new tools and face new limitations in efforts to diagnose disparate impact.
1There are a set of diagnostic tools that exhibit high fidelity across both simple and complex models i.e., able to reliably identify features that are related to the model’s disparities.
Best Practice #9 : When identifying model features that contribute to disparity, select tools that combine information about how a feature correlates with protected class status and how important the feature is for the model’s prediction.
2When the distribution of these features is equalized across groups or these features are perturbed in a favorable direction sizably, reduce disparities across protected classes. But the re-weighting needs to be done in the context of their feature correlations-manipulating the features in isolation exhibits smaller reductions in disparities
3No single model performs best across a range of possible fairness metrics, but complex models consistently outperform simpler models that rely on relatively few features both in terms of fairness and predictive performance
4The relative patterns of predictive performance and adverse impact are preserved when evaluating underwriting models on a held-out data set with a different applicant composition.
Best Practice #10 : To keep in mind —more complex models exhibit higher predictive performance and smaller disparities across all metrics relative to most simple models.
5The ability to describe features that drive disparities with respect to a protected class does not automatically lead to models that are less discriminatory alternatives (LDA) when this information is used mechanically. Automated tools perform significantly better than strategies based on dropping features that were identified as drivers of disparities in the model — even for the never-seen-before data set with a different applicant composition.
Best Practice #11 : When searching for alternative models that have less discriminatory properties (i.e. Less Discriminatory Alternative), use automation with approaches like dual objective optimization an adversarial de-biasing. Complex models in combination with tools that rely on some degree of automation can produce a menu of model specifications that efficiently trade off fairness and predictive performance.
8.Future Suggestions
Counterfactuals
- Except a brief mention, this is a topic not covered in the white paper. A deep dive into the counterfactual methodologies, the vendor offerings and the usefulness would definitely be a great track for the next version of the paper.
- The counterfactuals not only aid the AAN part of the story but also as an educational tool for the consumers when used in a human-in-the-loop fashion.
Best Practice #12: Use counterfactuals to educate the customers — it could be a great venue to help potential customers but use it in a human-in-the-loop fashion. A time might come, in the future, to expose the counterfactuals directly to consumers but that needs careful considerations from multiple perspectives
Non-traditional, extended datasets and an inclusive eco system
- The question is not whether we can be more inclusive and serve underrepresented population, but how …
- How can we be more inclusive, advancing the causes of the underrepresented viz. the under-banked and the non-banked. This is socially beneficial and good for business. But requires non-traditional datasets and even business policies
- Where future versions of this report can help us to understand how we can use the tooling to build and evaluate models adding the extra dimension of non-traditional, extended datasets.
More Model Types and white-box (less opaque) models
- I suspect most of the work assumes black-box models. It would be interesting to extend to more transparent modeling where one can peer-into the different stages inside al algorithm — things like integrated gradients for deep learning models.
- This will give us a glimpse into the effective use of DeepLearning (with all its “deepness”) for underwriting, especially for the non-traditional, extended datasets. They have used complex-neural network models, but there are no insights specific to that class of models — i.e., how do they perform vis-à-vis non-DL models.
- We know we can develop models, but we also need to understand how to explain opaque models as well as apply fairness assessment effectively. If we can do that, I think we will make more progress in the inclusivity dimension viz. larger varied data and models that can extract correct valuable insights — the “responsible risk-taking”.
Intersectionality
- Another related line of inquiry to be included in the next version. The paper explores only one class — which might be relatively easier for the tools. Methodology to evaluate in the intersectional dimension is very important for a practitioner.
Fairness mitigation
- This is another topic that is peripherally covered. A separate section covering mitigation strategies in the light of the tooling available would be very interesting.
Editorial nitpick
- The white paper could use some more editing ! I am saying this as humbly as possible, realizing the amount of material it covers. Interestingly that is when we need more editing. There is some redundancy and repetition — many times I felt that I read the same thing somewhere else, probably due to occasional loss of context that is not apparent.
- I fear that many who would have benefitted from this work might not stay with it till the end !
- Probably, organize the paper leaning more towards industry practitioners, rather than the academic, might help.
Having said that …
- The main goal for this blog was for me to extract the insights and best practices for a couple of documents (internal and external) I am working on. This, I was able to do very well — most of this blog is directly from the paper ! So, the depth and the details are there.
Kudos for the team in developing this white paper — as I said earlier, insightful, thoughtful and detailed !!
Addendum. A note about fairness metrics
The paper has an excellent section on fairness metrics — which I thought will be very useful as a reference. There are three types — Threshold-based metrics, Non-threshold-based metrics and of course, hybrid.
Threshold-based metrics
- This type has a cutoff based on prior experience — often very intuitive and correspond to a realistic use case.
- These metrics focus on relevant outcomes by considering the approval threshold used in practice. Disparities in extreme tails of the model might not matter much for observed disparities in outcomes.
- These metrics are thus closer to the meaning of fairness intended by disparate impact requirements.
Disadvantages
- The downside of threshold‐based metrics is that they are specific to a decision threshold. If lenders change the decision threshold, the measured value of disparities also changes
- These metrics can also be sensitive to changes in applicant distribution, as well as strategic considerations related to the relevant product, business line, or loan portfolio.
- A model can appear to have low disparities when faced with an applicant pool that contains many minority applicants who are assigned low risk scores by the model.
- That same model can have high disparities when faced with an applicant pool that contains many minority applicants who are assigned high risk scores by the model and consequently rejected.
Best Practice #13: When using any unconditional metric like the threshold-based metric for fairness assessment, make sure to document and then monitor the class composition. If the class composition changes from the expected one assumed in the metric, the metric has to be evaluated for revision.
Examples include :
- Adverse Impact Ratio(AIR) (the ratio of the acceptance rate for the minority group to the acceptance rate of the majority group. AIR values closer 1 correspond to more parity
- Differences in True Positive Rates (”TPR”) or False Positive Rates (”FPR”) — The TPR is the fraction of defaults that are correctly predicted while the FPR refers to the fraction of non-defaults that are incorrectly predicted as defaults. Unlike the AIR, these measures also take outcome labels (here, defaults) into account, not only decisions (here, approvals). Values closer to zero correspond to more parity.
Best Practice #14: Use the threshold-based metrics jointly. For example, Considering AIR in the context of TPR and FPR allows practitioners to determine whether greater approval rate parity (AIR) is gained at the expense of approving people who have insufficient ability to repay the loan, which is reflected in decreased TPR.
Non-threshold-based metrics
- Statistical or demographic parity — The difference in the average predicted probabilities by protected classes. The closer to zero, the more parity
- Conditional statistical parity follows the same idea as statistical parity but ‘controls’ for the impact of key features that might skew the probability distribution across protected class
- Standardized mean difference (“SMD”) is a scaled version of statistical parity. the average difference in predictions between protected classes, divided by the standard deviation of the model predictions. The closer to zero, the more parity
Hybrid metrics
- Metrics that combine model predictions and decisions but are not threshold-based. A key example of such a hybrid metric is AUC parity.
- AUC parity - The difference in predictive performance as measured by AUC by protected class