When the stakes are high: new study finds people distrust single AI models and want human oversight when algorithms disagree

A new study by computer scientists at the University of California San Diego and the University of Wisconsin–Madison warns that relying on a single “best” machine learning (ML) model for high‑stakes decisions — from loan approvals to hiring — can undermine perceived fairness, and that ordinary people prefer human arbitration when equally good models disagree. The research, presented at the 2025 ACM CHI conference, explored how lay stakeholders react when multiple high‑accuracy models reach different conclusions for the same applicant and found strong resistance to both single‑model arbitrariness and to solutions that simply randomize outcomes; instead participants favored wider model searches, transparency and human decision‑making to resolve disagreements UC San Diego report and the authors’ paper Perceptions of the Fairness Impacts of Multiplicity in Machine Learning (CHI 2025) presents the detailed results and recommendations.

This matters for Thai readers because ML systems are already embedded in many high‑stakes domains domestically — digital lending, automated credit scoring, recruitment platforms, and public services — and Thailand’s financial and regulatory authorities are actively working on guidance for AI risk management. The Bank of Thailand opened public consultation on draft AI risk guidelines for financial‑service providers in mid‑2025, signalling an appetite for rules that address exactly the kinds of harms the study highlights Tilleke & Gibbins summary of BOT draft guidelines. If policy and industry ignore multiplicity — the existence of many equally well‑performing but outcome‑diverse models — Thai applicants could face arbitrary differences in life‑changing decisions depending on which model was chosen behind the scenes.

The research builds on a growing literature about the “Rashomon” effect in machine learning — the fact that many different models can achieve comparable accuracy while producing different predictions for individual cases. Earlier work warned that predictive multiplicity can mask arbitrariness and risky fairness outcomes; the CHI paper moves beyond theoretical analyses and asks the public how they want organizations to handle multiplicity when decisions matter see related multiplicity research. The UC San Diego news release describes the framing: a bank or employer running several “equally good” models that disagree about a single application raises the question of which model’s decision should stand, and how that process should be judged by decision subjects and the broader public UC San Diego report.

Key findings from the CHI study are striking and relevant for Thai policy and industry practice. Using multiple experimental studies with thousands of participants (the full suite of experiments reported in the paper and preprints covers a large sample across scenarios), the authors found: (1) a clear public preference against the status quo practice of selecting a single model without explanation when multiple models disagree; (2) a strong rejection of simple randomization (flipping a coin among models) as an acceptable tie‑breaking mechanism in high‑stakes contexts; and (3) support for remedies such as searching across a broader set of models to identify those that align better with fairness objectives and involving human decision‑makers to adjudicate disagreements rather than leaving outcomes to opaque algorithmic choice. The UC San Diego account quotes the study lead observing that “ML researchers posit that current practices pose a fairness risk. Our research dug deeper into this problem. We asked lay stakeholders, or regular people, how they think decisions should be made when multiple highly accurate models give different predictions for a given input” UC San Diego report.

These results challenge some academic defaults. In machine learning research and in many operational pipelines, teams often pick one “best” model according to cross‑validation metrics and deploy it. Philosophy and some fairness work have previously suggested randomization can be a neutral tie‑breaker when models are equally performant. But the CHI study shows that ordinary people do not view randomization as legitimate in contexts like hiring or lending; they expect organizations to take responsibility for the process and prefer human oversight when outcomes differ CHI paper abstract and discussion. One of the paper’s first authors is quoted explaining that participants’ preferences “contrast with standard practice in ML development and philosophy research on fair practices,” underscoring a gap between academic proposals and public expectations UC San Diego report.

Expert reaction in the field has been converging on caution about multiplicity for some time. Research on the Rashomon set shows that the “many roads to similar accuracy” phenomenon can either be an opportunity to find fairer models or a liability if model choice is arbitrary; recent papers propose methods to measure multiplicity, evaluate allocation consequences, and intentionally select models according to social objectives rather than solely predictive metrics see theoretical and empirical multiplicity studies and NeurIPS analyses of the Rashomon effect. The CHI study complements that work by adding a normative layer: it asks what stakeholders consider fair when multiplicity arises.

For Thailand, the implications are concrete. The country’s financial sector is rapidly adopting automated scoring and digital lending platforms that use ML to speed decisions, expand access and cut costs. But if different vendors or internal teams could have chosen alternative models that would have yielded opposite outcomes for a loan applicant, Thai consumers may be subject to arbitrary differences that undermine trust in digital finance. The Bank of Thailand’s draft risk management guidelines emphasise governance, transparency and human oversight for high‑risk AI use — elements that align with the CHI paper’s recommendations Bank of Thailand consultation summary. Regulators, banks and fintech firms in Thailand now face a choice: incorporate multiplicity audits and human adjudication into their AI governance, or risk public backlash and regulatory correction.

There are cultural and historical angles that make the CHI findings especially salient in Thailand. Thai society places high value on perceived fairness, community harmony and relational accountability; institutional arbitrariness is often met with loss of trust and informal reputational penalties. In public services, citizens expect clear reasons for decisions (for example, in university admissions or social welfare allocation), and opaque automated rejections can amplify perceptions of unfairness and “kreng jai” (hesitance to confront authorities) can inhibit appeals. Anecdotally, Thai customers who feel unfairly treated by a bank or government office may turn to social networks and media to voice grievances, putting reputational pressure on the institution. The CHI study’s finding that people prefer human review and transparent adjudication fits these cultural expectations: humans can be held accountable and can explain decisions in a way a sealed single model cannot UC San Diego report.

What might change in practice? The researchers recommend several practical measures that Thai institutions can adopt immediately. First, expand model search procedures: instead of training and deploying a single “best” model, teams should explore a broader Rashomon set and assess whether different models systematically disadvantage particular groups. Second, introduce multiplicity audits into development pipelines that measure outcome variability and identify cases where model choice flips decisions. Third, require human adjudication — particularly for high‑stakes or borderline decisions — so disputed cases are settled with a transparent, accountable process rather than by hidden algorithmic selection. Fourth, document and disclose the decision‑making process to decision subjects, including whether multiple models were considered and how ties are broken. These proposals are aligned with CHI paper recommendations and with international best practice emerging in AI governance literature [CHI paper and multiplicity literature](https://arxiv.org/abs/2409.12332; https://arxiv.org/abs/2501.15634).

For Thai consumers and applicants, the study suggests several concrete actions. If you apply for a loan, job, university place or government benefit and receive an automated denial, ask the provider: Was an algorithm used? If so, which process decides outcomes when models disagree? Request a human review and a clear explanation of the reasons for the decision. Under the Bank of Thailand’s draft guidance, financial institutions should be prepared to explain their AI risk management practices and to provide human oversight for high‑risk decisions; citizens and consumer advocates should use the public consultation processes to push for explicit protections against multiplicity‑based arbitrariness Bank of Thailand draft guidelines summary. Civil society groups and the media can also pressure banks and platforms to publish multiplicity audit results and to avoid black‑box deployment in sensitive domains.

There are important caveats and open questions. The CHI study captures public perceptions and preferences rather than prescribing algorithmic rules that always guarantee fair outcomes; some technical tradeoffs remain. Searching widely across models can improve fairness prospects but also increases complexity for developers and may expose organisations to other risks like overfitting or operational fragility. Human adjudication reduces algorithmic arbitrariness but introduces potential for bias, inconsistency and administrative burden unless adjudicators receive clear guidance and accountability mechanisms. Researchers in ML fairness are actively developing methods to measure multiplicity, select models using fairness‑aware objectives, and design adjudication workflows that combine human judgment with algorithmic explanations (see recent theoretical work on the Rashomon set and multiplicity mitigation) [multiplicity research links](https://arxiv.org/abs/2501.15634; https://proceedings.neurips.cc/paper_files/paper/2024/file/dbd07478c4aac41c0ce411e12f2e5a28-Paper-Conference.pdf).

Looking ahead, the interaction of public expectations, technical research and Thai regulation will shape how multiplicity is handled in Thailand. The Bank of Thailand’s draft guidance is a welcome opening; if final rules emphasise transparency, governance and human oversight for high‑risk uses — and if they require documentation of model selection and multiplicity audits — Thai banks could set a regional example in responsible AI deployment Bank of Thailand consultation. Technology vendors and data scientists will need to adapt development practices to include multiplicity metrics and provide interfaces for human adjudicators to interpret and resolve algorithmic disagreements. Consumer education campaigns can also help applicants understand their rights and the limits of algorithmic decision‑making.

In conclusion, the CHI 2025 study led by the UC San Diego and University of Wisconsin teams highlights a key social dimension of modern AI: people expect accountability, not randomness, when algorithms disagree. For Thailand — where digital finance, e‑government and platform work are reshaping everyday life — the message is clear. Regulators, banks, employers and platform designers should widen model searches, audit for multiplicity, keep humans in the loop for disputed cases, and be transparent with affected individuals. Thai consumers should be empowered to demand explanations and human review for important decisions. Taken together, these steps can help transform ML from a source of opaque arbitrariness into one more tool that supports fair, explained and culturally‑sensitive decision‑making.

Sources: UC San Diego news release on the CHI paper (today.ucsd.edu); preprint and CHI proceedings listing for Perceptions of the Fairness Impacts of Multiplicity in Machine Learning (arXiv abstract, ACM DL entry); broader literature on Rashomon/multiplicity and fairness (arXiv multiplicity paper, NeurIPS analysis of Rashomon effect); and reporting on Thailand’s draft AI risk guidelines for finance (Tilleke & Gibbins summary of Bank of Thailand draft guidelines).

When the stakes are high: new study finds people distrust single AI models and want human oversight when algorithms disagree

Related Topics

Related Articles

Machine Learning Fairness: Public Demands Human Oversight When AI Models Disagree

New Insights Reveal Why Human Brains Outthink Artificial Intelligence

New study warns “emotionally smart” AI can make us see people as less human — and more disposable