AI Metrology Working Group
In the summer of 2024, Humane Intelligence, under the leadership of Dr. Rumman Chowdhury launched an AI Metrology Working Group to foster a dedicated community for improving the scientific measurement of AI technology and build up an evaluation capacity external to traditional big tech institutions. The AI MWG currently draws on membership from around the globe across private industry, government agencies, civil society, and academia. The group maintains an interdisciplinary focus at the intersection of AI and machine learning, metrology, social and behavioral science, and the humanities. As an innovative hub for AI evaluation, the group is managed and led by the Center for Responsible AI.
A primary objective of the AI MWG is to establish the domain of “real-world AI evaluation,” along with its associated concepts, tools, data, and methods to improve AI’s real-world outcomes. A recent paper authored by many of the WG members sets forth a framework for this field, emphasizing methods that are “real-world” in several respects. First, these approaches can account for what materializes when people use AI, and--even more significantly--the context in which it all unfolds. These approaches can yield valuable data and insights about AI’s opportunities and risks, ultimately informing AI design, development, deployment, governance, and policy-making.
Real-world AI evaluation makes it possible to answer the most pressing questions that policy makers and publics are asking world-wide about AI’s long-term societal, cultural, and economic shifts. The current AI assurance discourse skews heavily towards computational concepts and metrics, and more recently, the potential for “artificial general intelligence” (AGI) and various hypothetical ramifications that could arise from advanced AI capabilities. This development-centric focus has led to significant gaps in evaluating and understanding AI’s impacts to people and society. For example, status quo evaluation methods are unable to effectively account for phenomena such as human emotional entanglement with chatbots, assess the effects of such entanglement, or estimate the benefit of policy solutions and risk mitigations to address it. The inability to assess these secondary and tertiary effects of AI may lead to reliance on status quo methods, which can reduce market opportunities, stifle innovation, impede intelligence and economic forecasting, and exacerbate inequality of opportunity.
The Broader Ecosystem
The WG aims to establish a real-world AI evaluation ecosystem that can empower the much larger group of stakeholders outside of development to collaboratively engage in the evaluation process, create solutions, and navigate the challenges posed by AI. This ecosystem can:
- enable AI users, the public, and other non-development stakeholders to specify their questions and requirements within their own contexts,
- identify new sensors for collecting contextually-rich data in accordance with user consent and human subjects protections,
- adapt and establish evaluation methods for assessing AI's real-world secondary and tertiary impacts,
- equip interdisciplinary evaluators with the resources they need to assess AI evaluation outcomes independent of the technology's development, and
- foster a dedicated community that can further expand and leverage a real-world AI evaluation toolbox.
Subgroups
To foster a broader engagement and expand the ecosystem surrounding real-world AI evaluation, the WG features spin off subgroups based on specific areas of interest, bringing in new stakeholders and allowing all members to dive deeper into specialized topics. The subgroups are able to conduct collaborative research projects in their focus area through community building and the creation of stakeholder networks interested in particular aspects of AI evaluation. Subgroups facilitate ideas and resource sharing and produce deliverables such as best practices, guidelines, tools, and frameworks for their areas, adding depth to the overall ecosystem. A few of the initial interdisciplinary subgroup topics are listed below.

1. Assessing AI’s Return on Investment (ROI)
As organizations adopt generative AI technologies, they need data and metrics to assess AI’s financial implications in their own contexts and justify investments. Without clear ROI metrics, companies may struggle to evaluate whether AI’s benefits outweigh costs, leading to poor decision-making and misallocated resources. Demonstrating tangible ROI can drive innovation and competitive advantage, support compliance and risk management activities, and bolster shareholder confidence. This subgroup will evaluate AI's financial and functional benefits across sectors, aiming to develop methodologies that quantitatively and qualitatively measure economic impact—focusing on cost savings, productivity, and utility. By engaging diverse stakeholders, the subgroup will:
- develop methods for assessing the economic impact of AI on organizational performance,
- establish metrics to measure the efficacy and efficiency of AI systems in delivering value,
- analyze long-term cost savings, productivity improvements, and qualitative benefits arising from AI integration across sectors.

2. Multi-Lingual Factors
Generative AI technologies face significant limitations in various languages and cultural contexts, particularly for neurodiverse and disabled audiences. Chatbots often fail to recognize idiomatic phrases, contextual nuances like formality and sarcasm, and produce appropriate outputs. These shortcomings can adversely affect users outside of Standard American English environments. Research shows that more dynamic approaches are needed to address these challenges. This interdisciplinary subgroup will involve experts in linguistics, human-language technology, and natural language processing to improve current methodologies for:
- assessing the communicative and cultural competence of AI models in real world contexts,
- modeling communication gaps between AI models and people, and
- identifying potential thresholds and policy strategies for structuring model responses based on linguistic context.
3. Human-AI Configuration Challenges
Individuals' perceptions, expectations, and purposes shape their interactions with generative AI, leading to varied system performance and outcomes. This can result in misinterpretations of AI output, misuse of systems, and fluctuating trust levels. Users may form emotional attachments to chatbots, potentially causing harmful consequences. Currently, no studies examine whether certain chatbot behaviors pose greater risks to specific user groups. Additionally, generative AI may produce sycophantic or overly persuasive responses, complicating the development of appropriate behavioral boundaries and increasing the risk of human over-reliance. This interdisciplinary subgroup would include experts in the field of human-AI interaction, sociology, computational social science, ethics, psychology and related disciplines to focus on:
- disentangling the factors that drive human-AI configuration challenges,
- defining and formalizing the concepts that underlie these phenomena, and
- building methods for how to assess these factors under real-world conditions.

4. Accessibility in AI
As AI integrates into sectors like healthcare, education, and employment, its accessibility plays a vital role in determining whether it will widen or close gaps for people with disabilities. Focusing on this issue promotes equitable systems that allow everyone to benefit from AI. Traditional evaluations often overlook user diversity and marginalized experiences, which can lead to AI reinforcing existing barriers. Focusing on widely used applications like speech recognition and image recognition, the subgroup will assess accessibility performance, enhance the metrology of accessibility in AI, build up associated tooling, and create actionable insights to:
- develop standardized accessibility metrics for evaluating AI applications,
- analyze how AI can both bridge and reinforce existing gaps for marginalized communities, and
- pilot metrics across widely used AI applications to improve accessibility and effectiveness.
5. Cross-Sectoral Dependencies
Regulated industries like healthcare, insurance, and financial services face numerous challenges in adopting generative AI, making many organizations hesitant to implement it. Various working groups have emerged to address issues such as accountability, fairness, risk data collection, and AI assurance. Instead of focusing on a single sector, this subgroup will tackle common industry challenges to reduce overhead, promote knowledge sharing, and identify effective real-world AI evaluation approaches. This subgroup will draw on the combined expertise of stakeholders from the healthcare, insurance, and financial sectors to:
- develop scoring rubrics for AI quality assessments,
- identify sector-wide challenges and associated collaborative solutions, and
- promote reproducibility and best practices in AI deployment across multiple sectors.

Governance
To effectively govern the AI Metrology Working Group (MWG) and subgroups, and provide opportunities for external partners to engage, a clear governance framework can be established to ensure transparent and inclusive decision-making processes, accountability measures, and engagement strategies. Key governance components include:
- Steering Committee: Composed of representatives from private industry, civil society, and academia and responsible for setting strategic direction, overseeing subgroup formation, and evaluating overall progress.