AI and Behavioral Science

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals.

Park, J. S., Zou, C. Q., Kamphorst, J., Egan, N., Shaw, A., Hill, B. M., Cai, C., Morris, M. R., Liang, P., Willer, R., & Bernstein, M. S.

Working Paper

The promise of human behavioral simulation—general-purpose computational agents that replicate human behavior across domains—could enable broad applications in policymaking and social science. We present a novel agent architecture that simulates the attitudes and behaviors of 1,052 real individuals—applying large language models to qualitative interviews about their lives, then measuring how well these agents replicate the attitudes and behaviors of the individuals that they represent. The generative agents replicate participants' responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later, and perform comparably in predicting personality traits and outcomes in experimental replications. Our architecture reduces accuracy biases across racial and ideological groups compared to agents given demographic descriptions. This work provides a foundation for new tools that can help investigate individual and collective behavior.

Policy Brief (05/2025), X Thread (11/2024)

Media Coverage: ScienceDirect (11/2024)

AI Behavioral Science

Jackson, M. O., Mei, Q., Wang, S., Xie, Y., Yuan, W., Benzell, S. G., Brynjolfsson, E., Camerer, C. F., Evans, J. A., Jabarian, B., Kleinberg, J., Meng, J., Mullainathan, S., Ozdaglar, A. E., Pfeiffer, T., Tennenholtz, M., Willer, R., Yang, D., & Ye, T.

Working Paper

We discuss the three main areas comprising the new and emerging field of "AI Behavioral Science." This includes not only how AI can enhance research in the behavioral sciences, but also how the behavioral sciences can be used to study and better design AI and to understand how the world will change as AI and humans interact in increasingly layered and complex ways.

Large Language Models Can Predict the Results of Social Science Experiments.

Hewitt, L.*, Ashokkumar, A*., Ghezae, I., & Willer, R.

Nature

To evaluate whether large language models (LLMs) can be leveraged to predict the results of social science experiments, we built an archive of 70 pre-registered, nationally representative survey experiments conducted in the United States, involving 476 experimental treatment effects and 105,165 participants. We prompted an advanced, publicly available LLM (GPT-4) to simulate how representative samples of Americans would respond to the stimuli from these experiments. Predictions derived from simulated responses correlate strikingly with actual treatment effects (r = 0.85), equaling or surpassing the predictive accuracy of human forecasters. Accuracy remained high for unpublished studies that could not appear in the model’s training data (r = 0.90). We further assessed predictive accuracy across demographic subgroups, various disciplines, and in nine recent megastudies featuring an additional 346 treatment effects. Together, our results suggest LLMs can augment experimental methods in science and practice, but also highlight important limitations and risks of misuse.

X Thread (08/2024), Talk at Stanford Tech Impact and Policy Center (02/2025), Talk at HAI (09/2023)

Media Coverage: HAI (07/2025)

A Reporting Checklist for Large Language Models in Behavioural Science

Feuerriegel, S., Christopher Barrie, M. J. Crockett… [72 authors] … Robb Willer, Dirk U. Wulff, Renwen Zhang, Simone Zhang, Steve Rathje, Manoel Horta Ribeiro.

Nature Human Behaviour.

Large language models (LLMs) are deep neural network architectures (typically transformers) trained on a large body of textual data that can generate human-like text. Many researchers across the behavioural and social sciences are enthusiastic about the potential of LLMs to open new avenues for studying human behaviour1. For example, researchers have proposed using LLMs to conduct in silico experiments that simulate human judgments and decisions in response to interventions, and to facilitate large-scale data annotation and analysis2. Other works have deployed LLMs as interventions to foster creativity, persuade, teach, or reduce misinformation beliefs3,4. However, the rapidly evolving role of LLMs in shaping empirical evidence, theoretical frameworks, policy decisions, and public discourse poses challenges for research rigour.

The alternative text for this image may have been generated using AI.
Here, we present a consensus-based checklist (‘GUIDE-LLM’) for research that involves LLMs in behavioural and social science, with the aim of strengthening transparency, reproducibility and ethical accountability. GUIDE-LLM stands for ‘guidelines for the use of LLMs in behavioural and social science’. The GUIDE-LLM checklist provides researchers with concrete guidance on improving transparency, reproducibility and ethical use, to strengthen rigour and documentation throughout the entire research workflow (including when LLMs function as research tools, as well as studies in which LLMs themselves are the object of empirical investigation).