
Predicting Results of Social Science Experiments Using Large Language Models
Hewitt, L.*, Ashokkumar, A*., Ghezae, I., & Willer, R.
Working Paper
To evaluate whether large language models (LLMs) can be leveraged to predict the results of social science experiments, we built an archive of 70 pre-registered, nationally representative survey experiments conducted in the United States, involving 476 experimental treatment effects and 105,165 participants. We prompted an advanced, publicly available LLM (GPT-4) to simulate how representative samples of Americans would respond to the stimuli from these experiments. Predictions derived from simulated responses correlate strikingly with actual treatment effects (r = 0.85), equaling or surpassing the predictive accuracy of human forecasters. Accuracy remained high for unpublished studies that could not appear in the model’s training data (r = 0.90). We further assessed predictive accuracy across demographic subgroups, various disciplines, and in nine recent megastudies featuring an additional 346 treatment effects. Together, our results suggest LLMs can augment experimental methods in science and practice, but also highlight important limitations and risks of misuse.
Web demo, X Thread (08/2024), Talk at Stanford Tech Impact and Policy Center (02/2025), Talk at HAI (09/2023)
Media Coverage: HAI (07/2025)
Experimental Results Forecaster Demo >>
This demo accompanies the paper Prediction of Social Science Experimental Results Using Large Language Models and can be used for predicting experimental treatment effects on U.S. adults. To manage costs of hosting this demo publicly, it uses GPT-4o-mini rather than GPT-4.