The study was approved by the Office of Research Ethics of the University of Waterloo under protocol no. 42142.
Pre-registration and deviations
The forecasts of all participating teams along with their rationales were pre-registered on the Open Science Framework (https://osf.io/6wgbj/registrations). Additionally, in an a priori specific document shared with the journal in April 2020, we outlined the operationalization of the key dependent variable (MASE), the operationalization of the covariates and benchmarks (that is, the use of naive forecasting methods), and the key analytic procedures (linear mixed models and contrasts being different forecasting approaches; https://osf.io/7ekfm). We did not pre-register the use of a Prolific sample from the general public as an additional benchmark before their forecasting data were collected, though we did pre-register this benchmark in early September 2020, prior to data pre-processing or analyses. Deviating from the pre-registration, we performed a single analysis with all covariates in the same model rather than performing separate analyses for each set of covariates, to protect against inflating P values. Furthermore, due to scale differences between domains, we chose not to feature analyses concerning absolute percentage errors of each time point in the main paper (but see the corresponding analyses on the GitHub site for the project, https://github.com/grossmania/Forecasting-Tournament, which replicate the key effects presented in the main manuscript).
Participants and recruitment
We initially aimed for a minimum sample of 40 forecasting teams in our tournament after prescreening to ensure that the participants possessed at minimum a bachelor’s degree in the behavioural, social or computer sciences. To ensure a sufficient sample for comparing groups of scientists employing different forecasting strategies (for example, data-free versus data-inclusive methods), we subsequently tripled the target size of the final sample (N = 120) and accomplished this target by the November phase of the tournament (see Supplementary Table 1 for the demographics).
The Forecasting Collaborative website that we used for recruitment (https://predictions.uwaterloo.ca/faq) outlined the guidelines for eligibility and the approach for prospective participants. We incentivized the participating teams in two ways. First, the prospective participants had an opportunity for co-authorship in a large-scale citizen science publication. Second, we incentivized accuracy by emphasizing throughout the recruitment that we would be announcing the winners and would share the rankings of scientific teams in terms of performance in each tournament (per domain and in total).
As outlined in the recruitment materials, we considered data-driven (for example, model-based) or expertise-based (for example, general intuition or theory-based) forecasts from any field. As part of the survey, the participants selected which method(s) they used to generate their forecasts. Next, they elaborated on how they generated their forecasts in an open-ended question. There were no restrictions, though all teams were encouraged to report their education as well as areas of knowledge or expertise. The participants were recruited via large-scale advertising on social media; mailing lists in the behavioural and social sciences, the decision sciences, and data science; advertisement on academic social networks including ResearchGate; and word of mouth. To ensure broad representation across the academic spectrum of relevant disciplines, we targeted groups of scientists working on computational modelling, social psychology, judgement and decision-making, and data science to join the Forecasting Collaborative.
The Forecasting Collaborative started by the end of April 2020, during which time the US Institute for Health Metrics and Evaluation projected the initial peak of the COVID-19 pandemic in the United States. The recruitment phase continued until mid-June 2020, to ensure that at least 40 teams joined the initial tournament. We were able to recruit 86 teams for the initial 12-month tournament (mean age, 38.18; s.d. = 8.37; 73% of the forecasts were made by scientists with a doctorate), each of which provided forecasts for at least one domain (mean = 4.17; s.d. = 3.78). At the six-month mark after the 2020 US presidential election, we provided the initial participants with an opportunity to update their forecasts (44% provided updates), while simultaneously opening the tournament to new participants. This strategy allowed us to compare new forecasts against the updated predictions of the original participants, resulting in 120 teams for this follow-up six-month tournament (mean age, 36.82; s.d. = 8.30; 67% of the forecasts were made by scientists with a doctorate; mean number of forecasted domains, 4.55; s.d. = 3.88). Supplementary analyses showed that the updating likelihood did not significantly differ between data-free and data-inclusive models (z = 0.50, P = 0.618).
Procedure
Information for this project was available on the designated website (https://predictions.uwaterloo.ca), which included objectives, instructions and prior monthly data for each of the 12 domains that the participants could use for modelling. Researchers who decided to partake in the tournament signed up via a Qualtrics survey, which asked them to upload their estimates for the forecasting domains of their choice in a pre-programmed Excel sheet that presented the historical trend and automatically juxtaposed their point estimate forecasts against the historical trend on a plot (Supplementary Appendix 1) and to answer a set of questions about their rationale and forecasting team composition. Once all data were received, the de-identified responses were used to pre-register the forecasted values and models on the Open Science Framework (https://osf.io/6wgbj/).
At the halfway point (that is, at six months), the participants were provided with a comparison summary of their initial point estimate forecasts versus actual data for the initial six months. Subsequently, they were provided with an option to update their forecasts, provide a detailed description of the updates and answer an identical set of questions about their data model and rationale for their forecasts, as well as the consideration of possible exogenous variables and counterfactuals.
Materials
Forecasting domains and data pre-processing
Computational forecasting models require enough prior time series data for reliable modelling. On the basis of prior recommendations46, in the first tournament we provided each team with 39 monthly estimates—from January 2017 to March 2020—for each of the domains that the participating teams chose to forecast. This approach enabled the teams to perform data-driven forecasting (should the teams choose to do so) and to establish a baseline estimate prior to the US peak of the pandemic. In the second tournament, conducted six months later, we provided the forecasting teams with 45 monthly time points—from January 2017 to September 2020.
Because of the requirement for rich standardized data for computational approaches to forecasting9, we limited the forecasting domains to issues of broad societal importance. Our domain selection was guided by the discussion of broad social consequences associated with these issues at the beginning of the pandemic47,48, along with general theorizing about psychological and social effects of threats of infectious disease49,50. An additional pragmatic consideration concerned the availability of large-scale longitudinal monthly time series data for a given issue. The resulting domains include affective well-being and life satisfaction, political ideology and polarization, bias in explicit and implicit attitudes towards Asian Americans and African Americans, and stereotypes regarding gender and career versus family. To establish the common task framework—a necessary step for the evaluation of predictions in data science9,17—we standardized methods for obtaining relevant prior data for each of these domains, made the data publicly available, recruited competitor teams for a common task of inferring predictions from the data and a priori announced how the project leaders would evaluate accuracy at the end of the tournament.
Furthermore, each team had to (1) download and inspect the historical trends (visualized on an Excel plot; an example is in the Supplementary Information); (2) add their forecasts in the same document, which automatically visualized their forecasts against the historical trends; (3) confirm their forecasts; and (4) answer prompts concerning their forecasting rationale, theoretical assumptions, models, conditionals and consideration of additional parameters in the model. This procedure ensured that all teams, at the minimum, considered historical trends, juxtaposed them against their forecasted time series and deliberated on their forecasting assumptions.
Affective well-being and life satisfaction
We used monthly Twitter data to estimate markers of affective well-being (positive and negative affect) and life satisfaction over time. We relied on Twitter because no polling data for monthly well-being over the required time period exists, and because prior work suggests that national estimates obtained via social media language can reliably track subjective well-being51. For each month, we used previously validated predictive models of well-being, as measured by affective well-being and life satisfaction scales52. Affective well-being was calculated by applying a custom lexicon53 to message unigrams. Life satisfaction was estimated using a ridge regression model trained on latent Dirichlet allocation topics, selected using univariate feature selection and dimensionally reduced using randomized principal component analysis, to predict Cantril ladder life satisfaction scores. Such Twitter-based estimates closely follow nationally representative polls54. We applied the respective models to Twitter data from January 2017 to March 2020 to obtain estimates of affective well-being and life satisfaction via language on social media.
Ideological preferences
We approximated monthly ideological preferences via aggregated weighted data from the Congressional Generic Ballot polls conducted between January 2017 and March 2020 (https://projects.fivethirtyeight.com/polls/generic-ballot/), which ask representative samples of Americans to indicate which party they would support in an election. We weighed the polls on the basis of FiveThirtyEight pollster ratings, poll sample size and poll frequency. FiveThirtyEight pollster ratings are determined by their historical accuracy in forecasting elections since 1998, participation in professional initiatives that seek to increase disclosure and enforce industry best practices, and inclusion of live-caller surveys to cell phones and landlines. On the basis of these data, we then estimated monthly averages for support of the Democratic and Republican parties across pollsters (for example, Marist College, NBC/Wall Street Journal, CNN and YouGov/Economist).
Political polarization
We assessed political polarization by examining differences in presidential approval ratings by party identification from Gallup polls (https://news.gallup.com/poll/203198/presidential-approval-ratings-donald-trump.aspx). We obtained a difference score as the percentage of Republican versus Democratic approval ratings and estimated monthly averages for the period of interest. The absolute value of the difference score ensures that possible changes following the 2020 presidential election do not change the direction of the estimate.
Explicit and implicit bias
Given the natural history of the COVID-19 pandemic, we sought to examine forecasted bias in attitudes towards Asian Americans (versus European Americans). To further probe racial bias, we sought to examine forecasted racial bias in attitudes towards African American (versus European American) people. Finally, we sought to examine gender bias in associations of the female (versus male) gender with family versus career. For each domain, we sought to obtain both estimates of explicit attitudes55 and estimates of implicit attitudes56. To this end, we obtained data from the Project Implicit website (http://implicit.harvard.edu), which has collected continuous data concerning explicit stereotypes and implicit associations from a heterogeneous pool of volunteers (50,000–60,000 unique tests on each of these categories per month). Further details about the website and test materials are publicly available at https://osf.io/t4bnj. Recent work suggests that Project Implicit data can provide reliable societal estimates of consequential outcomes57,58 and when studying cross-temporal societal shifts in US attitudes59. Despite the non-representative nature of the Project Implicit data, recent analyses suggest that the bias scores captured by Project Implicit are highly correlated with nationally representative estimates of explicit bias (r = 0.75), indicating that group aggregates of the bias data from Project Implicit can reliably approximate group-level estimates58. To further correct possible non-representativeness, we applied stratified weighting to the estimates, as described below.
Implicit attitude scores were computed using the revised scoring algorithm of the IAT60. The IAT is a computerized task comparing reaction times to categorize paired concepts (in this case, social groups—for example, Asian American versus European American) and attributes (in this case, valence categories—for example, good versus bad). Average response latencies in correct categorizations were compared across two paired blocks in which the participants categorized concepts and attributes with the same response keys. Faster responses in the paired blocks are assumed to reflect a stronger association between those paired concepts and attributes. Implicit gender–career bias was measured using the IAT with category labels of ‘male’ and ‘female’ and attributes of ‘career’ and ‘family’. In all tests, positive IAT D scores indicate a relative preference for the typically preferred group (European Americans) or association (men–career).
Respondents whose scores fell outside of the conditions specified in the scoring algorithm did not have a complete IAT D score and were therefore excluded from analyses. Restricting the analyses to only complete IAT D scores resulted in an average retention of 92% of the complete sessions across tests. The sample was further restricted to include only respondents from the United States to increase shared cultural understanding of the attitude categories. The sample was restricted to include only respondents with complete demographic information on age, gender, race/ethnicity and political ideology.
For explicit attitude scores, the participants provided ratings on feeling thermometers towards Asian Americans and European Americans (to assess Asian American bias) and towards white and Black Americans (to assess racial bias), on a seven-point scale ranging from −3 to +3. Explicit gender–career bias was measured using seven-point Likert-type scales assessing the degree to which an attribute was female or male, from strongly female (−3) to strongly male (+3). Two questions assessed explicit stereotypes for each attribute (for example, career with female/male, and, separately, the association of family). To match the explicit bias scores with the relative nature of the IAT, relative explicit stereotype scores were created by subtracting the ‘incongruent’ association from the ‘congruent’ association (for example, (male–career versus female–career) − (male–family versus female–family)). Thus, for racial bias, −6 reflects a strong explicit preference for the minority over the majority (European American) group, and +6 reflects a strong explicit preference for the majority over the minority (Asian American or African American) group. Similarly, for gender–career bias, −6 reflects a strong counter-stereotype association (for example, male–arts/female–science), and +6 reflects a strong stereotypic association (for example, female–arts/male–science). In both cases, the midpoint of 0 represents equal liking of both groups.
We used explicit and implicit bias data for January 2017–March 2020 and created monthly estimates for each of the explicit and implicit bias domains. Because of possible selection bias among the Project Implicit participants, we adjusted the population estimates by weighting the monthly scores on the basis of their representativeness of the demographic frequencies in the US population (age, race, gender and education, estimated biannually by the US Census Bureau; https://www.census.gov/data/tables/time-series/demo/popest/2010s-national-detail.html). Furthermore, we adjusted the weights on the basis of political orientation (1, ‘strongly conservative’; 2, ‘moderately conservative’; 3, ‘slightly conservative’; 4, ‘neutral’; 5, ‘slightly liberal’; 6, ‘moderately liberal’; 7, ‘strongly liberal’), using corresponding annual estimates from the General Social Survey. With the weighted values for each participant, we computed weighted monthly means for each attitude test. These procedures ensured that the weighted monthly averages approximated the demographics of the US population. We cross-validated this procedure by comparing the weighted annual scores to nationally representative estimates for feeling thermometers for African American and Asian American estimates from the American National Election studies in 2017 and 2018.
An initial procedure was developed for computing post-stratification weights for African American, Asian American and gender–career bias (implicit and explicit) to ensure that the sample was representative of the US population at large as much as possible. Originally, we computed weights for the entire year, which were then applied to each month in the year. After we received feedback from co-authors, we adopted a more optimal approach wherein weights were computed on a monthly as opposed to yearly basis. This was necessary because demographic characteristics varied from month to month each year. This meant that using yearly weights had the potential to amplify bias instead of reducing it. Consequently, our new procedure ensured that sample representativeness was maximized. This insight affected forecasts from seven teams who had provided them before the change. The teams were informed, and four teams chose to provide updated estimates using the newly weighted historical data.
For each of these domains, the forecasters were provided with 39 monthly estimates in the initial tournament (45 estimates in the follow-up tournament), as well as detailed explanations of the origin and calculation of the respective indices. We thereby aimed to standardize the data source for the purpose of the forecasting competition9. See Supplementary Appendix 1 for example worksheets provided to the participants for submissions of their forecasts.
Forecasting justifications
For each forecasting model submitted to the tournament, the participants provided detailed descriptions. They described the type of model they had computed (for example, time series, game theoretic models or other algorithms), the model parameters, additional variables they had included in their predictions (for example, the COVID-19 trajectory or the presidential election outcome) and the underlying assumptions.
Confidence in forecasts
The participants rated their confidence in their forecasted points for each forecast model they submitted. These ratings were on a seven-point scale from 1 (not at all) to 7 (extremely).
Confidence in expertise
The participants provided ratings of their teams’ expertise for a particular domain by indicating their extent of agreement with the statement “My team has strong expertise on the research topic of [field].” These ratings were on a seven-point scale from 1 (strongly disagree) to 7 (strongly agree).
COVID-19 conditional
We considered the COVID-19 pandemic as a conditional of interest given links between infectious disease and the target social issues selected for this tournament. In Tournament 1, the participants reported whether they had used the past or predicted trajectory of the COVID-19 pandemic (as measured by the number of deaths or the prevalence of cases or new infections) as a conditional in their model, and if so, they provided their forecasted estimates for the COVID-19 variable included in their model.
Counterfactuals
Counterfactuals are hypothetical alternative historic events that would be thought to affect the forecast outcomes if they were to occur. The participants described the key counterfactual events between December 2019 and April 2020 that they theorized would have led to different forecasts (for example, US-wide implementation of social distancing practices in February). Two independent coders evaluated the distinctiveness of the counterfactuals (interrater κ = 0.80). When discrepancies arose, the coders discussed individual cases with other members of the Forecasting Collaborative to make the final evaluation. In the primary analyses, we focus on the presence of counterfactuals (yes/no).
Team expertise
Because expertise can mean many things2,61, we used a telescopic approach and operationalized expertise in four ways of varying granularity. First, we examined broad, domain-general expertise in the social sciences by comparing social scientists’ forecasts with forecasts provided by the general public without the same training in social science theory and methods. Second, we operationalized the prevalence of graduate training on a team as a more specific marker of domain-general expertise in the social sciences. To this end, we asked each participating team to report how many team members had a doctorate in the social sciences and calculated the percentage of doctorates on each team. Moving to domain-specific expertise, we instructed the participating teams to report whether any of their members had previously researched or published on the topic of their forecasted variable, operationalizing domain-specific expertise through this measure. Finally, moving to the most subjective level, we asked each participating team to report their subjective confidence in their team’s expertise in a given domain (Supplementary Information).
General public benchmark
In parallel to the tournament with 86 teams, on 2–3 June 2020, we recruited a regionally, gender- and socio-economically stratified sample of US residents via the Prolific crowdworker platform (targeted N = 1,050 completed responses) and randomly assigned them to forecast societal change for a subset of domains used in the tournaments (well-being (life satisfaction and positive and negative sentiment on social media), politics (political polarization and ideological support for Democrats and Republicans), Asian American bias (explicit and implicit trends), African American bias (explicit and implicit trends) and gender–career bias (explicit and implicit trends)). During recruitment, the participants were informed that in exchange for 3.65 GBP, they had to be able to open and upload forecasts in an Excel worksheet.
We considered responses if they provided forecasts for 12 months in at least one domain and if the predictions did not exceed the possible range for a given domain (for example, polarization above 100%). Moreover, three coders (intercoder κ = 0.70 unweighted, κ = 0.77 weighted) reviewed all submitted rationales from lay people and excluded any submissions where the participants either misunderstood the task or wrote bogus bot-like responses. Coder disagreements were resolved via a discussion. Finally, we excluded responses if the participants spent under 50 seconds making their forecasts, which included reading instructions, downloading the files, providing forecasts and re-uploading their forecasts (final N = 802, 1,467 forecasts; mean age, 30.39; s.d. = 10.56; 46.36% female; education: 8.57% high school/GED, 28.80% some college, 62.63% college or above; ethnicity: 59.52% white, 17.10% Asian American, 9.45% African American/Black, 7.43% Latinx, 6.50% mixed/other; median annual income, $50,000–$75,000; residential area: 32.37% urban, 57.03% suburban, 10.60% rural).
Exclusions of the general public sample
Supplementary Table 7 outlines exclusions by category. In the initial step, we considered all submissions via the Qualtrics platform, including partial submissions without any forecasting data (N = 1,891). Upon removing incomplete responses without forecasting data and removing duplicate submissions from the same Prolific IDs, we removed 59 outliers whose data exceeded the range of possible values in a given domain. Subsequently, we removed responses that the independent coders flagged as either misunderstood (n = 6) or bot-like bogus responses (n = 26). See Supplementary Appendix 2 for verbatim examples of each screening category and the exact coding instructions. Finally, we removed responses where the participants took less than 50 seconds to provide their forecasts (including reading instructions, downloading the Excel file, filling it out, re-uploading the Excel worksheet and completing additional information on their reasoning about the forecast). Finally, one response was removed on the basis of open-ended information where the participant indicated they had made forecasts for a different country than the United States.
Naive statistical benchmarks
There is evidence from data science forecasting competitions that the dominant statistical benchmarks are the Theta method, ARIMA and ETS7. Given the socio-cultural context of our study and to avoid loss of generality, we decided to employ more traditional benchmarks such as naive/random walk, historical average and the basic linear regression model—that is, the method that is used more than anything else in practice and science. In short, we selected three benchmarks on the basis of their common application in the forecasting literature (historical mean and random walk are the most basic forecasting benchmarks) or the behavioural/social science literature (linear regression is the most basic statistical approach to test inferences in the sciences). Furthermore, these benchmarks target distinct features of performance (historical mean speaks to the base rate sensitivity, linear regression speaks to sensitivity to the overall trend and random walk captures random fluctuations and sensitivity to dependencies across consecutive time points). Each of these benchmarks may perform better in some but not in other circumstances. Consequently, to test the limits of scientists’ performance, we examined whether social scientists’ performance was better than each of the three benchmarks. To obtain metrics of uncertainty around the naive statistical estimates, we chose to simulate these three naive approaches for making forecasts: (1) random resampling of historical data, (2) a naive out-of-sample random walk based on random resampling of historical change and (3) extrapolation from a naive regression based on a randomly selected interval of historical data. We describe each approach in Supplementary Information.
Analytic plan
Categorization of forecasts
We categorized the forecasts on the basis of modelling approaches. Two independent research associates categorized the forecasts for each domain on the basis of the following justifications: (1) theoretical models only, (2) data-driven models only or (3) a combination of theoretical and data-driven models—that is, computational models that rely on specific theoretical assumptions. See Supplementary Appendix 3 for the exact coding instructions and a description of the classification (interrater κ = 0.81 unweighted, κ = 0.90 weighted). We further examined the modelling complexity of approaches that relied on the extrapolation of time series from the data we provided (for example, ARIMA or moving average with lags; yes/no; see Supplementary Appendix 4 for the exact coding instructions). Disagreements between coders here (interrater κ = 0.80 unweighted, κ = 0.87 weighted) and on each coding task below were resolved through joint discussion with the leading author of the project.
Categorization of additional variables
We tested how the presence and number of additional variables as parameters in the model impacted forecasting accuracy. To this end, we ensured that additional variables were distinct from one another. Two independent coders evaluated the distinctiveness of each reported parameter (interrater κ = 0.56 unweighted, κ = 0.83 weighted).
Categorization of teams
We next categorized the teams on the basis of compositions. First, we counted the number of members per team. We also sorted the teams on the basis of disciplinary orientation, comparing behavioural and social scientists with teams from computer and data science. Finally, we used information that the teams provided concerning their objective and subjective expertise levels for a given subject domain.
Forecasting update justifications
Given that the participants received both new data and a summary of diverse theoretical positions that they could use as a basis for their updates, two independent research associates scored the participants’ justifications for forecasting updates on three dummy categories: (1) the new six months of data that we provided, (2) new theoretical insights and (3) consideration of other external events (interrater κ = 0.63 unweighted/weighted). See Supplementary Appendix 5 for the exact coding instructions.
Statistical analyses
A priori (https://osf.io/6wgbj/), we specified a linear mixed model as a key analytical procedure, with MASE scores for different domains nested in participating teams as repeated measures. Prior to the analyses, we inspected the MASE scores to determine violations of linearity, which we corrected via log-transformation before performing the analyses. All P values refer to two-sided t-tests. For simple effects by domain, we applied Benjamini–Hochberg false discovery rate corrections. For 95% CIs by domain, we simulated a multivariate t distribution20 to adjust the scores for simultaneous inference of estimates for 12 domains in each tournament.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.