Pros and cons of crowdsourcing for systematic review

Jul 1, 2024

Written by Ciara Thomas, Vicky Crowe, & David Pritchett

Introduction

To achieve optimal patient outcomes, systematic literature reviews (SLRs) must synthesise high quality, contemporary evidence to inform health policy and practice (1, 2). SLRs can be time and labour intensive, due to the resources required to manage potentially thousands of citations (1, 3, 4). With an exponential increase in research output and with many SLRs requiring regular updates, new and innovative approaches are required to reduce the time and resource burden of performing SLRs without compromising their quality (1, 4-6). In this context, two approaches are gaining increased attention: machine learning and crowdsourcing. This blog focuses on the latter.

Crowdsourcing is defined as “the practice of obtaining participants, services, ideas, or content by soliciting contributions from a large group of people, especially via the internet” (7-9). It can be either paid or unpaid. Theoretically, anyone can contribute to crowdsourcing, regardless of their educational background, training, or prior experience. Cochrane Crowd is currently the leading crowdsourcing platform in the field of SLRs (11), but crowd contributions have also been coordinated via email, promotional materials, interest groups, scientific institutions, and other online platforms such as Mturk and CrowdScreenSR (1, 8). Thus far, the ‘microtask’ most often delegated to crowd contributors has been citation screening (8).

Advantages and disadvantages of crowdsourcing in SLRs

The main benefit of crowdsourcing is that by spreading the workload, SLRs can be completed more quickly (12). This reduces SLR production time and reduces the chances that an SLR will need to be updated prior to publication (10). Noel-Storr 2021 (10) found that a group of 78 crowd contributors required a 33-hour period to screen 9,546 records at title/abstract stage, while a team of three SLR analysts needed a 410-hour period to screen the same records, despite the crowd contributors taking longer on average to screen a single record than the SLR team (6.06 seconds versus 3.97 seconds). McDonald 2017 (13) concluded that the use of crowdsourcing alongside a randomised controlled trial (RCT) machine classifier could reduce the screening workload of SLR analysts by up to 80%.

A frequently-voiced concern regarding crowdsourcing is that screening accuracy may be lower amongst a group of amateur crowd contributors than a team of trained SLR professionals. Nonetheless, some impressive results have already been documented; in one pilot validation study, crowd contributors achieved a sensitivity and specificity of over 99% in an SLR update to identify all high dose vitamin D trials in children (14). However, it is important to note that the task of identifying clinical trials is much less complex compared with most SLR screening criteria. Additionally, the pilot study only included four crowd contributors, all of whom had a medical background.

Agreement algorithm and quality control measures

At the beginning of a crowdsourcing project, an ‘agreement algorithm’ must be determined. This is a predefined algorithm specifying the degree of crowd consensus required to include or exclude a publication. The impact of different agreement algorithms can be compared by assessing speed, sensitivity (% of relevant publications included), and specificity (% of irrelevant publications excluded) (8, 15). Nama 2017 (14) found that with an agreement algorithm requiring a unanimous decision between four crowd contributors to exclude a study, sensitivity was 100% and the workload saving for the investigative team was 66% (defined as citations sorted by the crowd without involvement of the principal investigator). With an alternative algorithm that excluded publications if at least one contributor voted to exclude it, sensitivity decreased to 85%, but the workload saving increased to 92% (14).

To eliminate underperforming crowd contributors and promote higher quality screening, Mortensen 2017 (15) used ‘qualification tests’, where participants had to correctly screen four citations in order to proceed with further screening. They also experimented with hidden control tests, or ‘honey pots’, which involved contributors screening a test citation without knowledge that it was a test. If the contributor screened this citation correctly, they were permitted to continue screening, but if not, they received feedback and a warning that failing further tests could result in expulsion (15).

Improving the effectiveness of crowdsourcing

To ensure accurate citation screening, it is vital that crowd contributors receive comprehensive training, such as the opportunity to practise screening and receive feedback at the beginning of a project (8, 15-17). It is also important to make crowdsourcing tasks as engaging, clear, and simple as possible, with user-friendly interfaces, regular feedback, and encouragement. In addition, clearly defined tasks, achievable goals, and short project timelines have all been demonstrated to prevent crowd contributors from withdrawing from projects (8, 16). Noel-Storr 2021 attributed low accuracy in some previous crowdsourcing studies to task complexity, and suggested that instead of using standard SLR eligibility criteria to guide screening decisions, crowd contributors should be asked simple yes/no questions, such as “is this record an RCT?” (17). The same research group suggested changing the overarching question from “does the record look potentially relevant?” to “is the record obviously not relevant?” during title/abstract screening, encouraging the crowd to focus on identifying citations that are clearly off-topic (10).

Crowdsourcing and machine learning

Machine learning may become an extremely powerful tool for SLR screening, but accurate screening is dependent on a large ‘training set’, a collection of publications already categorised by humans from which a machine learning algorithm is derived (16). Producing these training sets can be extremely time and labour intensive, which is where crowdsourcing may prove useful. One good example is Cochrane’s RCT classifier, which outputs a likelihood score that a novel publication is an RCT. The RCT classifier uses a machine learning model, which was trained using a huge repository of publications categorised as RCTs or non-RCTs by a group of crowd contributors (16).

Conclusions

As crowdsourcing is often voluntary, it is unlikely to be employed in a consultancy environment, with the possible exception of pro bono projects. Instead, it is more suited to academia or not-for-profit organisations such as Evidence Aid, which performs SLRs to inform decision-making during humanitarian crises (18). However, crowdsourcing research does have implications for SLR work in a consultancy setting. The impressive accuracy and efficiency of novice crowd contributors suggests that with appropriate training and close supervision, new SLR analysts should quickly become proficient at citation screening. Moreover, measures that have proved effective in improving accuracy and engagement amongst amateur crowd contributors should also be applicable to new recruits, such as the use of clearly defined tasks, achievable goals, user-friendly interfaces, and regular feedback. Nonetheless, while crowdsourcing has demonstrated its efficacy in aiding simple reviews, its application in more complex SLR questions remains largely unexplored. Similarly, existing pilot studies centre on crowdsourcing for title/abstract screening, but its use at more intricate stages of the systematic review process, such as data extraction, remains an avenue to be investigated.

If you would like to learn more about systematic literature reviews, please contact Source Health Economics, an HEOR consultancy specialising in evidence generation, health economics, and communication.

References

Nama N, Sampson M, Barrowman N, Sandarage R, Menon K, Macartney G, et al. Crowdsourcing the Citation Screening Process for Systematic Reviews: Validation Study. J Med Internet Res. 2019;21(4):e12953.
Shojania KG, Sampson M, Ansari MT, Ji J, Doucette S, Moher D. How quickly do systematic reviews go out of date? A survival analysis. Ann Intern Med. 2007;147(4):224-33.
Allen IE, Olkin I. Estimating time to conduct a meta-analysis from number of citations retrieved. Jama. 1999;282(7):634-5.
Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7(2):e012545.
Créquit P, Trinquart L, Yavchitz A, Ravaud P. Wasted research when systematic reviews fail to provide a complete and up-to-date evidence synthesis: the example of lung cancer. BMC Med. 2016;14:8.
Bastian H, Glasziou P, Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS Med. 2010;7(9):e1000326.
Ranard BL, Ha YP, Meisel ZF, Asch DA, Hill SS, Becker LB, et al. Crowdsourcing–harnessing the masses to advance health and medicine, a systematic review. J Gen Intern Med. 2014;29(1):187-203.
String L, Simmons RK. Citizen science: crowdsourcing for systematic reviews. The Healthcare Improvement Studies Institute, RAND Europe; 2018.
Thomas J, Noel-Storr A, Marshall I, Wallace B, McDonald S, Mavergames C, et al. Living systematic reviews: 2. Combining human and machine effort. J Clin Epidemiol. 2017;91:31-7.
Noel-Storr AH, Redmond P, Lamé G, Liberati E, Kelly S, Miller L, et al. Crowdsourcing citation-screening in a mixed-studies systematic review: a feasibility study. BMC Med Res Methodol. 2021;21(1):88.
Cochrane Crowd [Available from: https://crowd.cochrane.org.
Lee YJ, Arida JA, Donovan HS. The application of crowdsourcing approaches to cancer research: a systematic review. Cancer Med. 2017;6(11):2595-605.
McDonald S, Noel-Storr A, Thomas J, editors. Harnessing the efficiencies of machine learning and Cochrane Crowd to identify randomised trials for individual Cochrane Reviews. . Abstracts of the Global Evidence Summ; 2017; Cape Town, South Africa: Cochrane Database of Systematic Reviews
Nama N, Iliriani K, Xia MY, Chen BP, Zhou LL, Pojsupap S, et al. A pilot validation study of crowdsourcing systematic reviews: update of a searchable database of pediatric clinical trials of high-dose vitamin D. Transl Pediatr. 2017;6(1):18-26.
Mortensen ML, Adam GP, Trikalinos TA, Kraska T, Wallace BC. An exploration of crowdsourcing citation screening for systematic reviews. Res Synth Methods. 2017;8(3):366-86.
Noel-Storr A, Dooley G, Affengruber L, Gartlehner G. Citation screening using crowdsourcing and machine learning produced accurate results: Evaluation of Cochrane’s modified Screen4Me service. J Clin Epidemiol. 2021;130:23-31.
Noel-Storr A, Dooley G, Elliott J, Steele E, Shemilt I, Mavergames C, et al. An evaluation of Cochrane Crowd found that crowdsourcing produced accurate results in identifying randomized trials. J Clin Epidemiol. 2021;133:130-9.
Evidence Aid. About us [Available from: https://evidenceaid.org/who-we-are/.