Read-Agree-Predict: A Crowdsourced Approach to Discovering Relevant Primary Sources for Historians

Historians spend significant time looking for relevant, high-quality primary sources in digitized archives and through web searches. One reason this task is time-consuming is that historians’ research interests are often highly abstract and specialized. These topics are unlikely to be manually indexed and are difficult to identify with automated text analysis techniques. In this article, we investigate the potential of a new crowdsourcing model in which the historian delegates to a novice crowd the task of labeling the relevance of primary sources with respect to her unique research interests. The model employs a novel crowd workflow, Read-Agree-Predict (RAP), that allows novice crowd workers to label relevance as well as expert historians. As a useful byproduct, RAP also reveals and prioritizes crowd confusions as targeted learning opportunities. We demonstrate the value of our model with two experiments with paid crowd workers (n=170), with the future goal of extending our work to classroom students and public history interventions. We also discuss broader implications for historical research and education.


INTRODUCTION
Historians are often researchers as well as educators, and both roles involve significant interaction with primary sources. Primary sources are artifacts such as documents, manuscripts, diary entries, and newspaper articles created at the time under study. These sources are not only direct evidence for historical arguments (Rutner & Schonfeld, 2012) but also important materials for teaching Crowdsourcing could provide an alternative approach to overcoming these challenges. Crowdsourcing has been shown to be effective for many types of text analysis, from transcription (Little, Chilton, Goldman, & Miller, 2010) to word processing (Bernstein et al., 2010) to qualitative analysis and clustering (André, Kittur, & Dow, 2014). However, little research has sought to use crowdsourcing to perform in-depth analysis of historical documents for the purposes of labeling relevance. We suggest one key problem is that novice workers employed on popular crowdsourcing platforms like Amazon Mechanical Turk (MTurk) typically lack the expertise in history that is presumably necessary for such judgements.
In this article, we present a crowdsourcing approach that enables novice crowds to label the relevance of digitized primary sources as accurately as expert historians. To develop this approach, we first conducted a preliminary experiment with 120 MTurk crowd workers and a real-world online archive of digitized American Civil War-era documents, in collaboration with professional historians. This study investigated which of three interface designs, based on theories from educational psychology, would best support crowdsourced relevance labels. Informed by these results, we developed our crowdsourcing approach, which we call Read-Agree-Predict (RAP).
With RAP, a historian first provides her specialized topic of interest and two example documents. Then, crowd workers use a novel interface to read historical documents and label their relevance to the historian's topic. Finally, the results are aggregated based on worker agreement to produce a final relevance label to the historian for each document. Our analysis of the preliminary study data found that RAP enabled perfect precision and recall for labeling relevance, outperforming both a majority vote aggregation and individual worker performance. RAP is simple enough to be easily adapted for most online archives. As a useful byproduct, RAP identifies areas of crowd confusion that could help historians prioritize teaching opportunities in classrooms or public history projects.
To validate RAP, we conducted a second experiment with 50 additional MTurk workers, new historical documents, a new topic, and a new expert historian. We again found that RAP enabled the novice crowds to label relevance as well as an expert. Additional validations include a simulation study of RAP with different crowd sizes, a comparison of relevance agreement between two historians, and a comparison to automated text analysis approaches for labeling relevance.
Our contributions in this article include the technical contribution of the RAP crowdsourcing approach and the empirical contributions of preliminary and validation studies demonstrating the effectiveness of RAP on real-world digitized historical documents. As a result, historians can spend more time analyzing and interpreting primary sources, rather than searching for them, and consider to a larger set of relevant documents than they would have been able to locate on their own. The results may have implications for historical scholarship and history education.

Challenges in Historical Research
Studies of the practices of historians and research support professionals show that interacting with primary sources, including gathering, discovering, and organizing historical documents, remains central to historical research (Dalton & Charnigo, 2004;Nawrotzki, 2013;Wineburg, 2010). When gathering sources, historians identify and assess relevant sources to address their research questions and support arguments, and organize the sources primarily based on their own specialized topics of interest. Later, they can use these topics to find the sources again (Rutner & Schonfeld, 2012). Historians spend a large part of their daily work gathering and discovering sources relevant to their specialized research topics (Dalton & Charnigo, 2004;Rutner & Schonfeld, 2012).
Even with modern search engines, historians often cannot directly search by topics of interest (Rutner & Schonfeld, 2012). This is because these topics are often not keywords found verbatim in the raw texts, and a set of keywords might have different relevance to topics depending on the context. Existing archives may have rich annotations created by metadata librarians or other professionals, but because these require time and expertise, they tend to focus on topics of broader interest to maximize their utility.
Historians may try to overcome these limitations by using many different search terms to find resources relevant to their unique topics of interest, and filter out many irrelevant search results caused by wrong search terms. Historians may conduct individualized foraging because the topics historians look for and the approaches historians take may be very different from one another. This individualized foraging and curation process is time-consuming and sometimes tedious, but seen as an inescapable part of the historical research process. Some historians even regret spending the time on organizing instead of other research activities. As one historian interviewee said, "Once it's organized, it's up to me to think about it and write. But I do resent the time that's spent organizing and managing everything" (Rutner & Schonfeld, 2012). In this article, we consider how novice crowds could support individual historians to improve the breadth and efficiency of their organizing efforts.

Crowdsourcing in the Classroom
Crowdsourcing research has started to explore the use of crowdsourcing in classroom-related settings. These studies often aim to address the issue of low ratios of instructors to students, especially in massive open online courses (MOOCs), by leveraging peer learners or other (sometimes paid) crowds to provide feedback and improve learning through collaboration. Some of this work seeks to enhance learning with collective learner activity. For example, while watching a teaching video, students may pause at different places to digest the content. The aggregated pause positions may hint at important or confusing parts of the video. Other work seeks to structure the learning process or provide learners with useful feedback from other (e.g., paid) crowds. These efforts include creating crowdsourced sub-goals in how-to videos (Kim, Miller, & Gajos, 2013;Kim & others, 2015;Weir, Kim, Gajos, & Miller, 2015), crowdsourced assessments or exercises (Mitros, 2015;Šimko, Šimko, Bieliková, Ševcech, & Burger, 2013), personalized hints for problem-solving (Glassman, Lin, Cai, & Miller, 2016), receiving design critiques (Xu, Rao, Dow, & Bailey, 2015), and identifying students' confusions (Glassman, Kim, Monroy-Hernández, & Morris, 2015).
Studies of historians' current practices in classrooms show that while historians may have very different ways of teaching history, such as chronological narration or topic-based class activities, lecturing continues to be the established practice (Grant, 2001(Grant, , 2018Grant & Gradwell, 2010;McDaniel, 2010). Historians rarely include crowdsourcing to integrate classroom teaching with their research. Other research has found that scholars may resist crowdsourcing for research due to knowledge and role uncertainties (Law, Gajos, Wiggins, Gray, & Williams, 2017). For example, one historian expressed a need to know the person and have some human link in order to trust the quality of the data they produce. In addition, there may be a moral dilemma in asking students to do unattractive tasks which could be perceived as exploitative without promoting learning (Law, Gajos, Wiggins, Gray, & Williams, 2017). These concerns are consistent with issued raised in discussions with our historian collaborators.
Like some of this prior work, we seek to leverage crowdsourcing to support classroom learning, but we extend this thread of research in several ways. First, we explore this goal in the domain of history, which has seen little attention among crowdsourcing researchers to date. Second, our approach is designed to work in a research context, where the answers are not known a priori and the crowd contributes authentically to the historian's scholarship. While our long-term vision is that historians could deploy our approach with students in their classrooms, this article focuses on a proof-of-concept with paid crowd workers. The results of our study may help mitigate the uncertainties some historians have about adopting this technique in the classroom.

Crowdsourced Text Classification and Labeling
Many automated techniques have been developed to handle the task of both single-label and multi-label text classifications (Aggarwal & Zhai, 2012;Sebastiani, 2002;M.-L. Zhang & Zhou, 2014). However, recent studies show that state-of-the-art automated techniques may still be far from perfect (Venkatesan, Er, Dave, Pratama, & Wu, 2016;X. Zhang, Zhao, & LeCun, 2015). In addition, in order to perform well, these techniques often require many examples with highquality labels as training data (e.g. (Banko & Brill, 2001;Kavzoglu & Colkesen, 2012)). Therefore, a large body of research focuses on how to generate high-quality labels, e.g., using experts or crowdsourcing.
While experts can produce high quality labels, experts are often rare and expensive, as in the case of historical research. On the other hand, while crowds can produce a much larger number of labels, they often lack required domain expertise. Therefore, much crowdsourcing research explores and develops aggregation techniques to increase the quality of crowdsourced labels from non-expert crowds in various tasks, such as affective text analysis, word similarity, and word sense disambiguation (Snow, O'Connor, Jurafsky, & Ng, 2008). Majority vote aggregation is shown to be effective when the crowd's responses are imperfect but better than chance (Sheng, Provost, & Ipeirotis, 2008;Snow et al., 2008). With work history, expectation maximization (EM) (Dawid & Skene, 1979) can generally be used to further improve quality (Hosseini, Cox, Milić-Frayling, Kazai, & Vinay, 2012;Ipeirotis, Provost, & Wang, 2010;McDonnell, Lease, Elsayad, & Kutlu, 2016;Snow et al., 2008), although it may only converge on local maxima instead of achieving global optima (Drapeau, Chilton, Bragg, & Weld, 2016). With work history, it is also possible to improve the quality of results by considering individual workers' systematic biases (Giancola, Paffenroth, & Whitehill, 2018). For example, using only majority vote, accuracy ranges from 0.58 to 0.80, whereas using EM, the accuracy increases to a range of 0.64 to 0.82 for the INEX dataset (Hosseini et al., 2012). Identifying workers with domain expertise can also provide better quality results (Drapeau et al., 2016;Prelec, Seung, & McCoy, 2017). Studies also show that appropriate hierarchical schemes and task assignments can also improve quality of the results for multiclass classification (Duan & Tajima, 2019).
We contribute to this literature by exploring how crowdsourced labeling and classification can be used to contribute directly to historical research. Unlike many automated approaches, our model requires minimal training data (two example documents) and works as well as experts for abstract, long-tail search topics. Different from the aforementioned crowdsourcing research, which requires work history or special ways of identifying expertise, our model works well with typical paid crowds that lack specialized knowledge and have short time commitments. However, crowds with more training and motivation (e.g., students or enthusiasts) might perform more efficiently.

Educational Psychology and Semantic Tasks
Making connections between topics and relevant documents in order to label relevance requires readers to have a good understanding of both. Reading comprehension has been modeled as a complex cognitive process involving different levels of lexical and semantic processing (Kintsch & van Dijk, 1978). Research on levels of processing suggests that deeper elaboration leads to better recall and understanding (e.g. (Craik & Lockhart, 1972;Craik & Tulving, 1975)). Underlining and summarizing are semantic tasks that may trigger deeper levels of processing with more elaboration and thus increase reading comprehension (Bobrow & Bower, 1969;Doctorow, C, & Marks, 1978;Linden & Wittrock, 1981;Schnell & Rocchio, 1978;Smart & Bruning, 1973; M. C. Wittrock & Alesandrini, 1990). Improved reading comprehension may help novices apply better relevance labels. In our preliminary study, we explore how different types of semantic tasks can affect task performance in a crowdsourced history context.

Research Questions and Hypotheses
Drawing on the above literature review, we sought to understand how to design an interface for crowds to effectively label the relevance of historical primary sources to topics of interest. Therefore, we conducted a preliminary experiment to establish a quality baseline for crowdsourced contributions in the domain of history. This study compared two popular reading comprehension techniques, underlining and summarizing, with a reading-only (baseline) interface. We refer to these three techniques as semantic tasks. We hypothesize that summarizing will have a stronger effect on performance than underlining or reading because summarizing has been shown to require the deepest level of processing in writing tasks (Cai, Iqbal, & Teevan, 2016). Specifically, our preliminary study explored the following research questions and hypotheses: RQ1: How does the semantic task (reading, underlining, or summarizing) affect the quality of crowd-generated relevance labels?
H1: The quality of crowd labels will be highest in the summary task and lowest in the reading task.
RQ2: How does the semantic task affect the agreement of crowd-generated relevance labels? H2: The agreement of crowd labels will be highest in the summary task and lowest in the reading task.
RQ3: How does the semantic task affect the efficiency of applying crowd-generated relevance labels?

H3:
The efficiency of applying crowd labels will be lowest in the summary task and highest in the reading task.

Dataset and Historian
The documents used in this study come from a digital archive1 of 189 digitized historical primary sources (personal diaries and letters, newspaper articles, and public speeches) from the American Civil War era (ca. 1840-1870). This archive was assembled by a tenured professor of Civil War history at our institution (Historian A, also the third co-author of this article) for a prior research project. Historian A generated a list of six topics of interest, related to Independence Day celebrations, that he used to build the archive. We used a subset of these documents and topics for this study, as detailed in the Experimental Design (Section 3.5).

Apparatus and Procedure
The experiment was conducted entirely online. After completing an online IRB-approved consent form, each participant (an MTurk worker) was randomly assigned to one of three conditions corresponding to one of the three semantic tasks: reading, keyword (underlining), or summary (summarizing). While prior studies on underlining were often conducted in pen-and-paper settings, our study takes place online, so we instead asked participants in the keyword (underlining) condition to type in a set of keywords. While both activities similarly ask participants to identify and highlight important words and phrases, selecting keywords is a more common (and arguably, more natural and familiar) task on MTurk than underlining. Each participant was also assigned a topic and a document. The participant then used the web interface we developed, based on a few alternative designs in pilots, to complete a three-step process as shown in Figure 1.
First, the participant filled out a short quiz in which they matched their assigned topic to its correct definition. If participants did not get the answer right, they could not proceed to the next step, and had to end the task themselves. This step ensured all participants in the study understood the topic's meaning. We did not observe any cases where the participant proceeded without providing the correct answer.
Next, the participant viewed two example documents for their topic with relevance labels provided by Historian A. Our pilots and recent work on crowd innovation (Yu, Kittur, & Kraut, 2014) both suggest that by viewing good examples, people can better understand abstract concepts and analogies. The participant also practiced their assigned semantic task on these examples. The reading task involved simply reading the example documents. The keyword task involved reading the documents and selecting 4-8 important keywords or phrases for both. The summary task involved reading the documents and writing a 1-2 sentence summary for both.
Third, the participant completed the semantic task on a new document. After completing the task, the participant decided whether it was relevant to the assigned topic by clicking "Yes" or "No" and typing in a brief justification of their decision.

Participants
We used Amazon Mechanical Turk to recruit novice crowd workers. We restricted to US-only workers to increase the likelihood of English language fluency, with a 95% HIT (human intelligence task) minimum acceptance rate and 50 or more completed HITs. We recruited 120 workers and randomly assigned 40 to each of the three conditions. Each worker was unique and assigned to only one HIT to ensure that the required expertise was learned within that HIT. Thus, there were five unique workers for each combination of condition (semantic task), document, and topic. We paid participants $7.25/hour based on average task times in pilots. We also paid them a 20% bonus payment if they provided a reasonable justification for their decision, even if it was wrong. The total amount paid to workers for the preliminary study was about $192, including $32 in worker bonuses but excluding MTurk platform fees.
Although learning is often assessed with students in classrooms, a risk is that new teaching methods may hinder students' learning if the methods are not effective (Brown, 1992). Our studies use paid crowds on MTurk in order to validate our approach in a controlled lab setting where participants are compensated regardless of how much they learned. In Section 6.2.1, we discuss how our findings from MTurk studies could be adapted for students in the classroom.

Experimental Design
This was a between-subjects design with one independent variable (semantic task), two covariates (topic and document), and three dependent variables (quality, agreement, and efficiency).

Independent Variable
The independent variable, semantic task type, had three levels: reading, keyword, or summary. Therefore, the experiment had three conditions.

Covariates
We controlled for two covariates: topic and document. The complexity of the topic is likely to affect crowd performance, so we selected four diverse topics -Revolutionary History and Ideals, American Nationalism, American Hypocrisy, and Anxietyfrom the list generated by Historian A. Table 1 shows the number of relevant documents for each of the four topics across the entire dataset of 189 documents. Each document was labeled for all four topics; a document may be relevant to multiple topics. Document complexity could also affect crowd performance. Therefore, we randomly selected documents that were similar in terms of length (mean=265 words) and readability (college-level, according to Flesch-Kincaid readability tests). We randomly selected two documents for each topic, one highly relevant and one irrelevant (as judged by Historian A), for a total of eight documents. None of the documents contained the topic name verbatim.

Historical Topic Relevant Documents
Revolutionary History and Ideals 21 American Nationalism 38 American Hypocrisy 10 Anxiety 28

Dependent Variables
There were three dependent variables: quality, agreement and efficiency. To measure quality, we compared how each worker's responses compared with gold standard responses provided prior to the study by Historian A. Specifically, we measured the accuracy, precision, and recall of the labels applied by the crowd, i.e., whether they indicated a document was relevant or irrelevant to their assigned topic. These metrics are widely used in the field of Information Retrieval to measure the performance of an information retrieval system. We measured accuracy as the ratio of correct labels (both relevant and irrelevant) to the total number of labels applied by the crowd. This gave us an overall idea how close the crowd's labels were to the experts' by considering both relevant and irrelevant labels. We measured precision as the ratio of the number of correct relevant labels to the total relevant labels indicated by the crowd. Precision can be seen as an indicator of how credible the crowd's labels were. We measured recall as the ratio of the number of correct relevant labels to Historian A's relevant labels. This measure told us if the crowd missed any relevant documents.
We also measured agreement among the five workers assigned to each condition. This metric provides an indicator of reliability for crowd workers and identifies areas of confusion as potential teaching opportunities. We used two measures of agreement, Fleiss' κ (Fleiss, Levin, & Paik, 2013) and Raw Agreement Indices (RAI) (Fleiss et al., 2013;John Uebersax, 2009). Fleiss' κ provides overall agreement and there is some established interpretation for its values. In addition to overall agreement, RAI also allows finer-grained calculations, such as an agreement value for a particular document in a condition. Both Fleiss' κ and RAI use a 0-1 scale where 0 is no agreement and 1 is perfect agreement.
We also measured the crowd's efficiency in analyzing documents in terms of both time and attempts. Time describes how long it takes for a task to be completed and is a measure of how much effort the task requires. Attempts (attrition) describes how many workers accept and return a HIT before it is completed and is a measure of the perceived difficulty of the task.

Individual Quality Similar Across Conditions
There was no significant difference in individual quality across the three conditions in terms of accuracy, precision, or recall. The results of the quality analysis are shown in

Majority Vote Improves Quality
Since we had five unique worker results for each combination of condition, document, and topic, we also considered how an aggregated (majority vote) decision affected quality. When we used a majority vote strategy, there was only one false positive for the keyword condition and two false positives for each of the other two conditions, giving overall accuracy values of 0.88 and 0.75, respectively. The precision values are 0.80 for keyword and 0.67 for both reading and summary. The recall value is 1.0 for all three conditions.

Summarizing Leads to Higher Agreement
We found that the summary condition led to higher average agreement. For Fleiss' κ, average agreement in the summary condition was 0.80, interpreted as between "substantial agreement" and "almost perfect agreement." The κ values for the reading and keyword conditions were similar, 0.56 and 0.54 respectively, indicating "moderate agreement."

Crowd Labels
Reading

Reading is fastest
Overall, the average time to complete a task was about 11 minutes (SD = 5.3, min=1.5, max=29). Broken down by condition, the averages were reading: 7.8 min (SD = 3.7), keyword: 11 min (SD = 4.5), and summary: 13 min (SD = 6.0). A one-way ANOVA showed condition had a significant effect on time (F(2, 117)=12.66, p<0.01). Post-hoc Tukey tests showed that the reading condition was significantly faster than both the keyword and summary conditions. There was no difference between the keyword and summary conditions.

Keywords Require Most Attempts
Overall, on average, it required about 2.8 attempts (SD = 2.2, min=1, max=11) to complete a task. Average attempts per condition were reading: 2.15 (SD = 1.7), keywords: 3.80 (SD = 2.7), and summary: 2.55 (SD = 1.9). A one-way ANOVA showed that condition had a significant effect on attempts (F(2, 117)=6.65, p<0.01). Post-hoc Tukey tests showed that it took significantly more attempts to complete the keyword condition than the reading and summary condition. There was no difference between the reading and summary conditions.

Quality
The results of the quality analysis showed that semantic task did not affect quality for individual workers, so H1 is refuted. Across all three conditions, individuals in the crowd did better than flipping a coin but might not perform well enough for scholarly work. This result supported the general assumption that a novice might not be able to produce high quality results due to a lack of expertise. The quality scores indicate occurrences of crowd confusions, while the reasoning provided by the crowd elaborates on what the confusions were.
By investigating the keywords in the keyword condition, we found that some participants applied wrong labels based on just a few keywords. For example, when some participants saw the keyword "Fourth of July", they directly connected the document to the topic Revolutionary History and Ideals regardless of the context for how "Fourth of July" was used. This outcome may be evidence of the von Restorff effect, i.e., when multiple similar stimuli are present, the one that differs from the rest is most likely to be remembered. Previous studies have identified this phenomenon in underlining or highlighting because participants tend to remember what has been highlighted (E. H. Chi, Hong, Heiser, & Card, 2006;Ed H. Chi, Hong, Gumbrecht, & Card, 2005;Nist & Hogrebe, 1987;Peterson, 1991).
However, measuring aggregated crowd results using a majority vote technique showed better results, in line with prior work (e.g., (Sheng et al., 2008)). The higher recall and precision values suggest real-world potential for crowds supporting historians because workers were able to find all relevant sources while filtering out some irrelevant ones. For example, in the preliminary study dataset, crowds could have reduced the size of search pool for the historian by 37.5% in the keyword condition (correctly eliminating 3/4 irrelevant sources, leaving 1 false positive) or 25% in the reading or summary conditions (each correctly eliminating 2/4 irrelevant sources, leaving 2 false positives). Thus, 75% of time the historian spent reviewing irrelevant documents could have been saved in the keyword condition, and 50% of time the historian spent reviewing irrelevant documents could have been saved in the reading and summary conditions.

Agreement
The results for intra-crowd agreement shown in Table partially support H2. While there was no significant difference among the conditions based on the fine-grained RAI agreement value for each document, the overall agreement was higher at the summary condition (0.83) than in the reading (0.60) and keyword conditions (0.58). This result seems to be in line with previous studies (Bretzing & Kulhavy, 1979;Cai et al., 2016;Doctorow et al., 1978; M. C. Wittrock & Alesandrini, 1990; Merlin C. Wittrock, 1989) showing that summarizing demands a deep level of semantic processing.
When multiple crowd workers make the same incorrect labels, it often means there is some shared confusion or common misunderstanding. This situation suggests an opportunity for historians to help the crowd better understand the material. Like most experts, historians' time is limited, so it is important to prioritize these misconceptions to help as many workers as possible.
Our measure of intra-crowd agreement can be a good indicator for this.
The results from Table showed that there were two high-impact confusions in the summary condition for Topic 1 and Topic 2. In these situations, the crowd majority thought an irrelevant document was relevant. For example, for Topic 1 (Revolutionary History and Ideals), Historian A did not think it was relevant, but crowd workers in the summary condition all thought it was. The example is shown in Table along with two of the crowd workers' responses and Historian A's feedback.

Participants' Reasons (Summary Condition)
"There seems to be preparations going on in all the principal cities of the Union to celebrate the Fourth of July in the oldfashioned style of military, oratorical and patriotic jubilation. There is a good deal of American feeling still left in the country, and it makes itself manifest on all suitable "It is related to the topic of Revolutionary History and Ideals through the language used within the document (words, such as Union, sectionalism and political parties), the mention of old fashioned military, and the general timeless sense of patriotism and national pride." -P21 occasions. It is pleasing to observe that all the political parties emphatically announce their loyalty to the Union, which is a strong proof that sectionalism is not popular. Far distant be the day when the Fourth of July shall awaken no patriotic associations, sentiments and hopes in the breasts of American citizens!" Historian A's feedback: Sectionalism does not fit the topic as well as some of the other words.
"It speaks to the unity that Americans feel. It's a matter of pride. And it's always going to be. July 4th is always going to be central in the hearts of all Americans. It's a day to celebrate because this country has done so much good for so many people" -P24 Historian A's feedback: This knee-jerk reaction to the July 4 reference introduces emotions not present in the document.

Efficiency
The results partly supported H3 in that the reading condition was significantly faster than the other two. However, with respect to number of attempts, there was no difference between reading and summary, while keywords required significantly more attempts than the others. This latter result surprised us, as summarizing has been previously shown to be more cognitively demanding. One possible explanation is that our instructions were phrased in a way that made the keyword condition seem more laborious than it actually was. In the keyword condition, participants were asked to provide "4-8 keywords/keyphrases" while in summary condition, participants were asked to provide "1-2 sentences." By glancing the numbers shown in the semantic task instructions, there may have seemed more work to be done in the keyword condition than for the summary condition.

READ-AGREE-PREDICT (RAP)
The preliminary study showed mixed results for the three semantic tasks: reading, underlining, and summarizing. Going beyond the original research questions, we made several observations in our follow-up data analysis that suggested an approach that could yield better results than any one task, and better than other common aggregation techniques like majority vote. We call this combined approach Read-Agree-Predict (RAP).

Observations from Preliminary Study
In the preliminary study, there were three possible levels of intra-crowd agreement: zero workers vs. five, one vs. four, and two vs. three, corresponding to RAI scores of 1.0, 0.6, and 0.4, respectively (see Table ). While the first two levels were considered high agreement because the crowd had a clear majority choice, the third was considered low agreement because workers were nearly equally split. We could therefore choose 0.6 as a threshold to distinguish high (≥0.6) and low (<0.6) agreement.
We made two observations with respect to this agreement threshold that held for only the reading condition. First, we observed that if crowd agreement was low (RAI=0.4), the document was always irrelevant. In other words, confusion or disagreement among workers suggests the document is not relevant to the topic. These situations may reflect ambiguity or a lack of information in the source material.
Second, we observed that if crowd agreement was high (RAI≥0.6), the crowd's majority-vote decision was highly accurate. In other words, when crowds converge on a single decision (relevant or irrelevant), that decision could usually be trusted. These situations may occur when there is sufficient evidence for the crowd to make a clear yes-or-no decision.
Taken together, these observations about the reading condition suggest the following robust pattern could be used to be used to predict highly accurate labels between documents and topics. If a crowd reading a document reaches low agreement about its relevance to a given topic, i.e., a nearly split vote over whether the document is or is not relevant, then we can predict the document is irrelevant. However, if a crowd has high agreement about a document's relevance, its majority vote decision (relevant or irrelevant) can be trusted. We call this pattern Read-Agree-Predict (RAP).

RAP vs. Majority Vote
RAP can be viewed as an improvement upon majority vote for crowdsourced adjudication. This improvement is two-fold. One, it tells when to reliably use majority voteonly when crowd agreement is high (RAI≥0.6). Two, it offers a judgement when majority vote is not reliablethe document is irrelevant to the topic. While much crowdsourcing research uses simple majority vote for adjudication or relevance assessment, RAP pushes the concept a step further by 1) demonstrating how a threshold value of majority may have strong impact on output, and 2) providing a clear binary relevance judgement in all possible situations.
Overall, in the preliminary study, majority vote allowed the crowd to achieve quality scores up to 0.8 (precision) and 1.0 (recall) for certain topics and documents. These results could have helped reduce the size of a historian's search pool by up to 37% and saved up to 75% of time spent on irrelevant documents in the archive.
For comparison, we applied RAP post-hoc to the preliminary study's dataset. The results in Table  show that RAP is a substantial improvement over majority vote, yielding perfect accuracy relative to Historian A's gold standard judgements. RAP achieved scores of 1.0 (precision) and 1.0 (recall) across all documents and topics. These results suggest a historian would not even have to search the digital archive herself, because crowd workers using RAP would have correctly labeled all relevant documents.

Crowd Confusion as Teaching Opportunities
Beyond producing high quality labels from noisy ones, RAP not only detects where students' confusions may occur but also prioritizes these confusions as a useful byproduct. If agreement in the reading condition is low and the majority thinks the unknown document is relevant, then RAP predicts this source-topic pair will be a high-impact confusion for teaching. We further discuss educational opportunities in Section 6.2, with a usage scenario in 6.2.1.
In the next section, we simulate this usage scenario with a new study to validate RAP.

VALIDATION STUDY
We conducted this study to validate RAP, so the experimental design was almost identical to the preliminary study. We summarize the differences below.

Dataset and Historian
We again used the same online archive of American Civil War primary sources as in the preliminary study, but it had since been expanded to about 1200 primary sources. For the validation study, we randomly sampled 10 new documents of similar length and readability level that were not among the set of eight used in the preliminary study. Next, we recruited a new expert historian from our institution, Historian B. We asked Historian B, without seeing the 10 documents, to provide a topic of interest, a definition, and two historical documents that were good examples of that topic. Drawing on his research interests, he chose the topic "Racial Equality." Finally, we asked Historian B to generate gold standard answers by reading each of the 10 documents and deciding whether it was relevant or irrelevant to his topic.
We used this selection mechanism because 1) it avoided biasing our expert, and 2) it reflected how RAP would be used in a real-world situation. That is, a historian locates an unfamiliar digital archive, provides a topic, definition, and two example documents, and the crowd analyzes each document from the archive to decide if it is relevant to that topic. After that, the historian comes back to check the sources labeled by the crowd.

Apparatus and Procedure
We used a very similar web-based interface and procedure as the preliminary study. We kept the reading condition the same as before. We removed the keyword and summary conditions, which were significantly slower than reading yet comparable in terms of quality.

Participants
We recruited 50 participants on Amazon Mechanical Turk using the same criteria and pay rate as the preliminary study. The total amount paid to workers for the validation study was about $56, including $19 in worker bonuses but excluding MTurk platform fees.

Experimental Design
The experimental design mirrors that of the preliminary study, with the exception of document selection procedure described in Section 5.1.1.

Results and Discussion
After collecting the crowd data from 50 workers, we ran the data through the RAP crowd algorithm as well as a standard majority vote aggregation to generate predictions of relevance for each of the 10 documents. Table shows that majority vote yielded perfect recall but 2 false positives, similar to the majority vote performance for the reading condition in the preliminary study. In contrast, the RAP predictions exactly matched the gold standard answers provided by Historian B. Thus, in this validation study, RAP again achieved perfect accuracy for a new historian, new topic of interest, and new random sample of documents within the same digital archive as in the preliminary study. RAP also automatically prioritized documents with crowd confusions based on the number of wrong votes for the historian's reference.
We also noticed that the ratio of relevant documents in the validation study was much higher than that of the preliminary study. We speculate this is partly due to the topic chosen by Historian B. Since Historian B's topic was "Racial Equality," and African-American slavery was a main cause of the American Civil War, this topic may have a higher relevance ratio than more specialized topics, such as those used in the preliminary study.

Simulating Different Crowd Sizes
To further investigate the effectiveness of RAP, we ran a simulation to understand how RAP would compare to majority vote with different hypothetical crowd sizes. For each crowd size n, we resampled (with replacement) the existing crowd data to create the desired crowd size. We then calculated the average F-1 score for 1,000 resampled data points. We used the F-1 score (harmonic mean of precision and recall) because it is a widely-used measure of search performance in Information Retrieval research.
In Figure 2, "Ideal Average F-1 Score" is the best average F-1 score that RAP achieves for a given crowd size. "Cut-off Agreement for Ideal Average F-1 Score" is the recommended agreement threshold to achieve the ideal average F-1 score. "Average F-1 Score with Cut-off Agreement = 0.6" is the average F-1 score using a threshold of 0.6. "Majority Vote (Reading)" is an F-1 score using the majority vote from the reading condition. Crowd size is on the x-axis, and both agreement threshold and F-1 score are shown on the y-axis.
The simulation results suggest three key takeaways. First, RAP's average F-1 score is very close to the ideal average F-1 score (correlation coefficient=0.99). This suggests that the agreement threshold we used for both the preliminary and validation studies, 0.6, was an effective choice.
Second, RAP with either 0.6 or the ideal agreement threshold outperforms simple majority vote for all crowd sizes. The highest majority vote score is still worse than the lowest possible RAP score, which occurs at crowd size=3.
Third, the benefits of RAP increase with larger crowd sizes, approaching perfect accuracy. At crowd size=5, used in the preliminary study, the average performance of RAP is already close to Historian B, with F-1=0.84. At crowd size=11, the average performance is equal to Historian B, with F-1=0.89. In contrast, the F-1 score for majority vote quickly saturates at around 0.8.
Finally, we note that if the worker accuracies change a lot, the threshold will also change, but the process to find the threshold would be similar. We would also expect the threshold to be slightly different in other domains, depending on how similar those domains are to history.

Historian Accuracy and Agreement
To complement this validation, we also sought to create a baseline of historian performance by comparing the two historians in our studies, Historian A and Historian B, to each other. We asked both historians to judge relevance for the document set they had not seen before. Historian B judged Documents 1-8 (preliminary study), and Historian A judged Documents 9-18 (validation study). Across both document sets, there was substantial agreement between the two historians (Cohen's κ = 0.72). The average F-1 score across both historians and document sets was 0.89. This could be interpreted as a general measure of historians' performance in finding relevant sources for other historians.
These results support the intuition that historians can have slightly different interpretation of documents based on their research context. RAP is able to follow individual historians' interpretation in their individual research context by achieving perfect accuracy for both historians and datasets.

Comparison to Automated Techniques
As a point of comparison, we also used purely automated techniques to classify the same dataset from the preliminary study. This comparison was made possible because Historian A had labeled all 189 documents used in the preliminary study with his research topics (see Table 5). Note that the comparison conditions are not fully equivalent, as the crowd required far less data. The crowd had only two relevant examples, whereas the following machine learning models need both relevant and irrelevant examples. Also, to maximize the power of automated techniques, we used all 189 documents with the four topics, compared to only the two example documents needed for RAP.
Since all the primary sources are digitized and in an image format, our first step was to use an optical character recognition (OCR) system, Tesseract 4.00.00a (with LSTM)2 (R. Smith, 2007;R. W. Smith, 2009;Ray Smith, Antonova, & Lee, 2009), to automatically transcribe them. Next, we preprocessed these textual documents by removing stopwords and stemming words based on a Snowball algorithm. We then transformed these preprocessed documents into TF-IDF space. Next, we chose five techniques representing five different categories of algorithms for binary text classification: 1) logistic regression, 2) kNN (k=9 to maximize available class samples), 3) SVM, 4) decision tree (CART), and 5) random forest (Sebastiani, 2002). Finally, we ran stratified 10fold cross validations for all five techniques for each of the historian's four topics.
The results show that all techniques have high accuracy (0.75-0.95) but very low recall (0-0.3) due to the highly imbalanced numbers of class samples. For example, there were only 10 out of 189 documents relevant to American Hypocrisy for which accuracy is 0.91-0.95, but recall is 0 across all techniques. This means all relevant documents were missed for that topic.
To deal with the class imbalance issue, we applied three common techniques: adjusting class weights, random over-sampling, and random under-sampling. Under-sampling showed the best improvement, with accuracy 0.55-0.8 and recall 0.2-0.7. For example, with under-sampling, SVM had highest recall (0.7) and 0.55 accuracy for American Hypocrisy. Although this was a substantial improvement in recall, it may still not be practical, because there were very few relevant examples for this topic, and 30% of the relevant documents were mistakenly excluded by the automated technique.
Although future advancements may make automated techniques more powerful, the above results show RAP and crowdsourcing may offer a compelling alternative in our demonstrated context of history.

Quality Labels for Historical Scholarship
The Read-Agree-Predict (RAP) algorithm allows novice transient crowds to find relevant primary sources in a digital archive as effectively as expert historians and, as a byproduct, reveals and prioritizes crowds' confusions. We demonstrated the effectiveness of this approach with an authentic historical dataset and two studies with different historians, topics of interest, and documents. RAP also offers clear advantages over majority vote. Our empirical results and simulations show that RAP consistently outperforms majority vote, and larger crowd sizes increase RAP's accuracy to be on par with experts. We also propose that a major strength of RAP is that its design is simple and elegant enough to be easily implemented for a variety of systems.
The ability to produce quality labels may give more confidence to historians in trusting data collected via crowdsourcing and in adopting this new crowdsourcing model for their research and classes.

Opportunities for History Education
Historical primary sources are important sources for both scholarly research and education in history domain (Stearns et al., 2000;Tally & Goldenberg, 2005) and teaching students to "think like a historian" is a primary goal of history education (Hynd, Holschuh, & Hubbard, 2004;Mandell, 2008;Wineburg, 2010). While the experiments in this article focus on paid crowd workers as a proof-of-concept, future work may explore how RAP extends to students in classroom settings. Within this context, RAP's crowdsourcing model may create a win-win situation for both historian-educators and students. One the one hand, this model could help historians do research by organizing related primary sources into research topics, and to teach by identifying and prioritizing students' confusions. On the other hand, students could get opportunities to participate in authentic historical research and to practice historical thinking and knowledge with primary sources and receive feedback accordingly. As prior research shows, comparing students' and domain experts' output of the same task is an effective way to identify students' confusions (Anderson, Boyle, Corbett, & Lewis, 1990;Anderson, Boyle, & Reiser, 1985;Merrill, Reiser, Ranney, & Trafton, 1992).
By adopting the new crowdsourcing model and RAP in classroom settings, historians could easily organize unprocessed primary sources and collect prioritized confusions that may be pervasive among students, and direct their time and expertise to the ones with higher potential impact. Demystified materials may help motivate and engage students, as research shows that people are often interested in surprising materials that challenge their existing assumptions (Brands, 2008;Davis, 1971).
While other research from non-historical domains shows that it is possible for the crowd to learn a few microtasks in a short amount of time (e.g., < 30 minutes in total) (Dow, Kulkarni, Klemmer, & Hartmann, 2012;Lee, Lo, Kim, & Paulos, 2016;Zhu, Dow, Kraut, & Kittur, 2014), we did not observe this in our study of reading comprehension techniques. The wide adoption of long-term apprenticeship in historical research may help explain why we have different results (Law et al., 2017).

Classroom Usage Scenario
To make concrete the potential costs and benefits of deploying RAP in a classroom setting, we propose the following usage scenario. A historian could begin with a list of topics of interests and a collection of unprocessed primary sources. In the historian's class, she picks topics of interest relevant to the class and asks students to apply relevance labels for these topics to unprocessed primary sources. As the class progresses, the RAP automatically reports relevant primary sources for the topics and prioritizes students' confusions about sources and topics by aggregating students' labels. The historian then uses relevant sources for research and addresses students' confusions starting with the prioritized errors.
Taking data from the preliminary study as an example, with 189 primary sources and 4 topics relevant to the class, it would take 9828 human minutes to complete all possible labels (189 sources × 4 topics × 5 students per source-topic pair × 2.6 average reading time per source).
Historian A generally has about 35 students in his course on the American Civil War, and there are about 16 weeks per semester, requiring only about 17 minutes per week for each student in a semester. In practice, students should be able to analyze more sources as they learn to improve their skills throughout the process, and five students are not always needed when there is already high agreement.
Assuming the historian's processing time is roughly the same, it would require the historian to spend 1965.6 minutes (32.8 hours) to analyze 189 sources (189 sources × 4 topics × 1 historian per source-topic pair × 2.6 average processing time per source). Thus, the historian could save nearly 33 hours of analysis time by leveraging RAP within a classroom context, excluding time spent on developing initial examples and monitoring students, which to some extent may overlap with existing teaching responsibilities.
Alternatively, if the historian were to use RAP with paid crowd workers, the total cost for labeling 189 documents for 4 topics would be about $1187.60 (9828 minutes / 60 minutes × $7.25/hour) using the current US Federal minimum wage, or about $1.57 per document-topic analysis. While using paid crowds requires financial resources, this approach has the advantage of being much faster (completable within a day) than the semester-long classroom-based scenario described above. RAP's high-quality results in our crowd-based studies suggest a variety of options to historians based on their available time, funding, and teaching flexibility.

Limitations and Future Work
This article focused on understanding baseline use of crowdsourcing in historical scholarship and reported several studies with real-world digital archives. Our findings are based on a set of five historical topics and 18 primary sources, averaging 250 words in length and at a college-age reading level, from the American Civil War era. Additional studies, drawing on larger datasets of topics and documents, exploring other historical periods and document formats, and adapting the techniques for classroom settings, are needed to show how these findings replicate and generalize. Experts in many other domains also serve as both researchers and educators, so we believe this new crowdsourcing model may also apply to other domains. Furthermore, the notion of labeling raw textual documents with high-level concepts is prevalent in sensemaking tasks such as intelligence analysis (Pirolli & Card, 2005) and software or product design (Russell, Stefik, Pirolli, & Card, 1993), so we believe the RAP will generalize to other related domains.
To understand the baseline performance with minimal constraints, our studies used novice paid crowds. With these positive initial results, we have more confidence to deploy the new crowdsourcing model in a real classroom setting as the next step. In a classroom setting, we might want to include work history, so we can apply more automated techniques such as EM to further increase robustness of RAP, decrease the required participants per document, or both. In addition, work history can also double as learning history to constantly reflect how well students learn throughout the process.

CONCLUSION
With digitized historical and scholarly materials made available online, it is often difficult for researchers to find documents of interest because the topics and themes they are investigating are specialized and abstract. In this article, we investigated the possibility of a new crowdsourcing model to label the relevance of digitized primary sources to high-level topics, and to reveal and prioritize crowd confusions. In our preliminary study, focusing on the effect of different semantic tasks on comprehension, we found promising results supporting the new crowdsourcing model. We also found that a robust pattern emerged enabling highly accurate predictions of document relevance based on crowd performance. Based on these results, we developed Read-Agree-Predict (RAP), a crowdsourcing approach which allows crowds to label relevance of primary sources to an abstract theme with high accuracy. As a useful byproduct, RAP also reveals crowd confusions that suggest opportunities for learning interventions. We successfully validated RAP with a new historian and set of primary sources, and conducted follow-up analyses with a simulation study and a comparison of agreement among experts. While this research used paid crowd workers in a historical domain, it has implications for applications in classroom settings and in other domains.