The data quality on Amazon Mechanical Turk (mTurk) has suffered for years now (Byrd, 2023; Chandler & Paolacci, 2017; Moss & Litman, 2018; Chmielewski & Kucker, 2019; Ahler et al., 2020; Kennedy et al., 2020; MacInnis et al., 2020). There are a few ways to protect online survey data quality. In this post, I will briefly cover five strategies for weeding out junk data in online research (not just via mTurk), from easiest to hardest.
1. Quality-Controlled Crowd Work Platforms
Even if you have already put all of your funds into mTurk, you can use them via CloudResearch by linking a CloudResearch account with an AWS account—that took me about 5-10 minutes.
CloudResearch will allow you to sample from a quality-controlled subset of mTurk, thereby ignoring the bots, dishonest users (e.g., who lie about their location, language proficiency, etc.), and low effort responders (who tend to fail attention checks) without any custom code, data-cleaning, or post-collection filtering (Litman, Rosenzweigh, and Moss, 2020). This is ideal for large samples.
Prolific is also good at this (Palan & Schitter, 2018; Peer, Brandimarte, Samat, & Acquisti, 2017). However, I do not yet know how this can be used with mTurk funds or if it samples from mTurk. (I do know that some of my mTurk participants also work on Prolific, but that is probably unsurprising.) Nonetheless, I have used Prolific for an online think aloud study with about 98% high quality data (Byrd et al., 2021). If you sign up for a Prolific account with the following link, then you will get $150 towards your first study when you top up the account with at least $250: prolific.co/?ref=IPP5AC0D. (I’m not aware of any benefit to me when you sign up with that link besides the an enjoyable email telling me that I “gifted” you $150.)
Of course, you can employ the same methods that platforms like CloudResearch and Prolific use to weed out junk crowd worker accounts (e.g., CloudResearch, 2020). Alas, if you don’t have time to develop and validate these methods for yourself, then you must either (a) accept the possibility that junk data can undermine your results or (b) use already-validated methods. In my experience, I only pay 3-6% more to use these third-party quality-controlled crowd work participant pools. That seems like a small price to pay to protect data quality.
2. Qualtrics’s Data Protection Features
Qualtrics offers various response quality features, depending on your (or your employers’) Qualtrics package.
Features like “prevent ballot box stuffing”, “bot detection”, “email scan roadblock”, and RelevantID can block and/or flag respondents depending on various signs that they are a bot, a duplicate participant, a fraudulent participant, etc. This can speed up the process of identifying or corroborating junk data.
3. Advanced Image Parsing (Easier Than It Sounds)
Add the following item to your survey: (a) an image of an easily-identifiable facial expression (e.g., surprise, anger, etc.) with (b) an instruction like, “In one complete sentence, describe what may have happened immediately before this picture was taken.”
In recent research with colleagues, this has proved surprisingly effective at identifying junk responders because
- many bots will produce a non-sense response.
- more clever bots tend to describe the emotion (often with a dictionary definition) rather than its cause.
- people who are not actually fluent in the language of the survey will often reveal this in their sentence.
- low effort responders tend not to write a complete sentence.
Obviously, you can adapt this kind of image parsing task in many ways. The key is to ensure that it requires not just basic image recognition ability (that a bot can pass), but also social inference, imagination, and language proficiency.
4. Instructional Manipulation Checks
If you are still reading this blog post, then share it with your colleagues using the share buttons on your screen.
Statements like this—which require readers to do something specific that they would probably not have done otherwise—can be a good test of participants’ attention. There are various forms of instructional manipulation checks, given that they’ve been used for many years (see Oppenheimer et al., 2009).
5. Scripts
Do you need to find out how much time your participants spend off task during your Qualtrics survey? Try TaskMaster (Permut, 2019). By placing some javascript in a few locations in your survey, and uploading the resulting dataset to a shiny app, you can get variables like the following automatically added to your dataset:
- Page_N: an ordered array of on- and off-task behavior at the page level. Negative numbers indicate the duration of intervals with the cursor outside the task window, and positive numbers indicate the duration of intervals with the cursor inside the task window.
- Page_N_ClickAways: the number of times the subject clicked away from the page. This variable corresponds to the count of negative values in the Page_N array.
- Page_N_TimeOffPage: the total amount of time spent off a given page (the absolute value of the sum of the negative values in the Page_N array).
- Page_N_TimeOnPage: the total amount of time spent on a given page (the sum of the positive values in the Page_N array).
There are also other scripts that have been developed—e.g., for identifying suspicious responses from virtual private servers (Francis, 2019), or determining if users change tabs (Kristian‘s StackOverflow recommendation). Data produced by these scripts can be used to flag (and/or exclude) participants according to your own a priori criteria.
Caveats
Previewing & Testing
Hopefully this is obvious, but testing your survey’s functionality will be key for any of these strategies. Every time you implement a change—especially to background scripts and code—you should retest the survey to make sure that it looks as expected and produces the requisite data.
In a recent project in which we—among other things—tested four identical surveys on multiple platforms (see the image at the top of the post), some scripts failed to produce data in one of the surveys. If we had tested every survey (rather than assuming that each one would work the same, given that they were identical), then we may have been able to collect this missing data. To this day, neither we nor Qualtrics understand what went wrong.
Fair Pay
I assume researchers agree that participants should be paid fairly. Alas, good pay is insufficient for quality data. Take it from this mTurk worker: “I’ve been rejected for going too fast [so] I’ll milk the timer on a new requester who is over paying for hits, hoping that it will make them less likely to drop the pay” (Wessling et al., 2017). In this well-paid case, reaction timers were not measuring just reaction time, but something else: timer milking. That can drastically change the interpretation of results for many online surveys.
What is the source for the diagram at the top of your post?
Hi Jim!
From the post: “In a recent project in which we—among other things—tested four identical surveys on multiple platforms (see the image at the top of the post)….” We’ll post the preprint as soon as it’s ready.Here’s the link to the preprint (Byrd 2023), with an improved visualization of the same difference(s) in data quality (one of the earliest figures in the short paper): https://doi.org/10.31234/osf.io/y8sdm