Along with most of the rest of the world, in-person testing for me stopped abruptly in March, 2020. I was about to launch piloting for the final project of my PhD, an electroencephalography experiment studying timing and speech perception – a logistical impossibility in locked down London, both then and now. It was a quick and painful pivot to online behavioural work via Gorilla, but I have to say that it’s been a lot of fun. It’s hard to imagine going back to onerous recruitment, booking people in, taking out cash advances that will take the university months to reimburse me… Online testing is fast, convenient, and easy to implement in general. But there have also been some difficulties, one of which I would like to bring to light, especially for those of us who research speech, music, or run multisensorial and/or temporally dynamic tasks.
In my speech experiments, I test my participants’ non-verbal sense of rhythm using a binary forced-choice discrimination task. In a nutshell, participants need to listen to two different drum rhythms and respond if they are identical or different from one another. In my lab-based experiments, I have found their performance in this perceptual task tracks well with rhythm production tasks, and I have also seen a nice range of performance that corresponds closely, but not exactly, with self-reports about musical backgrounds. An immediate result since moving my testing online is that scores on the rhythm task I use jumped from an average of about 75% to 85% accuracy, despite screening against musicians, as well as multilingual participants, who also tend to do well on rhythm tasks.
Since I’m focused on timing, I really want a good range of rhythmic abilities in my sample. I tried to understand, why are these participants scoring so highly? And so I looked a little more into my sample demographics, who are recruited by Prolific. Assuming my participants are responding truthfully during the screening process, it wasn’t music, dance, or language-learning experience that was driving my rhythm task data upwards. But then I remembered that gaming is associated with enhanced attention, working memory, multisensory processing, and other skills or traits that probably come into play in a lot of speech or music experiments (Boot et al., 2008; see Palaus et al., 2017 for review). Plus, I don’t want to oversell my tasks here, but they certainly share some elements in common with more traditional gamifications—if ever so slightly less entertaining.
I decided to screen against video gaming in my sample, and this was an eye-opener. The initial, non-screened sample of active participants Prolific offers totals over 150,000 people.
Sounds good! Let’s make use of Prolific’s extensive screening capabilities, and recruit from the sample who have answered questions about gaming, a smaller but still massive pool of 40,159 people.
I only select participants who report playing 0-3 hours of video games per week. This single screening variable slashed my pool of available participants to just 4,549 people – about 11% of the original sample who provided data about their video gaming.
For reference, if you selected for 13 hours or more of video games per week, you’ll get a new pool of participants nearly three times the size of non-gamers, or 12,915 people.
Of course, I also need to screen for other factors, such as hearing function and first language. Let’s ask for non-gamer (“0-3 hours per week”) English monolinguals between the ages of 18-35 without a diagnosis of hearing loss, a fairly non-exceptional sample in Western mainstream speech sciences, right?
That’s 1,435 participants. When I asked for this same group but without any musical experience, I was down to about 350 people. If you are concerned about “professional participants” and only want people who have completed fewer than 20 studies before, the sample virtually dries up entirely.
I am now in the process of recruiting a demographic of younger, non-gaming English monolinguals for another speech rhythm perception experiment. Unlike my other studies, where I didn’t screen for gaming, it’s taken a lot longer to collect: the better part of a week for thirty participants, rather than 150 people or more in a matter of hours. I wonder how many of those people in my earlier studies spend a significant amount of their time gaming, just waiting for email links to come in from Prolific, and what it could mean for my data.
The average score in the rhythm task so far (I’ve got 26 of 30 participants, five days and counting) is 78%, which is still a little high, but much closer to what I was getting with my in-person cohorts, and substantially lower than the previous online samples I collected. Can I prove that this is because I’ve started screening for video game time? Nope, but that probably isn’t the study you necessarily wanted to run, either!
I would like to finish by emphasising that this isn’t an issue unique to Prolific, but I think it’s something important to take into consideration if your research involves dynamic, multisensorial tasks that probe attention, measure reaction times, or are otherwise associated with the family of aptitudes and habits refined by serious gamers. Now that we’re all online, it might be worth checking your sample, too.