Tuesday, January 19, 2016

Who are Turkopticon's Top Contributors?

In my most recent post "Turkopticon: Defender of Amazon's Anonymous Workforce" I introduced Turkopticon, the social art project designed to provide basic tools for Amazon's massive Mechanical TURK workforce to share information about employers (requesters).

Turkopticon, has a been a runaway success with nearly 285 thousands reviews submitted by over 17 thousand reviewers since its inception in 2009. Collectively these reviews make up 53 million characters which maps to about 7.6 million words as 5 letters per average word plus two spaces. At 100 words every 7 minutes this represents approximately 371 days collectively spent just writing reviews. It is probably safe to considered this estimation an underestimation.

So given this massive investment of individuals in writing these reviews, I find myself wanting to ask, "who is investing this kind of energy producing this public good?"

In general, while there are many contributors, 500 contributors represent 54% of the reviews written. With the top 100 reviewers making up 30% of the reviews written and the top 15 representing 11% of all reviews written.

Figure 1: Using this graph we can find the Gini coefficient for number of submissions at around 82% indicating that a very few individuals are doing nearly all of the work.
Within Turkopticon there is no ranking system for reviwer quality so it is not obvious who are the top contributors and what their reviewing patterns look like. In this article we will examine some general features of the top contributors.

Table 1: A list of the Top 15 Turkopticon review contributors. Rank is the reviewer rank by number of reviews written. Name is the reviewer's name. Nrev is the number of reviews written. DaysTO is the number of days between the oldest review and the most recent review. Nchar is the average number of characters written in each review. FAIR, FAST, PAY, and COMM are quantitative scales that Turkopticon requests reviewers rank requesters by. Fair indicates how the requester was at either rejecting or failing to reject work. Fast indicates how quickly the requester approved or rejected work. Pay indicates how the reviewer perceived the payment scheme for work was. And Comm refers to communication which indicates, if the worker attempted to communicate with the requester, how well that requester addressed the worker's concerns.

Thom Burr
NurseRachet (moderator)

Find the full list as a google document here (First Tab).

From Table 1 we can see that all of the top 15 reviewers have contributed over 1,200 reviews with bibytes being the most prolific reviewer contributing over 52 hundred. In terms of the reviewer active on Turkopticon the longest, NurseRachet (a forum moderator) has been on the longest followed by worry and Rosey. In terms of the longest winded kimadagem has the longest average character count per review at 490 characters or  approximately 70 words per review while CaliBboy has the shortest reviews at only 75 characters or around 10 words.

In terms of the averages the four rating scales there is a fair bit of diversity between the top reviewers with jaso...@h.. having the highest average score between the four scales of 4.8 and jmbus...@h... having the lowest average scores, around 2.7 followed by ptosis with a average a tiny bit higher than 3.

So now we have a pretty good idea of what in general the top contributors to Turkopticon look like.

But what of the quality of the contributions?

In order to understand what a quality contribution in Turkopticon looks like we must consider the standards that the community has come up with after years of trial and error.
1. The four different scales should be distinct categories. That is a high pay rate should not cause someone to automatically rank a high Fairness or visa versa.
2. To this end what is referred as 1-Bombs an attempt to artificially drop a requesters score by ranking all scales 1 should be avoided. Similarly, 5-Bombs should also be avoided.
3. Within Turkopticon there is also the ability to flag reviews as problematic. If one of your reviews is flagged, it means someone has a problem with it.
4. In general we would like reviews to be approached with a level head so that reviewers write independent reviews rather than ones based on their current mood.
5. Finally, in general we would like reviewers to review as many categories as they can when writing reviews.

From these 5 guidelines, I will attempt to generate variables that measure each of these targets.
1. For different scales I will focus on the relationship between pay and the other three scales for individual requesters (FairPay, FastPay, and CommPay for the correlations between Fair, Fast, and Comm with pay respectively). The reason I focus on Pay is that it seems to be the scale often times that concerns Mturk workers the most.

Table 2: For reviewers the average correlation between Pay and other scales.
Top 100
Top 15

From Table 2 we can see that the average reviewer has a very strong positive correlation between Pay and the other scales with FAIR, FAST, and COMM in the .73-.81 range. In contrast the Top 100 and especially the Top 15 all have much lower correlations. We should not necessarily hope for a zero correlation between these factors since one might expect a requester who pays too low might also act unfairly, not respond quickly to submissions, or have poor communication habits.

2. 1-Bombs and 5-Bombs are easy to observe in the data in terms of all 1s or all 5s. However, it is worth noting that all of either 1s or 5s might actually be a valid review given the circumstances. Variables 1Bomb and 5Bomb will be a variable measuring the likelihood that an individuals review will be either of the two categories.

3. Flags are also a variable that can be directly observed. Multiple flags can be featured on a single review. The highest flag hit in my data has 17 flags. The variable FLAG is the average/expected number of flags for an individual reviewer's reviews.

Table 3:  The prevalence rates of 1-Bombs, 5-Bombs, and Flags.
Top 100
Top 15

From Table 3 we can see the prevalence rates of 1-Bombs, 5-Bombs, and Flags is much higher among the general reviewers than that of the Top 100 and especially among the top 15.

4. In order to attempt to measure "level-headedness" I will just look at how reviews trend from a rating perspective. That is, is the value of the current review correlated (either positively or negatively) with the value of the next review?

Table 4: The auto-regressive one step correlation between review levels. In this case the "ALL" category only includes the 3,700 reviewers who have written more than 10 reviews.

Top 100
Top 15

From Table 4 we can see that inter-review correlation is pretty small especially when compared with the correlation between pay and other scales within the same review (Table 2). Interestingly for the average reviewer, there is almost no correlation across reviews. This might be a result of reviewers writing less reviews in general, thus spacing them more widely and therefore less likely to be sequentially influenced by personal psychological trends.

5. Finally in terms of completeness we can easily measure completeness in terms of how frequently reviews of individual scales were not completed.

Table 5: The completion rates of individual scales.

Top 100
Top 15

From Table 5 we can see that the completion rates of all scales are more or less equivalent between that of the general reviewers and that of the Top 100 and Top 15 except in the case of COMM. In this case we can see that the top reviewers are much less likely to rate communication.

Constructing A Quality Scale

In order to construct the best scale given our data, we will choose those variables and values that seems to typical of the top 15 most prolific reviewers. From Tables 2 and 3 we can see very distinct differences between the average reviewer and top reviewers. However, for our auto-correlation and completeness rates we see very little differences in general except that the top reviewers are much less likely to rate communication. I can't know exactly why this is the case but I suspect it is a combination of top reviewers avoiding 1-Bombs and 5-Bombs perhaps in combination with top reviewers finding it not typically worth their time to directly communicate with requesters.

So here is my proposed index using standardized coefficients (x/sd(x)):
ReviewerProblemIndex = 3*Flag + 3*1Bomb + 1/2*5Bomb +
                                          1*FairPay + 1*FastPay + 1*CommPay

Because we have standardized the coefficients we can read the scalars in front as directly representing the weight of that variable. Flags, I will weight the strongest as they are an indicator that someone in the community has a problem with the review. Next highest rating are 1Bombs which are widely regarded as a serious problem and frequently discussed on the Turkopticon forum.

5Bombs, FAIRPay, FastPay, and CommPay are also discussed but not considered as important (Turkopticon Discuss). I have caused the 5Bombs to be half as important as FairPay, FastPay, and CommPay variables as it seems cruel to penalize someone for being generous with reviews.

So let's apply our index and see how our top 15 reviewers score!

Table 6: The top 15 most prolific contributors ranked based on the ReviewerProlemIndex (Index, RPI). IRank is the ranking of reviewers in terms of the RPI. Name is reviewer name. Nrev is the number of reviews written. Rank is the reviewers ranked in terms of number of reviews written. The other variables are described above.

IRank  Index   Name Nrev  Rank  Flag  1Bomb  5Bomb  FairPay  FastPay  CommPay
1 1.9 jessema...@g... 1539 9 0.001 0.001 0.016 0.12 0.09 0.20
2 2.1 kimadagem 3732 2 0.002 0.000 0.014 0.05 -0.01 0.27
3 3.2 worry 2637 3 0.000 0.003 0.006 0.11 0.11 0.53
4 3.5 absin...@y... 1320 10 0.000 0.000 0.007 0.24 0.13 0.55
5 3.5 bigbytes 5236 1 0.001 0.000 0.007 0.20 0.04 0.54
6 4.0 surve...@h... 2488 5 0.001 0.001 0.008 0.32 0.29 0.34
7 6.4 shiver 1721 7 0.001 0.005 0.015 0.50 0.33 0.76
8 6.6 jaso...@h... 2100 6 0.001 0.004 0.070 0.41 0.27 0.83
9 10.9 Thom Burr 1594 8 0.002 0.013 0.030 0.87 0.84 0.92
10 11.0 Rosey 1313 11 0.004 0.009 0.022 0.81 0.81 0.85
11 12.4 NurseRachet (moderator) 1274 14 0.016 0.022 0.078 0.39 0.32 0.46
12 12.7 CaliBboy 1281 12 0.022 0.004 0.005 0.20 0.21 0.47
13 13.1 TdgEsaka 1234 15 0.015 0.016 0.029 0.57 0.40 0.73
14 13.4 ptosis 1278 13 0.009 0.039 0.034 0.80 0.78 0.73
15 17.2 jmbus...@h... 2539 4 0.003 0.170 0.020 0.99 0.98 0.92

From Table 6 we can see that in general the more prolific reviewers also tend to be higher ranked on the RPI with a few exceptions. One exception is "jmbus", despite being the fourth most prolific contributor he/she is ranked at the bottom of the top 15 contributors list. This is likely due to having the highest 1-Bomb rate of the index with 17% of reviews being 1Bombs. His/her reviews also seem to be almost entirely correlated with Pay as FairPay, FastPay, and CommPay are all correlated upwards of 90%.

Similarly, "jessema" though only being the 9th most prolific reviewer seems to have the highest quality of reviews (slightly ahead of "kimadagem") with very low Flag, 1Bomb, and 5Bomb rates as well as very low correlation between the scales Fair, Fast, and Comm with that of Pay. Interestingly, though both "Thom Burr" and "Rosey" have very high correlation rates between Pay and the other scales, because the have relatively low Flag, 1Bomb, and 5Bomb rates they are ranked near the middle.

Overall, except for a few exceptions, I am very impressed that the top contributors seem to score so well on the RPI index.

Table 7: The Top 100 most prolific contributors ranked based on the Reviewer Problem Index (RPI).
Rank  Index   Name Nrev  Rrank  Flag  1Bomb  5Bomb  FairPay  FastPay  CommPay
1 -0.13 seri...@g... 488 64 0.000 0.000 0.006 0.00 -0.05 0.00
2 1.67 james...@y... 365 98 0.000 0.000 0.000 0.29 0.00 0.18
3 1.72 donn...@o... 1064 23 0.001 0.000 0.006 0.04 0.04 0.27
4 1.85 jessema...@g... 1539 9 0.001 0.001 0.016 0.12 0.09 0.20
5 1.94 iwashere 689 44 0.003 0.000 0.017 0.00 0.05 0.12
6 2.03 kimadagem 3732 2 0.002 0.000 0.014 0.05 -0.01 0.27
7 2.06 mmhb...@y... 422 79 0.005 0.000 0.009 0.00 0.00 0.00
8 2.21 aristotle...@g... 579 51 0.002 0.000 0.010 0.10 0.11 0.19
9 2.90 Kafei 561 55 0.002 0.000 0.027 0.16 0.13 0.27
10 2.93 turtledove 1188 19 0.001 0.000 0.012 0.32 0.04 0.34
90 15.28 Anthony99 571 53 0.005 0.014 0.391 1.00 1.00 1.00
91 15.83 cwwi...@g... 543 57 0.011 0.070 0.026 0.84 0.85 0.84
92 16.25 rand...@g... 490 63 0.002 0.157 0.051 0.97 0.97 0.99
93 16.76 trudyh...@c... 378 95 0.008 0.140 0.056 0.87 0.84 0.80
94 16.79 jmbus...@h... 2539 4 0.003 0.170 0.020 0.99 0.98 0.92
95 17.30 hs 945 28 0.010 0.115 0.098 0.87 0.86 0.89
96 17.94 ChiefSweetums 691 43 0.010 0.185 0.054 0.68 0.68 0.81
97 21.49 Playa 414 85 0.010 0.239 0.014 0.93 0.90 1.00
98 31.56 Tribune 360 99 0.053 0.011 0.108 0.76 0.61 0.97
99 35.74 taintturk. (moderator) 1176 21 0.027 0.499 0.014 0.89 0.87 0.73
100 40.53 Taskmistress 698 42 0.017 0.755 0.020 0.91 0.91 0.96

Find the full list of Top 100 ranked here (Second Tab).

In Table 7 we can see how reviewers score on the RPI across all of the Top 100 reviewers. The Top 10 have great scores with SERI having the top ranked score with over 488 reviews written and no Flags or 1Bombs and only three 5Bombs. For SERI there is also no correlation between Fair or Comm with an amazingly negative correlation with Fast.

The worse 10 reviewers is much more interesting mostly due to tainturk a Turkopticon moderator and Tribune a former moderator being on the list. Everybody on the worse 10 list suffer from very high correlations between the other scales and Pay. Tainturk though also suffers from having 50% of his/her reviews being 1Bombs (for those reviews in which all of the scales were completed). This is not the worse as Taskmistress has 75% 1Bombs but this was surprising. Looking back at the early reviews I see that 1Bombs seem to be common earlier in Turkopticon and are intended to reflect a Amazon Terms of Service violation, something that has since been implemented.

Similarly Tibune has one of the highest flag count rates in the entire list with an expected numbe rof flags of 5% on his/her reviews. However, as Tribune was invited to be a moderator despite this spotted history, we can only assume that my rating system has some serious flaws.

Overall, I would therefore take the RPI ranking with a grain of salt. Perhaps some of the longer time contributors to Turkopticon are suffering from changing standard over time. If I have time I will revisit the rating system looking at only reviews within the last year or two. 

Saturday, January 16, 2016

Turkopticon: Defender of Amazon's Anonymous Workforce

Labor crowdsourcing is the system by which large crowds or workers contribute to a project allowing for complex and tedious tasks to be rapidly and efficiently completed. The largest labor crowdsourcing platform in the world, Amazon Mechancial TURK (Mturk) is estimate to have revenue in the order of 10 and 150 million dollars annually. Despite this, there is no built in system by which workers can identify which employers (requesters) are cheaters and which are legitimate. And in a system powered by anonymity and numerous micro transactions, the inability to provide feedback to warn other workers of requester quality, is a big deal!

Social activists and artists, Six Silberman and Lilly Irani at UCSD have designed a solution to help mitigate this problem. Turkopticon provides a mechanism by which workers can rate their experience of working with requesters. Turkopticon reviews can be read on the UCSD host website and average ratings can be quickly accessed by workers when searching for requesters through the browser extension and pops up next to request information upon mouse over.
Figure 1: An example MTurk HIT listing with Turkopticon review information provided.
Established in May of 2009, Turkopticon with over 284 thousand reviews written by more than 17 thousand reviewers has been an unmitigated success at creating a tool by which the community of Mturk workers share information about their experience with requesters.
Figure 2: Activity on Turkopticon measured in terms of number of reviews written daily and the unique number of reviewers.

From Figure 2 we can see that both the number of reviews and the number of daily participating reviewers increased dramatically from 2009 until mid-2015 at which time both the number of participating reviewers and the number of published reviews have been in decline.

This might not actually be a problem for the Mturk system. Perhaps as information is shared about a particular requester, workers find that their individualized experiences are sufficiently summarized by the quantity of information already available.
Figure 3: Mean and median number of reviews for individual requesters at each date.

Figure 3 seems to support the idea that as time has gone on some requesters have accumulated a large collection of reviews. This is not particularly surprising as one would expect that the longer requesters use Mturk the more reviews they accumulate except in the case when requesters prefer not to maintain their reviews (if for instance they are cheaters). Requesters have the power at any time to start a new Amazon requester account. Reviews do not transfer between accounts. This might be what is driving the median number of reviews to be so low (around 10 at this time).

In order to get a better perspective on what is happening we might want to ask the question, how long are requesters generally active? We cannot observe how long requesters are active directly as we do not have the Mturk activity data, but we can look at when reviews are posted assuming that requesters must have been active at least for each day for which as review was written.

Figure 4: Mean and median of the number of days requester accounts have been active calculated as Current Date of a Review less the First Date of a Review.
From Figure 4 we can see that on average requesters have been active for nearly two years though the median activity level is much lower than this value at less than a year. These numbers are likely inflated as requesters that get reviewed early then drop out of Mturk for a period of time before using their accounts once again are considered equally active as requesters who have been active continuously over the same period. One way of avoiding this would be only to count days which were active by requesters as demonstrated by being reviewed those days. However, this figure ends up being almost identical to Figure 3, so I have omitted it.

Seeing all of these reviews we might ask ourselves how many reviews are being contributed by an elite group of very active reviewers and how much by a wider group?

Figure 5: The numbers of reviews by reviewers at the time of writing a review.
From Figure 5 we can see that the median number of reviews written on any given date is around 100. This implies large scale community involvement with many reviewers contributing a significant number of reviews. We can also see that the mean is significantly higher than the median and grows more so recently, implying that the distribution is skewed with a few reviewers contributing a significant portion of reviews written.

From the available evidence we can therefore confidently claim that Turkopticon has been successful at fulfilling its mission of providing a mechanism for workers to exchange information with regard requester quality. This however is not the only objective of Turkopticon and the workers who contribute to its database.

One of the major objectives of Turkopticon, at least as many workers see it, is to provide a platform by which workers can exchange information about requesters and thus use that information to gain leverage over requesters. Ideally driving the pay rate upwards.

Workers seem to be targeting an ideal pay rate in the range of $11-$15 per hour though in practice the effective pay rate seems to be much less than this. Turkopticon reviewers would also like to see the general quality of their working environment improve as well. In practice they would like their work to be reviewed quick and rarely rejected. And when a problem arises for requesters to communicate effectively and respectfully with them.

From our data we cannot directly observe pay rate. However, what we can observe is the four rating categories defined by Turkopticon: Pay, Fast (rapidity of accepting or rejected submitted HITs), Fairness, Communication. These categories allow for ratings between 1 and 5.

From the trends in these rating systems we can ideally infer how the Mturk workplace has changed for workers over time.
Figure 6: Mean reviews over time.
From Figure 6 we can see that mean reviews have changed significantly over time. In particular the ratings for Fast and Fair have improved generally while the ratings for Pay and Communication, except for a brief bump for pay in 2013 have fallen ominously.

So what is driving these trends and does Turkopticon have anything to do with it?

There was a time when Mturk placed heavy restrictions on non-US workers, I suspect as a result of attempting to comply with federal taxation laws. This I suspect significantly restricted the supply of workers causing a short term rise in the system wage.

From our previous analysis we can see that Turkopticon is widely and actively used by many Mturk workers. Thus we consider how Turkopticon might be driving the trends that we are seeing.

My personal experience with interacting with workers through Turkopticon was quiet unpleasant. I don't know if this is typical of other requesters using Turkopticon but talking with other academics who have used Turkopticon I suspect it is not unusual.

If this is the case, then Turkopticon provides an easy explanation for the falling wage. That is, requesters dropping out of the system after being targeted by a disgruntled, organized, and anonymous workforce. Simultaneously, many workers have claimed that the only reason they have continued to be active on Mturk is because of Turkopticon. Thus Turkopticon might be suppressing the going wage by driving requesters away, while simultaneously retaining workers.

Ironic though predictable.

Increased attention by requesters to scales like Fairness and Fastness is also predictable as requesters grow more attentive to the non-monetary concerns of the workforce. Likewise, due to the growing boldness and hostility of workers organized through the Turkopticon platform, it should come as no surprise to anybody that requesters have opted out of investing in direct communication with workers.

I personally found the experience painful in the extreme as literally everything I said was turned against me in a kind of sadistic group think in which anonymous workers would take turns at antagonizing, humiliating, and threatening me.

My personal experience aside, I do not know what is driving the apparent fall in Mturk pay.

One might argue that pay and other factors have not meaningfully changed only that instead as reviewers have gained more experience they have had more experience to base their reviews. This might be particularly true for the scale "Pay" with some outspoken workers asserting that a pay rating 5 is only warranted when the effective hourly wage is over $12 per hour. This is a plausible explanation as workers seems to learn from the input of other workers about their rating scales.

Figure 7: Completion rates of scales and bomb rates. An incomplete scale submission is a rating in which a reviewer does not provide a rating for a requester on that scale.
In Figure 7 in we can see that 1-bombs and 5-bombs (all 1s or all 5s) have peaked in previous years and dropped off over time. This indicates that perhaps reviewers are learning the expectations of the system and conforming their reviews to match these standards. This supports the changing expectations for pay hypothesis.

Overall, we conclude that Turkopticon is an amazing success. It has brought workers together to exchange information about employees on an immense scale. The end results however of collective worker actions have not been as hoped for though predictable with wages decreasing as requesters drop out of the system in response to collective antagonism organized through Turkopticon.

Related Posts:
Who are Turkopticon's Top Contributors?