Tell HN: Mechanical Turk is twenty years old today
MTurk was built by two two-pizza teams at AWS over the course of a year and launched on Nov 2, 2005. It took a few days for people to find it and catch on, but then things got busy.
At the time, AWS was about 100 people (when you were on call, you were on call for all of AWS), Amazon had just hit 10,000, S3 was still in private beta, and EC2 was a whitepaper.
What did you create with MTurk and the incredibly patient hard-working workforce behind it?
[delayed]
If there's any justice, a good number of comments will focus on the ethical nightmare MTurk turned out to be. Apologies to the people who worked on it, but it's fair and appropriate for observers to point out when someone has spent their time and energy creating something that is a net negative for the state of society. That's probably the case here.
If mturk workers had better opportunities, they'd take them. mturk is competing with local economies in low opportunity locales. It is rational to work in a cybercafe doing rote web tasks for 8 hours if you'd receive the same amount of money performing manual labor.
Happily I then can state we did create nothing based on MTurk as it had this negative ethical side to it from day one.
What do you see as net negative about it? I’m familiar with the product but not that aware of how it’s been used.
It's basically a way for people to externalize tasks that require a human but pay fractions of what it would cost to actually employ those humans.
Mechanical Turk was one of the early entrants into "how can we rebrand outsourcing low skill labor to impoverished people and pay them the absolute bare minimum as the gig economy".
Much of the low skill labor were things like writing transcripts and turning receipts into plaintext. It was at a point where OCR wasn't reliable. There were a few specialist tasks.
The gig economy was very much a net positive here. Some people used it to quit factory work and make twice the income; some used it as negotiation terms against the more tyrannical factories. Factories were sometimes a closed ecosystem here - factory workers would live in hostels, eat the free factory food or the cheap street food that cropped up near the area. They'd meet and marry other factory workers, have kids, who'd also work there. They were a modern little serfdom. Same goes for plantations.
Things like gig work and mturk were an exit from that. Not always leaving an unhappy or dangerous life, but making their own life.
If it paid badly, just don't work there. These things push wages down for this kind of work, but this work probably shouldn't be done in service economies anyway.
It's not a fraction of what it would cost to actually employ those humans, since there were humans who clearly chose to do that work when presented with the opportunity.
I think this is a very first-world oriented take. It efficiently distributed low-value workloads to people who were willing to do it for the pay provided. The market was efficient, and the wages were clearly on par with those who were doing the work found economical to do, considering they did (and still do) the work for the wages provided.
Yes, and "use the output of MTurk workers to make themselves redundant."
"probably". Care to provide reasoning or is this just a knee jerk reaction? Are you familiar with the service and how it works?
These are extraordinary claims (yea?). I'm sure there are great stories of opportunity creation and destruction - how could we even measure the net effect?
I used MTurk heavily in its hey-day for data annotation - it was an invaluable tool for collecting training data for large-scale research projects, I honestly have to credit it with enabling most of my early career triumphs. We labeled and classified hundreds of thousands of tweets, Facebook posts, news articles, YouTube videos - you name it. Sure, there were bad actors who gave us fake data, but with the right qualifications and timing checks, and if you assigned multiple Turkers (3-5) to each task, you could get very reliable results with high inter-rater reliability that matched that of experts. Wisdom of the crowd, or the law of averages, I suppose. Paying a living wage also helped - the community always got extremely excited when our HITs dropped and was very engaged, I loved getting thank yous and insightful clarifying questions in our inbox. For most of this kind of work, I now use AI and get comparable results, but back in the day, MTurk was pure magic if you knew how to use it to its full potential. Truthfully I really miss it - hitting a button to launch 50k HITs and seeing the results slowly pour in overnight (and frantically spot-checking it to make sure you weren't setting $20k on fire) was about as much of a rush as you can get in the social science research world.
My wife had dozens, well probably over 100, handwritten recipes from a dead relative. They were pretty difficult to read. I scanned them and used mturk to have them transcribed.
Most of the work was done by one person - i think she was a woman in the Midwest, it's been like 15-years so the details are hazy. A few recipes were transcribed by people overseas but they didn't stick at it. I had to reject only one transcription.
I used mturk in some work projects too but those were boring and maybe also a little unethical (basically paying people 0.50 to give us all of their Facebook graph data, for example.)
Do you think ChatGPT could do the same work now? It would be interesting to try it.
Almost 2 years ago I did this with ChatGPT. It was soon after you could feed it images as input IIRC. It worked very well. I settled on AWS Textract + ChatGPT to save money and was able to get it to well under 1 cent to take an image and turn it into a recipe you could export to Paprika (and others). I never pursued it further but it was a fun little side project.
At this point I don’t think I’d do the Textract step since LLMs have gotten way better and cheaper. Also you lose some info/context when the model only gets the post-OCR data.
I used Gemini to decode and transcribe an old (and well known) cursive hand written mail. I couldn't read it at all. It managed to do this in a few seconds. I am not sure if it used an already available transcription or not. However if not, it was amazing work.
I’ve run millions of jobs on MTurk.
For a major mall operator in the USA, we had an issue with tenants keeping their store hours in sync between the mall site and their own site. So we deployed MTurk workers in redundant multiples for each retail listing… 22k stores at the time, checked weekly from October through mid-January.
Another use case.. figuring out whether a restaurant had OpenTable as an option. This also changes from time to time, so we’d check weekly via MTurk. 52 weeks a year across over 100 malls. Far fewer in quantity, think 2-300. But it’s still more work than you’d want to staff.
A fun more nuanced use case: In retail mall listings, there’s typically a link to the retailer’s website. For GAP, no problem… it’s stable. But for random retailers (think kiosk operators), sometimes they’d lose their domain, which would then get forwarded to an adult site. The risk here is extremely high. So daily we would hit all retailer website links to determine if they contained adult or objectionable content. If flagged, we’d first send to MTurk for confirmation, then to client management for final determination. In the age of AI this would be very different, but the number of false positives was comical. Take a typical lingerie retailers and send it to a skin detection algorithm… you’d maybe be surprised how many common retailers have NSFW homepages.
Now some pro tips I’ll leave you with.
- Any job worth doing on mturk is worth paying a decent amount of money for.
- never runs. Job 1 tile run it 3-5 times and then build a consensus algo on the results to get confidence
- assume they will automate things you would not have assumed automated - And be ready to get some junk results at scale
- think deeply on the flow and reduce the steps as much as possible.
- similar to how I manage Ai now. Consider how you can prove they did the work if you needed a real human and not an automation.
The automation one is so true! When I first deployed a huge job to MTurk, with so much money on the line I wanted to be careful, and I wrote some heuristics to auto-ban Turkers who worked their way through the HITs suspiciously quickly (2 standard deviations above the norm, iirc) - and damn did I wake up to a BUNCH of angry (but kind) emails. Turns out, there was a popular hotkey programming tool that Turk Masters made use of to work through the more prized HITs more efficiently, and on one of their forums someone shared a script for ours. I checked their work and it was quality, they were just hyper-optimizing. It was reassuring to see how much they cared about doing a good job.
We asked user to evaluate 300x300 pixel maps. Users were shown two image and had to decide which better matched the title we chose. Answers were something like "left", "right", "both same", "I don't know". Due to misconfiguration the images didn't load for users (they only loaded in our internal network). Still we got plenty of "left" and "right" answers. Random and unusable. Our own fault of course.
During the beta, the only consistent HITs were to identify an album based on a picture and 4 or 5 choices (if Im remembering correctly). These paid pretty well since the workforce volume was very low and the service was brand new. Well, I noticed that the image link contained an ASIN. So, I wrote a greasemonkey script that would look it up on Amazon and highlight the most likely correct answer. I then turned around and shared it with the forum I frequented. It became extremely popular and spread to other forums before we moved it to a private forum. The damage was already done though.
People kept asking me to automate it, but I felt it was against the spirit of mTurk. So, another member would take my updates and add an auto-clicker. That lasted for a couple of weeks at most before the HIT volume dried up and very few would be released. I guess Amazon caught on to what was happening. But before that, several forum members made enough to get some high dollar items: laptops, speakers, etc. Eventually, I relented and created a wishlist. Thats how I ended up with the box sets for the first run of Futurama seasons.
I have looked at MTurk many times throughout my career. In particular my previous company had a lot of data cleaning, scraping, product tagging, image description, and machine learning built on these. This was all pre-LLM. MTurk always felt like it would be a great solution.
But every time I looked at it I persuaded myself out of it. The docs really down played the level of critical thinking that we could expect, they made it clear that you couldn't trust any result to even human-error levels, you needed to test 3-5 times and "vote". You couldn't really get good results for unstructured outputs instead it was designed around classification across a small number of options. The bidding also made pricing it out hard to estimate.
In the end we hired a company that sat somewhere between MTurk and fully skilled outsourcing. We trained the team in our specific needs and they would work through data processing when available, asking clarifying questions on Slack, and would reference a huge Google doc that we had with various disambiguations and edge cases documented. They were excellent. More expensive that MTurk on the surface, but likely cheaper in the long run because the results were essentially as correct as anyone could get them and we didn't need to check their work much.
In this way I wonder if MTurk never found great product market fit. It languished in AWS's portfolio for most of 20 years. Maybe it was just too limited?
Using the Propser.com data set (a peer-to-peer lending market), I used MTurk to analyze the images of people applying for a loan. This was used in a finance research project with 3 University of Washington professors of Finance.
The idea was that the Prosper data set contained all of the information that a lending officer would have, but they also had user-submitted pictures. We wanted to see if there was value in the information conveyed in the pictures. For example, if they had a puppy or a child in the picture, did this increase the probability that the loan would get funded? That sort of thing. It was a very fun project!
Paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1343275
Yikes. Have you ever considered that you were hurting people?
Yeah, whenever there are human subjects there is an IRB which is necessary. But, beyond that, we didn't participate in the market in any way. We wanted to see if there was bias there, and how much of it. I think I may have used the word 'value' in a bad way in my description. Not 'value' as in 'can we exploit people?' but value as in statistical significance. E.g. if you applied for a loan and your profile picture contained yourself with a child, did that help you, hurt you, or was it neutral?
How so? Read the paper. The methodology was entirely observational. They did not intervene in the prosper.com loan market or interact with the borrowers. If anything, the paper identified a form of bias that exists in the real world, namely that people commonly "perceived" as less trustworthy are penalized despite their actual creditworthiness.
The paper is a study of an existing market. They looked at data about people who had requested loans and data about which of those loans were funded, with the intent of seeing whether or not lenders were being biased by requester photos. They found that they were.
Say more about how studying that bias is hurting people?
Several times, I had MTurk workers transcribe a yearly printed pricing catalog that was a boon to our small business. Inconsistently-structured tabular data intermixed with pictures that OCR of the day did a terrible job with.
Later, we needed to choose the best-looking product picture from a series of possible pictures (collected online) for every SKU, for use in our website's inventory browser. MTurk to the rescue--their human taste was perfect, and it was effortless on my part.
Neither of these were earthshattering from a tech perspective, and I'm sure these days AI could do it, but back then MTurk was the perfect solution. Humans make both random and consistent errors and it was kinda fun to learn how to deal with both kinds of error. I learned lots of little tricks to lower the error rate. As a rule, I always paid out erroneous submissions (you can choose to reject them but it's easier to just pay for all submissions) and just worked to improve my prompts. I never had anyone maliciously or intentionally try to submit incomplete or wrong work, but lots of "junk" happens with the best of intentions.
Used it to capture respondent data for a unique research tool we run. Got good results. Had to custom-code all the server/client interactions to handle MTurks' requirements. Went well. Still use the content from MTurk users as demo of "...how to get unique insights from your consumers". As things progressed, stopped using it. However all our project setup/server/client code still have variables/functions that start with mturk_. Not causing any issues, so there they sit. I feel guitly every time I think about not having cleansed the code. BTW: Just added new custom code for Prolific. Hoping to test their respondents this week. However Prolific's affect on the code was nothing compared to interacting with MTurk's servers.
I'm a software developer, but I took a brief career break in 2011 to try B2B sales for an ISP. I was the only sales rep with experience as a developer, so I was always looking for ways to use my software skills to get an edge as a sales rep.
The most valuable prospects were businesses in buildings where we had a direct fiber connection. There were sites online that purported to list the buildings and leads that the company bought from somewhere, but the sources were all really noisy. Like 98% of the time, the phone number was disconnected or didn't match the address the source said, so basically nobody used these sources.
I thought MTurk would be my secret weapon. If I could pay someone like $0.10/call to call business and confirm the business name and address, then I'd turn these garbage data sources into something where 100% of the prospects were valid, and none of the sales reps competing with me would have time to burn through these low-probability phone numbers.
The first day, I was so excited to call all the numbers that the MTurk workers had confirmed, and...
The work was all fake. They hadn't called anyone at all. And they were doing the jobs at like 4 AM local time when certainly nobody was answering phones at these businesses.
I tried a few times to increase the qualifications and increase pay, but everyone who took the job just blatantly lied about making the calls and gave me useless data.
Still, I thought MTurk was a neat idea and wish I'd found a better application for it.
I was working for a marine biologist applying computer vision to the task of counting migrating river herring. The data set was lousy, with background movement, poor foreground/background separation, inconsistent lighting, and slow exposures that created blurry streaks. Moreover, the researchers had posted the frames to some citizen science platform and asked volunteers to tag each fish with a single point—sufficient for counting by hand, but basically useless for object detection.
In desperation I turned to Mechanical Turk to have the labels redone as bounding boxes, with some amount of agreement between labelers. Even then, results were so-so, with many workers making essentially random labels. I had to take another pass, flipping rapidly through thousands of frames with low confidence scores, which gave me nausea not unlike seasickness.
I'm not a participant nor creator, just remembering: "Bicycle Built for Two Thousand" recreated IBM's "Daisy Bell" by asking each person to take a short snippet and sing the part: https://youtu.be/Gz4OTFeE5JY
Delightful.
Hand drawn pictures of mushrooms (to later compare to the 118 carvings on Stonehenge)
Say more!
When we first created Domainr (then domai.nr, now domainr.com) back in 2008, we needed a list of “zones under which domain registrations were somehow possible.” E.g. not just the root zone list from IANA, but all the .co., .edu., .net., etc. variants. We found what we could from Wikipedia, and used Mturk to find the rest from registry websites, etc.
It wasn’t perfect, but it didn’t need to be. We essentially needed a “good enough to start with” dataset that we could refine going forward. It got the job done.
I used to to run UX surveys
https://web.archive.org/web/20170809155252id_/http://kittur....
I used it when Dropbox came out to get the max 16GB storage. Only cost me a few bucks too.
Nothing of actual utility:
https://github.com/cole-k/turksort
I worked at two crowdsourcing companies built on MTurk. Humanoid and CloudCrowd. Had millions of tasks go through from transcriptions to labeling.
I used it to have transcribed our 2 kids x 5 years of daycare x 180 days a year worth of extensive wonderful daily reports from daycare.
I assisted many professors in data collection for their research in the grad school. Later I also collected data for a couple of my papers. mTurk was very popular in the beginning in large part due to its low cost. Then one day they jacked up their commission so much, it was no more attractive to me. Also, the response/task quality went down significantly. My last time using it was in 2018 for a large scale image labeling task. After doing a pilot run, I concluded I was getting garbage. I went to some other vendor and never returned to mTurk after that.
Never used it since it was only available in the US. Looks like additional countries were not added till Oct 2016, 11 years after it was first launched[0]
None of the companies I've worked for have used it AFAIK, despite them all using AWS. I think I've mostly ignored it as one of the niche AWS products that isn't relevant.
[0] https://blog.mturk.com/weve-made-it-easier-for-more-requeste...
I remember participating in the workforce early on transcribing really bad audio recordings along with the cheap survey type stuff. It was pretty neat back in the day.
I was in school and automated some MTurk HITs to make a small amount of money.
I supported data curation on it in the beginning but it became a popular way to exploit labor. I really love the idea but the main value is specifically to gain advantage over wealth inequality. I really support MTurk and the hard workers on it but I also cannot ignore the negatives.
Tried it once or twice but never was worth the hassle and the feeling of exploiting other people.
This sounds so much like a PR piece from Amazon.
I did a project where I had them write poems and draw pictures about and of robots.
I got about 100 of each.
Has anyone used an LLM to run MTurk HITs, and make more money on MTurk than paid out to an LLM vendor?
Scale AI was a wrapper around MTurk /s
With LLMs, I think we will finally have the missing piece needed to make something like MTurk work at scale.
Bad data or false work was a big problem on MTurk, but now LLMs should be able to act as reasonable quality assurance for each and every piece of work a worker commits. The workers can be ranked and graded based on the quality of their work, instantly, instead of requiring human review.
You can also flip the model and have LLMs do the unit of work, and have humans as a verification layer, and the human review sanity checked again by an LLM to ensure people aren’t just slacking off and rubber stamping everything. You can easily do this by inserting blatantly bad data at some points and seeing if the workers pick up on it. Fail the people who are letting bad data pass through.
For a lot of people, I think this will be the future of work. People will go to schools to get rounded educations and get degrees in “Human Cognitive Tasks” which makes them well suited for doing all kinds of random stuff that fills in gaps for AI. Perhaps they will also minor in some niche fields for specific industries. Best of all, they can work their own hours and at home.
I, uh, you do understand the causality issues here? I'm reminded of the onion headline "Tab of LSD feeling a lot of pressure from tech worker to come up with new ideas"