Google Acquires reCAPTCHA
Google announced this morning they would be acquiring reCAPTCHA. This exciting acquisition makes a lot of sense. One of the original reasons (if not the original reason) for developing CAPTCHA was to ensure the person on your website or application was human. BOTS would crawl through the internet and create posts and forum topics, etc., to distribute SPAM and BOTS across sites. CAPTCHA would step in by displaying an image which only a human could decipher/respond to (usually), and then ask them to type in a word or several words to ensure they were HUMAN.
This is where the brilliance of this acquisition from Google comes in. As people are typing out these CAPTCHA images to register on Facebook or create a GMAIL account, etc., then reCAPTCHA can display old newspapers and books that need text conversion completed. This essential process enables a free workforce to do easy (but labor-intensive) work for the company that hires reCAPTCHA!
Since Google has been in the business of INDEXING the world, including old Books and Magazines, it will now have access to a FREE workforce to help translate and perfect its OCR technology. Like I said, brilliant.
bookwormJR
September 17, 2009 at 8:56 am
Wow that really is brilliant, for two reasons.
1. That they were able to convert scanned images from old literature into captcha text.
2. They’ll get people to convert it to type for free by doing something nearly every site out there requires now.
MrGroove
September 17, 2009 at 9:04 am
@bookwormJR,
Exactly – the reCAPTCHA site has some good info on it regarding the mass scale of CAPTCHA today – http://recaptcha.net/learnmore.html
Here’s a excerpt which details the shear scale of CAPTCHA and the possible workforce available for these puzzles / translation:
shockersh
September 18, 2009 at 8:50 am
Aye… pretty smart. I think I want some cash out of the deal if I’m gonna be doing OCR for them! :) :)
rengger
May 24, 2012 at 6:08 am
Well, some people around the internetz started an funny little operation they call reNigger. Basically, its pretty obvious which world is the control world, so they type that right and then just type ‘nigger’ on the word to be digitalized.
I thought that was rather fun thing to do, so I tested it out a bit. Works.
I wonder how long it will take before, by chance, enough people type “nigger” instead of the word to be digitalized, so that reCaptcha translates a word incorrectly…
Stupid
May 3, 2010 at 11:22 pm
It seems like the only reason they’d need you to translate it, is if they didn’t know what it said. And if they didn’t know what it said, how would they be sure you typed in the right thing? You could type in whatever you wanted and the only consequence would be a mistranslated online book somewhere.
Is there something I’m missing here?
MrGroove
May 3, 2010 at 11:34 pm
Yes however, when reCaptcha shows you a “puzzle” to solve (or translate), two words are presented to you. One of the words is a control word which they already have the info. The second word is the word they need you to figure out. Now, since you don’t know which is the control word you must type of both words.
If you get the control word wrong, they make you try again. If you get the control word right then they assume you tried your best on the 2nd word as well which is what they need. My guess is they do this several times for each word or “text” they need deciphered or translated and if several people all type the same thing, they add that into the database and move on. FREE LABOR! :)
hifhgfn
May 4, 2010 at 1:18 pm
Thank you for that comment, it was exactly what I was looking for.
Pure
May 7, 2010 at 3:00 pm
not 2 minutes later, i got a Captcha myself.. tested this theory. apparently “stadddddddd” was the correct first word and typed the second normally. now that i know this, going to make it personal and make my free labor worth my time (and possibly mistranslate a few newspapers along the way)
MrGroove
May 9, 2010 at 9:44 pm
@pure – Exactly! You get what you pay for right!
Here’s a good one for ya – some sweat shops overseas has started a new trend… Spammers are having a hard time solving captcha’s to post their crap on forums and other websites. IE: the have to solve the captcha before they can register or post a comment on the blog etc…
So what they do is have low wage employees just sit all day and solve captcha’s so that they can continue to spam the world.
Another thing they do is take a legit site, hook into it and have legit users solve Captchas. Now the problem here is the Captchas are actually from websites which the spammers are trying to post on. Wild stuff….
Jack
May 9, 2010 at 12:04 pm
“One of the words is a control word which they already have the info. The second word is the word they need you to figure out.”
This would work if the user couldn’t distinguish between the two. In 90% of the time you can.
Everytime I see one these things I always write gibberish for the OCR word.
MrGroove
May 9, 2010 at 9:36 pm
Can’t say I disagree with ya. Don’t use me as a slave and expect great results. Now reward me in some way and you can expect high quality.
Perhaps that’s why they probably require a few dozen??? people to solve the captcha with the same word?
Matty
May 4, 2010 at 11:10 am
I just wanna thank Stupid for asking the above question, and MrGroove for answering it. I was wondering the exact same thing. (And I would hope that they do this several times for each word…)
So who are the grunts that have to scan in all these documents in the first place?
MrGroove
May 4, 2010 at 1:28 pm
@Matty – ROBOTS! I saw a youtube video somewhere a long time ago that showed how Google was using an automated system to scan the books and newspapers from libraries and universities etc… I can’t find it so I think it might have been a 60 minutes??? Anyway, here’s a link I had in my Delicious account – http://is.gd/bU6qX
It explains some of the details. If you find a better link, would really appreciate it if you could drop another comment with it! Thanks!
Ryan
May 10, 2010 at 1:18 pm
Confusing though is, if they have the tech to pull individual words from scanned images and convert them into individual images… why do they not have the tech to convert it into text automatically? I thought that technology had been around for a while.
MrGroove
May 10, 2010 at 1:53 pm
@Ryan – From what I’ve read the OCR (Optical Character Recognition) technology is getting better and better with each new iteration of the technology however in the case when the OCR scanner cannot recognize the image/word that’s when it goes into the database for confirmation / recognition.
Using humans, you not only get a free workforce but you also get more accurate translation from paper to digital. According to the ReCaptcha website, over 200 million captcha’s are solved everyday so my guess is the OCR is going to get better and better very quickly as the OCR database grows and grows…
Estey
May 12, 2010 at 2:43 pm
Why would you want to type in rubbish for one of the words? Taking old newspapers and magazines and books and making them searchable is a desirable thing for researchers. Why would you want to screw that up? It takes more of your “free labor” time to type it in incorrectly (you have to think about it and make a decision which is the “right” word) than to type it in correctly (you don’t have to take any time to think about it, just type what it says). Google might make money off of it? Yeah, so what? The process also serves a valid purpose by helping foil robots and spammers.
Matty
May 12, 2010 at 8:51 pm
It’s nice to see that not everyone here is an anarchist…
MrGroove
May 12, 2010 at 9:04 pm
I appreciate the comment and this side of the viewpoint. It reminds me of the thousands of photos sitting in albums from my parents which are almost impossible to look through and index in any great way. If I could find a way to 1: get them scanned for a low cost and 2: benefit the public at the same time, that’s a win win situation.
Many old books and newspapers out there have a lot of value but they are stored up and old libraries and “locked away” basically. By having them scanned and available through a search tool like Google, they will once again be available for the benefit of us all (and possibly the detriment..??). Either way, I appreciate your comment to the discussion.
Kie
May 14, 2010 at 10:48 pm
Wait, so we don’t need to get those captcha words right? So now that everyone knows this, we can just mess with google and misspell each word (since the program doesn’t know what the words are) :D
MrGroove
May 14, 2010 at 10:59 pm
@Kie – That’s one way of looking at it sure. But another is that in a positive way, it is keeping the boards free from SPAM so it’s not all that bad…
Octavian
July 28, 2010 at 5:43 pm
Hey I found what the supposed book was. It’s this
http://www.nytimes.com/1865/08/10/news/serious-railroad-accident-norwich-steamboat-train-runs-off-track-one-passenger.html
!!! I just googled the keywords I saw. Ta-da. No hands!
sanjana sanjana
November 2, 2010 at 12:48 am
because it very bad
khashayar
April 10, 2011 at 8:14 am
got a question here:
how the captcha application knows that the text entered by the user is correct or wrong when they actually don’t know what the text is and they just scanned it like a graphic file.
H. Winterman
September 12, 2011 at 6:47 am
People on the internet are so astoundingly stupid it makes my mind explode. The 3rd comment asked, and got an accurate answer to, the question that khashayar posted, 11 months before. Gah
Peter
November 3, 2011 at 11:43 pm
This is what I don’t understand – how does reCAPTCHA know if the word(s) you entered are correct if they don’t have it in converted, electronic form? I mean, if they have to check it against the source word (i.e. the one you have to type), then they already have it in electronic form, right?
Tim
December 30, 2011 at 9:07 pm
Tard. This has already been asked (and answered) in the comments above.