Login using the username and password you created for AcrobatUsers.com.
Note: This is not the same as your Adobe ID.
Benefits of Free Membership:
Learn more about AcrobatUsers.com
You don't have to be a member to look at any content on the site. Increase your expertise with our helpful tutorials, videos, forums, and sample PDFs.
Like what you see? Take the next step and become a member. Register now to get discounts, attend eSeminars, ask questions and more.
Get the most out of your membership. Post in the forums, create your profile, submit to the gallery, attend a user group meeting. Log In now.
The lords of search over at Google recently announced an interesting new feature for PDFs created from scanned pages.
Searchable PDF files are nothing new - and neither are searchable PDF files produced from scanned pages. Simply run OCR and voila - your scanned PDFs are now searchable.
But let's say you didn't OCR your files. Maybe you didn't want to take the time, maybe its impractical, or maybe you didn't even WANT your files to be searchable (my legal friends should take note here).
Too bad!
Post those PDFs on a publicly accessible site and now Google will OCR and index them for you, no extra charge.
I'm sure there are some limits here. Google isn't saying, but I'm guessing it won't download a 500 MB PDF just to discover that there's no text to index.
I'm also unsure as to the quality of the OCR. I'd have to believe that it's super-quick, and therefore, less than super-accurate, but then again, Google has computing resources that defy my paltry imagination, so no bets there either.
I'll be running some tests before long, but I'm curious to know what you think.
Do you WANT your scanned PDFs indexed by Google? Are you tempted to post oceans of scanned content online? Or is this a big yawn, something you thought Google was doing all along, so what's the big deal?
Looking for a job or seeking to fill a job? Check out the new Acrobat job board.
Go deeper into Acrobat through a new series of informal technical talks by Acrobat experts.
Tech Talks >
Sign up for your free membership today and save up to 40% on books, training, and more.
Comments
add new commentRowan,
I've been toying with the question of whether or not this development will eventually make 'local' OCR obsolete. After all, with server-side OCR, the more accurate it gets, the better search results will get, and no-one (except Google) has to do a thing except feed scans onto the web.
Hmm.
That's a good point about Google Webmaster. Google should make clearer exactly what their policy is on processing PDFs; size limitations (if any), text-volume limitations (if any), extent to which structure is used, and so on. At the same time, they should tell users how to "block" their PDFs from indexing, their policy on PDF metadata, and so on. There are many ways in which PDFs could really be handled "right" - I'd like to see it.
In the past week I have seen at least one or two PDFs showing up in my search results. Generally I have found the experience to be pretty good -- the PDFs are usually white papers, academic papers, data sheets, etc and so contain useful information.
It should be a little bit concerning for less tech savvy people who maybe aren't sure how Google's indexing works. It would be nice if Google were to add something to Google Webmaster that told you if any PDFs (or other file types) from your site were being indexed.