Archive for September, 2006

A PDF Perspective on Google Book Search

Friday, September 29th, 2006

google_book_search.jpgA lot of people sat up and took notice when Google announced their book-scanning initiative.  And not for nothing; when a company as powerful and innovative as Google says they are going to do something, it’s usually worth watching.

Per my earlier promise, I’ve been sniffing around this new Google site.  From the PDF Perspective, then, a brief review of Google Book Search.

Background

The end-product of a massive scanning project, Google Book Search is intended to eventually span millions of books.  For many works in the public-domain, Google makes complete cover-to-cover scans of the book available to users as images in an online viewer and also… you guessed it, as a PDF.

The Imaging Work

Overall, the scanning quality is average, perhaps very slightly above average.  The black and white pages from each book have are captured with JBIG2 compression, and are overlaid by a clever grayscale “screen” to produce the “patina” of an old document.  Nice touch - it keeps the file-size very low indeed while preserving at least some of the “atmospherics” of an old book.  Google managed to suppress edge-artifacts for the most part, but I’ve certainly noticed errors which should have been caught during imaging… about 1 in 300 pages or so has a boo-boo of some sort.  Not too bad, but not too good either.   For the price they are doubtless paying (and charging) for the service, I’m sure Google thinks it’s just fine the way it is.

ti2.gifGoogle’s Book Viewer

This gadget displays an image of each page in your browser window, complete with buttons to move forward or backwards through pages, or to goto a specific page. If you’re looking at the page as the result of a text-search, your search-term is highlighted, although this works less well than it should - the highlight is usually “off”.

The book’s own Table of Contents is provided via adjacent links, as is information about the publisher and current editions available in print.

The downloadable PDF files

The first thing to say about the files I’ve downloaded from Google Book Search is that they are very “lightweight” - from 8 to 20 kb per page in size for “black and white” pages. Very nice… but in their zeal to produce the SMALLEST possible PDF files, the Googlistas left something important (actually two somethings) OUT.

  1. There’s no searchable text!  Users who want to locate a word or phrase are out of luck. OK, they want you to do your searching online, not offline… fair enough.  But if you were thinking about doing something offline that involves text search or extraction, you better reconsider.
  2. The OCR engine used to generate the text needed to support the full-text search feature online is so-so at best.  I suspect it was selected for speed and robustness rather than quality.  In fact, I’ll go further, and guess that Google wrote their own OCR engine.  Either way, they could have done better.
  3. There aren’t any bookmarks!  Users who might prefer to actually NAVIGATE a 300 page book rather than simply turn pages are also… you guessed it… out of luck.
  4. Since they don’t include text, the files are (can’t be) tagged, and are completely inaccessible to disabled users.
  5. File properties are left at Acrobat defaults.  Clearly the presentation of the PDF (ie, the end-user experience) doesn’t overly concern the Googlistas.

Overall, the service is, of course, free, so whining about it most likely won’t change anything.  It’s a good thing too… I recently found a fascinating “Glossary of Words Pertaining to the Dialect of Mid-Yorkshire” from the 1870s.

If I could ask them to change ONE thing, it would be this: It’s clear that Google is capturing the necessary metadata (how else do they create links for a table of contents on their site) when they scan the book, so it’s really mysterious why they don’t go ahead and slap that data into each PDF in the form of Bookmarks. Who knows?  If Google Google’s this blog post, maybe they’ll fix it!

Reader can Save: A New Day Dawns for PDF

Monday, September 18th, 2006

With Acrobat 8, everything changes

Reader Save!
A PDF form enabled for Reader Save in Acrobat 8 Professional may
now be completed and SAVED using the free Adobe Reader!

From the introduction of forms technology to PDF nine years ago until today, users with the free Adobe Reader could certainly fill out and print a PDF form (if, in fact, it included form-fields), or submit the form to a server, but that was the limit. Could they save their work along the way? No. Could they fill out part of a form, and pass it to a coworker to check over and complete? Nope.

PDF forms offer an easy yet sophisticated way to move existing business-processes from paper to the computer without losing the connection with paper workflows. This capacity was intentionally hobbled in the free Adobe Reader, sending most users to the printer once they’d filled-out a form. Quite apart from end-user frustration, the limitation effectively precluded implementation of PDF forms in many distributed applications where end-users could not be expected to own Adobe’s $300 Acrobat Standard software.

PDF forms exploded nonetheless. From the IRS to the smallest non-profit, organizations everywhere found a myriad ways to to use PDF forms, Reader Save or no. The ability to add typed text to a form that would faithfully reproduce itself when printed was an obvious winner.

Naturally, almost as soon as the forms capability was introduced to PDF in Acrobat, users and third-party developers alike began asking for (nay, demanding!) the ability to save completed forms to the user’s own computer using the free Reader. The absence of the feature was (rightly) regarded as the single biggest barrier to wholesale implementation of PDF forms. Adobe Systems understood this, but also understood that Reader Save had major revenue potential, and thus were in no hurry to give it away for free.

After an abortive attempt at a low-cost “Reader + Save” product called Acrobat Approval, (junked to howls of protest from 3rd party PDF developers), Adobe Systems faced the demand for “Reader Save” capability with the development of the Adobe LiveCycle Reader Extensions Server (ARES), the basic purpose of which is to “bless” PDF files with various “extended rights” - including the ability to be saved with Adobe Reader.

Acrobat 8

ARES remains very, very expensive, and the typical customer is a large corporation or government agency with a major forms headache and a server software budget in the high five figures. The lack of a affordable Reader Save solution helped foster the so-called “Acrobat Alternatives”, including ARTS Nitro PDF, Nuance’s PDF Converter and Global Graphics’ JAWS PDF Editor. Besides replicating many of the most popular functions in Adobe Acrobat Standard and Professional, these lower-priced products allow users to fill and SAVE a form right there on their own computer.

And then, late last year came word of Microsoft’s foray into PDF creation. Ouch. So what does Adobe do? It was time for the heavy artillery.

Adobe’s Response

The Acrobat Alternatives and Microsoft’s PDF software exist only because Adobe Systems elected to publish the PDF Reference. This move made it possible for any sufficiently competent software developer to create and edit PDF files without any Adobe software. This was, in a sense, a calculated risk. The move could spawn competitors to Acrobat, but on the other hand, a world awash in PDF (from whatever source) could only be a good thing.

Distribute Form in Acrobat Professional 8.0What Adobe did NOT give away, of course, is the code for the free Adobe Reader. This ubiquitous software, installed on hundreds of millions of computers worldwide, is Adobe’s “special sauce”, for only they can build features into Reader that PDF files can unlock.

Even with all of the advanced capabilities in Acrobat 7, most people still buy the software because it can make PDFs, period. The “higher” capabilities of the PDF format barely register for most developers and decision-makers, and are rarely utilized.

Adobe had to change that, or risk increasing peril to the Acrobat franchise. With the announcement of Acrobat 8, Adobe can (and I believe, will) move beyond the perception that “Acrobat is for making PDFs, Reader is for Viewing PDFs”. The ability to add Reader Save capabilities to PDF files creates a compelling reason to purchase Adobe’s own desktop software for creating and managing PDF files - Adobe Acrobat and Acrobat Professional - before any others. Awareness, interest in and adoption of PDF as an electronic document in its own right, not merely as a conveyance for a consistent printout, is about to take off.

Four more articles

Monday, September 11th, 2006

Over the past couple of months, the editors of acrobatusers.com have posted another four articles by yours truly.  You’ve probably read them already, but for search-engines and others who appreciate the author’s crude attempts at self-promotion, here’s a very brief summary:

Digital Signatures in Acrobat:  The good, the bad and the ugly.  This one was somewhat contentious, not everyone at Adobe felt that this article reflected Acrobat 7.0 in the, ahh, best possible light.  Too true!  Let the marketing haze around “digital signatures” dissipate, and what’s left is the software equivalent of a Mondrian painting - theoretically plausible, but otherwise utterly inscrutable.  Read the piece if you want to know more, but you’ve been warned - it does have a significant soporific effect.

Maximizing PDF usability is a short piece devoted to the idea that PDFs are more than the paper on which you (might) eventually print them.

Acrobat Bookmarks:  Why and how.  Who knew PDFs could bestow comparative advantage!  This “docudrama” presents the fortunes of Tightship Associates against their rival, Inefficiency Systems.  Guess who makes better use of bookmarks!

Understanding Acrobat’s Optimizer.  PDFs don’t have to be large and unpolished.  Savvy PDF creators may already know that Acrobat’s PDF Optimizer can radically shrink PDF files - but there’s more to the Optimizer than just shrinking.