Archive for February, 2007

OOXML & ODF vs. PDF/A

Tuesday, February 27th, 2007

A colleague sent me the following statement today and asked me to respond to it. I first wrote only to him, but then thought that it was worth posting here.

He wrote:

Policymakers seem to cherish the ‘perception’ that the advent of ODF and/or OOXML will make PDF/A-1 a ‘redundant’ standard for long-term document preservation. Their ‘case’: it’s based on XML and XML-related open standards, so long-term accessibility is granted. They may reason as follows: just wait, implement a few Office plug-ins for conversion of the useful Office legacy files to ODF and/ or OOXML and ‘we’ have taken care for the long-term accessibility of these legacy files, adequately. This will keep ‘us’ away from ‘obscure’, PDF-derived, open standards. If it’s not XML, the format has no right to exist, at all. Even Adobe admits this by developing MARS. Please, let ‘us’ avoid the costs involved in developing and maintenance of PDF-infrastructure, PDF-training and PDF-knowledge.

Here is what I have to say.

Before one can argue over file formats, one needs to determine what it is that is being archived and why. For example, if I was interested in archiving an address book - I would probably focus on the data and not on a single presentation of that data. However, if I was archiving the Declaration of Independence, then I would focus on the presentation of the content in addition to the actual content. I also want to ensure that any content is maintained in its original “format” - so that vector diagrams from a CAD-generated floor plan would remain as rich vectors and not be converted to something like raster data. Finally, in all cases, I would want to ensure that any relevant marginalia (metadata, comments/markup, etc.) could be incorporated.

This is why the archival community approached Adobe about the use of PDF for long term archival storage of content containing text, images and raster data. PDF is the only format that encompasses ALL of the above needs - content, presentation and metadata for all standard content elements (text, vector, raster). Combined with that is a technical design that enables easy creation of a “reference implementation” at some point in the future without any ambiguities - thus ensuring that the content and its presentation will survive.

Neither OOXML or ODF address all these needs. In fact, they are focused primarily on the textual content and (limited) metadata - and in no way help preserve the presentation of that content. As such, they aren’t even acceptable for the archiving of simple Office documents They also do nothing to address the needs of those wishing to archive scans, CAD drawings, print publications and many other types of documents. Combine those limitations with the fact that neither was designed with the intent of ease of creation of a reference implementation (it’s IMPOSSIBLE to write a fully compliant OOXML viewer), also make their use as archival standards insufficient.