Archive for the 'Technology' Category

OpenDocument and OOXML –all is not well

Sometimes I get the impression that much of the current talk about sustainability of data resources is just a broad way of refocusing on a complex of problems which were somewhat overlooked, probably because many of us failed to grasp the full extent of the commitment to portable data. Along with more flimsy promises such as interoperable web services, portability in terms of platform-independent data, was/is actually an attainable goal – provided, of course, “we can use the cleanly documented, well-understood, easy-to-parse, text-based formats that XML provides.” And to continue along the same lines: “XML lets documents and data be moved from one system to another with reasonable hope that the receiving system will be able to make sense out of it.” (from Elliotte Rusty Harold and W. Scott Means, XML in a XML in a nutshell)

“Reasonable hope” –yes indeed. It’s very much implied that there’s more to portability than data being re-usable across different software and hardware platforms. If they are to be re-usable across different communities and different purposes as well, there are some further questions that cannot be left unanswered. This is all well argued for by Steven Bird and Gary Simons in their seminal Seven Dimensions of Portability for Language Documentation and Description (2003).

I’m bringing it up, because with word processors like OpenOffice.org and MS Word now using XML as a storage format, people could get the impression that such reasonably well-documented formats ship with a sustainability guarantee. XML formats is a step in the right direction, but like HTML they are “only” presentational, although arguably much harder to understand than HTML, and consequently difficult to manipulate and repurpose. Consider an excerpt of the present document in ODT:

<office:body>
 ...
 <text:p text:style-name="P1">
 <text:bookmark-start text:name="h.pcbwh9-j8rixd"/>
 <text:span text:style-name="Default_20_Paragraph_20_Font">
 <text:span text:style-name="T1">OpenDocument</text:span>
 </text:span>
 <text:bookmark-end text:name="h.pcbwh9-j8rixd"/>
 <text:span text:style-name="Default_20_Paragraph_20_Font">
 <text:span text:style-name="T1"> and OOXML </text:span>
 </text:span>
 </text:p>
 <text:p text:style-name="P2">
 <text:bookmark-start text:name="h.slg1ig-i51sxm"/>
 <text:span text:style-name="Default_20_Paragraph_20_Font">
 <text:span text:style-name="T2">all</text:span>
 </text:span>
 <text:bookmark-end text:name="h.slg1ig-i51sxm"/>
 <text:span text:style-name="Default_20_Paragraph_20_Font">
 <text:span text:style-name="T2"> is not well</text:span>
 </text:span>
 </text:p>
 <text:p text:style-name="P3">
 <text:span text:style-name="Default_20_Paragraph_20_Font">
 <text:span text:style-name="T3">Sometimes I get the impression that much of the
 current talk about sustainability of data resources is just a broad way of
 refocusing ...
 </text:span>
 </text:span>
 </text:p> ...

Basically, it consists of paragraph (text:p) and span (text:span) child elements. Mind you, these are consistently used, but in terms of format the markup doesn’t really provide any information except how an application should render it. Notice how a heading is just another paragraph with different typography.

In TEI we are able to distinguish between headings, captured by the head (heading) element, and p (paragraph) elements, which should only be used to reflect a real prose paragraph. Further, headings and paragraphs are contained by a div (division) element.

<text>
...
<body>
<div>
 <head>OpenDocument and OOXML</head>
 <head>all is not well</head>
 <p>Sometimes I get the impression that much of the current talk about
 sustainability of data resources is just a broad way of refocusing ...</p>
 ...
</div>
...
</text>

In terms of content TEI markup adds another dimension. By applying the TEI terminology, people can use the Guidelines to check if we use the terminology correctly and consistently. Also, by enriching the markup with more elements we could get a broader coverage of the different aspects of the content (quotations, emphasized passages, etc.) thereby making the content relevant to more people.

So, for long-term preservation purposes, OpenDocument and OOXML don’t quite cut it. Besides the lock-in with notoriously short-lived word processor applications, they aren’t rich enough to capture relevant aspects of your content.

From Topic Maps to MediaWiki – Quick and Dirty

Recently, I needed to make some fairly large bodies of XML available for editing by a group of people. In this case the data was stored in the Topic Maps format (XTM), and –as long as I was the only one editing the files– this had been working just fine.

But with more people about to join in, it was clear that editing the files in a simple text editor wasn’t such a good idea. So, to avoid the risk of ending up with different versions (and people endlessly complaining about editing XML), I decided to turn the whole thing into a wiki.

Now, MediaWiki has the Special:Export tool for migrating wikis (‘transwikiing’). It exports pages  in a simple XML format, so that you can import it to another wiki. This way you’re able to create a wiki simply by emulating the MediaWiki XML export format.

How to

If you want to try it, the MediaWiki output has to look a little something like this:

<?xml version="1.0" encoding="utf-8"?>
<mediawiki xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns="http://www.mediawiki.org/xml/export-0.3/"
  xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/
  http://www.mediawiki.org/xml/export-0.3.xsd"
  version="0.3" xml:lang="da">
<page>
 <title>Google</title>
 <id>1</id>
 <revision>
  <id>1</id>
  <timestamp/>
  <contributor>
   <username>yourUserName</username>
   <id>1</id>
   </contributor>
   <text xml:space="preserve">
   <!-- Wikitext goes here -->
   ==Link==
   [http://www.google.com]

   </text>
  </revision>
</page>
<page>
 <title>Microsoft</title>
 <id>2</id>
 ...
</page>
</mediawiki>

If your data is XTM, your starting point might be something like this made-up Topic Map with names and links of three companies:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topicMap SYSTEM "xtm1.dtd">
<topicMap id="companies-tm.xtm"
  xmlns="http://www.topicmaps.org/xtm/1.0/"
  xmlns:xlink="http://www.w3.org/1999/xlink">
 <topic id="001">
  <baseName>
   <baseNameString>Google</baseNameString>
  </baseName>
  <occurrence>
   <resourceRef xlink:href="http://www.google.com"/>
  </occurrence>
 </topic>
 <topic id="002">
  <baseName>
   <baseNameString>Microsoft</baseNameString>
  </baseName>
  <occurrence>
   <resourceRef xlink:href="http://www.microsoft.com"/>
  </occurrence>
 </topic>
 <topic id="003">
  <baseName>
   <baseNameString>Oracle<baseNameString>
  </baseName>
  <occurrence>
   <resourceRef xlink:href="http://www.oracle.com"/>
  </occurrence>
</topic>
</topicMap>

In this case the following XSLT stylesheet will do the job:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:tm="http://www.topicmaps.org xtm/1.0/"
   xmlns:tmlink="http://www.w3.org/1999/xlink"
   exclude-result-prefixes="tm tmlink" version="2.0">
 <xsl:output method="xml" encoding="utf-8" indent="yes"/>
 <xsl:template match="/">
  <xsl:apply-templates select="tm:topicMap"/>
 </xsl:template>
 <xsl:template match="tm:topicMap">
  <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/
  http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="da">
   <xsl:apply-templates select="tm:topic"/>
 </mediawiki>
</xsl:template>
<xsl:template match="tm:topic">
<page>
 <title>
  <xsl:apply-templates select="tm:baseName/tm:baseNameString"/>
 </title>
 <id><!--To give each page a unique number, use the xsl:number instruction--><xsl:number/></id>
 <revision>
  <id>1</id>
  <timestamp/>
  <contributor>
   <username>yourUserName</username>
   <id>2</id>
  </contributor>
  <!--Since whitespace is crucial to the layout of your wikipage,
you should add the xml:space attribute and set the value to 'preserve'-->
  <text xml:space="preserve">
  <!--Now start building your wikipage -->

==Links==
<xsl:value-of select="tm:occurrence"/>

</text>
</revision>
</page>
</xsl:template>
</xsl:stylesheet>

Therefore:

  • Make sure that your wiki is installed, AND that you have admin rights
  • Create a stylesheet, somewhat like the one provided above
  • Run the stylesheet on your XML file, for instance from your command line with saxon:
    $ saxon topics.xtm topicMaps2Mediawiki.xsl > mediawikiTopics.xml
  • Go to the Special:Import page on your wiki
  • Browse for the file, and
  • Upload! Do remember, however, that the filesize maximum defaults to around 1.4 MB. To change it, you need to go to php.ini and simply change the parameters for maxuploadsize=.

After uploading the file, you’ll receive a list of links to the pages, you just made.

Introducing EPUB

With digital books finding their way to more and more, people read everywhere and on a variety of different devices. A lot of these have small displays, and this is a problem if the text you’re reading is in PDF.

EPUB is an XML publishing format for reflowable digital books and publications standardized by the International Digital Publishing Forum (IDPF), a trade and standards association for the digital publishing industry. For the record, this organization was formerly known as Open eBook Forum. “Reflowable” means that it scales to fit different screen sizes.

Since its official adoption by IDPF in 2007, EPUB has become popular among major publishers as Hachette, O’Reilly and Penguin. The format allows publishers to produce and send a single digital publication file through distribution, and it can be read using a variety of open source and commercial software. You can use O’Reilly’s Bookworm online for free, and you can go buy Adobe’s Digital Editions (ADE). It works on all major operating systems, on e-book devices (like Kindle and Sony PRS), and other small devices such as the Apple iPhone.

Collectively referred to as EPUB, the format is made up of three open standards:

  • Open eBook Publication Structure Container Format (OCF): Describes the directory tree structure and file format (zip) of an EPUB archive
  • Open Publication Structure (OPS): Specifies the common vocabularies for the eBook, especially the formats allowed to be used for book content (for example XHTML and CSS)
  • Open Packaging Format (OPF): Defines the required and optional metadata, reading order, and table of contents in an EPUB

To learn more, Liza Daly of Threepress has done a nice tutorial called Build a digital book with EPUB, available at IBM developerWorks. To really get to know EPUB, you’ll need to read the specifications: OCF, OPS, and OPF.

Out with the new, in with the old

For the moment it certainly seems as though the public uproar against Facebook’s recent changes to the Terms of Sevice has had an effect. On his blog Mark Zuckerberg  says that:

A couple of weeks ago, we revised our terms of use hoping to clarify some parts for our users. Over the past couple of days, we received a lot of questions and comments about the changes and what they mean for people and their information. Based on this feedback, we have decided to return to our previous terms of use while we resolve the issues that people have raised.

Very well -at least for now. It’ll be interesting to see exactly how (and if) these issues will be resolved.

I remain sceptical, because Zuckerberg appears to be talking about the change of ToS as an attempt to get rid of what he regards as “overly formal and protective … [language]” in the old ToS.

But this is simply downplaying a genuine disagreement between Facebook and it’s users. The new ToS are suspended, not abolished, and so the question remains: Exactly who owns the content you create on Facebook? You do, for now; but for how long?

Facebook Owns You

If you’re just a teeny weeny bit like me, you don’t like the idea of Facebook owning your content AFTER you’ve closed your account. That’s why I’ve joined this group on… er… Facebook, and you should too.

What’s the fuss? Well, we’ve grown accustomed to owning our content ourselves; that was the deal according to the old Terms Of Service (TOS). But with the new TOS, it’s a different story:

“You hereby grant Facebook an irrevocable, perpetual, non-exclusive, transferable, fully paid, worldwide license (with the right to sublicense) to (a) use, copy, publish, stream, store, retain, publicly perform or display, transmit, scan, reformat, modify, edit, frame, translate, excerpt, adapt, create derivative works and distribute (through multiple tiers), any User Content you (i) Post on or in connection with the Facebook Service or the promotion thereof subject only to your privacy settings or (ii) enable a user to Post, including by offering a Share Link on your website and (b) to use your name, likeness and image for any purpose, including commercial or advertising, each of (a) and (b) on or in connection with the Facebook Service or the promotion thereof.”

Learn more from  Erick Schonfeld on TechCrunch.

Update (February 18th, 2009): Over at Slashdot Ian Lamont has news about Facebook’s measures to contain the ToS fallout.