HTML use in ONIX

The following is amalgamated from two posts that Graham Bell, the Executive Director of EDItEUR, made to the ONIX Implementation Group.  Anyone can join this Implementation Listserv and it's strongly recommended for anyone creating ONIX metadata from a database.  It may be overkill for someone using another company's ONIX creation software but it has tips useful to anyone using the standard.



In ONIX – whether version 2.1 or 3.0 – there are many common issues that arise when data providers embed HTML within the various textual data elements. Data providers deliver HTML in a variety of different ways – some which match the standard, many which don’t. And this means that for data recipients, the complexity of receiving so many unpredictable variations forces them to choose to just ignore all HTML – whether it matches the standard or not – or to treat each data file as unique (which adds unnecessary cost and time). This isn’t good for either senders or recipients.

Can you use just any HTML, or only a subset?

When ONIX initially introduced HTML markup of textual data (way back in mid-2000), there were no documented limitations on the range of HTML tags that could be included. When XHTML was introduced (in 2003) some limitations were documented. But these limitations were generous – probably too generous – and meant that most HTML tags other than those used for interactivity (things like forms and scripts) were technically okay. It meant that publishers could add complex HTML or XHTML tagging in an attempt to control the look and layout of their product descriptions on the retailer’s web page, using HTML headings, tables, style attributes, even images and hypertext links. Understandably, this led many data recipients to simply reject all HTML and XHTML tags.

The technical limitations on XHTML remain, but are also still very generous. In practice, there is a strong recommendation for data suppliers to limit themselves to using a very restricted range of simple HTML or XHTML tags – and the hope is that if data suppliers do this, data recipients will not reject all HTML out of hand.

Things to Remember: Embedding HTML (escaping vs CDATA and the best option, using XHTML with tags).

  • Remember that you're embedding your HTML in another company's webpage.  Anything that could affect their CSS or formatting and any use of a URL will likely result in your work being stripped or not loaded.  Header tags (h1, h2 etc) should NOT be used as they would interfere with the retailer's webpage.

  • Simple, clean, noninvasive XHTML is what belongs in appropriate in ONIX elements

Quick reference

  • EDItEUR's Application Note: HTML markup in ONIX  Also available as a Visual application note (like a webinar)
  • The full list of appropriate elements is below but text blocks – descriptions, and biographical information – are appropriate.  All other data elements should NOT include any HTML, page returns or formatting of any kind. This includes Title and Contributor elements. Think of it this way: Data in ONIX is used by to fill data base fields and for indexing. Anything that doesn't belong in a database field can create problems, data loss and indexing issues. (ONIX 3.0 solves some of the need for formatting with specific "display" elements that are XHTML enabled, including one to display the Title and another for Contributor Statement.)

  • If you don't know anything about HTML there is more information in Editeur's Best Practices. http://www.w3schools.com/html/ is a good place to start and this explains the X in XHMTL  http://www.w3schools.com/html/html_xhtml.asp Someone in your company is likely creating ebooks and would understand the terms being used here.

These general best practices are best to follow as they are part of XHMTL:

  • All copy should be enclosed with HTML tags – normally use a paragraph tag <p>copy</p> but <ul> <ol> and <dl> are other options (these are called "block options" below)

  • Use lower case tags <p> are preferred to <P>

  • Close all tags appropriately <p> is closed by </p>. Some tags are "self closed" and you should use <br /> instead of <br>

  • Nest tags appropriately <p><i>copy</i> copy</p> but NOT <i><p>copy</i> copy</p>

The following tags are recommended (others are problematic or prohibited) and this applies to both HTML (using CDATA or escaping of the < character) and XHTML (using HTML tags within text blocks and allowing XML processing rules apply to your tags. This option is the recommended one, but it's best supported by regular use of XML schema validation to ensure it's accuracy.)

<p> and <br /> – paragraphs and line breaks
<i> and <em>, <b> and <strong> – italic and bold emphasis
<cite> – for book titles
<ul>, <ol> and <li> – bulleted and numbered lists

<sub> and <sup> – sub- and superscript
<dl>, <dt> and <dd> – definition lists

ONIX implementers

In ONIX – whether version 2.1 or 3.0 – there are many common issues that arise when data providers embed HTML within the various textual data elements. Data providers deliver HTML in a variety of different ways – some which match the standard, many which don’t. And this means that for data recipients, the complexity of receiving so many unpredictable variations forces them to choose to just ignore all HTML – whether it matches the standard or not – or to treat each data file as unique (which adds unnecessary cost and time). This isn’t good for either senders or recipients.

The primary reason to use HTML is to include multiple paragraphs, or italic or bold text, in a contributor biography or long description for example. Numbered or bullet point lists are also common requirements. There are three ways to include markup within a data element:

  1. use XHTML without any other special treatment

  2. use HTML with CDATA

  3. use HTML, but escape all < characters to &lt;

Option 1. is preferred in all cases. If you cannot use XHTML, then 2. is preferred over 3. in ONIX 3.0, and in ONIX 2.1, 2. and 3. are equally preferred.

Common errors

The first common error is simply omitting the necessary textformat attribute from List 34. When embedding HTML, you must always include the attribute textformat="02", as the default when you omit it is ‘plain text’. If you’re using XHTML, then you must use textformat="05".

Then there are common issues with the escaping of the HTML markup:

<Text textformat="02">&lt;p&gt;This is some text marked up with HTML.&lt;/p&gt;</Text>

<Text textformat="02">&lt;p>This is some text marked up with HTML.&lt;/p></Text>

Only the second version is correct according to the standard. The rule is – if you switch [only] &lt; back to a < character, the data should be correct HTML. In theory, if the text is processed using an XML-aware processor, the two versions would be considered identical, but in practice, many data recipients do not use end-to-end XML-aware processing.

<Text textformat="02">&lt;p>This is some text that mentions Marks &amp;amp; Spencer.&amp;lt;/p></Text>

<Text textformat="02">&lt;p>This is some text that mentions Marks &amp; Spencer.&lt;/p></Text>

Again, only the second version is correct. Don’t ‘double-escape’ any characters.

ONIX 3.0 Special Considerations

Because of the common issues with ‘double-escaping’, the CDATA method is preferred over the escaping method in ONIX 3.0:

<Text textformat="02"><![CDATA[<p>This is some text that mentions Marks &amp; Spencer.</p>]]></Text>

With CDATA, the data content should be valid HTML without any changes.

Note that CDATA should not be used on any data except HTML – don’t use it unnecessarily on XHTML, or on ordinary text without markup.

Ideally, the whole text should be enclosed in <p> or maybe <ul> tags (or similar ‘block level’ HTML or XHTML tags):

<Text textformat="02">This is some &lt;em>text&lt;/em> marked up with HTML. &lt;p>Here is a second paragraph.&lt;/p></Text>

<Text textformat="02">&lt;p>This is some &lt;em>text&lt;/em> marked up with HTML.&lt;/p><p>Here is a second paragraph.&lt;/p></Text>

Once again, only the second version is really correct. There should be block-level tags enclosing the whole of the text – no free-floating text that is outside of any tags at all.


Using XHTML instead of HTML is the best method of all.

<Text textformat="05"><p>This is some text that mentions Marks &amp; Spencer.</p></Text>

What’s the difference? In many cases, not much. XHTML uses textformat="05" instead of "02", and you must ensure that all the markup tags are properly matched and nested. In HTML, the </p> tag is actually optional, but in XHTML, it is mandatory, and XHTML tags must always be lower case whereas in HTML upper case tags can be acceptable).

It is strongly recommended that ONIX data suppliers use only the following tags:

<p> and <br /> – paragraphs and line breaks

<i> and <em>, <b> and <strong> – italic and bold emphasis
<cite> – for book titles
<ul>, <ol> and <li> – bulleted and numbered lists
<sub> and <sup> – sub- and superscript
<dl>, <dt> and <dd> – definition lists,
<ruby>, <rb>, <rp> and <rt> – for simple glosses in Chinese, Japanese and other text

Of these, only <p>, <ul>, <ol> and <dl> are ‘block-level’. Any attributes (e.g. the style attribute) should be avoided. And it should be emphasised that these recommendations apply to both HTML (using CDATA or escaping of the < character) and XHTML.

There is complete list of XHTML and HTML tags – allowed and disallowed – within the ONIX 3.0 Implementation and Best Practice Guide.

There is only a small list of ONIX data elements in which HTML or XHTML markup is acceptable.

For example, in ONIX 3.0, <BiographicalNote> can contain markup, but <ProductFormDescription> cannot. Here’s the complete list of data elements where markup can be used:

  • <AncillaryContentDescription>

  • <AudienceDescription>

  • <BiographicalNote>

  • <BookClubAdoption>

  • <CitationNote>

  • <CopiesSold>

  • <ConferenceTheme> (deprecated)

  • <ContributorDescription>

  • <ContributorStatement>

  • <EditionStatement>

  • <FeatureNote>

  • <IllustrationsNote>

  • <InitialPrintRun>

  • <MarketPublishingStatusNote>

  • <PrizeJury>

  • <PromotionCampaign>

  • <PromotionContact> (deprecated)

  • <PublishingStatusNote>

  • <ReissueDescription> (deprecated)

  • <ReligiousTextFeatureDescription>

  • <ReprintDetail>

  • <SalesRestrictionNote>

  • <Text>

  • <TitleStatement>

  • <WebsiteDescription>


In ONIX 2.1, HTML and XHTML should only be used in data elements which take the textformat attribute, or have an associated <TextFormat> element:

  • <Annotation>

  • <BiographicalNote>

  • <DownloadCaption>

  • <DownloadCopyrightNotice>

  • <DownloadCredit>

  • <DownloadTerms>

  • <MainDescription>

  • <PrizeJury>

  • <ProductWebsiteDescription>

  • <ReviewQuote>

  • <Text>

  • <TextWithDownload>

  • <WebsiteDescription>

FAQ

Q. If the “<“ character occurs in the text of a short or long description, and I also want to include HTML or XHTML markup, how should i do that?

Because the “<“ character is used to mark the beginning of an tag, it can be confusing if you also want to include it as part of the text. Generally, if you want a < to appear in the text of a web page (not as part of the markup), you would use &lt; and equally, if you want & to appear on a web page, you use &amp; instead.

But when you’re embedding HTML in ONIX with the escaping method, < characters at the beginning of tags are changed to &lt;. The recipient will turn every &lt; back into a < character. And while that would be correct for < at the beginning of a tag, it would not be correct for a < that is meant to appear as part of the text.

The simplest way to deal with this is this: if you’re using the escaping method and you want a < to appear as part of your marked-up text, use &#60; instead of &lt;

Alternatively, use the CDATA method instead. With CDATA, < at the beginning of a tag stays as <, and a < intended to appear as part of the text is changed to &lt; or &#60;.

And of course with XHTML, < at the beginning of a tag stays as <, and a < intended to appear as part of the text is changed to &lt; or &#60;.

Q. Can I include named character entities in text with markup in ONIX 3.0?

If the markup is HTML, this is pretty confusing, because named character entities – things like &hellip; for an ellipsis or &ndash; for a dash – are valid in HTML and XHTML. But they are not valid in ONIX 3.0. A strict answer to the question depends on which method you’re using to embed the HTML:

<Text textformat="02">&lt;p>This is some text &ndash; with a dash!&lt;/p></Text>
– not correct, because the named character entity isn’t valid

<Text textformat="02">&lt;p>This is some text &amp;ndash; with a dash!&lt;/p></Text>
– not correct, because the character entity is ‘double-escaped'

<Text textformat="02">&lt;p>This is some text &#8211; with a dash!&lt;/p></Text>
– correct, uses the numerical character reference for the dash

<Text textformat="02">&lt;p>This is some text with a dash!&lt;/p></Text>
– correct, uses the native character for the dash, in whatever character encoding the entire message uses

<Text textformat="02"><![CDATA[<p>This is some text &ndash; with a dash!</p>]]></Text>
– this is correct – and will probably work (because it's within CDATA)

<Text textformat="02"><![CDATA[<p>This is some text &#8211; with a dash!</p>]]></Text>
– correct

<Text textformat="02"><![CDATA[<p>This is some text – with a dash!</p>]]></Text>
– correct

If you’re embedding XHTML – with textformat="05”, and without CDATA or escaping – then the answer is much simpler. Either the native character or the numerical reference are fine, but the named entity (&ndash;) is not.