The following is amalgamated from two posts that Graham Bell, the Executive Director of EDItEUR, made to the ONIX Implementation Group.  Anyone can join this Implementation Listserv and it's strongly recommended for anyone creating ONIX metadata from a database.  It may be overkill for someone using another company's ONIX creation software but it has tips useful to anyone using the standard.



In ONIX – whether version 2.1 or 3.0 – there are many common issues that arise when data providers embed HTML within the various textual data elements. Data providers deliver HTML in a variety of different ways – some which match the standard, many which don’t. And this means that for data recipients, the complexity of receiving so many unpredictable variations forces them to choose to just ignore all HTML – whether it matches the standard or not – or to treat each data file as unique (which adds unnecessary cost and time). This isn’t good for either senders or recipients.

Can you use just any HTML, or only a subset?

When ONIX initially introduced HTML markup of textual data (way back in mid-2000), there were no documented limitations on the range of HTML tags that could be included. When XHTML was introduced (in 2003) some limitations were documented. But these limitations were generous – probably too generous – and meant that most HTML tags other than those used for interactivity (things like forms and scripts) were technically okay. It meant that publishers could add complex HTML or XHTML tagging in an attempt to control the look and layout of their product descriptions on the retailer’s web page, using HTML headings, tables, style attributes, even images and hypertext links. Understandably, this led many data recipients to simply reject all HTML and XHTML tags.

The technical limitations on XHTML remain, but are also still very generous. In practice, there is a strong recommendation for data suppliers to limit themselves to using a very restricted range of simple HTML or XHTML tags – and the hope is that if data suppliers do this, data recipients will not reject all HTML out of hand.

Things to Remember: Embedding HTML (escaping vs CDATA and the best option, using XHTML with tags).

Quick reference

These general best practices are best to follow as they are part of XHMTL:

The following tags are recommended (others are problematic or prohibited) and this applies to both HTML (using CDATA or escaping of the < character) and XHTML (using HTML tags within text blocks and allowing XML processing rules apply to your tags. This option is the recommended one, but it's best supported by regular use of XML schema validation to ensure it's accuracy.)

<p> and <br /> – paragraphs and line breaks
<i> and <em>, <b> and <strong> – italic and bold emphasis
<cite> – for book titles
<ul>, <ol> and <li> – bulleted and numbered lists

<sub> and <sup> – sub- and superscript
<dl>, <dt> and <dd> – definition lists

ONIX implementers

In ONIX – whether version 2.1 or 3.0 – there are many common issues that arise when data providers embed HTML within the various textual data elements. Data providers deliver HTML in a variety of different ways – some which match the standard, many which don’t. And this means that for data recipients, the complexity of receiving so many unpredictable variations forces them to choose to just ignore all HTML – whether it matches the standard or not – or to treat each data file as unique (which adds unnecessary cost and time). This isn’t good for either senders or recipients.

The primary reason to use HTML is to include multiple paragraphs, or italic or bold text, in a contributor biography or long description for example. Numbered or bullet point lists are also common requirements. There are three ways to include markup within a data element:

  1. use XHTML without any other special treatment

  2. use HTML with CDATA

  3. use HTML, but escape all < characters to &lt;

Option 1. is preferred in all cases. If you cannot use XHTML, then 2. is preferred over 3. in ONIX 3.0, and in ONIX 2.1, 2. and 3. are equally preferred.

Common errors

The first common error is simply omitting the necessary textformat attribute from List 34. When embedding HTML, you must always include the attribute textformat="02", as the default when you omit it is ‘plain text’. If you’re using XHTML, then you must use textformat="05".

Then there are common issues with the escaping of the HTML markup:

<Text textformat="02">&lt;p&gt;This is some text marked up with HTML.&lt;/p&gt;</Text>

<Text textformat="02">&lt;p>This is some text marked up with HTML.&lt;/p></Text>

Only the second version is correct according to the standard. The rule is – if you switch [only] &lt; back to a < character, the data should be correct HTML. In theory, if the text is processed using an XML-aware processor, the two versions would be considered identical, but in practice, many data recipients do not use end-to-end XML-aware processing.

<Text textformat="02">&lt;p>This is some text that mentions Marks &amp;amp; Spencer.&amp;lt;/p></Text>

<Text textformat="02">&lt;p>This is some text that mentions Marks &amp; Spencer.&lt;/p></Text>

Again, only the second version is correct. Don’t ‘double-escape’ any characters.

ONIX 3.0 Special Considerations

Because of the common issues with ‘double-escaping’, the CDATA method is preferred over the escaping method in ONIX 3.0:

<Text textformat="02"><![CDATA[<p>This is some text that mentions Marks &amp; Spencer.</p>]]></Text>

With CDATA, the data content should be valid HTML without any changes.

Note that CDATA should not be used on any data except HTML – don’t use it unnecessarily on XHTML, or on ordinary text without markup.

Ideally, the whole text should be enclosed in <p> or maybe <ul> tags (or similar ‘block level’ HTML or XHTML tags):

<Text textformat="02">This is some &lt;em>text&lt;/em> marked up with HTML. &lt;p>Here is a second paragraph.&lt;/p></Text>

<Text textformat="02">&lt;p>This is some &lt;em>text&lt;/em> marked up with HTML.&lt;/p><p>Here is a second paragraph.&lt;/p></Text>

Once again, only the second version is really correct. There should be block-level tags enclosing the whole of the text – no free-floating text that is outside of any tags at all.


Using XHTML instead of HTML is the best method of all.

<Text textformat="05"><p>This is some text that mentions Marks &amp; Spencer.</p></Text>

What’s the difference? In many cases, not much. XHTML uses textformat="05" instead of "02", and you must ensure that all the markup tags are properly matched and nested. In HTML, the </p> tag is actually optional, but in XHTML, it is mandatory, and XHTML tags must always be lower case whereas in HTML upper case tags can be acceptable).

It is strongly recommended that ONIX data suppliers use only the following tags:

<p> and <br /> – paragraphs and line breaks

<i> and <em>, <b> and <strong> – italic and bold emphasis
<cite> – for book titles
<ul>, <ol> and <li> – bulleted and numbered lists
<sub> and <sup> – sub- and superscript
<dl>, <dt> and <dd> – definition lists,
<ruby>, <rb>, <rp> and <rt> – for simple glosses in Chinese, Japanese and other text

Of these, only <p>, <ul>, <ol> and <dl> are ‘block-level’. Any attributes (e.g. the style attribute) should be avoided. And it should be emphasised that these recommendations apply to both HTML (using CDATA or escaping of the < character) and XHTML.

There is complete list of XHTML and HTML tags – allowed and disallowed – within the ONIX 3.0 Implementation and Best Practice Guide.

There is only a small list of ONIX data elements in which HTML or XHTML markup is acceptable.

For example, in ONIX 3.0, <BiographicalNote> can contain markup, but <ProductFormDescription> cannot. Here’s the complete list of data elements where markup can be used:


In ONIX 2.1, HTML and XHTML should only be used in data elements which take the textformat attribute, or have an associated <TextFormat> element:

FAQ

Q. If the “<“ character occurs in the text of a short or long description, and I also want to include HTML or XHTML markup, how should i do that?

Because the “<“ character is used to mark the beginning of an tag, it can be confusing if you also want to include it as part of the text. Generally, if you want a < to appear in the text of a web page (not as part of the markup), you would use &lt; and equally, if you want & to appear on a web page, you use &amp; instead.

But when you’re embedding HTML in ONIX with the escaping method, < characters at the beginning of tags are changed to &lt;. The recipient will turn every &lt; back into a < character. And while that would be correct for < at the beginning of a tag, it would not be correct for a < that is meant to appear as part of the text.

The simplest way to deal with this is this: if you’re using the escaping method and you want a < to appear as part of your marked-up text, use &#60; instead of &lt;

Alternatively, use the CDATA method instead. With CDATA, < at the beginning of a tag stays as <, and a < intended to appear as part of the text is changed to &lt; or &#60;.

And of course with XHTML, < at the beginning of a tag stays as <, and a < intended to appear as part of the text is changed to &lt; or &#60;.

Q. Can I include named character entities in text with markup in ONIX 3.0?

If the markup is HTML, this is pretty confusing, because named character entities – things like &hellip; for an ellipsis or &ndash; for a dash – are valid in HTML and XHTML. But they are not valid in ONIX 3.0. A strict answer to the question depends on which method you’re using to embed the HTML:

<Text textformat="02">&lt;p>This is some text &ndash; with a dash!&lt;/p></Text>
– not correct, because the named character entity isn’t valid

<Text textformat="02">&lt;p>This is some text &amp;ndash; with a dash!&lt;/p></Text>
– not correct, because the character entity is ‘double-escaped'

<Text textformat="02">&lt;p>This is some text &#8211; with a dash!&lt;/p></Text>
– correct, uses the numerical character reference for the dash

<Text textformat="02">&lt;p>This is some text with a dash!&lt;/p></Text>
– correct, uses the native character for the dash, in whatever character encoding the entire message uses

<Text textformat="02"><![CDATA[<p>This is some text &ndash; with a dash!</p>]]></Text>
– this is correct – and will probably work (because it's within CDATA)

<Text textformat="02"><![CDATA[<p>This is some text &#8211; with a dash!</p>]]></Text>
– correct

<Text textformat="02"><![CDATA[<p>This is some text – with a dash!</p>]]></Text>
– correct

If you’re embedding XHTML – with textformat="05”, and without CDATA or escaping – then the answer is much simpler. Either the native character or the numerical reference are fine, but the named entity (&ndash;) is not.