Introduction to XML

XML is a universal format for data exchange on the web. In its essence, XML is a markup language like HTML, but with a stricter syntax. In this post I'll outline the main features of XML and its differences with HTML.

Element names

XML has no predefined element set. Unlike HTML, XML has no predefined DTD. Element names are defined every time by us. The only limitations are:

  • you cannot use xml as an element name
  • element names cannot start with a digit.

XML prolog

The XML prolog must always be put at the very beginning of an XML document, just before the root element. It defines three aspects of an XML document:

  1. XML version (version attribute)
  2. document encoding (encoding atribute)
  3. whether the document itself must be validated against a given DTD (standalone attribute - yes or no values).

Example:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<document></document>

Root element

Every XML document must have a root element. This element contains the document tree and its DOM structure.

<?xml version="1.0" encoding="utf-8"?>
<document>

  <section id="section-a">
  
    <title level="1">Title</title>
    <para>...</para>
  
  </section>

</document>

In this case, document is the root element.

Encoding and content type

The preferred encoding for XML is UTF-8. You can even choose to use UTF-16, if you want. The delivered content type must be either text/xml or application/xml. The former content type is backward-compatible for user-agents that don't fully support XML.

Syntax rules

  1. XML is case-sensitive. This rule applies to element names, attributes and attribute values.
  2. Every XML document must have an XML prolog.
  3. Every XML document must have a root element (and only one root element).
  4. Elements must be correctly nested. Thus <element> <para> </element> </para> will return a fatal XML parsing error.
  5. Attribute values must be enclosed within quotes. Thus <element attr=value></elemento> will return a fatal XML parsing error.
  6. Empty elements must have a matching closing tag as other elements. Thus <break> will return a fatal XML parsing error. You must write empty elements as <element /> or <element></element>.
  7. All special characters must be converted into SGML entities, such as &gt;, &lt; and so on. If there is no nominal entity reference for your character, use the hexadecimal notation, that is &#x, followed by the Unicode value, followed by a semi-colon. For Unicode values, see Alan Wood's site.

XML style sheets

To associate a CSS style sheet to an XML document, you must use a particular processing instruction to be inserted just after the XML prolog:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="style.css" type="text/css"?>

This PI works much as a normal HTML link element.

Class and ID attributes in XML

Since XML has no predefined DTD which associates attributes to elements, class and ID attributes don't work as in HTML. If you try to use some DOM methods such as getElementById() or getElementsByClassName(), you will get empty or null results. In the same way, if you try to use CSS ID and class selectors, you won't match any element.

For JavaScript and the DOM, you can choose to use more generic methods, such as getElementsByTagName(). For CSS, you can use attribute selectors.

This entry was posted in by Gabriele Romanato. Bookmark the permalink.

Leave a Reply

Note: Only a member of this blog may post a comment.