Technical Introduction to XML
It is somewhat remarkable to think that this article, which appeared initially in the Winter 1997 edition of the World Wide Web Journal was out of date by the time the final XML Recommendation was approved in February. And even as this update brings the article back into line with the final spec, a new series of recommendations are under development. When finished, these will bring namespaces, linking, schemas, stylesheets, and more to the table.
This introduction to XML presents the Extensible Markup Language at a reasonably technical level for anyone interested in learning more about structured documents. In addition to covering the XML 1.0 Specification, this article outlines related XML specifications, which are evolving. The article is organized in four main sections plus an appendix.
What Do XML Documents Look Like?
If you are conversant with HTML or SGML, XML documents will look familiar. A simple XML document is presented in Example 1.
Example 1. A Simple XML Document
Say goodnight, Gracie. Goodnight, Gracie.
A few things may stand out to you:
- The document begins with a processing instruction: . This is the XML declaration [Section 2.8]. While it is not required, its presence explicitly identifies the document as an XML document and indicates the version of XML to which it was authored.
- There's no document type declaration. Unlike SGML, XML does not require a document type declaration. However, a document type declaration can be supplied, and some documents will require one in order to be understood unambiguously.
- Empty elements (
in this example) have a modified syntax [Section 3.1]. While most elements in a document are wrappers around some content, empty elements are simply markers where something occurs (a horizontal rule for HTML's
tag, for example, or a cross reference for DocBook'stag). The trailing /> in the modified syntax indicates to a program processing the XML document that the element is empty and no matching end-tag should be sought. Since XML documents do not require a document type declaration, without this clue it could be impossible for an XML parser to determine which tags were intentionally empty and which had been left empty by mistake.
XML has softened the distinction between elements which are declared as EMPTY and elements which merely have no content. In XML, it is legal to use the empty-element tag syntax in either case. It's also legal to use a start-tag/end-tag pair for empty elements:. If interoperability is of any concern, it's best to reserve empty-element tag syntax for elements which are declared as EMPTY and to only use the empty-element tag form for those elements.
XML documents are composed of markup and content. There are six kinds of markup that can occur in an XML document: elements, entity references, comments, processing instructions, marked sections, and document type declarations. The following sections introduce each of these markup concepts.
Elements
Elements are the most common form of markup. Delimited by angle brackets, most elements identify the nature of the content they surround. Some elements may be empty, as seen above, in which case they have no content. If an element is not empty, it begins with a start-tag,
Attributes
Attributes are name-value pairs that occur inside start-tags after the element name. For example,
is a div element with the attribute class having the value preface. In XML, all attribute values must be quoted.
Entity References
In order to introduce markup into a document, some characters have been reserved to identify the start of markup. The left angle bracket, < , for instance, identifies the beginning of an element start- or end-tag. In order to insert these characters into your document as content, there must be an alternative way to represent them. In XML, entities are used to represent these special characters. Entities are also used to refer to often repeated or varying text and to include the content of external files.
Every entity must have a unique name. Defining your own entity names is discussed in the section on entity declarations. In order to use an entity, you simply reference it by name. Entity references begin with the ampersand and end with a semicolon.
For example, the lt entity inserts a literal <>
can be represented in an XML document as <element>. A special form of entity reference, called a character reference [Section 4.1], can be used to insert arbitrary Unicode characters into your document. This is a mechanism for inserting characters that cannot be typed directly on your keyboard.
Character references take one of two forms: decimal references, ℞, and hexadecimal references, ℞. Both of these refer to character number U+211E from Unicode (which is the standard Rx prescription symbol, in case you were wondering).
Comments
Comments begin with . Comments can contain any data except the literal string --. You can place comments between markup anywhere in your document.
Comments are not part of the textual content of an XML document. An XML processor is not required to pass them along to an application.
Processing Instructions
Processing instructions (PIs) are an escape hatch to provide information to an application. Like comments, they are not textually part of the XML document, but the XML processor is required to pass them to an application.
Processing instructions have the form: . The name, called the PI target, identifies the PI to the application. Applications should process only the targets they recognize and ignore all other PIs. Any data that follows the PI target is optional, it is for the application that recognizes the target. The names used in PIs may be declared as notations in order to formally identify them.
PI names beginning with xml are reserved for XML standardization.
CDATA Sections
In a document, a CDATA section instructs the parser to ignore most markup characters.
Consider a source code listing in an XML document. It might contain characters that the XML parser would ordinarily recognize as markup (< and &, for example). In order to prevent this, a CDATA section can be used.
Between the start of the section, and the end of the section, ]]>, all character data is passed directly to the application, without interpretation. Elements, entity references, comments, and processing instructions are all unrecognized and the characters that comprise them are passed literally to the application.
The only string that cannot occur in a CDATA section is ]]>.
Document Type Declarations
A large percentage of the XML specification deals with various sorts of declarations that are allowed in XML. If you have experience with SGML, you will recognize these declarations from SGML DTDs (Document Type Definitions). If you have never seen them before, their significance may not be immediately obvious.
One of the greatest strengths of XML is that it allows you to create your own tag names. But for any given application, it is probably not meaningful for tags to occur in a completely arbitrary order. Consider the old joke example introduced earlier. Would this be meaningful?
Goodnight, Gracie Say Gracie.goodnight,It's so far outside the bounds of what we normally expect that it's nonsensical. It just doesn'tmean anything.
However, from a strictly syntactic point of view, there's nothing wrong with that XML document. So, if the document is to have meaning, and certainly if you're writing a stylesheet or application to process it, there must be some constraint on the sequence and nesting of tags. Declarations are where these constraints can be expressed.
More generally, declarations allow a document to communicate meta-information to the parser about its content. Meta-information includes the allowed sequence and nesting of tags, attribute values and their types and defaults, the names of external files that may be referenced and whether or not they contain XML, the formats of some external (non-XML) data that may be referenced, and the entities that may be encountered.
There are four kinds of declarations in XML: element type declarations, attribute list declarations, entity declarations, and notation declarations.
Element Type Declarations
Element type declarations [Section 3.2] identify the names of elements and the nature of their content. A typical element type declaration looks like this:
This declaration identifies the element named oldjoke. Its content model follows the element name. The content model defines what an element may contain. In this case, an oldjoke must contain burns and allen and may contain applause. The commas between element names indicate that they must occur in succession. The plus after burns indicates that it may be repeated more than once but must occur at least once. The question mark after applauseindicates that it is optional (it may be absent, or it may occur exactly once). A name with no punctuation, such as allen, must occur exactly once.
Declarations for burns, allen, applause and all other elements used in any content model must also be present for an XML processor to check the validity of a document.
In addition to element names, the special symbol #PCDATA is reserved to indicate character data. The moniker PCDATA stands for parseable character data .
Elements that contain only other elements are said to have element content [Section 3.2.1]. Elements that contain both other elements and #PCDATA are said to have mixed content[Section 3.2.2].
For example, the definition for burns might be
The vertical bar indicates an or relationship, the asterisk indicates that the content is optional (may occur zero or more times); therefore, by this definition, burns may contain zero or more characters and quote tags, mixed in any order. All mixed content models must have this form:#PCDATA must come first, all of the elements must be separated by vertical bars, and the entire group must be optional.
Two other content models are possible: EMPTY indicates that the element has no content (and consequently no end-tag), and ANY indicates that any content is allowed. The ANY content model is sometimes useful during document conversion, but should be avoided at almost any cost in a production environment because it disables all content checking in that element.

