XML Basics, Part II: The Key Concepts
P.G. Daly
1/13/2004
Printer Friendly Version
In Part I of this series I answered the question, "What is XML?" Here in Part II of XML Basics, I will define, discuss, and illustrate some of the key concepts crucial to understanding and working with XML documents.
To briefly recap, XML is a meta language for describing mark-up languages. It provides a facility to define tags and the structural relationship between them.
In order to view XML documents hierarchically or view their output, you need an XML parser and processor. While there are a number of these tools available (Perfect XML has one such listing), for the purpose of simplicity, I will use Internet Explorer (5.x and later) to view XML documents for my examples. Internet Explorer has a built in XML parser and processor and is readily available.
The basic flow with XML processing consists of creating an XML document (and optionally corresponding XSL stylesheets), and translating it through an XML parser and processor to result in desired output(s). One of the benefits of XML is the ability to create multiple outputs from one XML document and this is clearly shown within the following visual representation of this process:
The XML Document
XML documents are composed of markup and contents. Six kinds of markup can occur in XML documents:
- Document Type Declarations
- Elements
- Comments
- Entity References
- Processing Instructions
- Conditional Sections
I will discuss the first three items in this article since processing instructions, entity references, and conditional sections, are more advanced topics that are best set aside for now.
Every XML document begins with a declaration that identifies it as being of the type XML. While the XML declaration is not mandatory, it is good practice to include it anyway. It can look as simply as:
<?xml version='1.0'?>
or it could use the optional attributes of encoding and standalone and look something like:
<?xml version='1.0' encoding='UTF-16' standalone='yes'?>
The encoding attribute specifies to the XML parser what character encoding the text is in so that it can read the document and translate it into Unicode (the "all integers language" machines understand).
The standalone attribute specifies whether the XML document depends on other, external files.
Most of the time, it will be sufficient to accept the defaults and not include these two attributes.
Document Type Declarations
Typically referred to as "the DOCTYPE declaration", the purpose of the Document Type Declaration is to link an XML document to a Document Type Definition (DTD). As luck would have it, this is a very confusing play on names and abbreviations; however there is a distinction between the two. While the DOCTYPE does the linking, the DTD sets the rules of the road. A DTD states which tags and attributes are allowed, where they can be placed, and whether or not they can be nested within a given document.
Since one of XML's benefits is its strong adherence to common standards while still being extensible, the DTD coupled with the XML specification are the key to making this work in the real world.
An example of what a DOCTYPE looks like is:
<!DOCTYPE MovieCatalog SYSTEM "movie_catalog.dtd">
In this example MovieCatalog is the name of the root document element (we'll discuss this in a moment under elements). It is required and links the DTD to the entire element tree. The keyword SYSTEM and the URL that follows allows the document to locate the corresponding DTD file on the same or an external filesystem.
DTDs can get a little confusing and don't really make much sense until you've worked with some XML documents. So, for now, it is sufficient just to understand what they are and why they are important.
Elements
When I think of an XML document, I think primarily of the elements that comprise it. To me, that is the heart of the document, the true content. An element consists of all the information from the beginning of a start-tag to the end of an end-tag including everything in between.
To draw an example from HTML, all of the following would be the equivalent of one element, named h1:
<h1>This is my big heading.</h1>
Where, <h1> is the start tag, </h1> is the end tag, and the content is in between.
Each XML document has a root element within which all other elements are nested. So, if we were creating an XML document representing a movie catalog, as in the example of the DOCTYPE statement above, the root element might look as follows:
<MovieCatalog>
Many other nested elements...
</MovieCatalog>
Fundamental to the understanding of XML are the rules all elements must follow. I will list and describe them briefly:
- Every start-tag must have a matching end-tag.
- Tags cannot overlap. Proper nesting is required.
- XML documents can only have one root element.
- Element names must obey the following XML naming conventions:
- Names must start with letters or the "_" character. Names cannot start with numbers of punctuation characters.
- After the first character, numbers and punctuation characters are allowed.
- Names cannot contain spaces.
- Names should not contain the ":" character as it is a "reserved" character.
- Names cannot start with the letters "xml" in any combination of case.
- The element name must come directly after the "<" without any spaces between them.
- XML is case sensitive.
- XML preserves white space within text.
- Elements may contain attributes. If an attribute is present, it must have a value, even if it is an empty string "". (I'll discuss attributes further in a complete MovieCatalog example.)
Comments
Comments in XML work the same as comments in any other language with which you are familiar. They are simply line(s) of code whose sole purpose is to provide the developer, and anyone reading the code in the future, information about the code. Comments are the "notes in the margin" of the programming world. Syntax for XML comments:
<!-- all the comments go in here -->
Comments can contain any combination of characters, numbers, or punctuation except for the literal string "--".
Where Does all this Lead?
An understanding of these topics is crucial to building a foundation of XML knowledge. Once you understand these basics, a real XML example will help you pull the pieces together. In Part III of my XML Basics series, I will build on the movie catalog concept with a full example. In addition, I will introduce the very important topics of "well-formed" and "valid" XML and explain the distinction between them.
Printer Friendly Version