This lesson explains the importance of XML as a data-interchange language and shows how XML documents are constructed using tags, hierarchy, and grammar. Introduces the Document Type Definition (DTD) language for defining the grammar (structure) of an XML document.
XML in a Nutshell
XML is a markup language that uses elements and attributes to create a text file of structured data. XML is a common data format used by many organizations, systems, and the web to exchange data. Often, data needed for analytics resides in an XML and must be imported. Consequently, data analytics professionals should know how to read (parse) XML and prepare the data for import into a data mining effort or a data warehouse.
One hallmark of XML is that the tags that create elements are created by the person or system that creates the file. As such, there is no specific definition or explanation of the tags and the elements they create. This increased flexibility for use of the files but can make it more difficult to exchange data between systems effectively. For this reason, we use a DTD to describe the tags that are used the the order and quantity in which they must appear in the XML..
A DTD, or Document Type Definition, defines the structure of an XML document (or file). It describes the elements and attributes used in the file and helps us to understand and agree on the contents. It also is possible for special parsers to read the DTD to ensure that the structure of an XML file is valid according to the DTD. Since we are working with large data sets, we cannot check them ourselves; a DTD, along with the restrictions in the XML language, help to ensure that the data is structured correctly and thus can be imported.
A DTD is text-based. It can be internal, part of your XML file, or external, in a stand-alone text file. Either way, the DTD specifies what we should expect in the XML. The DTD contains the following elements:
Elements: appear as tags, determine what is being structured
Attributes: provide extra information about the element
Entities: define shortcuts that have special meaning
PCDATA: parsed character data, data that should be parsed
CDATA: character data that should not be parsed
The DTD
The DTD (Document Type Definition) defines the structure of an XML document. it is the grammar for the XML tag language. A DTD can be as part of the XML (internal) or in a separate file (external) which is easier for sharing and standards.
Internal DTD
An internal DTD appears at the top of your XML file. It will appear under the declaration for XML and will not be visible in your XML code if displayed. If you view the source code in a text editor, you will see the DTD, followed by the XML elements.
<?xml version="1.0"?><!DOCTYPEwriting[<!ELEMENTwriting (title,author+,pubdate?)><!ATTLISTwriting type CDATA #REQUIRED><!ELEMENTtitle (#PCDATA)><!ELEMENTauthor (authorfirstname, authorlastname)><!ELEMENTauthorfirstname (#PCDATA)><!ELEMENTauthorlastname (#PCDATA)><!ELEMENTpubdate (month,day,year)><!ELEMENTmonth (#PCDATA)><!ELEMENTday (#PCDATA)><!ELEMENTyear (#PCDATA)>]><writing type='book'> <title> The Joy of Data </title> <author> <authorfirstname>J.</authorfirstname> <authorlastname>Datasmith</authorlastname> </author> <pubdate> <month>12</month> <day>18</day> <year>2020</year> </pubdate></writing>
In the above example, the DTD tell us that the document stores writing, with space to capture the book type, title, author (broken into first and last name – which are called child nodes), and publication date (with the child nodes month, day, and year). #PCDATA tells us that the elements title can be parsed. The type of writing is an attribute of writing.
External DTD
If the DTD is external, it is located in a separate file. You will see that the XML code references the DTD in the opening lines of the XML. For example, a DTD with the name “writing.dtd” would be noted as: xml <!DOCTYPE writing SYSTEM "writing.dtd">
An external DTD file will contain similar text to the internal, but will only specify what is necessary for the definitions.
<!ELEMENT writing (title,author+,pubdate?)><!ATTLIST writing type CDATA #REQUIRED><!ELEMENT title (#PCDATA)><!ELEMENT author (authorfirstname, authorlastname)><!ELEMENT authorfirstname (#PCDATA)><!ELEMENT authorlastname (#PCDATA)><!ELEMENT pubdate (month,day,year)><!ELEMENT month (#PCDATA)><!ELEMENT day (#PCDATA)><!ELEMENT year (#PCDATA)>
DTD vs XML Schema Definition (XSD)
You may find that some XML documents do not contain references to a DTD and instead reference a schema. An XML schema is similar to a DTD in that it describes the structure of the XML document.
A DTD is written as a grammar (in fact, it is a BNF – or Bachus-Naur Form – grammar) and tends to be brief in nature. XML schemas are written in XML and are more powerful than a DTD because they allow for more control over the elements. An XML schema supports data types and name spaces. This allows for more granular descriptions, restrictions, and validation.
Both a DTD and schema will allow you to declare your elements and attributes, describe how these elements are groups, nested, and used, and provide a way to restrict the type or format of elements. A DTD may be less complex to write, because it uses a different language and has less code available. It is useful in specific types of data work where an XML schema is often more complex, using XML code.
A way to think about this:
A DTD will define and describe the structure of your document.
An XML schema will be more descriptive and allow you to have more control of data types.
In data and analytics work, when we access files to prepare them for including at the data warehouse or an OLAP database, we may see either one of these as part of the XML files.
Data Sources
Not all data in an organization is stored in a database managed by a database management system (DBMS) such as Oracle or MySQL. In fact, a whole lot more data is stored in ad-hoc data files, spreadsheets, and on the web. Most web-based data is contained in unstructured HTML files served by web servers while some data is available in structured or semi-structured XML documents.
Structured vs Unstructured Data
Data in relations (tables) stored in a relational database is structured data and the structure is defined through the database schema while data integrity is defined via constraints and enforced by the DBMS. Every row in a table has the same form and each attribute conforms to defined data types.
The same is not necessarily true for data in files, spreadsheets, and HTML, although some may be semi-structured by virtue of the tabular presentation of the data. Unstructured data is much more difficult to work with but is not at all uncommon.
Web Data & Scraping
A significant source of data is the web where data is often displayed on web pages. Such data is unstructured, and interpretation is done through inspection and retrieval via “scraping”. Web scraping is a technique for acquiring (retrieving) the data contained in unstructured HTML and converting it to a structured format that is usable for processing or inclusion in a structured data store. Most programming languages offer web scraping packages and functions. There are also numerous platforms available for scraping, e.g., http://x-tract.io. While the data is often publicly available, many information providers either explicitly forbid web scraping or they implicitly obfuscate data or hinder web scrapers. Some data is copyrighted (e.g., most sports data) and may not be used for commercial purposes or stored in any retrieval system.
Data Interchange
Data must often be interchanged and exported from systems and databases. An “export format” is required:
CSV (Comma Separated Values)
XML (domain language or ad-hoc)
Other interchange formats include SOAP and JSON.
CSV files either represents full tables or combinations (joins) of tables but are inherently tabular. XML files have both a hierarchical structures and through a DTD or Schema enforcement of structural constraints. XML files are queried queried through XPath and XQuery or via custom programming.
XML formats (elements, tags, attributes, structure) can be converted from one format to another without programming by specifying transformation rules in XSL. XSL itself is an XML language for specifying transformation rules.
Structure of an XML Document
While XML data is often contained in documents (files) that is not always the case. There are databases that are based on XML, such as Apache Axiom, Qizx, BaseX, Sonic XML Server, eXistdb, Sedna, and webMethod’s Tamino. So, rather than using the term XML document, we often prefer the term XML store.
An XML document uses two main constructs: elements and attributes. An element is a start tag and a matching end tag and contents enclosed between them, e.g., <PHONE>617-373-9000</PHONE>. XML is case sensitive, so <PHONE> and <phone> are not the same element tag. Elements can be nested, and an XML store forms a tree or hierarchy.
AN XML document starts with the preamble <?xml version="1.0"?>, followed by an optional DTD or XML Schema, followed by the root element with its child elements. There is only one root element, so all XML stores form a tree.
<?xml version="1.0" encoding="UTF-8"?><root> ...<!-- child elements --></root>
An XML document is well-formed if
it starts with an XML declaration in the form <?xml version="1.0" encoding="UTF-8" ?>
has a single root element, which contains all other elements in the document as child elements, and, finally,
all elements are properly nested in the XML document.
Elements vs Attributes
Elements may have attributes associated with them. An XML element attribute provides information about the element and is in the form of a name/value pair. Elements may have multiple attributes. Attributes names must only be unique within an element and not within the file.
In the XML fragment below, the element <publisher> has an attribute pID and the element <address> has the attribute aType.
Attribute values should be enclosed in quotes, although for numbers that is not strictly enforced but good practice. All attributes have the data type “character” or text/string even if they are all numbers. Single and double quotes are the same but cannot be mixed, i.e., ‘att-val’ is equivalent to “att-val”. This is useful when we need to enclose one type of quote within another, e.g., “Karl’s Course”.
Comments
XML allows inserting comments anywhere in the document (except within tags). Comments do not become a part of the processed or displayed content. Comments are enclosed within <-- and -->.
<?xml version="1.0" encoding="UTF-8"?><!-– List of publishers --><publishers> <publisher pID="1001"> <name>Murray</name> <address aType="US"> <city>Wilmington</city> <state>DE</state> </address> <phone>647.331.5555</phone> </publisher> ...</publishers>
Special Characters in XML
As some characters have special meaning in XML, e.g., <_ or >, they are a bit more difficult to insert. The table below lists some of the more common characters and their XML representation.
Character
XML Representation
Example
<
<
<tag>y < 0</tag>
>
>
<tag>use <X></tag>
&
&
<tag>y & w</tag>
”
"
<tag>y"</tag>
’
'
<tag>Billy's Boat</tag>
Defining Structure with a DTD
An important issue is the lack of a definition of the structure of the XML elements. How do we know – or how does a receiving information system know – that the XML file is properly structured? Solution: Define the grammar of the XML. Method: Use the Document Type Definition (DTD) meta language.
A DTD is a set of rules that defines the elements and attributes that can appear within an XML document and their sequence and nesting. An XML document is considered a valid document if it conforms to the associated DTD. A DTD can appear within an XML document (internal DTD) or be defined in an external file (external DTD) which allows for sharing. An internal DTD is only visible within the file in which it resides, meaning that other XML files cannot use the DTD. A DTD that applies to several XML document or is used to communicate an XML standard should be placed into an external file rather than embedded.
Assume that the DTD shown below is stored in the file note.dtd.
Then the XML document below can reference that external DTD and ensure (during parsing of the XML) that it conforms to the rules of the DTD. Note the syntax of the referencing. It starts with <!DOCTYPE followed by the name of the root element (<note>) in this example) and the keyword SYSTEM and the path to the file, which can be a local file or a URL.
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPEnote SYSTEM "Note.dtd"><note> <to>Chris</to> <from>Topher</from> <heading>Reminder</heading> <body> Don't forget the meeting! </body></note>
To summarize, the Document Type Definition (DTD) can be in a separate file or defined inline at the beginning of the XML document. It is used to validate the structure of the XML document and defines the “grammar” of our XML “language”. The DTD language is used to define the grammar of the XML using Extended Bachus-Naur Form (EBNF). A DTD is optional (but recommended) although required if the document is to be validated by the parser prior to parsing.
DTD: Cardinality
A cardinality adornment may follow an element. If no cardinality adornment is present, then implied cardinality is exactly 1, i.e., the element must appear and cannot appear more than one. There are three adornments that modify cardinality:
_*_ indicating a cardinality of zero or more, i.e., the element can appear as often as desired and possibly not at all (optional)
+ indicating a cardinality of one or more, i.e., the element can appear as often as desired but must appear at least once (mandatory)
? indicating a cardinality of zero or one, i.e., the element may not appear at all or at most once
These adornments will become clearer in the sections that follow when we look at examples.
DTD: Content
An element that has a value is indicated with <!ELEMENT foo (#PCDATA)>. The value is what is between the opening and closing tag, e.g., for the element <foo>content</foo>, the value is “content”.
DTD: Sequence
A sequence in XML is a an element followed by another element. The diagram DTD below shows the different parts of the DTD and what they specify. The comma in the element definition for Author is a sequence of four child elements, which must appear in that order.
The DTD specifies that the “root” element is <root> and that it contains zero or more <Author> elements as indicated by the specification <!ELEMENT root (Author*)>. The parenthesis specifies child elements, so the root element has zero or more (indicated by the _*_ after Author) Author elements.
Each Author element consists of a sequence of child elements, all of which must appear exactly once (no cardinality adornment) in that sequence (the comma indicates a sequence).
Sequence of Elements in a DTD
An XML document that conforms to the above DTD would look like this:
The child elements directly underneath <Author> must appear in exactly that order and cannot be omitted, i.e., they are mandatory or required. Leaving one of the child elements out or changing the sequence would result in an error when the XML document is validated (perhaps during parsing when it is loaded into an application for processing).
The XML document below is also valid even though it contains no child elements under <root>. That is valid as the cardinality adornment for Author is _*_ which indicates zero or more elements of that type.
Naturally, most of these errors would not occur in practice if the XML is generated programmatically.
DTD: Mixed Content
The comma (,) operator specifies sequence in a DTD. On the other hand, the or (|) operator specifies a choice. If the DTD contains (foo|bar) it means that either the element foo or the element bar must appear. So, the DTD below specifies that an Author element has a mandatory child element authorID followed by another mandatory child element name followed by either an org or an email element. Actually, since it states _(org|email)*_ it means zero of more of (org|email).
Sequence of Elements in a DTD
An XML document that conforms to the above DTD would look like this:
If we wanted at least one org and one email we would have specified the DTD as follows, where it would be one of authorID, name, org, email in that sequence followed by zero or more combinations of org and email.
Elements may (optionally) have attributes which are key/value pairs set in the tag of the element. Attributes are character strings and should be enclosed in quotes (although that is not strictly required if they are numbers).
The example below uses an attribute for authorID rather than representing it as a child element. This is common for “key values”. An attribute can have a data type of CDATA (character data), ID (unique value), IDREF (reference to an ID), or an enumerated token (see below), among others. AN attribute is specified with <!ATTLIST element attribute default>, so <!ATTLIST Author authorID CDATA #REQUIRED> specifies the (mandatory) attribute authorID for the element Author.
Element attributes can have the following constraints:
#REQUIRED – a value for the attribute must be provided
#FIXED – a fixed value that cannot be changed and all attributes for all element instances have that value; a value must be specified after this keyword
#IMPLIED – an optional attribute that may or may not be present
Note that these keywords must be in upper case and only one can be specified.
DTD: Enumerated Tokens
XML recognizes characters as the default type of any value. In addition, XML supports value sets or categorical attributes which draw their values from a predefined set of values. To enforce a value set (enumerated tokens) it must be defined as an attribute as no validation can occur for parsed character data (#PCDATA).
In the XML example below, the LearningAsset element has the attribute difficulty. It must be one of (easy|medium|difficult) and, if not present, defaults to the value “easy”. Setting the attribute to any value other than the ones defined results in a validation error.
An XML document must generally be valid in order to be easily parsable by a program, i.e., conform to rules (“grammar”) prescribed by the DTD. An XML document can be well-formed without being valid, i.e., each element has a properly nested start and end tag but may or may not have a DTD not conform to it. In practice, many XML files do not have a DTD which can make parsing (reading) them challenging and much more difficult.
Approaches to Parsing
There are two general approaches to “parsing” an XML document: DOM vs SAX parsing. In DOM (Document Object Model) parsing, the entire XML document is converted to an in-memory tree data structured and thus resides fully in memory where it can be traversed node-by-node or via XPath. In SAX (Simple API for XML) parsing only one element is parsed and loaded into memory at a time and a callback function registered by the processing program is called for each element as it is encountered. SAX parsing is preferable for very large XML documents, but it is more difficult to parse as context is lost and XPath cannot be used.
With SAX parsing, events are triggered when the XML is being parsed. When the parser is parsing the XML, and encounters a start tag (e.g., <something>), then it triggers an event and calls a processing function that is registered for that tag, so different tags may have different event handleing functions. Similarly, when the end of the tag is met while parsing (</something>), it triggers another event and calls a registered callback function. Using a SAX parser implies you need to handle these events and make sense of the data returned with each event, but you do not have “context” unless you track it.
On the other hand, in DOM parsing, there are no events triggered while parsing. The entire XML is parsed and a DOM tree (of the nodes in the XML) is generated and returned. Once parsed, the user can navigate the tree to access the various data previously embedded in the various nodes in the XML. In general, DOM is easier to use but has an overhead of parsing the entire XML before you can start using it and keeping it in memory.
DOM
tree model parser converting XML to an internal tree
loads the document fully into the memory and then parse the document
memory constrained as it loads the entire XML before parsing
entire object model is stored as a tree in memory, thus memory intensive
object model can be read and written, i.e., can insert or delete elements
preferable if the XML content fits into memory
can be queried via XPath
easy to navigate in any direction
slow to process
SAX
event based parser using callback functions
parses the document as it is read, i.e., parses one element at a time
no memory constraints as it does not store the XML in memory.
read only, i.e., can’t insert or delete elements
preferable when XML is large or memory is limited
reads the XML document from beginning to end in one direction
fast processing
Inspecting XML Files
XML files are text files and must be viewed (and edited) using a text editor and not a word processor. So, use TextEdit on MacOS, or Notepad on Windows. Don’t use Word or Pages. A programming environment (IDE) such as Visual Studio, Eclipse, or R Studio can also be used.
For example, to inspect the pubmed_sample.xml file, you can use R Studio – it handles XML well. Download the pubmed_sample.xml first, then launch R Studio, and then click File/Open File… and choose pubmed_sample.xml from the folder into which you downloaded the file (likely the Downloads folder). Voila…
The short demonstration below shows how to open an XML file in R Studio for inspection.
In many cases you can open up a file by just double-clicking on it. That launches the app that’s associated with the extension (.xml, .dtd, etc). You can change the association in your preference settings. But often you want to open a file with a different app or there’s no association. You will need to open the app compatible with the file type first, then open the file from within the app.
A note on downloading XML files from a link: this doesn’t generally work, as, by default, XML files are opened in your browser which will attempt to render the XML. So, to download and save an XML file, open the context menu by (generally) right-clicking on the link and selecting “Save As” or a similar choice depending on your operating system.
Summary
XML is an important mechanism for transferring data between parties and information systems.
Tutorial I: Overview of XML and DTD
In this narrated chalk-talk, Khoury Boston’s Prof. Schedlbauer provides an overview of XML, common use, and the role of DTD as a defining grammar of an XML schema1.
Tutorial II: Essential XML with Examples
In these two video tutorials, Khoury Boston’s Prof. Schedlbauer explains first how to build an XML document with an associated DTD followed by a code walk-through in https://www.xmlvalidation.com.
Comments
XML allows inserting comments anywhere in the document (except within tags). Comments do not become a part of the processed or displayed content. Comments are enclosed within
<--
and-->
.