After describing domain ontology using a conceptual data model, commonly expressed in a UML Class Diagram, that model must be mapped to an actual representation. This lesson shows patterns and practices for mapping a UML Class Diagram to an XML representation.
The presented patterns provide a framework for expressing ontologies in a DTD for the purpose of defining an XML data store. They are best practices but much of the integrity and constraint enforcement is not possible through neither the DTD nor XML and must thus be relegated to application logic.
The use of attributes versus elements is suggestive rather than prescriptive. Other representations are possible and generally equally valid. The representational patterns shown in this lesson are based on experience and their demonstrated usefulness rather than being based on standards.
Prerequisites
This lesson presumes that you are familiar with the structure of XML and how to construct a DTD and that you can read a UML Class Diagram. The lessons below provide the necessary background:
Entities might have attributes that are optional and may not be valid for some instances, although one might argue that this is a modeling mistake and that the optionality is better represented with a subclass.
Class representing an entity
The DTD to represent an entity would look as follows for the above class. Note the use of the ? to indicate that there are zero or one elements of org.
XML instances based on the above DTD would look as follows, the second of which doesn’t have a value for org. Note that at most one value for org is allowed.
Every entity should have a key attribute that uniquely identifies each instance. Key attributes should be represented as element attributes rather than child elements.
The key attribute is represented as an XML attribute on an element. The terminology can be confusing as an XML attribute is not the same as an entity attribute in UML or an ontology. In this example, the entity attribute authorID is represented as an XML element attribute, while the entity attribute name is represented as a child node of Author.
The tag #REQUIRED ensures that a value is provided for the attribute authorID, although since it is CDATA there is no assurance that it is unique.
Key attributes should be mandatory (required). In XML, element attributes can have the following constraints:
#REQUIRED – a value for the attribute must be provided
#FIXED – a fixed value that cannot be changed and all attributes for all element instances have that value; a value must be specified after this keyword
#IMPLIED – an optional attribute that may or may not be present
Note that these keywords must be in upper case and only one can be specified.
The value of an attribute can be either CDATA (any character string) or ID (a unique value that starts with a letter). Key attributes should be unique and it is often best to use ID rather than CDATA. However, unlike databases and UML, in XML, the value must be unique within the entire document or data store and not simply be unique for each entity type. So, while that it may often be desirable for the parser to ensure that key attributes are unique, this may not be feasible if data is exported from a database where duplicates can occur. For example, the table Author can have a primary key column value of 1001, while the table Lesson might also have a lesson with a primary key value of 1001. Not an issue in the database as the primary key values are unique within each table, but definitely an issue for XML – aside from the issue that it must start with a letter. So, to export, we would either not use XML element attributes, we would use CDATA, or we might use ID but prefix each primary key value with the table name, e.g., 1001 for an author would become the ID attribute value “author1001”.
So, to summarize, the DTD can be used to enforce uniqueness when the key attribute is represented as an XML element attribute. ID, IDREF, and ENTITY are tokenized attribute types that allow restrictions on values and enforced uniqueness rules.
The value of an ID attribute type must start with a letter.
<!DOCTYPEroot[<!ELEMENTroot (Author*)><!ELEMENTAuthor (name,org?,email)><!ATTLISTAuthor authorID ID #REQUIRED><!ELEMENTname (#PCDATA)><!ELEMENTorg (#PCDATA)><!ELEMENTemail (#PCDATA)>]>
Some elements may be empty in which case there is only a beginning and an end tag but no content. These are often used to represent Boolean (true/false) values. Empty elements can be defined using the shortcut xml <tag/>.
Boolean Flags
<!DOCTYPEroot[<!ELEMENTroot (LearningAsset*)><!ELEMENTLearningAsset (title,difficulty,isDeployed?)><!ATTLISTLearningAsset assetID ID #REQUIRED><!ELEMENTtitle (#PCDATA)><!ELEMENTdifficulty (#PCDATA)><!ELEMENTisDeployed EMPTY>]>
Note the use of self-closing tags to represent a Boolean attribute. Its mere presence in the XML indicates TRUE, while its absence would then indicate FALSE.
<root> <LearningAsset assetID="la101"> <title>Introduction to XML</title> <difficulty>easy</difficulty> </LearningAsset> <LearningAsset assetID="la102"> <title>Unions in C</title> <difficulty>intermediate</difficulty> <isDeployed /> </LearningAsset> …</root>
Learning Checkpoint I
Create a DTD for the XML below assuming the following “rules”:
There is at least one section per course but no limit
time and location may appear in any order and location is optional
courseNumber and crn are required key attributes that must be unique
each course may or may not run and whether it’s running should be signaled with a Boolean attribute
Learning Checkpoint I
Pattern 8: One-to-One Embedded
This pattern is one of two patterns for representing a one-to-one relationship between two entities. The pattern below embeds the dependent entity within the independent entity and is preferred when one entity is “part-of” another (a “partonomy” in ontology speak).
Naturally, sharing of elements is not possible, so it is most appropriate for composition relationships and “tight” associations.
In this representation, an XML element is embedded within another.
<root> <Lesson> <timeToComplete>2.5</timeToComplete> <title>File System Architecture</title> <Memo> <contents>contents of memo</contents> <attachment>http://foo.bar.com/slides.pptx</attachment> <attachment>http://foo.bar.com/demo.c</attachment> </Memo> </Lesson><!-- ... --></root>
Pattern 9: One-to-One Linked
This pattern also implements a one-to-one relationships, sich as an association, an aggregation, or a composition, but does so via linking. This is more appropriate when the linked objects are shared or both are independent entities.
Note the use of the IDREF mechanism to essentially implement a foreign key. Values used in IDREF are checked to ensure they point to a valid ID in the document at the time the XML document is read (parsed). Recall that ID values must be unique in the document rather than for a type of entity.
IDREF must have the constraint #REQUIRED or #IMPLIED (if the attribute is not mandatory which would imply that the multiplicity is 0..1 rather than 1).
<root> <Lesson lessonID="lesson334"> <timeToComplete>2.5</timeToComplete> <title>File System Architecture</title> </Lesson> <Lesson lessonID="lesson138"> <timeToComplete>4.5</timeToComplete> <title>Ontology Design and Representation</title> </Lesson><!-- ... --> <Memo lessonIDFK="lesson334"> <contents> Lorem ipsum dolor sit amet, consectetur adipiscing elit. </contents> <attachment>http://foo.bar.com/slides293.ppts</attachment> <attachment>http://foo.bar.com/allfiles293.zip</attachment> </Memo> <Memo lessonIDFK="lesson138"> <contents> Nam vitae ligula vehicula, imperdiet ex nec, condimentum turpis. </contents> </Memo></root>
Pattern 10: One-to-Many Back-Linked
This pattern is one of two that is used to represent a one-to-many relationship for an association, aggregation, or composition.
<!DOCTYPEroot[<!ELEMENTroot (LearningAsset*, Author*)><!ELEMENTLearningAsset (title,difficulty,isDeployed)><!ATTLISTLearningAsset assetID ID #REQUIRED><!ATTLISTLearningAsset authorIDFK IDREF #REQUIRED><!ELEMENTtitle (#PCDATA)><!ELEMENTdifficulty (#PCDATA)><!ELEMENTisDeployed (#PCDATA)><!ELEMENTAuthor (name,org?,email)><!ATTLISTAuthor authorID ID #REQUIRED><!ELEMENTname (#PCDATA)><!ELEMENTorg (#PCDATA)><!ELEMENTemail (#PCDATA)>]>
<root> <LearningAsset assetID="i100" authorIDFK="a1"> <title>Ontology Design</title> <difficulty>medium</difficulty> <isDeployed>true</isDeployed> </LearningAsset> <LearningAsset assetID="i200" authorIDFK="a1"> <title>Relational Calculus</title> <difficulty>advanced</difficulty> <isDeployed>false</isDeployed> </LearningAsset><!-- more LearningAsset elements --> <Author authorID="a1"> <name>Cohen, W.</name> <email>cohen.w@learningguild.com</email> </Author> <Author authorID="a3"> <name>Patel, P.</name> <email>patelp7@ufsp.edu.br</email> </Author><!-- more Author elements --></root>
Pattern 11: One-to-Many Forward-Linked
This is the second of the two patterns on representing one-to-many relationships in XML. Pattern 10 was similar to how a relational database would represent a one-to-many relationship – as a foreign key. This pattern uses more of an approach that might be used in Java or other object-oriented programming languages that supports lists. Here, one entity has a list of references to the other. So, each author has a list of IDs for the lessons they authored, while Pattern 10 had the ID of the author linked to from the lesson.
<!DOCTYPEroot[<!ELEMENTroot (LearningAsset*, Author*)><!ELEMENTLearningAsset (title,difficulty,isDeployed)><!ATTLISTLearningAsset assetID ID #REQUIRED><!ELEMENTtitle (#PCDATA)><!ELEMENTdifficulty (#PCDATA)><!ELEMENTisDeployed (#PCDATA)><!ELEMENTAuthor (name,org?,email)><!ATTLISTAuthor authorID ID #REQUIRED><!ATTLISTAuthor assetIDs IDREFS #IMPLIED><!ELEMENTname (#PCDATA)><!ELEMENTorg (#PCDATA)><!ELEMENTemail (#PCDATA)>]>
<root> <LearningAsset assetID="i100"> <title>Ontology Design</title> <difficulty>medium</difficulty> <isDeployed>true</isDeployed> </LearningAsset> <LearningAsset assetID="i200"> <title>Relational Calculus</title> <difficulty>advanced</difficulty> <isDeployed>false</isDeployed> </LearningAsset><!-- more LearningAsset elements --> <Author authorID="a1" assetIDs="i100 i200"> <name>Cohen, W.</name> <email>cohen.w@learningguild.com</email> </Author> <Author authorID="a3"> <name>Patel, P.</name> <email>patelp7@ufsp.edu.br</email> </Author><!-- more Author elements --></root>
Note that the IDREFS attribute is #IMPLIED so that there can be authors that did not author any learning assets, i.e., they do not have a reference to a learning asset.
Pattern 12: Many-to-Many Junction
This pattern implements many-to-many relationships using a junction entity similar to a junction table used in the relational model.
Note the use of a new “junction element” that maps one LearningUnit to one Lesson, similar to a junction table in a relational database. It has two primary key identifiers, one for each entity.
<root> <LearningUnit unitID="i100"> <timeToCover>3.5</timeToCover> <title>File System Access</title> <overview>Lorem ipsum orem tacitum</overview> </LearningUnit> <LearningUnit unitID="i200"> <timeToCover>2</timeToCover> <title>Dynamic Memory Allocation</title> <overview>Lorem ipsum orem tacitum</overview> </LearningUnit><!-- more LearningUnit elements --> <Lesson lessonID="i1"> <timeToComplete>30</timeToComplete> <title>Directories and Files</title> </Lesson> <Lesson lessonID="i2"> <timeToComplete>20</timeToComplete> <title>Allocating Dynamic Buffers</title> </Lesson><!-- more Lesson elements --> <Content lessonID="i1" unitID="i200" /> <Content lessonID="i2" unitID="i200" /> <Content lessonID="i2" unitID="i100" /><!-- ... --></root>
Pattern 13: Taxonomy / Repeated Attributes
This pattern represents taxonomy/generalization (inheritance) relationships. This is the first of two patterns: it assumes that the superclass is abstract and only represents the subclasses by duplicating common attributes.
LearningAsset is an abstract class which means that there are no instances of that class and only instances of the subclasses. So, there is no LearningAsset element needed in the XML. The relationship to Author is implemented in each subclass element.
<root> <Video assetID="v100" authorIDFK="a387"> <title>Loading XML into R</title> <difficulty>medium</difficulty> <isDeployed>true</isDeployed> <url>youtu.be/Hg44Xjh3</url> <runTime>16:23</runTime> <type>Tutorial</type> <platform>YouTube</platform> </Video><!-- more Video elements --> <SlideDeck assetID="s100" authorIDFK="a387"> <title>Information Discovery</title> <difficulty>easy</difficulty> <isDeployed>true</isDeployed> <url>drv.co/Hddj87za</url> <fileType>pptx</fileType> </SlideDeck><!-- more SlideDeck elements --> <Author authorID="a387"> <name>Kerner, I.</name> </Author><!-- more Author elements --></root>
Pattern 14: Taxonomy / Connected Classes
The second approach to representing generalization hierarchies (aka, taxonomies or inheritance hierarchies) is to create an instance for the super class and one for the subclass and then use a foreign key link to connect them. This is similar to the approach used for relational databases and works particularly well if the superclass is concrete rather than abstract.
Note that the superclass and subclass elements have the same ID (primary key) so they can be connected, but in the subclass it is an IDREF that connects the subclass to the superclass.
<root> <LearningAsset assetID="la100" authorIDFK="a387"> <title>Loading XML into R</title> <difficulty>medium</difficulty> <isDeployed>true</isDeployed> </LearningAsset> <Video assetID="la100"> <url>youtu.be/Hg44Xjh3</url> <runTime>16:23</runTime> <type>Tutorial</type> <platform>YouTube</platform> </Video><!-- more Video elements --> <LearningAsset assetID="la102" authorIDFK="a387"> <title>Information Discovery</title> <difficulty>easy</difficulty> <isDeployed>true</isDeployed> </LearningAsset> <SlideDeck assetID="la102"> <url>drv.co/Hddj87za</url> <fileType>pptx</fileType> </SlideDeck><!-- more SlideDeck elements --> <Author authorID="a387"> <name>Kerner, I.</name> </Author><!-- more Author elements --></root>
Pattern 15: Value Set with Default Value
This final pattern implements value sets or categorical attributes which draw their values from a predefined set of values. To enforce a value set (enumerated type) it must be defined as an attribute as no validation can occur for parsed character data (#PCDATA).
A default value is required for a enumerated type.
<root> <LearningAsset assetID="la101" difficulty="medium"> <title>Introduction to XML</title> </LearningAsset> <LearningAsset assetID="la102"> <title>Unions in C</title> <isDeployed /> </LearningAsset><!-- ... --></root>
Best Practices
The use of XML element attributes for elements should be restricted to: Primary keys, Foreign keys, and Value sets (categorical). All other entity attributes should be represented as child element.
The use of these representational patterns is suggestive rather than prescriptive, so deviate and adjust as necessary for specify use cases.
Summary
This lesson provided numerous patterns for mapping different conceptual modeling elements to an XML representation. Many of the patterns are similar to those used in mapping conceptual data models to relational models.
Tutorial I: Mapping to XML
In this video tutorial, Khoury Boston’s Prof. Schedlbauer explains how to use a set of representational patterns to map a conceptual data model into an XML datastore and define its structure in a DTD.
Note that the video presents an older version of the slide deck that has ID values that start with a digit, which is not correct. An ID value must start with a letter. Furthermore, an IDREF must have a #REQUIRED constraint; the slides in the video incorrectly show an IDREF without that.
While most parsers will gloss over these errors, many will not. Consequently, the XML file would not validate as the DTD is incorrect. Use tools such as http://xmlvalidation.com to help validate your XML documents and their DTD.
Tutorial II: Primary vs Foreign Keys in UML and XML
In this narrated chalk-talk, Khoury Boston’s Prof. Schedlbauer explains the difference between primary and foreign keys in UML, XML, and in general.