Introduction

Structure of XML

<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
  <PubmedArticle PMID="1">
    <Article>
      <Journal>
        <ISSN IssnType="Print">0006-2944</ISSN>
        <JournalIssue CitedMedium="Print">
          <Volume>13</Volume>
          <Issue>2</Issue>
          <PubDate>
            <Year>1975</Year>
            <Month>Jun</Month>
          </PubDate>
        </JournalIssue>
        <Title>Biochemical medicine</Title>
        <ISOAbbreviation>Biochem Med</ISOAbbreviation>
      </Journal>
      <Language>eng</Language>
      <ArticleTitle>Formate assay in body fluids</ArticleTitle>
      <AuthorList CompleteYN="Y">
        <Author ValidYN="Y">
          <LastName>Makar</LastName>
          <ForeName>A B</ForeName>
          <Initials>AB</Initials>
        </Author>
        <Author ValidYN="Y">
          <LastName>McMartin</LastName>
          <ForeName>K E</ForeName>
          <Initials>KE</Initials>
        </Author>
        ...
      </AuthorList>
    </Article>
  </PubmedArticle>
  ...
</PubmedArticleSet>

Load XML

library(XML)

xmlURL <- "http://artificium.us/lessons/06.r/l-6-183-extractxml-data-in-r/pubmed-xml-tfm/pubmed22n0001-tf.xml"

system.time(
  xmlDOM <- xmlParse(xmlURL, validate=F)
)
##    user  system elapsed 
##   0.454   0.191  52.589
xmlFile <- "pubmed-v3.xml"

system.time(
  xmlDOM <- xmlParse(xmlFile, validate=F)
)
##    user  system elapsed 
##   0.432   0.083   0.524
root <- xmlRoot(xmlDOM)
numArticles <- xmlSize(root)

cat("Articles to process: ", numArticles)
## Articles to process:  30000
system.time(
for (a in 1:numArticles) {
  anArticle <- root[[a]]
}
)
##    user  system elapsed 
##   9.780   0.026   9.819

Get an article’s title

system.time(
for (a in 1:numArticles) {
  anArticle <- root[[a]]
}
)
##    user  system elapsed 
##  10.352   0.032  10.472

Summary

This lesson provided several specific recommendations and programming practices for extracting data from a large XML and transforming it to a relational schema.


Files & Resources

All Files for Lesson 80.802

References

No references.

Errata

None collected yet. Let us know.

LS0tCnRpdGxlOiAiRXh0cmFjdCBhbmQgTG9hZCBMYXJnZSBYTUwgdG8gRGF0YWJhc2UiCnBhcmFtczoKICBjYXRlZ29yeTogODAKICBudW1iZXI6IDgwMgogIHRpbWU6IDQ1CiAgbGV2ZWw6IGJlZ2lubmVyCiAgdGFnczogInIseHBhdGgseG1sIgogIGRlc2NyaXB0aW9uOiAiSW52ZXN0aWdhdGVzIHNwZWNpYWwgcGVyZm9ybWFuY2UgY29uc2lkZXJhdGlvbnMKICAgICAgICAgICAgICAgIHdoZW4gZXh0cmFjdGluZyByZWxhdGlvbmFsIGRhdGEgZnJvbSBsYXJnZSBYTUwKICAgICAgICAgICAgICAgIGZpbGVzIGFuZCBsb2FkaW5nIHRoZW0gaW50byBhIGRhdGFiYXNlLiIKZGF0ZTogIjxzbWFsbD5gciBTeXMuRGF0ZSgpYDwvc21hbGw+IgphdXRob3I6ICI8c21hbGw+TWFydGluIFNjaGVkbGJhdWVyPC9zbWFsbD4iCmVtYWlsOiAibS5zY2hlZGxiYXVlckBuZXUuZWR1IgphZmZpbGl0YXRpb246ICJOb3J0aGVhc3Rlcm4gVW5pdmVyc2l0eSIKb3V0cHV0OiAKICBib29rZG93bjo6aHRtbF9kb2N1bWVudDI6CiAgICB0b2M6IHRydWUKICAgIHRvY19mbG9hdDogdHJ1ZQogICAgY29sbGFwc2VkOiBmYWxzZQogICAgbnVtYmVyX3NlY3Rpb25zOiBmYWxzZQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQogICAgdGhlbWU6IHNwYWNlbGFiCiAgICBoaWdobGlnaHQ6IHRhbmdvCi0tLQoKLS0tCnRpdGxlOiAiPHNtYWxsPmByIHBhcmFtcyRjYXRlZ29yeWAuYHIgcGFyYW1zJG51bWJlcmA8L3NtYWxsPjxici8+PHNwYW4gc3R5bGU9J2NvbG9yOiAjMkU0MDUzOyBmb250LXNpemU6IDAuOWVtJz5gciBybWFya2Rvd246Om1ldGFkYXRhJHRpdGxlYDwvc3Bhbj4iCi0tLQoKYGBge3IgY29kZT14ZnVuOjpyZWFkX3V0ZjgocGFzdGUwKGhlcmU6OmhlcmUoKSwnL1IvX2luc2VydDJEQi5SJykpLCBpbmNsdWRlID0gRkFMU0V9CmBgYAoKIyMgSW50cm9kdWN0aW9uCgojIyBTdHJ1Y3R1cmUgb2YgWE1MCgpgYGAgeG1sCjw/eG1sIHZlcnNpb249IjEuMCIgZW5jb2Rpbmc9IlVURi04Ij8+CjxQdWJtZWRBcnRpY2xlU2V0PgogIDxQdWJtZWRBcnRpY2xlIFBNSUQ9IjEiPgogICAgPEFydGljbGU+CiAgICAgIDxKb3VybmFsPgogICAgICAgIDxJU1NOIElzc25UeXBlPSJQcmludCI+MDAwNi0yOTQ0PC9JU1NOPgogICAgICAgIDxKb3VybmFsSXNzdWUgQ2l0ZWRNZWRpdW09IlByaW50Ij4KICAgICAgICAgIDxWb2x1bWU+MTM8L1ZvbHVtZT4KICAgICAgICAgIDxJc3N1ZT4yPC9Jc3N1ZT4KICAgICAgICAgIDxQdWJEYXRlPgogICAgICAgICAgICA8WWVhcj4xOTc1PC9ZZWFyPgogICAgICAgICAgICA8TW9udGg+SnVuPC9Nb250aD4KICAgICAgICAgIDwvUHViRGF0ZT4KICAgICAgICA8L0pvdXJuYWxJc3N1ZT4KICAgICAgICA8VGl0bGU+QmlvY2hlbWljYWwgbWVkaWNpbmU8L1RpdGxlPgogICAgICAgIDxJU09BYmJyZXZpYXRpb24+QmlvY2hlbSBNZWQ8L0lTT0FiYnJldmlhdGlvbj4KICAgICAgPC9Kb3VybmFsPgogICAgICA8TGFuZ3VhZ2U+ZW5nPC9MYW5ndWFnZT4KICAgICAgPEFydGljbGVUaXRsZT5Gb3JtYXRlIGFzc2F5IGluIGJvZHkgZmx1aWRzPC9BcnRpY2xlVGl0bGU+CiAgICAgIDxBdXRob3JMaXN0IENvbXBsZXRlWU49IlkiPgogICAgICAgIDxBdXRob3IgVmFsaWRZTj0iWSI+CiAgICAgICAgICA8TGFzdE5hbWU+TWFrYXI8L0xhc3ROYW1lPgogICAgICAgICAgPEZvcmVOYW1lPkEgQjwvRm9yZU5hbWU+CiAgICAgICAgICA8SW5pdGlhbHM+QUI8L0luaXRpYWxzPgogICAgICAgIDwvQXV0aG9yPgogICAgICAgIDxBdXRob3IgVmFsaWRZTj0iWSI+CiAgICAgICAgICA8TGFzdE5hbWU+TWNNYXJ0aW48L0xhc3ROYW1lPgogICAgICAgICAgPEZvcmVOYW1lPksgRTwvRm9yZU5hbWU+CiAgICAgICAgICA8SW5pdGlhbHM+S0U8L0luaXRpYWxzPgogICAgICAgIDwvQXV0aG9yPgogICAgICAgIC4uLgogICAgICA8L0F1dGhvckxpc3Q+CiAgICA8L0FydGljbGU+CiAgPC9QdWJtZWRBcnRpY2xlPgogIC4uLgo8L1B1Ym1lZEFydGljbGVTZXQ+CmBgYAoKIyMgTG9hZCBYTUwKCmBgYHtyfQpsaWJyYXJ5KFhNTCkKCnhtbFVSTCA8LSAiaHR0cDovL2FydGlmaWNpdW0udXMvbGVzc29ucy8wNi5yL2wtNi0xODMtZXh0cmFjdHhtbC1kYXRhLWluLXIvcHVibWVkLXhtbC10Zm0vcHVibWVkMjJuMDAwMS10Zi54bWwiCgpzeXN0ZW0udGltZSgKICB4bWxET00gPC0geG1sUGFyc2UoeG1sVVJMLCB2YWxpZGF0ZT1GKQopCmBgYAoKYGBge3J9CnhtbEZpbGUgPC0gInB1Ym1lZC12My54bWwiCgpzeXN0ZW0udGltZSgKICB4bWxET00gPC0geG1sUGFyc2UoeG1sRmlsZSwgdmFsaWRhdGU9RikKKQpgYGAKCmBgYHtyfQpyb290IDwtIHhtbFJvb3QoeG1sRE9NKQpudW1BcnRpY2xlcyA8LSB4bWxTaXplKHJvb3QpCgpjYXQoIkFydGljbGVzIHRvIHByb2Nlc3M6ICIsIG51bUFydGljbGVzKQpgYGAKCmBgYHtyfQpzeXN0ZW0udGltZSgKZm9yIChhIGluIDE6bnVtQXJ0aWNsZXMpIHsKICBhbkFydGljbGUgPC0gcm9vdFtbYV1dCn0KKQpgYGAKCkdldCBhbiBhcnRpY2xlJ3MgdGl0bGUKCmBgYHtyfQpzeXN0ZW0udGltZSgKZm9yIChhIGluIDE6bnVtQXJ0aWNsZXMpIHsKICBhbkFydGljbGUgPC0gcm9vdFtbYV1dCn0KKQpgYGAKCiMjIFN1bW1hcnkKClRoaXMgbGVzc29uIHByb3ZpZGVkIHNldmVyYWwgc3BlY2lmaWMgcmVjb21tZW5kYXRpb25zIGFuZCBwcm9ncmFtbWluZyBwcmFjdGljZXMgZm9yIGV4dHJhY3RpbmcgZGF0YSBmcm9tIGEgbGFyZ2UgWE1MIGFuZCB0cmFuc2Zvcm1pbmcgaXQgdG8gYSByZWxhdGlvbmFsIHNjaGVtYS4KCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQoKIyMgRmlsZXMgJiBSZXNvdXJjZXMgeyNmaWxlcy1pZH0KCmBgYHtyIHppcEZpbGVzLCBlY2hvPUZBTFNFfQp6aXBOYW1lID0gc3ByaW50ZigiTGVzc29uRmlsZXMtJXMtJXMuemlwIiwgCiAgICAgICAgICAgICAgICAgcGFyYW1zJGNhdGVnb3J5LAogICAgICAgICAgICAgICAgIHBhcmFtcyRudW1iZXIpCgp0ZXh0QUxpbmsgPSBwYXN0ZTAoIkFsbCBGaWxlcyBmb3IgTGVzc29uICIsIAogICAgICAgICAgICAgICBwYXJhbXMkY2F0ZWdvcnksIi4iLHBhcmFtcyRudW1iZXIpCgojIGRvd25sb2FkRmlsZXNMaW5rKCkgaXMgaW5jbHVkZWQgZnJvbSBfaW5zZXJ0MkRCLlIKa25pdHI6OnJhd19odG1sKGRvd25sb2FkRmlsZXNMaW5rKCIuIiwgemlwTmFtZSwgdGV4dEFMaW5rKSkKYGBgCgotLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KCiMjIFJlZmVyZW5jZXMKCk5vIHJlZmVyZW5jZXMuCgojIyBFcnJhdGEKCk5vbmUgY29sbGVjdGVkIHlldC4gTGV0IHVzIGtub3cuCgpgYGB7PWh0bWx9CjxzY3JpcHQgc3JjPSJodHRwczovL2Zvcm0uam90Zm9ybS5jb20vc3RhdGljL2ZlZWRiYWNrMi5qcyIgdHlwZT0idGV4dC9qYXZhc2NyaXB0Ij4KICBuZXcgSm90Zm9ybUZlZWRiYWNrKHsKICAgIGZvcm1JZDogIjIxMjE4NzA3Mjc4NDE1NyIsCiAgICBidXR0b25UZXh0OiAiRmVlZGJhY2siLAogICAgYmFzZTogImh0dHBzOi8vZm9ybS5qb3Rmb3JtLmNvbS8iLAogICAgYmFja2dyb3VuZDogIiNGNTkyMDIiLAogICAgZm9udENvbG9yOiAiI0ZGRkZGRiIsCiAgICBidXR0b25TaWRlOiAibGVmdCIsCiAgICBidXR0b25BbGlnbjogImNlbnRlciIsCiAgICB0eXBlOiBmYWxzZSwKICAgIHdpZHRoOiA3MDAsCiAgICBoZWlnaHQ6IDUwMCwKICAgIGlzQ2FyZEZvcm06IGZhbHNlCiAgfSk7Cjwvc2NyaXB0PgpgYGAKYGBge3IgY29kZT14ZnVuOjpyZWFkX3V0ZjgocGFzdGUwKGhlcmU6OmhlcmUoKSwnL1IvX2RlcGxveUtuaXQuUicpKSwgaW5jbHVkZSA9IEZBTFNFfQpgYGAK