Introduction
Structure of XML
<?xml version="1.0" encoding="UTF-8"?>
<PubmedArticleSet>
<PubmedArticle PMID="1">
<Article>
<Journal>
<ISSN IssnType="Print">0006-2944</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>13</Volume>
<Issue>2</Issue>
<PubDate>
<Year>1975</Year>
<Month>Jun</Month>
</PubDate>
</JournalIssue>
<Title>Biochemical medicine</Title>
<ISOAbbreviation>Biochem Med</ISOAbbreviation>
</Journal>
<Language>eng</Language>
<ArticleTitle>Formate assay in body fluids</ArticleTitle>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Makar</LastName>
<ForeName>A B</ForeName>
<Initials>AB</Initials>
</Author>
<Author ValidYN="Y">
<LastName>McMartin</LastName>
<ForeName>K E</ForeName>
<Initials>KE</Initials>
</Author>
...
</AuthorList>
</Article>
</PubmedArticle>
...
</PubmedArticleSet>
Load XML
library(XML)
xmlURL <- "http://artificium.us/lessons/06.r/l-6-183-extractxml-data-in-r/pubmed-xml-tfm/pubmed22n0001-tf.xml"
system.time(
xmlDOM <- xmlParse(xmlURL, validate=F)
)
## user system elapsed
## 0.454 0.191 52.589
xmlFile <- "pubmed-v3.xml"
system.time(
xmlDOM <- xmlParse(xmlFile, validate=F)
)
## user system elapsed
## 0.432 0.083 0.524
root <- xmlRoot(xmlDOM)
numArticles <- xmlSize(root)
cat("Articles to process: ", numArticles)
## Articles to process: 30000
system.time(
for (a in 1:numArticles) {
anArticle <- root[[a]]
}
)
## user system elapsed
## 9.780 0.026 9.819
Get an article’s title
system.time(
for (a in 1:numArticles) {
anArticle <- root[[a]]
}
)
## user system elapsed
## 10.352 0.032 10.472
Summary
This lesson provided several specific recommendations and programming practices for extracting data from a large XML and transforming it to a relational schema.
References
No references.
Errata
None collected yet. Let us know.
LS0tCnRpdGxlOiAiRXh0cmFjdCBhbmQgTG9hZCBMYXJnZSBYTUwgdG8gRGF0YWJhc2UiCnBhcmFtczoKICBjYXRlZ29yeTogODAKICBudW1iZXI6IDgwMgogIHRpbWU6IDQ1CiAgbGV2ZWw6IGJlZ2lubmVyCiAgdGFnczogInIseHBhdGgseG1sIgogIGRlc2NyaXB0aW9uOiAiSW52ZXN0aWdhdGVzIHNwZWNpYWwgcGVyZm9ybWFuY2UgY29uc2lkZXJhdGlvbnMKICAgICAgICAgICAgICAgIHdoZW4gZXh0cmFjdGluZyByZWxhdGlvbmFsIGRhdGEgZnJvbSBsYXJnZSBYTUwKICAgICAgICAgICAgICAgIGZpbGVzIGFuZCBsb2FkaW5nIHRoZW0gaW50byBhIGRhdGFiYXNlLiIKZGF0ZTogIjxzbWFsbD5gciBTeXMuRGF0ZSgpYDwvc21hbGw+IgphdXRob3I6ICI8c21hbGw+TWFydGluIFNjaGVkbGJhdWVyPC9zbWFsbD4iCmVtYWlsOiAibS5zY2hlZGxiYXVlckBuZXUuZWR1IgphZmZpbGl0YXRpb246ICJOb3J0aGVhc3Rlcm4gVW5pdmVyc2l0eSIKb3V0cHV0OiAKICBib29rZG93bjo6aHRtbF9kb2N1bWVudDI6CiAgICB0b2M6IHRydWUKICAgIHRvY19mbG9hdDogdHJ1ZQogICAgY29sbGFwc2VkOiBmYWxzZQogICAgbnVtYmVyX3NlY3Rpb25zOiBmYWxzZQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQogICAgdGhlbWU6IHNwYWNlbGFiCiAgICBoaWdobGlnaHQ6IHRhbmdvCi0tLQoKLS0tCnRpdGxlOiAiPHNtYWxsPmByIHBhcmFtcyRjYXRlZ29yeWAuYHIgcGFyYW1zJG51bWJlcmA8L3NtYWxsPjxici8+PHNwYW4gc3R5bGU9J2NvbG9yOiAjMkU0MDUzOyBmb250LXNpemU6IDAuOWVtJz5gciBybWFya2Rvd246Om1ldGFkYXRhJHRpdGxlYDwvc3Bhbj4iCi0tLQoKYGBge3IgY29kZT14ZnVuOjpyZWFkX3V0ZjgocGFzdGUwKGhlcmU6OmhlcmUoKSwnL1IvX2luc2VydDJEQi5SJykpLCBpbmNsdWRlID0gRkFMU0V9CmBgYAoKIyMgSW50cm9kdWN0aW9uCgojIyBTdHJ1Y3R1cmUgb2YgWE1MCgpgYGAgeG1sCjw/eG1sIHZlcnNpb249IjEuMCIgZW5jb2Rpbmc9IlVURi04Ij8+CjxQdWJtZWRBcnRpY2xlU2V0PgogIDxQdWJtZWRBcnRpY2xlIFBNSUQ9IjEiPgogICAgPEFydGljbGU+CiAgICAgIDxKb3VybmFsPgogICAgICAgIDxJU1NOIElzc25UeXBlPSJQcmludCI+MDAwNi0yOTQ0PC9JU1NOPgogICAgICAgIDxKb3VybmFsSXNzdWUgQ2l0ZWRNZWRpdW09IlByaW50Ij4KICAgICAgICAgIDxWb2x1bWU+MTM8L1ZvbHVtZT4KICAgICAgICAgIDxJc3N1ZT4yPC9Jc3N1ZT4KICAgICAgICAgIDxQdWJEYXRlPgogICAgICAgICAgICA8WWVhcj4xOTc1PC9ZZWFyPgogICAgICAgICAgICA8TW9udGg+SnVuPC9Nb250aD4KICAgICAgICAgIDwvUHViRGF0ZT4KICAgICAgICA8L0pvdXJuYWxJc3N1ZT4KICAgICAgICA8VGl0bGU+QmlvY2hlbWljYWwgbWVkaWNpbmU8L1RpdGxlPgogICAgICAgIDxJU09BYmJyZXZpYXRpb24+QmlvY2hlbSBNZWQ8L0lTT0FiYnJldmlhdGlvbj4KICAgICAgPC9Kb3VybmFsPgogICAgICA8TGFuZ3VhZ2U+ZW5nPC9MYW5ndWFnZT4KICAgICAgPEFydGljbGVUaXRsZT5Gb3JtYXRlIGFzc2F5IGluIGJvZHkgZmx1aWRzPC9BcnRpY2xlVGl0bGU+CiAgICAgIDxBdXRob3JMaXN0IENvbXBsZXRlWU49IlkiPgogICAgICAgIDxBdXRob3IgVmFsaWRZTj0iWSI+CiAgICAgICAgICA8TGFzdE5hbWU+TWFrYXI8L0xhc3ROYW1lPgogICAgICAgICAgPEZvcmVOYW1lPkEgQjwvRm9yZU5hbWU+CiAgICAgICAgICA8SW5pdGlhbHM+QUI8L0luaXRpYWxzPgogICAgICAgIDwvQXV0aG9yPgogICAgICAgIDxBdXRob3IgVmFsaWRZTj0iWSI+CiAgICAgICAgICA8TGFzdE5hbWU+TWNNYXJ0aW48L0xhc3ROYW1lPgogICAgICAgICAgPEZvcmVOYW1lPksgRTwvRm9yZU5hbWU+CiAgICAgICAgICA8SW5pdGlhbHM+S0U8L0luaXRpYWxzPgogICAgICAgIDwvQXV0aG9yPgogICAgICAgIC4uLgogICAgICA8L0F1dGhvckxpc3Q+CiAgICA8L0FydGljbGU+CiAgPC9QdWJtZWRBcnRpY2xlPgogIC4uLgo8L1B1Ym1lZEFydGljbGVTZXQ+CmBgYAoKIyMgTG9hZCBYTUwKCmBgYHtyfQpsaWJyYXJ5KFhNTCkKCnhtbFVSTCA8LSAiaHR0cDovL2FydGlmaWNpdW0udXMvbGVzc29ucy8wNi5yL2wtNi0xODMtZXh0cmFjdHhtbC1kYXRhLWluLXIvcHVibWVkLXhtbC10Zm0vcHVibWVkMjJuMDAwMS10Zi54bWwiCgpzeXN0ZW0udGltZSgKICB4bWxET00gPC0geG1sUGFyc2UoeG1sVVJMLCB2YWxpZGF0ZT1GKQopCmBgYAoKYGBge3J9CnhtbEZpbGUgPC0gInB1Ym1lZC12My54bWwiCgpzeXN0ZW0udGltZSgKICB4bWxET00gPC0geG1sUGFyc2UoeG1sRmlsZSwgdmFsaWRhdGU9RikKKQpgYGAKCmBgYHtyfQpyb290IDwtIHhtbFJvb3QoeG1sRE9NKQpudW1BcnRpY2xlcyA8LSB4bWxTaXplKHJvb3QpCgpjYXQoIkFydGljbGVzIHRvIHByb2Nlc3M6ICIsIG51bUFydGljbGVzKQpgYGAKCmBgYHtyfQpzeXN0ZW0udGltZSgKZm9yIChhIGluIDE6bnVtQXJ0aWNsZXMpIHsKICBhbkFydGljbGUgPC0gcm9vdFtbYV1dCn0KKQpgYGAKCkdldCBhbiBhcnRpY2xlJ3MgdGl0bGUKCmBgYHtyfQpzeXN0ZW0udGltZSgKZm9yIChhIGluIDE6bnVtQXJ0aWNsZXMpIHsKICBhbkFydGljbGUgPC0gcm9vdFtbYV1dCn0KKQpgYGAKCiMjIFN1bW1hcnkKClRoaXMgbGVzc29uIHByb3ZpZGVkIHNldmVyYWwgc3BlY2lmaWMgcmVjb21tZW5kYXRpb25zIGFuZCBwcm9ncmFtbWluZyBwcmFjdGljZXMgZm9yIGV4dHJhY3RpbmcgZGF0YSBmcm9tIGEgbGFyZ2UgWE1MIGFuZCB0cmFuc2Zvcm1pbmcgaXQgdG8gYSByZWxhdGlvbmFsIHNjaGVtYS4KCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQoKIyMgRmlsZXMgJiBSZXNvdXJjZXMgeyNmaWxlcy1pZH0KCmBgYHtyIHppcEZpbGVzLCBlY2hvPUZBTFNFfQp6aXBOYW1lID0gc3ByaW50ZigiTGVzc29uRmlsZXMtJXMtJXMuemlwIiwgCiAgICAgICAgICAgICAgICAgcGFyYW1zJGNhdGVnb3J5LAogICAgICAgICAgICAgICAgIHBhcmFtcyRudW1iZXIpCgp0ZXh0QUxpbmsgPSBwYXN0ZTAoIkFsbCBGaWxlcyBmb3IgTGVzc29uICIsIAogICAgICAgICAgICAgICBwYXJhbXMkY2F0ZWdvcnksIi4iLHBhcmFtcyRudW1iZXIpCgojIGRvd25sb2FkRmlsZXNMaW5rKCkgaXMgaW5jbHVkZWQgZnJvbSBfaW5zZXJ0MkRCLlIKa25pdHI6OnJhd19odG1sKGRvd25sb2FkRmlsZXNMaW5rKCIuIiwgemlwTmFtZSwgdGV4dEFMaW5rKSkKYGBgCgotLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KCiMjIFJlZmVyZW5jZXMKCk5vIHJlZmVyZW5jZXMuCgojIyBFcnJhdGEKCk5vbmUgY29sbGVjdGVkIHlldC4gTGV0IHVzIGtub3cuCgpgYGB7PWh0bWx9CjxzY3JpcHQgc3JjPSJodHRwczovL2Zvcm0uam90Zm9ybS5jb20vc3RhdGljL2ZlZWRiYWNrMi5qcyIgdHlwZT0idGV4dC9qYXZhc2NyaXB0Ij4KICBuZXcgSm90Zm9ybUZlZWRiYWNrKHsKICAgIGZvcm1JZDogIjIxMjE4NzA3Mjc4NDE1NyIsCiAgICBidXR0b25UZXh0OiAiRmVlZGJhY2siLAogICAgYmFzZTogImh0dHBzOi8vZm9ybS5qb3Rmb3JtLmNvbS8iLAogICAgYmFja2dyb3VuZDogIiNGNTkyMDIiLAogICAgZm9udENvbG9yOiAiI0ZGRkZGRiIsCiAgICBidXR0b25TaWRlOiAibGVmdCIsCiAgICBidXR0b25BbGlnbjogImNlbnRlciIsCiAgICB0eXBlOiBmYWxzZSwKICAgIHdpZHRoOiA3MDAsCiAgICBoZWlnaHQ6IDUwMCwKICAgIGlzQ2FyZEZvcm06IGZhbHNlCiAgfSk7Cjwvc2NyaXB0PgpgYGAKYGBge3IgY29kZT14ZnVuOjpyZWFkX3V0ZjgocGFzdGUwKGhlcmU6OmhlcmUoKSwnL1IvX2RlcGxveUtuaXQuUicpKSwgaW5jbHVkZSA9IEZBTFNFfQpgYGAK