Mining scientific articles

with XML, Python, and allofplos

Elizabeth Seiver
plos.org, @tweetotaler
PLOS and The Hacker Within
Wed November 29, 2017

Why Mine Scientific Articles?¶

Science articles represent scientific knowledge
XML is version of articles for machines, while PDF is for humans
Tool for meta-research and meta-science
Quickly identify sets of articles of interest
Identify research literature trends over time (study findings, jargon usage, citation networks)

PLOS corpus of articles¶

220,000+ scientific articles from a wide array of research fields, focusing on the medical and life sciences
Since 2003
Open Access: free to read, free to re-use
Creative Commons license (CC-BY, CC0)
Many scientific articles are behind a paywall

Tutorial plan¶

Goal: to enable research questions about science articles
Will use JupyterHub and a sample corpus of 10,000 randomly selected PLOS articles
Won't be discussing research techniques (data analysis, natural language processing)
Assumes basic Python knowledge (lists, dictionaries, loops, conditionals, datetime)

Tutorial structure¶

How to parse XML using allofplos and lxml
Basic structure of XML documents and JATS standard
Example projects with the PLOS test corpus
Hacking session: parse articles or contribute to the allofplos codebase

Exercises based on tutorial: https://github.com/eseiver/xml_tutorial

allofplos¶

Python package for both downloading and parsing PLOS XML articles
Turns PLOS XML articles into Python data structures
Doesn't require knowledge of XML to use
Focuses on article metadata (e.g., title, authors, date of publication)
Work in progress, so aspects of it may change

allofplos basics¶

Initialize an article object w/DOI or XML filename
- DOI (Digital Object Identifier) is a unique identifier for an online document/article
- All PLOS DOIs start with "10.1371/journal.", like "10.1371/journal.pone.0185809"
allofplos XML files are named with last part of DOI, e.g. "journal.pone.0185809.xml"

from allofplos import Article # if have run `pip install allofplos`
# from article_class import Article  # if inside cloned GitHub directory

# first instantiation of Article class by DOI
article = Article('10.1371/journal.pone.0178690')
article.title

'Physician assessments of drug seeking behavior: A mixed methods study'

# first instantiation of Article class by filename
article = Article.from_filename('allofplos_xml/journal.pone.0181748.xml')
article.title

'THPdb: Database of FDA-approved peptide and protein therapeutics'

# new article
article.doi = '10.1371/journal.pone.0183591'
article.title

'A checklist is associated with increased quality of reporting preclinical biomedical research: A systematic review'

Notable properties¶

Try printing or returning some of these values

Basic metadata¶

article.doi

'10.1371/journal.pone.0183591'

article.journal

'PLOS ONE'

article.pubdate

datetime.datetime(2017, 9, 13, 0, 0)

article.title

'A checklist is associated with increased quality of reporting preclinical biomedical research: A systematic review'

article.counts

{'fig-count': '3', 'page-count': '14', 'table-count': '2'}

article.word_count

print(article.abstract[:310])

Irreproducibility of preclinical biomedical research has gained recent attention. It is suggested that requiring authors to complete a checklist at the time of manuscript submission would improve the quality and transparency of scientific reporting, and ultimately enhance reproducibility. Whether a checklist

People¶

contributor = article.contributors[0]
contributor.keys()

dict_keys(['contrib_initials', 'given_names', 'surname', 'group_name', 'ids', 'rid_dict', 'contrib_type', 'author_type', 'editor_type', 'email', 'affiliations', 'author_roles', 'footnotes'])

article.authors[0]

{'affiliations': ['Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America'],
 'author_roles': {'CASRAI CREDiT taxonomy': ['Conceptualization',
   'Data curation',
   'Formal analysis',
   'Funding acquisition',
   'Investigation',
   'Writing – original draft',
   'Writing – review & editing']},
 'author_type': 'corresponding',
 'contrib_initials': 'SH',
 'contrib_type': 'author',
 'editor_type': None,
 'email': ['shan.workmd@gmail.com'],
 'footnotes': ['Current address: Division of Pulmonary and Critical Care, Department of Medicine, Northwestern University, Chicago, Illinois, United States of America'],
 'given_names': 'SeungHye',
 'group_name': None,
 'ids': [{'authenticated': 'true',
   'id': 'http://orcid.org/0000-0001-5625-6337',
   'id_type': 'orcid'}],
 'rid_dict': {'aff': ['aff001'],
  'corresp': ['cor001'],
  'fn': ['currentaff001']},
 'surname': 'Han'}

article.corr_author

[{'affiliations': ['Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America'],
  'author_roles': {'CASRAI CREDiT taxonomy': ['Conceptualization',
    'Data curation',
    'Formal analysis',
    'Funding acquisition',
    'Investigation',
    'Writing – original draft',
    'Writing – review & editing']},
  'author_type': 'corresponding',
  'contrib_initials': 'SH',
  'contrib_type': 'author',
  'editor_type': None,
  'email': ['shan.workmd@gmail.com'],
  'footnotes': ['Current address: Division of Pulmonary and Critical Care, Department of Medicine, Northwestern University, Chicago, Illinois, United States of America'],
  'given_names': 'SeungHye',
  'group_name': None,
  'ids': [{'authenticated': 'true',
    'id': 'http://orcid.org/0000-0001-5625-6337',
    'id_type': 'orcid'}],
  'rid_dict': {'aff': ['aff001'],
   'corresp': ['cor001'],
   'fn': ['currentaff001']},
  'surname': 'Han'}]

article.editor[0]

{'affiliations': ['Fraunhofer Research Institution of Marine Biotechnology, GERMANY'],
 'author_roles': {None: ['Editor']},
 'author_type': None,
 'contrib_initials': 'JB',
 'contrib_type': 'editor',
 'editor_type': None,
 'email': None,
 'footnotes': [],
 'given_names': 'Johannes',
 'group_name': None,
 'ids': [],
 'rid_dict': {'aff': ['edit1']},
 'surname': 'Boltze'}

Article type¶

article.type_  # JATS

'research-article'

article.plostype

'Research Article'

article.proof  # whether an uncorrected proof/early version or not

Local article file (more on this later)¶

article.filename

'/Users/Elizabeth/PLOS_Corpus_Project/allofplos/allofplos/allofplos_xml/journal.pone.0183591.xml'

article.local

True

article.tree

<lxml.etree._ElementTree at 0x10efa9dc8>

article.root

<Element article at 0x1118cf348>

article.xml

Other notable methods¶

article.get_dates()

{'accepted': datetime.datetime(2017, 8, 7, 0, 0),
 'collection': datetime.datetime(2017, 1, 1, 0, 0),
 'epub': datetime.datetime(2017, 9, 13, 0, 0),
 'received': datetime.datetime(2017, 3, 19, 0, 0)}

article.check_if_doi_resolves()

'works'

article

DOI: 10.1371/journal.pone.0183591
Title: A checklist is associated with increased quality of reporting preclinical biomedical research: A systematic review

print(article)

Primer on XML elements¶

Why use lxml.etree for parsing XML documents?¶

JATS standard for scientific article XML is consistent
BeautifulSoup better for unreliable web documents
lxml.etree and BeautifulSoup each have a module of the other
The following XML element examples are derived from actual PLOS articles

Primer on XML elements¶

XML elements have four key properties in the lxml.etree library¶

element.tag
element.text
element.attrib
element.tail

Example basic element¶

<article-title>Why Most Published Research Findings Are False</article-title>

element.tag

'article-title'

element.text

'Why Most Published Research Findings Are False'

Basic element with attribute¶

<alt-title alt-title-type="running-head">Essay</alt-title>

element.tag

'alt-title'

element.text

'Essay'

element.attrib

{'alt-title-type': 'running-head'}

# for any text that comes directly after closing tag and before another tag
element.tail

An element attribute is a dictionary¶

Appears inside the element tag¶

XML(element)

<alt-title alt-title-type="running-head">Essay</alt-title>

element.attrib

{'alt-title-type': 'running-head'}

element.attrib['alt-title-type']

'running-head'

element.attrib.get('alt-title-type')

'running-head'

XML elements can have sub-elements¶

<title-group>
  <article-title>Why Most Published Research Findings Are False</article-title>
  <alt-title alt-title-type="running-head">Essay</alt-title>
</title-group>

# to find direct descendants; don't need to know their tags
element.getchildren()

[<Element article-title at 0x10ab16888>, <Element alt-title at 0x1087d06c8>]

new_element = element.getchildren()[0]
new_element.tag

'article-title'

# to find direct ancestor; don't need to know its tag
new_element.getparent()

<Element title-group at 0x10ab16588>

Finding sub-elements by name with xpath¶

Example element: Creative Commons License¶

<license xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xlink:href="http://creativecommons.org/licenses/by/4.0/" xlink:type="simple">
  <license-p>This is an open access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/" xlink:type="simple">Creative Commons Attribution License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>

Xpath returns a list of search results¶

# search direct descendants by name
license.xpath('./license-p')

[<Element license-p at 0x10ab47488>]

# search descendants of direct descendants
license.xpath('./license-p/ext-link')

[<Element ext-link at 0x10ab445c8>]

# search ALL descendants
license.xpath('.//ext-link')

[<Element ext-link at 0x10ab445c8>]

Warning 1: Multiple elements can have the same xpath location¶

Remember that it always returns a list¶

<contrib-group>
  <contrib contrib-type="author">
    <name name-style="western">
      <surname>dos Santos</surname>
      <given-names>Renato Vieira</given-names>
    </name>
  </contrib>
  <contrib contrib-type="author">
    <name name-style="western">
      <surname>da Silva</surname>
      <given-names>Linaena Mericy</given-names>
    </name>
  </contrib>
</contrib-group>

element.xpath('./contrib')

[<Element contrib at 0x10ab44608>, <Element contrib at 0x10ab4d508>]

Warning 2: element.text doesn't always work¶

When in doubt, use `lxml.etree.tostring()`¶

<license xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xlink:href="http://creativecommons.org/licenses/by/4.0/" xlink:type="simple">
  <license-p>This is an open access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/" xlink:type="simple">Creative Commons Attribution License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>

print(license.text)

None

import lxml.etree as et
license_text = et.tostring(license, method='text', encoding='unicode')
print(license_text)

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Quick quiz before we move on!¶

<contrib contrib-type="author" equal-contrib="yes">
    <name name-style="western">
        <surname>Chen</surname>
        <given-names>Ximing</given-names>
    </name>plossy
</contrib>

element.tag?
element.attrib?

element.tag: contrib
element.attrib: {'contrib-type': 'author', 'equal-contrib': 'yes'}

Quick quiz before we move on!¶

<contrib contrib-type="author" equal-contrib="yes">
    <name name-style="western">
        <surname>Chen</surname>
        <given-names>Ximing</given-names>
    </name>plossy
</contrib>

new_element = element.xpath('./name')[0]

new_element.tag?
new_element.attrib?
new_element.tail?

new_element.tag: name
new_element.attrib: {'name-style': 'western'}
new_element.tail: plossy

Using lxml.etree and allofplos Article class to parse XML files¶

from allofplos import Article
doi = '10.1371/journal.pone.0183591'
article = Article(doi)
article.filename

'/Users/Elizabeth/PLOS_Corpus_Project/allofplos/allofplos/allofplos_xml/journal.pone.0183591.xml'

# Is the article XML file locally stored?
article.local

True

tree = article.tree
tree.xpath('./body')

[<Element body at 0x10b176a48>]

xml_root = article.root
xml_root.xpath('.//license')

[<Element license at 0x10ab58488>]

license = xml_root.xpath('.//license')[0]
license.attrib

{'{http://www.w3.org/1999/xlink}href': 'http://creativecommons.org/licenses/by/4.0/', '{http://www.w3.org/1999/xlink}type': 'simple'}

Tying it all together!¶

Project example: which articles use PCR in their Methods section?¶

First, on finding elements with xpath searching tag AND attribute¶

Body of article is divided into sections ('sec')
Method section attribute of note: {'sec-type': 'materials|methods'} or {'sec-type': 'methods'}

methods_sections = xml_root.xpath("//sec[@sec-type='materials|methods']")
print(methods_sections)

[<Element sec at 0x10b176b48>]

from allofplos.samples.corpus_analysis import get_random_list_of_dois
from allofplos.article_class import Article
import lxml.etree as et

# First get list of articles/DOIs
dois = get_random_list_of_dois(count=50)
pcr_list = []
# Initialize first article object
article = Article(dois[0])

for doi in dois:
    # Step 1: create new article object
    article.doi = doi
    xml_root = article.root
    # Step 2: find Method sections
    methods_sections = xml_root.xpath("//sec[@sec-type='materials|methods']")
    if not methods_sections:
        methods_sections = xml_root.xpath("//sec[@sec-type='methods']")
    for sec in methods_sections:
        # Step 3: turn the method sections into strings
        method_string = et.tostring(sec, method='text', encoding='unicode')
        # Step 4: add DOI if 'PCR' in string
        if 'PCR' in method_string:
            pcr_list.append(article.doi)
            break
        else:
            pass

print(pcr_list[0:5])

['10.1371/journal.pone.0128195', '10.1371/journal.pone.0165464', '10.1371/journal.pone.0136574', '10.1371/journal.pone.0072749', '10.1371/journal.pone.0060101']

Mining scientific articles

with XML, Python, and allofplos

Elizabeth Seiver plos.org, @tweetotaler PLOS and The Hacker Within Wed November 29, 2017

Why Mine Scientific Articles?¶

PLOS corpus of articles¶

Tutorial plan¶

Tutorial structure¶

allofplos¶

allofplos basics¶

Notable properties¶

Basic metadata¶

People¶

Article type¶

Local article file (more on this later)¶

Other notable methods¶

Primer on XML elements¶

Why use lxml.etree for parsing XML documents?¶

Primer on XML elements¶

XML elements have four key properties in the lxml.etree library¶

Example basic element¶

Basic element with attribute¶

An element attribute is a dictionary¶

Appears inside the element tag¶

XML elements can have sub-elements¶

Finding sub-elements by name with xpath¶

Example element: Creative Commons License¶

Xpath returns a list of search results¶

Warning 1: Multiple elements can have the same xpath location¶

Remember that it always returns a list¶

Warning 2: element.text doesn't always work¶

When in doubt, use lxml.etree.tostring()¶

Quick quiz before we move on!¶

Quick quiz before we move on!¶

Using lxml.etree and allofplos Article class to parse XML files¶

Tying it all together!¶

Project example: which articles use PCR in their Methods section?¶

First, on finding elements with xpath searching tag AND attribute¶

Elizabeth Seiver
plos.org, @tweetotaler
PLOS and The Hacker Within
Wed November 29, 2017

When in doubt, use `lxml.etree.tostring()`¶