DeepDive Open Datasets

Over the last few years, our collaborators have used DeepDive to build over a dozen different applications. Often, these applications are interesting because of the sheer volume of documents that one can read--millions of documents or billions of Web pages. However, many researchers can't get started on these problems because NLP preprocessing is simply too expensive and painful. With the generous support of Condor CHTC, we have been fortunate to be have a ready supply of CPU hours for our own projects. The datasets below have taken over 5+ million CPU hours to produce. It’s time that we gave back to the NLP community that has given us so many good ideas for our own systems.

Our work would not be possible without open data. As a result, our group decided to enhance as many Creative Commons datasets as we can find. Below, we describe the data format and provide a small list of data sets in this page. We plan to process more data sets and to provide richer data as well (extracted entities and relationships). Feel free to contact us to suggest more open data sets to process.

Acknowledgement
           
We would like to thank the HTCondor research group and the Center for High Throughput Computing (CHTC) at the University of Wisconsin-Madison, who have provided millions of machine hours to our group. Thank you, Miron Livny. We would also like to thank the Stanford Natural Language Processing Group, whose tools we use in many of our applications. DeepDive is also generously supported by our sponsors.

Data Format (NLP Markups)

The datasets we provide are in two formats, and for most datasets, we provide data in both formats.

  • DeepDive-ready DB Dump. In this format, the data is a database table that can be loaded directly into a database with PostgreSQL or Greenplum. The schema of this table is the same as what we used in our tutorial example such that you can start building your own DeepDive applications immediately after download.

  • CoNLL-format Markups. Using DeepDive provides us the opportunity to better support your application and related technical questions, however, you do not need to be tied up to DeepDive to use our datasets. We also provide a format that is similar to what has been used in the CoNLL-X shared task. The columns of the TSV file is arranged as follows: ID FORM POSTAG NERTAG LEMMA DEPREL HEAD SENTID PROV. The meaning for most of these names could be found in the CoNLL specification. PROV means a set of bounding boxes in the original documents that corresponds to the word, which we will describe in details.

Provenance. For each word in our dataset, we provide its provenance back to the original document. Depending on the format of the original document, the provenance is provided in one of the following two formats.

  • PDF provenance: If the original document is PDF or (scanned) image, the provenance for a word is a set of bounding boxes. For example, [p8l1329t1388r1453b1405, p8l1313t1861r1440b1892], means that the corresponding word appears in page 8 of the original document. It has two bounding boxes because it cross two lines. The first bounding box has left margin 1329, top margin 1388, right margin 1453, and bottom margin 1405. All numbers are in pixels when the image is converted with 300dpi.

  • Pure-text provenance: If the original document is HTML or pure text, the provenance for a word is a set of intervals of character offsets. For example, 14/15 means that the corresponding word contains a single character that starts at character offset 14 (include) and ends at character offset 15 (not include).

Each dataset is versioned with date, and with the MD5 checksum in the file name.

PMC-OA (PubMed Central Open Access Subset)

Quick Statistics & Downloads
Pipeline

HTML > STRIP (html2text) > NLP (Stanford CoreNLP 1.3.4)

Size 70 GB Document Type Journal Articles
# Documents 359,324 # Machine Hours 100 K
# Words 2.7 Billion # Sentences 110 Million
Downloads

PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). DeepDive's PMC-OA corpus contains a full snapshot that we downloaded in March 2014 from the PubMed Central Open Access Subset.

PMC applies different creative common licenses. Information obtained at Jan 27, 2015.

BMC (BioMed Central)

Quick Statistics & Downloads
Pipeline HTML > STRIP (html2text) > NLP (Stanford CoreNLP 1.3.4)
Size 21 GB Document Type Journal Articles
# Documents 70,043 # Machine Hours 20 K
# Sentences 15 Million # Words 400 Million
Downloads

BioMed Central is an STM (Science, Technology and Medicine) publisher of 274 peer-reviewed open access journals. We plan to have DeepDive's BMC corpus to contain a full snapshot of BioMed Central in Jan 2015.

BioMed Central applies CC BY 4.0 license. Information obtained at Jan 27, 2015.

PLOS (Public Library of Science)

Quick Statistics & Downloads
Pipeline

PDF > OCR (Tesseract) > NLP (Stanford CoreNLP 1.3.4)

Size 70GB Document Type Journal Articles
# Documents 125,378 # Machine Hours 370 K
# Words 1.3 Billion # Sentences 73 Million
Downloads

PLOS is a nonprofit open access scientific publishing project aimed at creating a library of open access journals and other scientific literature under an open content license. DeepDive's PLOS corpus contains a full snapshot that we downloaded in Aug 2014 of the following PLOS journals: (1) PLOS Biology, (2) PLOS Medicine, (3) PLOS Computational Biology, (4) PLOS Genetics, (5) PLOS Pathogens, (6) PLOS Clinical Trials, (7) PLOS ONE, (8) PLOS Neglected Tropical Diseases, and (9) PLOS Currents.

PLOS applies CC BY 3.0 license. Information obtained at Jan 26, 2015.

BHL (Biodiversity Heritage Library)

Quick Statistics & Downloads
Pipeline OCR'ed Text > NLP (Stanford CoreNLP 1.3.4)
Size 229 GB Document Type Books
# Documents 98,099 # Machine Hours 500 K
# Sentences 1 Billion # Words 8.7 Billion
Downloads

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global ''biodiversity commons.'' DeepDive's BHL corpus contains a full snapshot that we downloaded in Jan 2014 from the Biodiversity Heritage Library.

BHL applies CC BY-NC-SA 4.0 license. Information obtained at Jan 26, 2015.

PATENT (Google Patents)

Quick Statistics & Downloads
Pipeline OCR'ed Text > NLP (Stanford CoreNLP 3.5.1)
Size 428 GB Document Type Government Document
# Documents 2,437,000 # Machine Hours 100 K
# Sentences 248 Million # Words 7.7 Billion
Downloads

We plan to have DeepDive's PATENT Corpus to contain a full snapshot of patent grants since 1920 from the United States Patent and Trademark Office (USPTO), European Patent Office (EPO), and World Intellectual Property Organization (WIPO), indexed by Google Patents in Feb 2015.

Patent applications we processed belong to the public domain. Information obtained at Jan 27, 2015.

WIKI (Wikipedia (English Edition))

Quick Statistics & Downloads
Pipeline WIKI PAGE > WikiExtractor (link) > NLP (Stanford CoreNLP 3.5.1)
Size 97 GB Document Type Web page
# Documents 4,776,093 # Machine Hours 24 K
# Sentences 85 Million # Words 2 Billion
Downloads

Wikipedia is a free-access, free content Internet encyclopedia, supported and hosted by the non-profit Wikimedia Foundation. We plan to have DeepDive's WIKI Corpus to contain a full snapshot of the English-language edition of Wikipedia in Feb 2015.

Note. For Web-based data sets that contain million of Web pages, e.g., Wikipedia, we follow the standard WARC format to deliver CoNLL-based NLP result. Each chunk is a .gz file that contains a single .warc file. Each .warc file contains multiple Web pages, e.g.,

WARC/1.0
WARC-Type: conversion
Content-Length: 30098
WARC-Date: 2015-03-03T15:11:08Z
WARC-Payload-Digest: sha1:e7e0459dce73775510147726156fb74f30aa07c3
WARC-Target-URI: http://en.wikipedia.org/wiki?curid=2504364
Content-Type: application/octet-stream
WARC-Record-ID: <urn:uuid:855b1a64-c1b7-11e4-886d-842b2b4a49e6>

1 Mr. NNP O Mr. nn  4 SENT_1  0:3
2 Capone-E  NNP O Capone-E  nn  4 SENT_1  4:12
3 Mr. NNP O Mr. nn  4 SENT_1  14:17
4 Capone-E  NNP O Capone-E  nsubj 11  SENT_1  18:26
5 or  CC  O or  null  0 SENT_1  27:29
6 Fahd  NNP O Fahd  nn  7 SENT_1  30:34
7 Azam  NNP O Azam  conj_or 4 SENT_1  35:39
8 is  VBZ O be  cop 11  SENT_1  40:42
9 an  DT  O a det 11  SENT_1  43:45
10  Pakistani JJ  MISC  pakistani amod  11  SENT_1  46:55
11  rapper  NN  O rapper  null  0 SENT_1  56:62
12  . . O . null  0 SENT_1  62:63

Wikipedia applies CC BY-SA 3.0 Unported license. Information obtained at Jan 27, 2015.

To cite DeepDive open datasets, you can use the following BibTeX citation:

@misc{DeepDive:2015:OpenData,
  author       = { Christopher R\'{e} and Ce Zhang },
  title        = { {DeepDive} open datasets },
  howpublished = { \url{http://deepdive.stanford.edu/opendata} },
  year         = { 2015 }
}