DeepDive Open Datasets

Over the last few years, our collaborators have used DeepDive to build over a dozen different applications. Often, these applications are interesting because of the sheer volume of documents that one can read--millions of documents or billions of Web pages. However, many researchers can't get started on these problems because NLP preprocessing is simply too expensive and painful. With the generous support of Condor CHTC, we have been fortunate to be have a ready supply of CPU hours for our own projects. The datasets below have taken over 5+ million CPU hours to produce. It’s time that we gave back to the NLP community that has given us so many good ideas for our own systems.

Our work would not be possible without open data. As a result, our group decided to enhance as many Creative Commons datasets as we can find. Below, we describe the data format and provide a small list of data sets in this page. We plan to process more data sets and to provide richer data as well (extracted entities and relationships). Feel free to contact us to suggest more open data sets to process.

Acknowledgement

We would like to thank the HTCondor research group and the Center for High Throughput Computing (CHTC) at the University of Wisconsin-Madison, who have provided millions of machine hours to our group. Thank you, Miron Livny. We would also like to thank the Stanford Natural Language Processing Group, whose tools we use in many of our applications. DeepDive is also generously supported by our sponsors.

Data Format (NLP Markups)

The datasets we provide are in two formats, and for most datasets, we provide data in both formats.

DeepDive-ready DB Dump. In this format, the data is a database table that can be loaded directly into a database with PostgreSQL or Greenplum. The schema of this table is the same as what we used in our tutorial example such that you can start building your own DeepDive applications immediately after download.
CoNLL-format Markups. Using DeepDive provides us the opportunity to better support your application and related technical questions, however, you do not need to be tied up to DeepDive to use our datasets. We also provide a format that is similar to what has been used in the CoNLL-X shared task. The columns of the TSV file is arranged as follows: ID FORM POSTAG NERTAG LEMMA DEPREL HEAD SENTID PROV. The meaning for most of these names could be found in the CoNLL specification. PROV means a set of bounding boxes in the original documents that corresponds to the word, which we will describe in details.

Provenance. For each word in our dataset, we provide its provenance back to the original document. Depending on the format of the original document, the provenance is provided in one of the following two formats.

PDF provenance: If the original document is PDF or (scanned) image, the provenance for a word is a set of bounding boxes. For example, [p8l1329t1388r1453b1405, p8l1313t1861r1440b1892], means that the corresponding word appears in page 8 of the original document. It has two bounding boxes because it cross two lines. The first bounding box has left margin 1329, top margin 1388, right margin 1453, and bottom margin 1405. All numbers are in pixels when the image is converted with 300dpi.
Pure-text provenance: If the original document is HTML or pure text, the provenance for a word is a set of intervals of character offsets. For example, 14/15 means that the corresponding word contains a single character that starts at character offset 14 (include) and ends at character offset 15 (not include).

Each dataset is versioned with date, and with the MD5 checksum in the file name.

PMC-OA (PubMed Central Open Access Subset)

Quick Statistics & Downloads

Pipeline	HTML > STRIP (html2text) > NLP (Stanford CoreNLP 1.3.4)
Size	70 GB	Document Type	Journal Articles
# Documents	359,324	# Machine Hours	100 K
# Words	2.7 Billion	# Sentences	110 Million
Downloads	DeepDive-ready DB Dump CoNLL-format Markups DeepDive-ready DB Dump CoNLL-format Markups

PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). DeepDive's PMC-OA corpus contains a full snapshot that we downloaded in March 2014 from the PubMed Central Open Access Subset.

PMC applies different creative common licenses. Information obtained at Jan 27, 2015.

BMC (BioMed Central)

Quick Statistics & Downloads

Pipeline	HTML > STRIP (html2text) > NLP (Stanford CoreNLP 1.3.4)
Size	21 GB	Document Type	Journal Articles
# Documents	70,043	# Machine Hours	20 K
# Sentences	15 Million	# Words	400 Million
Downloads	DeepDive-ready DB Dump CoNLL-format Markups DeepDive-ready DB Dump CoNLL-format Markups

BioMed Central is an STM (Science, Technology and Medicine) publisher of 274 peer-reviewed open access journals. We plan to have DeepDive's BMC corpus to contain a full snapshot of BioMed Central in Jan 2015.

BioMed Central applies CC BY 4.0 license. Information obtained at Jan 27, 2015.

PLOS (Public Library of Science)

Quick Statistics & Downloads

Pipeline	PDF > OCR (Tesseract) > NLP (Stanford CoreNLP 1.3.4)
Size	70GB	Document Type	Journal Articles
# Documents	125,378	# Machine Hours	370 K
# Words	1.3 Billion	# Sentences	73 Million
Downloads	DeepDive-ready DB Dump CoNLL-format Markups DeepDive-ready DB Dump CoNLL-format Markups

PLOS is a nonprofit open access scientific publishing project aimed at creating a library of open access journals and other scientific literature under an open content license. DeepDive's PLOS corpus contains a full snapshot that we downloaded in Aug 2014 of the following PLOS journals: (1) PLOS Biology, (2) PLOS Medicine, (3) PLOS Computational Biology, (4) PLOS Genetics, (5) PLOS Pathogens, (6) PLOS Clinical Trials, (7) PLOS ONE, (8) PLOS Neglected Tropical Diseases, and (9) PLOS Currents.

PLOS applies CC BY 3.0 license. Information obtained at Jan 26, 2015.

BHL (Biodiversity Heritage Library)

Quick Statistics & Downloads

Pipeline	OCR'ed Text > NLP (Stanford CoreNLP 1.3.4)
Size	229 GB	Document Type	Books
# Documents	98,099	# Machine Hours	500 K
# Sentences	1 Billion	# Words	8.7 Billion
Downloads	CoNLL-format Markups DeepDive-ready DB Dump CoNLL-format Markups

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global ''biodiversity commons.'' DeepDive's BHL corpus contains a full snapshot that we downloaded in Jan 2014 from the Biodiversity Heritage Library.

BHL applies CC BY-NC-SA 4.0 license. Information obtained at Jan 26, 2015.

PATENT (Google Patents)

Quick Statistics & Downloads

Pipeline	OCR'ed Text > NLP (Stanford CoreNLP 3.5.1)
Size	428 GB	Document Type	Government Document
# Documents	2,437,000	# Machine Hours	100 K
# Sentences	248 Million	# Words	7.7 Billion
Downloads	DeepDive-ready DB Dump CoNLL-format Markups DeepDive-ready DB Dump CoNLL-format Markups

We plan to have DeepDive's PATENT Corpus to contain a full snapshot of patent grants since 1920 from the United States Patent and Trademark Office (USPTO), European Patent Office (EPO), and World Intellectual Property Organization (WIPO), indexed by Google Patents in Feb 2015.

Patent applications we processed belong to the public domain. Information obtained at Jan 27, 2015.

WIKI (Wikipedia (English Edition))

Quick Statistics & Downloads

Pipeline	WIKI PAGE > WikiExtractor (link) > NLP (Stanford CoreNLP 3.5.1)
Size	97 GB	Document Type	Web page
# Documents	4,776,093	# Machine Hours	24 K
# Sentences	85 Million	# Words	2 Billion
Downloads	DeepDive-ready DB Dump CoNLL-format Markups DeepDive-ready DB Dump CoNLL-format Markups

Wikipedia is a free-access, free content Internet encyclopedia, supported and hosted by the non-profit Wikimedia Foundation. We plan to have DeepDive's WIKI Corpus to contain a full snapshot of the English-language edition of Wikipedia in Feb 2015.

Note. For Web-based data sets that contain million of Web pages, e.g., Wikipedia, CommonCrawl, and ClueWeb, we follow the standard WARC format to deliver CoNLL-based NLP result. Each chunk is a .gz file that contains a single .warc file. Each .warc file contains multiple Web pages, e.g.,

WARC/1.0
WARC-Type: conversion
Content-Length: 30098
WARC-Date: 2015-03-03T15:11:08Z
WARC-Payload-Digest: sha1:e7e0459dce73775510147726156fb74f30aa07c3
WARC-Target-URI: http://en.wikipedia.org/wiki?curid=2504364
Content-Type: application/octet-stream
WARC-Record-ID: <urn:uuid:855b1a64-c1b7-11e4-886d-842b2b4a49e6>

1 Mr. NNP O Mr. nn  4 SENT_1  0:3
2 Capone-E  NNP O Capone-E  nn  4 SENT_1  4:12
3 Mr. NNP O Mr. nn  4 SENT_1  14:17
4 Capone-E  NNP O Capone-E  nsubj 11  SENT_1  18:26
5 or  CC  O or  null  0 SENT_1  27:29
6 Fahd  NNP O Fahd  nn  7 SENT_1  30:34
7 Azam  NNP O Azam  conj_or 4 SENT_1  35:39
8 is  VBZ O be  cop 11  SENT_1  40:42
9 an  DT  O a det 11  SENT_1  43:45
10  Pakistani JJ  MISC  pakistani amod  11  SENT_1  46:55
11  rapper  NN  O rapper  null  0 SENT_1  56:62
12  . . O . null  0 SENT_1  62:63

Wikipedia applies CC BY-SA 3.0 Unported license. Information obtained at Jan 27, 2015.

CCRAWL (CommonCrawl)

Quick Statistics & Downloads

Pipeline	HTML > STRIP (html2text) > NLP (Stanford CoreNLP)
Size	-	Document Type	Web page
# Documents	-	# Machine Hours	-
# Sentences	-	# Words	-
Downloads	DeepDive-ready DB Dump CoNLL-format Markups DeepDive-ready DB Dump CoNLL-format Markups

We plan to have DeepDive's CCRAWL Corpus to process a full snapshot of the Common Crawl Corpus, which is a corpus of web crawl data composed of over 5 billion web pages.

This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.

CLUEWEB (ClueWeb)

Quick Statistics & Downloads

Pipeline	HTML > STRIP (html2text) > NLP (Stanford CoreNLP)
Size	-	Document Type	Web page
# Documents	-	# Machine Hours	-
# Sentences	-	# Words	-
Downloads	DeepDive-ready DB Dump CoNLL-format Markups DeepDive-ready DB Dump CoNLL-format Markups

We plan to have DeepDive's CLUEWEB Corpus to process a full snapshot of the ClueWeb 2012 corpus, which is a corpus of web crawl data composed of over 733 million web pages.

More Datasets Are Coming -- Stay Tuned!

We are currently working hard to bring more (10+!) datasets available in the next couple months. In the mean time, we'd love to hear about your applications or interesting datasets that you have in mind. Just let us know!

To cite DeepDive open datasets, you can use the following BibTeX citation:

@misc{DeepDive:2015:OpenData,
  author       = { Christopher R\'{e} and Ce Zhang },
  title        = { {DeepDive} open datasets },
  howpublished = { \url{http://deepdive.stanford.edu/opendata} },
  year         = { 2015 }
}