DeepDive Open Datasets
Over the last few years, our collaborators have used DeepDive to build over a dozen different applications. Often, these applications are interesting because of the sheer volume of documents that one can read--millions of documents or billions of Web pages. However, many researchers can't get started on these problems because NLP preprocessing is simply too expensive and painful. With the generous support of Condor CHTC, we have been fortunate to be have a ready supply of CPU hours for our own projects. The datasets below have taken over 5+ million CPU hours to produce. It’s time that we gave back to the NLP community that has given us so many good ideas for our own systems.
Our work would not be possible without open data. As a result, our group decided to enhance as many Creative Commons datasets as we can find. Below, we describe the data format and provide a small list of data sets in this page. We plan to process more data sets and to provide richer data as well (extracted entities and relationships). Feel free to contact us to suggest more open data sets to process.
Data Format (NLP Markups)
The datasets we provide are in two formats, and for most datasets, we provide data in both formats.
DeepDive-ready DB Dump. In this format, the data is a database table that can be loaded directly into a database with PostgreSQL or Greenplum. The schema of this table is the same as what we used in our tutorial example such that you can start building your own DeepDive applications immediately after download.
CoNLL-format Markups. Using DeepDive provides us the opportunity to better support your application and related technical questions, however, you do not need to be tied up to DeepDive to use our datasets. We also provide a format that is similar to what has been used in the CoNLL-X shared task. The columns of the TSV file is arranged as follows:
ID
FORM
POSTAG
NERTAG
LEMMA
DEPREL
HEAD
SENTID
PROV
. The meaning for most of these names could be found in the CoNLL specification.PROV
means a set of bounding boxes in the original documents that corresponds to the word, which we will describe in details.
Provenance. For each word in our dataset, we provide its provenance back to the original document. Depending on the format of the original document, the provenance is provided in one of the following two formats.
PDF provenance: If the original document is PDF or (scanned) image, the provenance for a word is a set of bounding boxes. For example,
[p8l1329t1388r1453b1405, p8l1313t1861r1440b1892],
means that the corresponding word appears in page 8 of the original document. It has two bounding boxes because it cross two lines. The first bounding box has left margin 1329, top margin 1388, right margin 1453, and bottom margin 1405. All numbers are in pixels when the image is converted with 300dpi.Pure-text provenance: If the original document is HTML or pure text, the provenance for a word is a set of intervals of character offsets. For example,
14/15
means that the corresponding word contains a single character that starts at character offset 14 (include) and ends at character offset 15 (not include).
Each dataset is versioned with date, and with the MD5 checksum in the file name.
PMC-OA (PubMed Central Open Access Subset)
Pipeline |
HTML > STRIP (html2text) > NLP (Stanford CoreNLP 1.3.4) |
||
---|---|---|---|
Size | 70 GB | Document Type | Journal Articles |
# Documents | 359,324 | # Machine Hours | 100 K |
# Words | 2.7 Billion | # Sentences | 110 Million |
Downloads |
|
PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). DeepDive's PMC-OA corpus contains a full snapshot that we downloaded in March 2014 from the PubMed Central Open Access Subset.
PMC applies different creative common licenses. Information obtained at Jan 27, 2015.
BMC (BioMed Central)
Pipeline | HTML > STRIP (html2text) > NLP (Stanford CoreNLP 1.3.4) | ||
---|---|---|---|
Size | 21 GB | Document Type | Journal Articles |
# Documents | 70,043 | # Machine Hours | 20 K |
# Sentences | 15 Million | # Words | 400 Million |
Downloads |
|
BioMed Central is an STM (Science, Technology and Medicine) publisher of 274 peer-reviewed open access journals. We plan to have DeepDive's BMC corpus to contain a full snapshot of BioMed Central in Jan 2015.
BioMed Central applies CC BY 4.0 license. Information obtained at Jan 27, 2015.
PLOS (Public Library of Science)
Pipeline |
PDF > OCR (Tesseract) > NLP (Stanford CoreNLP 1.3.4) |
||
---|---|---|---|
Size | 70GB | Document Type | Journal Articles |
# Documents | 125,378 | # Machine Hours | 370 K |
# Words | 1.3 Billion | # Sentences | 73 Million |
Downloads |
|
PLOS is a nonprofit open access scientific publishing project aimed at creating a library of open access journals and other scientific literature under an open content license. DeepDive's PLOS corpus contains a full snapshot that we downloaded in Aug 2014 of the following PLOS journals: (1) PLOS Biology, (2) PLOS Medicine, (3) PLOS Computational Biology, (4) PLOS Genetics, (5) PLOS Pathogens, (6) PLOS Clinical Trials, (7) PLOS ONE, (8) PLOS Neglected Tropical Diseases, and (9) PLOS Currents.
PLOS applies CC BY 3.0 license. Information obtained at Jan 26, 2015.
BHL (Biodiversity Heritage Library)
Pipeline | OCR'ed Text > NLP (Stanford CoreNLP 1.3.4) | ||
---|---|---|---|
Size | 229 GB | Document Type | Books |
# Documents | 98,099 | # Machine Hours | 500 K |
# Sentences | 1 Billion | # Words | 8.7 Billion |
Downloads |
|
The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global ''biodiversity commons.'' DeepDive's BHL corpus contains a full snapshot that we downloaded in Jan 2014 from the Biodiversity Heritage Library.
BHL applies CC BY-NC-SA 4.0 license. Information obtained at Jan 26, 2015.
PATENT (Google Patents)
Pipeline | OCR'ed Text > NLP (Stanford CoreNLP 3.5.1) | ||
---|---|---|---|
Size | 428 GB | Document Type | Government Document |
# Documents | 2,437,000 | # Machine Hours | 100 K |
# Sentences | 248 Million | # Words | 7.7 Billion |
Downloads |
|
We plan to have DeepDive's PATENT Corpus to contain a full snapshot of patent grants since 1920 from the United States Patent and Trademark Office (USPTO), European Patent Office (EPO), and World Intellectual Property Organization (WIPO), indexed by Google Patents in Feb 2015.
Patent applications we processed belong to the public domain. Information obtained at Jan 27, 2015.
WIKI (Wikipedia (English Edition))
Pipeline | WIKI PAGE > WikiExtractor (link) > NLP (Stanford CoreNLP 3.5.1) | ||
---|---|---|---|
Size | 97 GB | Document Type | Web page |
# Documents | 4,776,093 | # Machine Hours | 24 K |
# Sentences | 85 Million | # Words | 2 Billion |
Downloads |
|
Wikipedia is a free-access, free content Internet encyclopedia, supported and hosted by the non-profit Wikimedia Foundation. We plan to have DeepDive's WIKI Corpus to contain a full snapshot of the English-language edition of Wikipedia in Feb 2015.
Note. For Web-based data sets that contain million
of Web pages, e.g., Wikipedia, CommonCrawl, and ClueWeb, we follow
the standard WARC format to deliver CoNLL-based NLP result. Each chunk
is a .gz
file that contains a single .warc
file. Each
.warc
file contains multiple Web pages, e.g.,
WARC/1.0
WARC-Type: conversion
Content-Length: 30098
WARC-Date: 2015-03-03T15:11:08Z
WARC-Payload-Digest: sha1:e7e0459dce73775510147726156fb74f30aa07c3
WARC-Target-URI: http://en.wikipedia.org/wiki?curid=2504364
Content-Type: application/octet-stream
WARC-Record-ID: <urn:uuid:855b1a64-c1b7-11e4-886d-842b2b4a49e6>
1 Mr. NNP O Mr. nn 4 SENT_1 0:3
2 Capone-E NNP O Capone-E nn 4 SENT_1 4:12
3 Mr. NNP O Mr. nn 4 SENT_1 14:17
4 Capone-E NNP O Capone-E nsubj 11 SENT_1 18:26
5 or CC O or null 0 SENT_1 27:29
6 Fahd NNP O Fahd nn 7 SENT_1 30:34
7 Azam NNP O Azam conj_or 4 SENT_1 35:39
8 is VBZ O be cop 11 SENT_1 40:42
9 an DT O a det 11 SENT_1 43:45
10 Pakistani JJ MISC pakistani amod 11 SENT_1 46:55
11 rapper NN O rapper null 0 SENT_1 56:62
12 . . O . null 0 SENT_1 62:63
Wikipedia applies CC BY-SA 3.0 Unported license. Information obtained at Jan 27, 2015.
CCRAWL (CommonCrawl)
Pipeline | HTML > STRIP (html2text) > NLP (Stanford CoreNLP) | ||
---|---|---|---|
Size | - | Document Type | Web page |
# Documents | - | # Machine Hours | - |
# Sentences | - | # Words | - |
Downloads |
|
We plan to have DeepDive's CCRAWL Corpus to process a full snapshot of the Common Crawl Corpus, which is a corpus of web crawl data composed of over 5 billion web pages.
This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.
CLUEWEB (ClueWeb)
Pipeline | HTML > STRIP (html2text) > NLP (Stanford CoreNLP) | ||
---|---|---|---|
Size | - | Document Type | Web page |
# Documents | - | # Machine Hours | - |
# Sentences | - | # Words | - |
Downloads |
|
We plan to have DeepDive's CLUEWEB Corpus to process a full snapshot of the ClueWeb 2012 corpus, which is a corpus of web crawl data composed of over 733 million web pages.
More Datasets Are Coming -- Stay Tuned!
We are currently working hard to bring more (10+!) datasets available in the next couple months. In the mean time, we'd love to hear about your applications or interesting datasets that you have in mind. Just let us know!
To cite DeepDive open datasets, you can use the following BibTeX citation:
@misc{DeepDive:2015:OpenData,
author = { Christopher R\'{e} and Ce Zhang },
title = { {DeepDive} open datasets },
howpublished = { \url{http://deepdive.stanford.edu/opendata} },
year = { 2015 }
}