Text chunking example
In this document, we will describe an example application of text chunking using DeepDive to demonstrate how to use categorical factors with categorical variables. This example assumes a working installation of DeepDive and basic knowledge of how to build an application in DeepDive. Please go through the tutorial with the spouse example application before preceding.
Text chunking consists of dividing a text in syntactically correlated parts of words. For example, the following sentence:
He reckons the current account deficit will narrow to only # 1.8 billion in September .
can be divided as follows:
[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] .
Text chunking is an intermediate step towards full parsing. It was the shared task for CoNLL-2000. Training and test data for this task is derived from the Wall Street Journal corpus (WSJ), which includes words, part-of-speech tags, and chunking tags.
In the example, we will predicate chunk label for each word. We include three inference rules, corresponding to logistic regression, linear-chain conditional random field (CRF), and skip-chain conditional random field. The features and rules we use are very simple, just to illustrate how to use categorical variables and categorical factors in DeepDive to build applications.
Running the example
The complete example is under the examples/chunking
cd examples/chunking/
The structure of this directory is as follows:
contains training and testing data.udf/
contains extractor for extracting training data and features.result/
contains evaluation scripts and sample results.
To run this example, use the following command:
deepdive compile && deepdive run
Then run the following to evaluate the results:
Example walkthrough
The application performs the following high-level steps:
- Data preprocessing: load training and test data into database.
- Feature extraction: extract surrounding words and their part-of-speech tags as features.
- Statistical inference and learning.
- Evaluation of the results.
1. Data preprocessing
The train and test data consist of words, their part-of-speech tag and the chunk tags as derived from the WSJ corpus.
The raw data is first copied into table words_raw
by input/
Then it is processed to convert the chunk labels to integer indexes, based on predefined mappings in the tags
This process is defined in app.ddlog
using the following code:
words(sent_id, word_id, word, pos, true_tag, tag_id) :-
words_raw(sent_id, word_id, word, pos, true_tag),
tags(tag, tag_id),
if true_tag = "B-UCP" then ""
else if true_tag = "I-UCP" then ""
else if strpos(true_tag, "-") > 0 then
split_part(true_tag, "-", 2)
else if true_tag = "O" then "O"
else ""
end = tag.
The input table words_raw
looks like
word_id | word | pos | tag | id
1 | Confidence | NN | B-NP |
The output table words
looks like
sent_id | word_id | word | pos | true_tag | tag | id
1 | 1 | Confidence | NN | B-NP | 0 | 0
2. Feature extraction
To predict chunking label, we need to add features.
We use three simple features: the word itself, its part-of-speech tag, and the part-of-speech tag of its previous word.
We add an extractor in app.ddlog
function ext_features
over (word_id1 bigint, word1 text, pos1 text, word2 text, pos2 text)
returns rows like word_features
implementation "udf/" handles tsv lines.
word_features +=
ext_features(word_id1, word1, pos1, word2, pos2) :-
words(sent_id, word_id1, word1, pos1, _, _),
words(sent_id, word_id2, word2, pos2, _, _),
[word_id1 = word_id2 + 1],
word1 IS NOT NULL.
where the input is generating 2-grams from words
table, which looks like:
w1.word_id | w1.word | w1.pos | w2.word | w2.pos
15 | figures | NNS | trade | NN
The output will look like:
word_id | feature | id
15 | word=figures |
15 | pos=NNS |
15 | prev_pos=NN |
The user-defined function can be in udf/
3. Statistical learning and inference
We will predicate the chunk tag for each word, which corresponds to tag
column of words
The variables are declared in app.ddlog
tag?(word_id bigint) Categorical(13).
Here, we have 13 types of chunk tags NP, VP, PP, ADJP, ADVP, SBAR, O, PRT, CONJP, INTJ, LST, B, null
according to CoNLL-2000 task description.
We have three rules, logistic regression, linear-chain CRF, and skip-chain CRF.
The logistic regression rule is:
tag(word_id) :- word_features(word_id, f).
To express conditional random field, just use the Multinomial
factor to link variables that could interact with each other.
For more information about CRF, see this tutorial on CRF.
The following rule links labels of neighboring words:
Multinomial(tag(word_id_1), tag(word_id_2)) :-
words(_, word_id_1, _, _, _, _),
words(_, word_id_2, _, _, _, _),
It is similar with skip-chain CRF, where we have skip edges that link labels of identical words.
Multinomial(tag(word_id_1), tag(word_id_2)) :-
words(sent_id, word_id_1, word, _, _, tag),
words(sent_id, word_id_2, word, _, _, _),
We also specify the holdout variables according to task description about training and test data in deepdive.conf
# Specify a holdout fraction
deepdive.calibration.holdout_query: """
INSERT INTO dd_graph_variables_holdout(variable_id)
SELECT dd_id
FROM dd_variables_chunk
WHERE word_id > 220663
#deepdive.sampler.sampler_cmd: "numbskull"
deepdive.sampler.sampler_args: "-l 100 -i 100 --sample_evidence"
4. Evaluation results
Running the following script will give the evaluation results.
Below are the results for using different rules. We can see that by adding CRF rules, we get better results both for precision and recall.
Logistic regression
processed 47377 tokens with 23852 phrases; found: 23642 phrases; correct: 19156.
accuracy: 89.56%; precision: 81.03%; recall: 80.31%; FB1: 80.67
ADJP: precision: 50.40%; recall: 42.92%; FB1: 46.36 373
ADVP: precision: 69.21%; recall: 71.13%; FB1: 70.16 890
CONJP: precision: 0.00%; recall: 0.00%; FB1: 0.00 13
INTJ: precision: 100.00%; recall: 50.00%; FB1: 66.67 1
LST: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
NP: precision: 79.88%; recall: 77.52%; FB1: 78.68 12055
PP: precision: 90.51%; recall: 89.59%; FB1: 90.04 4762
PRT: precision: 66.39%; recall: 76.42%; FB1: 71.05 122
SBAR: precision: 83.51%; recall: 71.96%; FB1: 77.31 461
VP: precision: 79.48%; recall: 84.71%; FB1: 82.01 4965
LR + linear-chain CRF
processed 47377 tokens with 23852 phrases; found: 22996 phrases; correct: 19746.
accuracy: 91.58%; precision: 85.87%; recall: 82.79%; FB1: 84.30
: precision: 0.00%; recall: 0.00%; FB1: 0.00 1
ADJP: precision: 75.74%; recall: 69.86%; FB1: 72.68 404
ADVP: precision: 76.47%; recall: 73.56%; FB1: 74.99 833
CONJP: precision: 25.00%; recall: 22.22%; FB1: 23.53 8
INTJ: precision: 50.00%; recall: 50.00%; FB1: 50.00 2
LST: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
NP: precision: 82.22%; recall: 77.19%; FB1: 79.63 11662
PP: precision: 93.43%; recall: 94.26%; FB1: 93.84 4854
PRT: precision: 66.67%; recall: 69.81%; FB1: 68.20 111
SBAR: precision: 84.93%; recall: 74.77%; FB1: 79.52 471
VP: precision: 90.37%; recall: 90.21%; FB1: 90.29 4650
LR + linear-chain CRF + skip-chain CRF
processed 47377 tokens with 23852 phrases; found: 22950 phrases; correct: 19794.
accuracy: 91.79%; precision: 86.25%; recall: 82.99%; FB1: 84.59
: precision: 0.00%; recall: 0.00%; FB1: 0.00 1
ADJP: precision: 75.25%; recall: 68.72%; FB1: 71.84 400
ADVP: precision: 76.29%; recall: 73.56%; FB1: 74.90 835
CONJP: precision: 30.00%; recall: 33.33%; FB1: 31.58 10
INTJ: precision: 100.00%; recall: 50.00%; FB1: 66.67 1
LST: precision: 0.00%; recall: 0.00%; FB1: 0.00 0
NP: precision: 82.96%; recall: 77.54%; FB1: 80.16 11611
PP: precision: 93.70%; recall: 94.30%; FB1: 94.00 4842
PRT: precision: 66.67%; recall: 69.81%; FB1: 68.20 111
SBAR: precision: 83.37%; recall: 74.95%; FB1: 78.94 481
VP: precision: 90.34%; recall: 90.34%; FB1: 90.34 4658