Specifying a statistical model in DDlog
Every DeepDive application can be viewed as defining a statistical inference problem using input data and data derived by a series of data processing steps. This document describes (a) how to declare random variables for a DeepDive application's statistical model, (b) how to define their scope as well as supervision labels, and (c) how to write inference rules for specifying features and correlations.
Variable declarations
DeepDive requires the user to specify the name and type of the variable relations that hold random variables used during probabilistic inference.
Currently DeepDive supports Boolean (i.e., Bernoulli) variables and Categorical variables.
Variable relations are declared in app.ddlog
with a small twist to the syntax used for declaring normal relations.
Boolean variables
A question mark after the relation name indicates that it is a variable relation containing random variables rather than a normal relation used for loading or processing data to be later used by the model. The columns of the variable relation serve as a key. The following is an example declaration of a relation of Boolean variables.
has_spouse?(p1_id text, p2_id text).
This declares a variable relation named has_spouse
where each unique pair of (p1_id, p2_id)
represents a different random variable in the model.
Categorical variables
DeepDive supports categorical variables, which take integer values ranging from 0 to a user-specified upper bound.
The variable relation is declared similarly as a Boolean variable except that the declaration is followed by a Categorical(N)
where N
is the number of categories the variables can take, defining the size of the domain.
Each variable can take values from 0, 1, ..., N
-1.
For instance, in the chunking example, a categorical variable of 13 possible categories is declared as follows:
tag?(word_id bigint) Categorical(13).
Scoping and supervision rules
After declaring a variable relation, its scope needs to be defined along with the supervision labels.
That means, (a) all possible values for the variable relation's columns must be defined by deriving them from other relations, and (b) whether a random variable in the relation is true or false (Boolean), or which value it takes from its domain of categories (Categorical) must be defined using a special syntax.
For these scoping and supervision rules, a syntax very similar to normal derivation rules is used for defining the variable relation, except that the head is followed by an =
sign and an expression that corresponds to the random variable's label.
For instance, in the spouse example, we scope the has_spouse
variable by the following rule:
has_spouse(p1_id, p2_id) = NULL :-
spouse_candidate(p1_id, _, p2_id, _).
This means all distinct p1_id
and p2_id
pairs found in the spouse_candidate
relation is considered the scope of all has_spouse
random variables in the model.
By using a NULL
expression on the right hand side, they are considered as unsupervised variables.
On the other hand, the following similar looking rule provides the supervision labels using a sophisticated expression.
has_spouse(p1_id, p2_id) = if l > 0 then TRUE
else if l < 0 then FALSE
else NULL end :- spouse_label_resolved(p1_id, p2_id, l).
This rule is basically doing a majority vote, turning aggregate numbers computed from spouse_label_resolved
relation into Boolean labels.
Inference rules
Inference rules specify features for a variable and/or the correlations between variables.
They are basically the templates for the factors in the factor graph, telling DeepDive how to ground them based on what input and derived data.
Again, these rules extend the syntax of normal derivation rules and allow the type of the factor to be specified in the rule's head, preceded by a @weight
declaration as shown below.
@weight(...)
FACTOR_HEAD :- RULE_BODY.
Here, RULE_BODY denotes a typical conjunctive query also used for normal derivation rules in DDlog. FACTOR_HEAD denotes the part where more than one variable relations can appear with a special syntax. Let's first look at the simplest case of describing one variable in the head of an inference rule.
Specifying features
In common cases, one wants to model the probability of a Boolean variable being true using a set of features.
Expressing this kind of binary classification problem is very simple in DDlog.
By writing a rule with just one variable relation in the head, DeepDive creates in the model a unary factor that connects to it. The weight of this unary factor is determined by a user-defined feature.
For instance, in the spouse example, there is an inference rule specifying features for the has_spouse
variables written as:
@weight(f)
has_spouse(p1_id, p2_id) :-
spouse_candidate(p1_id, _, p2_id, _),
spouse_feature(p1_id, p2_id, f).
This rule means that:
- A factor should be created for each pair of person mentions found in the
spouse_candidate
relation and each of the corresponding features found inspouse_feature
relation. - Each of those factors connects to a single
has_spouse
variable identified by a pair of mentions(p1_id, p2_id)
originating from thespouse_candidate
relation. - The feature
f
for a factor determines a weight (to be learned for this particular rule) that translates into the factor's potential, which in turn influences the probability of the connectedhas_spouse
variable being true or not.
Specifying correlations
Now, in almost every problem, the variables are correlated with each other in a special way, and it is desirable to enrich the model with this domain knowledge. Such correlations can be modeled by creating certain types of factors that connect multiple correlated variables together. This is where a richer syntax in the FACTOR_HEAD comes into play. DDlog borrows a lot of syntax from Markov Logic Networks, and hence, first-order logic.
For example, the following rule in the smoke example correlates two variable relations.
@weight(3)
smoke(x) => cancer(x) :-
person(x).
This rule expresses that if a person smokes, there is an implication that he/she will have cancer.
Here, a constant 3
is used in the @weight
to express some level of confidence in this rule, instead of learning the weight from the data (explained later).
Implication
Logical implication or consequence of two or more variables can be expressed using the following syntax.
@weight(...) P(x) => Q(y) :- RULE_BODY.
@weight(...) P(x), Q(y) => R(z) :- RULE_BODY.
@weight(...) P(x), Q(y), R(z) => S(k) :- RULE_BODY.
Disjunction
Logical disjunction of two or more variables can be expressed using the following syntax.
@weight(...) P(x) v Q(y) :- RULE_BODY.
@weight(...) P(x) v Q(y) v R(z) :- RULE_BODY.
Conjunction
Logical conjunction of two or more variables can be expressed using the following syntax.
@weight(...) P(x) ^ Q(y) :- RULE_BODY.
@weight(...) P(x) ^ Q(y) ^ R(z) :- RULE_BODY.
Equality
Logical equality of two variables can be expressed using the following syntax.
@weight(...) P(x) = Q(y) :- RULE_BODY.
Negation
Whenever a rule has to refer to a case when the Boolean variable is false (also referred to as a negative literal), then it can be negated using a preceding !
as shown below.
@weight(...) P(x) => ! Q(x) :- RULE_BODY.
@weight(...) ! P(x) v ! Q(x) :- RULE_BODY.
Multinomial factors (STALE. NEEDS REVAMP.)
DeepDive has limited support for expressing correlations of categorical variables. The introduced syntax above can be used only for expressing correlations between Boolean variables. For categorical variables, DeepDive only allows the conjunction of the variables each taking a certain category value being true to be expressed using a special syntax shown below:
@weight(...) Multinomial( P(x), Q(y) ) :- RULE_BODY.
@weight(...) Multinomial( P(x), Q(y), R(z) ) :- RULE_BODY.
Multinomial
takes only categorical variables as arguments*, and it can be thought as a compact representation of an equivalent model with Boolean variables corresponding to each category connected by a conjunction factor for every combination of category assignments.
For example, suppose a
is a variable taking values 0, 1, 2, and b
is a variable taking values 0, 1.
Then, Multinomial(a, b)
is equivalent to having factors between a
and b
that correspond to the following indicator functions.
- I{
a
= 0,b
= 0} - I{
a
= 0,b
= 1} - I{
a
= 1,b
= 0} - I{
a
= 1,b
= 1} - I{
a
= 2,b
= 0} - I{
a
= 2,b
= 1}
Note that each of the factors above has a distinct weight, i.e., one weight for each possible assignment of variables in the Multinomial
factor.
For more detail on how to specify Conditional Random Fields and perform Multi-class Logistic Regression using categorical factor, see the chunking example.
* Because of this limitation, categorical variables and categorical factor support is likely to go away in a near future release, in favor of a more flexible way to express multi-class predictions and mutual exclusions.
tag?(@key word_id BIGINT, pos TEXT)
rather than tag?(word_id BIGINT) Categorical(N)
, and replace Multinomial
factor with conjunction of those Boolean variables with columns having functional dependencies.
Specifying weights
Each factor is assigned a weight, which represents the confidence in the correlation it expresses in the model. During statistical inference, these weights translate into their potentials that influence the probabilities of the connected variables. Factor weights are real numbers and only the relative magnitude to each other matters. Factors with larger weights have a greater impact on the connected variables than factors with smaller weights. Weights can be fixed to a constant manually, or they can be learned by DeepDive from the supervision labels at different granularity. Weights can be parameterized by some data originating from the data (in most time referred to as features), in which case factors with different parameter values will use different weights. In order to learn weights automatically, there must be enough training data available.
DDlog syntax for specifying weights for three different cases are shown below.
Q?(x TEXT).
data(x TEXT, y TEXT).
# Fixed weight (10 can be treated as positive infinite)
@weight(10) Q(x) :- data(x, y).
# Unknown weight, to be learned from the data, but not depending on any variable.
# All factors created by this rule will have the same weight.
@weight("?") Q(x) :- data(x, y).
# Unknown weight, each to be learned from the data per different values of y.
@weight(y) Q(x) :- data(x, y).