Specifying a statistical model in DDlog

Every DeepDive application can be viewed as defining a statistical inference problem using input data and data derived by a series of data processing steps. This document describes (a) how to declare random variables for a DeepDive application's statistical model, (b) how to define their scope as well as supervision labels, and (c) how to write inference rules for specifying features and correlations.

Variable declarations

DeepDive requires the user to specify the name and type of the variable relations that hold random variables used during probabilistic inference. Currently DeepDive supports Boolean (i.e., Bernoulli) variables and Categorical variables. Variable relations are declared in app.ddlog with a small twist to the syntax used for declaring normal relations.

Boolean variables

A question mark after the relation name indicates that it is a variable relation containing random variables rather than a normal relation used for loading or processing data to be later used by the model. The columns of the variable relation serve as a key. The following is an example declaration of a relation of Boolean variables.

has_spouse?(p1_id text, p2_id text).

This declares a variable relation named has_spouse where each unique pair of (p1_id, p2_id) represents a different random variable in the model.

Categorical variables

DeepDive supports categorical variables, which take integer values ranging from 0 to a user-specified upper bound. The variable relation is declared similarly as a Boolean variable except that the declaration is followed by a Categorical(N) where N is the number of categories the variables can take, defining the size of the domain. Each variable can take values from 0, 1, ..., N-1. For instance, in the chunking example, a categorical variable of 13 possible categories is declared as follows:

tag?(word_id bigint) Categorical(13).

Scoping and supervision rules

After declaring a variable relation, its scope needs to be defined along with the supervision labels. That means, (a) all possible values for the variable relation's columns must be defined by deriving them from other relations, and (b) whether a random variable in the relation is true or false (Boolean), or which value it takes from its domain of categories (Categorical) must be defined using a special syntax. For these scoping and supervision rules, a syntax very similar to normal derivation rules is used for defining the variable relation, except that the head is followed by an = sign and an expression that corresponds to the random variable's label. For instance, in the spouse example, we scope the has_spouse variable by the following rule:

has_spouse(p1_id, p2_id) = NULL :-
    spouse_candidate(p1_id, _, p2_id, _).

This means all distinct p1_id and p2_id pairs found in the spouse_candidate relation is considered the scope of all has_spouse random variables in the model. By using a NULL expression on the right hand side, they are considered as unsupervised variables.

On the other hand, the following similar looking rule provides the supervision labels using a sophisticated expression.

has_spouse(p1_id, p2_id) = if l > 0 then TRUE
                      else if l < 0 then FALSE
                      else NULL end :- spouse_label_resolved(p1_id, p2_id, l).

This rule is basically doing a majority vote, turning aggregate numbers computed from spouse_label_resolved relation into Boolean labels.

Inference rules

Inference rules specify features for a variable and/or the correlations between variables. They are basically the templates for the factors in the factor graph, telling DeepDive how to ground them based on what input and derived data. Again, these rules extend the syntax of normal derivation rules and allow the type of the factor to be specified in the rule's head, preceded by a @weight declaration as shown below.

@weight(...)
FACTOR_HEAD :- RULE_BODY.

Here, RULE_BODY denotes a typical conjunctive query also used for normal derivation rules in DDlog. FACTOR_HEAD denotes the part where more than one variable relations can appear with a special syntax. Let's first look at the simplest case of describing one variable in the head of an inference rule.

Specifying features

In common cases, one wants to model the probability of a Boolean variable being true using a set of features. Expressing this kind of binary classification problem is very simple in DDlog. By writing a rule with just one variable relation in the head, DeepDive creates in the model a unary factor that connects to it. The weight of this unary factor is determined by a user-defined feature. For instance, in the spouse example, there is an inference rule specifying features for the has_spouse variables written as:

@weight(f)
has_spouse(p1_id, p2_id) :-
    spouse_candidate(p1_id, _, p2_id, _),
    spouse_feature(p1_id, p2_id, f).

This rule means that:

A factor should be created for each pair of person mentions found in the spouse_candidate relation and each of the corresponding features found in spouse_feature relation.
Each of those factors connects to a single has_spouse variable identified by a pair of mentions (p1_id, p2_id) originating from the spouse_candidate relation.
The feature f for a factor determines a weight (to be learned for this particular rule) that translates into the factor's potential, which in turn influences the probability of the connected has_spouse variable being true or not.

Specifying correlations

Now, in almost every problem, the variables are correlated with each other in a special way, and it is desirable to enrich the model with this domain knowledge. Such correlations can be modeled by creating certain types of factors that connect multiple correlated variables together. This is where a richer syntax in the FACTOR_HEAD comes into play. DDlog borrows a lot of syntax from Markov Logic Networks, and hence, first-order logic.

For example, the following rule in the smoke example correlates two variable relations.

@weight(3)
smoke(x) => cancer(x) :-
    person(x).

This rule expresses that if a person smokes, there is an implication that he/she will have cancer. Here, a constant 3 is used in the @weight to express some level of confidence in this rule, instead of learning the weight from the data (explained later).

Implication

Logical implication or consequence of two or more variables can be expressed using the following syntax.

@weight(...)  P(x) => Q(y)              :- RULE_BODY.
@weight(...)  P(x), Q(y) => R(z)        :- RULE_BODY.
@weight(...)  P(x), Q(y), R(z) => S(k)  :- RULE_BODY.

Disjunction

Logical disjunction of two or more variables can be expressed using the following syntax.

@weight(...)  P(x) v Q(y)         :- RULE_BODY.
@weight(...)  P(x) v Q(y) v R(z)  :- RULE_BODY.

Conjunction

Logical conjunction of two or more variables can be expressed using the following syntax.

@weight(...)  P(x) ^ Q(y)         :- RULE_BODY.
@weight(...)  P(x) ^ Q(y) ^ R(z)  :- RULE_BODY.

Equality

Logical equality of two variables can be expressed using the following syntax.

@weight(...)  P(x) = Q(y)  :- RULE_BODY.

Negation

Whenever a rule has to refer to a case when the Boolean variable is false (also referred to as a negative literal), then it can be negated using a preceding ! as shown below.

@weight(...)    P(x) => ! Q(x)  :- RULE_BODY.
@weight(...)  ! P(x) v  ! Q(x)  :- RULE_BODY.

Multinomial factors (STALE. NEEDS REVAMP.)

DeepDive has limited support for expressing correlations of categorical variables. The introduced syntax above can be used only for expressing correlations between Boolean variables. For categorical variables, DeepDive only allows the conjunction of the variables each taking a certain category value being true to be expressed using a special syntax shown below:

@weight(...)  Multinomial( P(x), Q(y) )        :- RULE_BODY.
@weight(...)  Multinomial( P(x), Q(y), R(z) )  :- RULE_BODY.

Multinomial takes only categorical variables as arguments^*, and it can be thought as a compact representation of an equivalent model with Boolean variables corresponding to each category connected by a conjunction factor for every combination of category assignments.

For example, suppose a is a variable taking values 0, 1, 2, and b is a variable taking values 0, 1. Then, Multinomial(a, b) is equivalent to having factors between a and b that correspond to the following indicator functions.

I{a = 0, b = 0}
I{a = 0, b = 1}
I{a = 1, b = 0}
I{a = 1, b = 1}
I{a = 2, b = 0}
I{a = 2, b = 1}

Note that each of the factors above has a distinct weight, i.e., one weight for each possible assignment of variables in the Multinomial factor. For more detail on how to specify Conditional Random Fields and perform Multi-class Logistic Regression using categorical factor, see the chunking example.

^* Because of this limitation, categorical variables and categorical factor support is likely to go away in a near future release, in favor of a more flexible way to express multi-class predictions and mutual exclusions. Emulate categorical variables with functional dependencies in the Boolean variable columns, i.e., tag?(@key word_id BIGINT, pos TEXT) rather than tag?(word_id BIGINT) Categorical(N), and replace Multinomial factor with conjunction of those Boolean variables with columns having functional dependencies.

Specifying weights

Each factor is assigned a weight, which represents the confidence in the correlation it expresses in the model. During statistical inference, these weights translate into their potentials that influence the probabilities of the connected variables. Factor weights are real numbers and only the relative magnitude to each other matters. Factors with larger weights have a greater impact on the connected variables than factors with smaller weights. Weights can be fixed to a constant manually, or they can be learned by DeepDive from the supervision labels at different granularity. Weights can be parameterized by some data originating from the data (in most time referred to as features), in which case factors with different parameter values will use different weights. In order to learn weights automatically, there must be enough training data available.

DDlog syntax for specifying weights for three different cases are shown below.

Q?(x TEXT).
data(x TEXT, y TEXT).

# Fixed weight (10 can be treated as positive infinite)
@weight(10)   Q(x) :- data(x, y).

# Unknown weight, to be learned from the data, but not depending on any variable.
# All factors created by this rule will have the same weight.
@weight("?")  Q(x) :- data(x, y).

# Unknown weight, each to be learned from the data per different values of y.
@weight(y)    Q(x) :- data(x, y).