# Learning and inference with the statistical model

For every DeepDive application, executing any data processing it defines is ultimately to supply with necessary bits in the construction of the statistical model declared in DDlog for *joint inference*.
DeepDive provides several commands to streamline operations on the statistical model, including its creation (*grounding*), parameter estimation (*learning*), and computation of probabilities (*inference*) as well as keeping and reusing the parameters of the model (*weights*).

## Getting the inference result

To simply get the inference results, i.e., the marginal probabilities of the random variables defined in DDlog, use the following command:

```
deepdive do probabilities
```

This takes care of executing all necessary data processing, then creates a statistical to perform learning and inference, and loads all probabilities of every variable into the database.

### Inspecting the inference result

For viewing the inference result, DeepDive creates a database view that corresponds to each variable relation (using a `_inference`

suffix).
For example, the following SQL query can be used for inspecting the probabilities of the variables in relation `has_spouse`

:

```
deepdive sql "SELECT * FROM has_spouse_inference"
```

It shows a table that looks like below where the `expectation`

column holds the inferred marginal probability for each variable:

```
p1_id | p2_id | expectation
--------------------------------------------------+--------------------------------------------------+-------------
7b29861d-746b-450e-b9e5-52db4d17b15e_4_5_5 | 7b29861d-746b-450e-b9e5-52db4d17b15e_4_0_0 | 0.988
ca1debc9-1685-4555-8eaf-1a74e8d10fcc_7_25_25 | ca1debc9-1685-4555-8eaf-1a74e8d10fcc_7_30_31 | 0.972
34fdb082-a6ef-4b54-bd17-6f8f68acb4a4_15_28_28 | 34fdb082-a6ef-4b54-bd17-6f8f68acb4a4_15_23_23 | 0.968
7b29861d-746b-450e-b9e5-52db4d17b15e_4_0_0 | 7b29861d-746b-450e-b9e5-52db4d17b15e_4_5_5 | 0.957
a482785f-7930-427a-931f-851936cd9bb1_2_34_35 | a482785f-7930-427a-931f-851936cd9bb1_2_18_19 | 0.955
a482785f-7930-427a-931f-851936cd9bb1_2_18_19 | a482785f-7930-427a-931f-851936cd9bb1_2_34_35 | 0.955
93d8795b-3dc6-43b9-b728-a1d27bd577af_5_7_7 | 93d8795b-3dc6-43b9-b728-a1d27bd577af_5_11_13 | 0.949
e6530c2c-4a58-4076-93bd-71b64169dad1_2_11_11 | e6530c2c-4a58-4076-93bd-71b64169dad1_2_5_6 | 0.946
5beb863f-26b1-4c2f-ba64-0c3e93e72162_17_35_35 | 5beb863f-26b1-4c2f-ba64-0c3e93e72162_17_29_30 | 0.944
93d8795b-3dc6-43b9-b728-a1d27bd577af_3_5_5 | 93d8795b-3dc6-43b9-b728-a1d27bd577af_3_0_0 | 0.94
216c89a9-2088-4a78-903d-6daa32b1bf41_13_42_43 | 216c89a9-2088-4a78-903d-6daa32b1bf41_13_59_59 | 0.939
c3eafd8d-76fd-4083-be47-ef5d893aeb9c_2_13_14 | c3eafd8d-76fd-4083-be47-ef5d893aeb9c_2_22_22 | 0.938
70584b94-57f1-4c8c-8dd7-6ed2afb83031_20_6_6 | 70584b94-57f1-4c8c-8dd7-6ed2afb83031_20_1_2 | 0.938
ac937bee-ab90-415b-b917-0442b88a9b87_5_7_7 | ac937bee-ab90-415b-b917-0442b88a9b87_5_10_10 | 0.934
942c1581-bbc0-48ac-bbef-3f0318b95d28_2_35_36 | 942c1581-bbc0-48ac-bbef-3f0318b95d28_2_18_19 | 0.934
ec0dfe82-30b0-4017-8c33-258e2b2d7e35_36_29_29 | ec0dfe82-30b0-4017-8c33-258e2b2d7e35_36_33_34 | 0.933
74586dd9-55af-4bb4-9a95-485d5cef20d7_34_8_8 | 74586dd9-55af-4bb4-9a95-485d5cef20d7_34_3_4 | 0.933
70bebfae-c258-4e9b-8271-90e373cc317e_4_14_14 | 70bebfae-c258-4e9b-8271-90e373cc317e_4_5_5 | 0.933
ca1debc9-1685-4555-8eaf-1a74e8d10fcc_7_30_31 | ca1debc9-1685-4555-8eaf-1a74e8d10fcc_7_25_25 | 0.928
ec0dfe82-30b0-4017-8c33-258e2b2d7e35_36_15_15 | ec0dfe82-30b0-4017-8c33-258e2b2d7e35_36_33_34 | 0.927
f49af9ca-609a-4bdf-baf8-d8ddd6dd4628_4_20_21 | f49af9ca-609a-4bdf-baf8-d8ddd6dd4628_4_15_16 | 0.923
ec0dfe82-30b0-4017-8c33-258e2b2d7e35_16_9_9 | ec0dfe82-30b0-4017-8c33-258e2b2d7e35_16_4_5 | 0.923
93d8795b-3dc6-43b9-b728-a1d27bd577af_3_23_23 | 93d8795b-3dc6-43b9-b728-a1d27bd577af_3_0_0 | 0.921
5530e6a9-2f90-4f5b-bd1b-2d921ef694ef_2_18_18 | 5530e6a9-2f90-4f5b-bd1b-2d921ef694ef_2_10_11 | 0.918
[...]
```

To better understand the inference result for debugging, please refer to the pages about calibration, Dashboard, labeling, and browsing data.

The next several sections describe further detail about the different operations on the statistical model supported by DeepDive.

## Grounding the factor graph

The inference rules written in DDlog give rise to a data structure called *factor graph* DeepDive uses to perform statistical inference.
*Grounding* is the process of materializing the factor graph as a set of files by laying down all of its variables and factors in a particular format.
This process can be performed using the following command:

```
deepdive model ground
```

The above can be viewed as a shorthand for executing the following built-in processes:

```
deepdive redo process/grounding/variable_assign_id process/grounding/combine_factorgraph
```

Grounding generates a set of files for each variable and factor under `run/model/grounding/`

.
They are then combined into a unified factor graph under `run/model/factorgraph/`

to be easily consumed by the DimmWitted inference engine for learning and inference.
For example, below shows a typical list of files holding a grounded factor graph:

```
find run/model/grounding -type f
```

```
run/model/grounding/factor/inf_imply_has_spouse_has_spouse/factors.part-1.bin.bz2
run/model/grounding/factor/inf_imply_has_spouse_has_spouse/nedges.part-1
run/model/grounding/factor/inf_imply_has_spouse_has_spouse/nfactors.part-1
run/model/grounding/factor/inf_imply_has_spouse_has_spouse/weights.part-1.bin.bz2
run/model/grounding/factor/inf_imply_has_spouse_has_spouse/weights_count
run/model/grounding/factor/inf_imply_has_spouse_has_spouse/weights_id_begin
run/model/grounding/factor/inf_imply_has_spouse_has_spouse/weights_id_exclude_end
run/model/grounding/factor/inf_istrue_has_spouse/factors.part-1.bin.bz2
run/model/grounding/factor/inf_istrue_has_spouse/nedges.part-1
run/model/grounding/factor/inf_istrue_has_spouse/nfactors.part-1
run/model/grounding/factor/inf_istrue_has_spouse/weights.part-1.bin.bz2
run/model/grounding/factor/inf_istrue_has_spouse/weights_count
run/model/grounding/factor/inf_istrue_has_spouse/weights_id_begin
run/model/grounding/factor/inf_istrue_has_spouse/weights_id_exclude_end
run/model/grounding/factor/weights_count
run/model/grounding/variable/has_spouse/count
run/model/grounding/variable/has_spouse/id_begin
run/model/grounding/variable/has_spouse/id_exclude_end
run/model/grounding/variable/has_spouse/variables.part-1.bin.bz2
run/model/grounding/variable_count
```

## Learning the weights

DeepDive learns the weights of the grounded factor graph, i.e., estimates the maximum likelihood parameters of the statistical model from the variables that were assigned labels via distant supervision rules written in DDlog.
DimmWitted inference engine uses *Gibbs sampling* with *stochastic gradient descent* to learn the weights.

The following command performs learning using the grounded factor graph (or grounds a new factor graph if needed):

```
deepdive model learn
```

This is equivalent to executing the following targets:

```
deepdive redo process/model/learning data/model/weights
```

DimmWitted outputs the learned weights as a text file under `run/model/weights/`

.
For convenience, DeepDive loads the learned weights into the database and creates several views for the following target:

```
deepdive do data/model/weights
```

This will create a comprehensive view of the weights named `dd_inference_result_weights_mapping`

.
The weights corresponding to each inference rule and by their parameter value can be easily accessed using it.
Below shows a few example of learned weights:

```
deepdive sql "SELECT * FROM dd_inference_result_weights_mapping"
```

```
weight | description
--------------+---------------------------------------------------------------
1.80754 | inf_istrue_has_spouse--INV_NGRAM_1_[wife]
1.45959 | inf_istrue_has_spouse--NGRAM_1_[wife]
-1.33618 | inf_istrue_has_spouse--STARTS_WITH_CAPITAL_[True_True]
1.30884 | inf_istrue_has_spouse--INV_NGRAM_1_[husband]
1.22097 | inf_istrue_has_spouse--NGRAM_1_[husband]
-1.00449 | inf_istrue_has_spouse--W_NER_L_1_R_1_[O]_[O]
-1.00062 | inf_istrue_has_spouse--NGRAM_1_[,]
-1 | inf_imply_has_spouse_has_spouse-
-0.94185 | inf_istrue_has_spouse--IS_INVERTED
-0.91561 | inf_istrue_has_spouse--INV_STARTS_WITH_CAPITAL_[True_True]
0.896492 | inf_istrue_has_spouse--NGRAM_2_[he wife]
0.835013 | inf_istrue_has_spouse--INV_NGRAM_1_[he]
-0.825314 | inf_istrue_has_spouse--NGRAM_1_[and]
0.805815 | inf_istrue_has_spouse--INV_NGRAM_2_[he wife]
-0.781846 | inf_istrue_has_spouse--INV_W_NER_L_1_R_1_[O]_[O]
0.75984 | inf_istrue_has_spouse--NGRAM_1_[he]
-0.74405 | inf_istrue_has_spouse--INV_NGRAM_1_[and]
0.701149 | inf_istrue_has_spouse--INV_NGRAM_1_[she]
-0.645765 | inf_istrue_has_spouse--INV_NGRAM_1_[,]
0.6105 | inf_istrue_has_spouse--INV_NGRAM_2_[husband ,]
0.585621 | inf_istrue_has_spouse--INV_NGRAM_2_[she husband]
0.583075 | inf_istrue_has_spouse--INV_NGRAM_2_[and he]
0.581042 | inf_istrue_has_spouse--NGRAM_1_[she]
0.540534 | inf_istrue_has_spouse--NGRAM_2_[husband ,]
[...]
```

## Inference

After learning the weights, DeepDive uses them with the grounded factor graph to compute the marginal probability of every variable. DimmWitted's high-speed implementation of Gibbs sampling is used for performing a marginal inference by approximately computing the probablities of different values each variable can take over all possible worlds.

```
deepdive model infer
```

This is equivalent to executing the following nodes in the data flow:

```
deepdive redo process/model/inference data/model/probabilities
```

In fact, because performing inference as a separate process from learning incurs unnecessary overhead of reloading the factor graph into memory again, DimmWitted also performs inference immediately after learning the weights. Therefore unless previously learned weights are being reused, hence skipping the learning part, the following command that performs just the inference has no effect:

DimmWitted outputs the inferred probabilities as a text file under `run/model/probabilities/`

.
As shown in the first section, DeepDive loads the computed probabilities into the database and creates views for convenience.

### Reusing weights

A common use case is to learn the weights from one dataset then performing inference on another, i.e., train model on one dataset and test it on new datasets.

- Learn the weights from a small dataset.
- Keep the learned weights.
- Reuse the kept weights for inference on a larger dataset.

DeepDive provides several commands to support the management and reuse of such learned weights.

#### Keeping learned weights

To keep the currently learned weights for future reuse, say under a name `FOO`

, use the following command:

```
deepdive model weights keep FOO
```

This dumps the weights from the database into files at `snapshot/model/weights/FOO/`

so they can be reused later.
The name `FOO`

is optional, and a generated timestamp is used instead when no name is specified.

#### Reusing learned weights

To reuse a previously kept weights, under a name `FOO`

, use the following command:

```
deepdive model weights reuse FOO
```

This loads the weights at `snapshot/model/weights/FOO/`

back to the database, then repeats necessary grounding processes for including the weights into the grounded factor graph.
The name `FOO`

is optional, and the most recently kept weights are used when no name is specified.

A subsequent command for performing inference reuses these weights without learning.

```
deepdive model infer
```

#### Managing kept weights

DeepDive provides several more commands to manage the kept weights.

To list the names of kept weights, use:

```
deepdive model weights list
```

To drop a particular weights, use:

```
deepdive model weights drop FOO
```

To clear any previously loaded weights to learn new ones, use:

```
deepdive model weights init
```