Application configuration file reference
This document contains the description of each configuration directive that can be specified in an application configuration file.
As a remark, from version 0.8, the user application is described in the app.ddlog
file. Therefore, the configuration file deepdive.conf
described here doesn't have to be used. However, during a call of deepdive compile
, the app.ddlog
and deepdive.conf
files are combined and compiled together. It is then possible to specify some parameters, arguments or tasks in the deepdive.conf
file, in addition to the main structure of the application written in DDlog, and both of them will be considered by DeepDive.
Overview of configuration structure
Global section: all application configuration directives described in the rest of the document
must appear inside a global deepdive
section:
deepdive {
# All configuration directives go here
}
A starter template of deepdive.conf
is below. You can found it in
examples/template/
in your DEEPDIVE_HOME
installation directory:
deepdive {
# Put your variables here
schema.variables {
}
# Put your extractors here
extraction.extractors {
}
# Put your inference rules here
inference.factors {
}
# Specify a holdout fraction
calibration.holdout_fraction: 0.00
}
In this template, the global section deepdive
contains following major sections: db
, schema
, extraction
, inference
, calibration
. Other optional sections are sampler
and execution
.
Links to these sections:
- extraction: extraction tasks
- inference: inference rules
- schema: variable schema
- calibration: calibration parameters
- sampler: sampler arguments
Notation format
DeepDive configuration file uses HOCON format. It is an extension of JSON. For a detailed specification, see readme of HOCON.
Below are some highlights of the notation format.
Blocks
Blocks are specified by {}
rather than indentation. Blocks can be nested.
Note that the following nested block definition are equivalent:
schema {
variables {
...
}
}
and
schema.variables {
...
}
This is often useful in making the code more compact.
Comments
Any text appearing after a #
or //
and before the next new line is considered a comment, unless the #
or //
is inside a quoted string.
Key-value separators
Both :
and =
are valid key-value separators.
Extraction and extractors
Configuration directives for executing extractors go in the
extraction
section, while extractor definitions go in the extraction.extractors
section:
deepdive {
extraction {
# extraction directives
}
extraction.extractors {
# extractor definitions
}
# ...
}
Extractors definition
Each extractor definition is a section named with the name of the extractor:
deepdive {
# ...
extraction.extractors {
extractor1 {
# definition of extractor1
}
extractor2 {
# definition of extractor2
}
# More extractors ...
}
# ...
}
Different styles of extractors are defined using different sets of directives. There is nevertheless a subset of directives that are common to all styles:
style
: specifies the style of the extractor. Can take the valuestsv_extractor
,sql_extractor
, orcmd_extractor
. This is a mandatory directive.before
: specifies a shell command to run before executing the extractor. This is an optional directive.myExtractor { # ... style: "tsv_extractor" # ... before: """echo starting myExtractor""" # ... }
after
: specifies a shell command to run after the extractor has completed:myExtractor { # ... style: "sql_extractor" # ... after: """python cleanup_after_myExtractor.py""" # ... }
dependencies
: takes an array of extractor names that this extractor depends on. The system resolves the dependency graph and execute the extractors in the required order. E.g.:extractor1 { # ... } extractor2 { # ... } myExtractor { # ... style: "cmd_extractor" # ... dependencies: [ "extractor1", "extractor2" ] # ... }
input_relations
: takes an array of relation names that this extractor depends on. Similar todependencies
, all extractors whoseoutput_relation
exists in this array will be executed before this extractor.
The following directives are only for the tsv_extractor
styles.
They are mandatory for these styles.
input
: specifies the input to the extractor. For all the extractor styles above it can be a SQL query to run on the database, e.g.,:myExtractor { # ... style: "tsv_extractor" # ... input: """SELECT * FROM titles""" # ... }
output_relation
: specifies the name of the relation the extractor output should be written to. Must be an existing relation in the database. E.g.:myExtractor { # ... style: "tsv_extractor" # ... output_relation: words # ... }
udf
: specifies the extractor User Defined Function (UDF). This is a shell command or a script that is executed.Depending on the extractor style, additional directives may be necessary, such as
sql
,cmd
,input_batch_size
, andoutput_batch_size
.
Inference
Note: this section presents configuration directive for the inference step. Refer to the appropriate section for the directives to define inference rules.
Configuration directives to control the inference steps go in the global
deepdive
section. The available directives are:
inference.batch_size
: batch size to insert variables, factors, and weights in the database during the factor graph creation:inference.batch_size = 1000000
The default value depends on the used datastore (50000 for PostgreSQL).
inference.parallel_grounding
. If set totrue
and you are using GreenPlum on DeepDive, use parallelism when grounding the graph. Default isfalse
.inference.parallel_grounding: true
inference.skip_learning
: iftrue
, DeepDive will skip the weight learning step, and reuse the weights learned in the last execution. It will generate a tabledd_graph_last_weights
containing all the weights. Weights will be matched by their "text description" (which is composed by[name of inference rule]-[specified value of "weight" in inference rule]
, e.g.myRule-male
), and no learning will be performed. To get meaningful results, A DeepDive run must be already performed in the database, and the viewdd_inference_result_weights_mapping
must be present.inference.skip_learning: true
By default this directive is
false
.inference.weight_table
: to be used in combination withinference.skip_learning
, it allows to skip the weight learning step and use the weights specified in a custom table. The table tuples must contain the factor description and weights. Note that it is important that this table is constracted with the same syntax as described above.This table can be the result from one execution of DeepDive (an example would be the view
dd_inference_result_weights_mapping
, ordd_graph_last_weights
used wheninference.skip_learning
istrue
) or manually assigned, or a combination of the two.If weight for a specific factor is not in the weight table, the weight will be treated as 0. For example, if
f_has_spouse_features-SOME_NEW_FEATURE
is not found in the specified weight table, but this factor is found in the inference step, the weight of it will be treated as 0.If
inference_skip_learning
isfalse
(default) this directive is ignored.inference.skip_learning: true inference.weight_table: [weight table name]
Inference schema
Inference schema directives define the variables used in the
factor graph and their type. Inference schema directives go in the
schema.variables
section:
deepdive {
# ...
schema.variables {
# Variable definitions
}
# ...
}
A variable in DeepDive is defined by its name (table.column) and its type:
person_smokes.smokes: Boolean
person_has_cancer.has_cancer: Boolean
A table can have up to one column declared as a DeepDive variable. This restriction makes the semantics clear such that each tuple in the database corresponds to one random variable. DeepDive currently supports Boolean and Categorical variables.
Inference rules
Note: refer to 'Writing inference rules' document for an in-depth discussion about writing inference rules.
The definitions of inference rules for the factor graphs go in the
inference.factors
section:
deepdive {
inference.factors {
rule1 {
# definition of rule1
}
rule2 {
# definition of rule2
}
# more rules...
}
}
The mandatory definition directives for each rule are:
input_query
: specifies the variables to create. It is a SQL query that usually combines relations created by the extractors. For each row in the query result, the factor graph will have variables for a subset of the columns in that row, one variable per column, all connected by a factor. The output of theinput_query
must include the reservedid
column for each variable.function
: specifies the factor function and the variables connected by the factor. Refer to the source code for details about the available functions. Example usage:weight
: specifies whether the weight of the factor should be a specified constant or learned (and if so, whether it should be a function of some columns in the input query. Possible values for this directive are:- a real number: the weight is the given number and not learned.
"?"
: DeepDive learns a weight for all factors defined by this rule. All the factors will share the same weight."?(column_name)"
: DeepDive learns multiple weights, one for each different value in the columncolumn_name
in the result ofinput_query
.
An example inference rule is the following:
smokes_cancer {
input_query: """
SELECT person_has_cancer.id as "person_has_cancer.id",
person_smokes.id as "person_smokes.id",
person_smokes.smokes as "person_smokes.smokes",
person_has_cancer.has_cancer as "person_has_cancer.has_cancer"
FROM person_has_cancer, person_smokes
WHERE person_has_cancer.person_id = person_smokes.person_id
"""
function: "Imply(person_smokes.smokes, person_has_cancer.has_cancer)"
weight: 0.5
}
Calibration / Holdout
Directive for calibration go to the calibration
section.
The available directives are:
holdout_fraction
: specifies the fraction of training data to use for holdout. E.g.:calibration { holdout_fraction: 0.25 }
holdout_query
: specifies a custom query to be used to define the holdout set. This must insert all variable IDs that are to be held out into thedd_graph_variables_holdout
table through arbitrary SQL. E.g.:calibration { holdout_query: "INSERT INTO dd_graph_variables_holdout(variable_id) SELECT dd_id FROM dd_variables_mytable WHERE predicate" }
When a custom holdout query is defined in holdout_query
, the
holdout_fraction
setting is ignored.
observation_query
: specifies a custom query to be used to define observation only evidence. Observation only evidence will not be fitted during weight learning. So there will be 3 kinds of variables during learning -- evidence that will be fitted, evidence that will not be fitted and non-evidence variables. This query must insert all variable IDs that are observation only evidence into thedd_graph_variables_observation
table through arbitrary SQL. E.g.:calibration { observation_query: "INSERT INTO dd_graph_variables_observation SELECT id FROM mytable WHERE predicate" }
Sampler
Configuration directives for the sampler go in the global deepdive
section.
The available directive are:
(Optional)
sampler.sampler_cmd
: the path to the sampler executable:sampler.sampler_cmd: "util/sampler-dw-mac gibbs"
Since version 0.03, DeepDive automatically chooses the correct executable based on your operating system (between
"util/sampler-dw-linux gibbs"
and"util/sampler-dw-mac gibbs"
), so we recommend to omit thesampler_cmd
directive.sampler.sampler_args
: the arguments to the sampler executable:deepdive { sampler.sampler_args: "-l 1000 -i 1000 --alpha 0.01" }
The default
sampler_args
are:"-l 300 -i 500 --alpha 0.1"
.For a list and the meaning of the arguments, please refer to the documentation of our DimmWitted sampler.