DeepDive developer's guide
This document describes useful information for those who want to make modifications to the DeepDive infrastructure itself and contribute new code. Most of the content here are irrelevant for DeepDive users who just want to build a DeepDive application.
DeepDive project at GitHub
Nearly all DeepDive development activities happen over GitHub.
Branches and releases of DeepDive
masterbranch points to the latest code.
- We use Semantic Versioning.
- Every MAJOR.MINOR version has a maintenance branch that starts with
vand ends with
- Every release is pointed by a tag, e.g.,
0.05-RELEASE. Since 0.6, release tag names start with
v, followed by MAJOR.MINOR.PATCH versions. They usually point to a commit in the release maintenance branch.
- Any other branch points to someone's work in progress.
Contributing code to DeepDive
- If you are part of the Hazy Research group, you can push your commits to a new branch, then create a Pull Request to
master. Otherwise, you need to first fork our repository, then push your code to that fork to create a Pull Request.
- If you already know who can review your code, assign that member to the Pull Request.
- The reviewer leaves comments about the code, then lets you know to fix them.
- You improve the code and push more commits to the branch for the Pull Request, then tell the reviewer to have another look. Remember that GitHub doesn't send out notifications (emails) unless you leave an actual comment on the Pull Request. The reviewer assumes the Pull Request is not ready for another look until you explicitly say so.
- Steps 3-4 repeat until the reviewer says everything looks good.
- The reviewer could merge your code to the master branch him/herself or ask you to do so (if you have permission).
- Your branch should be deleted after the Pull Request is merged or closed.
DeepDive is written in several programming languages.
- Bash and jq are the main programming languages for generating SQL queries and shell scripts that run the actual data pipeline, defined by the user's extractors and inference rules.
- C++ is used for writing the high performance Gibbs sampler that takes care of learning and inference of the model defined by user's inference rules.
- C is used for the high performance data router, mkmimo that enables executing many UDF processes in parallel efficiently.
- Python is the main language we use for the udfs in our examples.
- Scala and other mini languages are used for other minor parts.
DeepDive code structure
compiler/contains the code that compiles DeepDive application configuration into an execution plan.
database/contains database drivers as well as code implementing other database operations.
ddlib/contains the ddlib Python library that helps users write their applications.
doc/contains the Markdown/Jekyll source for the DeepDive website and documentation.
examples/contains the DeepDive examples.
extern/contains scripts for building and bundling runtime dependencies from external 3rd parties.
inference/contains the engine and necessary utilities for statistical learning and inference.
runner/contains the engine for running the execution plan compiled by the compiler.
shell/contains the code for the general
test/at the top as well as
*/test/under each subdirectory contain the test code.
util/contains other utilities for installation, build, and development.
DeepDive build is controlled by several files:
Makefiletakes care of the overall build process.
stage.shcontains the commands that stages built code under
dist/, which is the default location where the built executables and runtime data will be staged.
test/bats.mkcontains the Make recipes for running tests written in BATS under
test/*/should-work.shdetermines the .bats files to run for
.travis.ymlenables our continuous integration builds and tests at Travis CI, which are triggered every time a new commit is pushed to our GitHub repository.
DeepDive source tree includes several git submodules and ports:
compiler/ddlog/is the DDlog compiler.
inference/dimmwitted/is the DimmWitted Gibbs sampler.
runner/mkmimo/is a data routing component that is used for executing parallel UDF processes and efficiently streaming data through them.
util/mindbender/is the collection of tools supporting development, such as Mindtagger.
First, get DeepDive's source tree and move into it, by running:
git clone https://github.com/HazyResearch/deepdive.git cd deepdive
To install all build and runtime dependencies, run:
Or, if you don't have even
util/install.sh _deepdive_build_deps _deepdive_runtime_deps
Basically, DeepDive requires C/C++ compiler, JDK, Python, and several libraries with headers to build from source.
util/install/install.Ubuntu.shscript enumerates most of the build dependencies as APT packages. You may easily find corresponding packages for your platform and install them. On the other hand, most of the runtime dependencies will be built and bundled (see:
depends/bundled/), so eventually users will just grab a DeepDive binary and run it without having to waste time on installing the correct software packages.
To build most of what's under DeepDive's source tree and install at
PREFIXvariable allows the installation destination to be changed. For example:
make install PREFIX=/opt/deepdive
To run all tests, from the top of the source tree, run:
Note that at least one of PostgreSQL, MySQL, or Greenplum database must be running to run the tests.
TEST_DBHOSTenvironment to a
user:password@hostname, it is possible to specify against which database the tests should run. For specifying non-default ports for different database types, there are more specific variables:
To run tests selectively, use
EXCEPTMake variables for
For example, to run only the test with spouse example against PostgreSQL:
make test ONLY=test/postgresql/spouse_example.bats
Or, to skip the tests against MySQL:
make test EXCEPT=test/mysql/*.bats
To create a tarball package from the built and staged code, run:
The tarball is created at
To build the DDlog compiler from source and place the jar under
To build the sampler from source and replace the binaries, run:
To build the Mindbender toolchain from source and place the binary under
All commands shown above should be run from the top of the source tree.
Modifying DeepDive documentation
To preview your changes to the documentation locally, run:
make -C doc/ test
To deploy changes to the main website, run:
make -C doc/ deploy