DeepDive developer's guide
This document describes useful information for those who want to make modifications to the DeepDive infrastructure itself and contribute new code. Most of the content here are irrelevant for DeepDive users who just want to build a DeepDive application.
DeepDive project at GitHub
Nearly all DeepDive development activities happen over GitHub.
Branches and releases of DeepDive
- The
master
branch points to the latest code. - We use Semantic Versioning.
- Every MAJOR.MINOR version has a maintenance branch that starts with
v
and ends with.x
, e.g.,v0.6.x
(since 0.6). - Every release is pointed by a tag, e.g.,
v0.6.0
,0.05-RELEASE
. Since 0.6, release tag names start withv
, followed by MAJOR.MINOR.PATCH versions. They usually point to a commit in the release maintenance branch. - Any other branch points to someone's work in progress.
Contributing code to DeepDive
- If you are part of the Hazy Research group, you can push your commits to a new branch, then create a Pull Request to
master
. Otherwise, you need to first fork our repository, then push your code to that fork to create a Pull Request. - If you already know who can review your code, assign that member to the Pull Request.
- The reviewer leaves comments about the code, then lets you know to fix them.
- You improve the code and push more commits to the branch for the Pull Request, then tell the reviewer to have another look. Remember that GitHub doesn't send out notifications (emails) unless you leave an actual comment on the Pull Request. The reviewer assumes the Pull Request is not ready for another look until you explicitly say so.
- Steps 3-4 repeat until the reviewer says everything looks good.
- The reviewer could merge your code to the master branch him/herself or ask you to do so (if you have permission).
- Your branch should be deleted after the Pull Request is merged or closed.
DeepDive code
DeepDive is written in several programming languages.
- Bash and jq are the main programming languages for generating SQL queries and shell scripts that run the actual data pipeline, defined by the user's extractors and inference rules.
- C++ is used for writing the high performance Gibbs sampler that takes care of learning and inference of the model defined by user's inference rules.
- C is used for the high performance data router, mkmimo that enables executing many UDF processes in parallel efficiently.
- Python is the main language we use for the udfs in our examples.
- Scala and other mini languages are used for other minor parts.
DeepDive code structure
compiler/
contains the code that compiles DeepDive application configuration into an execution plan.database/
contains database drivers as well as code implementing other database operations.ddlib/
contains the ddlib Python library that helps users write their applications.doc/
contains the Markdown/Jekyll source for the DeepDive website and documentation.examples/
contains the DeepDive examples.extern/
contains scripts for building and bundling runtime dependencies from external 3rd parties.inference/
contains the engine and necessary utilities for statistical learning and inference.runner/
contains the engine for running the execution plan compiled by the compiler.shell/
contains the code for the generaldeepdive
command-line interface.test/
at the top as well as*/test/
under each subdirectory contain the test code.util/
contains other utilities for installation, build, and development.
DeepDive build is controlled by several files:
Makefile
takes care of the overall build process.stage.sh
contains the commands that stages built code underdist/
, which is the default location where the built executables and runtime data will be staged.test/bats.mk
contains the Make recipes for running tests written in BATS undertest/
.test/enumerate-tests.sh
andtest/*/should-work.sh
determines the .bats files to run formake test
..travis.yml
enables our continuous integration builds and tests at Travis CI, which are triggered every time a new commit is pushed to our GitHub repository.
DeepDive source tree includes several git submodules and ports:
compiler/ddlog/
is the DDlog compiler.inference/dimmwitted/
is the DimmWitted Gibbs sampler.runner/mkmimo/
is a data routing component that is used for executing parallel UDF processes and efficiently streaming data through them.util/mindbender/
is the collection of tools supporting development, such as Mindtagger.
Building and Testing DeepDive
First, get DeepDive's source tree and move into it, by running:
git clone https://github.com/HazyResearch/deepdive.git cd deepdive
Containerized builds and tests
DeepDive build and tests can be done using Docker, which can simplify the development environment setup dramatically.
To build the source tree inside a container and create a new Docker image, run:
make build--in-container
Or, if you don't even have
make
, just run:./DockerBuild/build-in-container
This pulls the
latest
image from Docker Hub (hazyresearch/deepdive-build), then inside a fresh container, runs the build after applying changes made to the current source tree. This is the default behavior formake
(without any target argument) when Docker is available on your system.CAVEAT: Note that only files that are tracked by git is reflected in the build inside containers. Use
git add
to make sure any new files are also considered when transfering changes to containers.To test the most recent build, run:
make test--in-container
You can pass the
ONLY=
andEXCEPT=
filters as you do for the normal builds (described below).Or, the equivalent without
make
is:./DockerBuild/test-in-container-postgres
You can in fact override the entire test command with this:
./DockerBuild/test-in-container-postgres make test ONLY=test/postgresql/*.bats
To inspect the most recent build or test, run:
./DockerBuild/inspect-container
You can pass a command to run as arguments:
./DockerBuild/inspect-container latest-run make test
The most recent image for the current branch is automatically updated after the most recent test finishes successfully. To make it also the new
latest
image for all other branches on your local machine, run:./DockerBuild/update-latest-image
Until you run this command, new builds will always start from the
latest
image from the central Docker Hub, not from the latest build on your local machine. If your source tree has diverged a lot from the master branch, it's a good idea to update the latest image once the initial long build finishes and passes all tests. That way builds for your branches won't have to repeat the same long build.If you have permission, you can push your master image to DockerHub and have others start build from there by running:
docker push hazyresearch/deepdive-build
Normal builds and tests
Running containerized builds and tests in Docker is the recommended way, but you are welcome to run normal builds directly on the host in the old way. Everything described here about normal builds in fact applies to the source tree inside the container. Moreover, normal build is the only way to produce releases for Mac and environments other than the one used in the master image.
To disable the containerized builds even if you have Docker installed and to force normal build, simply set:
export NO_DOCKER_BUILD=true
To install all build and runtime dependencies, run:
make depends
Or, if you don't have even
make
installed:util/install.sh _deepdive_build_deps _deepdive_runtime_deps
Basically, DeepDive requires C/C++ compiler, JDK, Python, GNU coreutils and several libraries with headers to build from source.
install__deepdive_build_deps
inutil/install/install.Ubuntu.sh
script enumerates most of the build dependencies as APT packages. You may easily find corresponding packages for your platform and install them. On the other hand, most of the runtime dependencies will be built and bundled (see:depends/bundled/
), so eventually users will just grab a DeepDive binary and run it without having to waste time on installing the correct software packages.To build most of what's under DeepDive's source tree and install at
~/local/
, run:make install
Overriding the
PREFIX
variable allows the installation destination to be changed. For example:make install PREFIX=/opt/deepdive
To run all tests, from the top of the source tree, run:
make test
Note that at least one of PostgreSQL, MySQL, or Greenplum database must be running to run the tests.
By setting
TEST_DBHOST
environment to auser:password@hostname
, it is possible to specify against which database the tests should run. For specifying non-default ports for different database types, there are more specific variables:TEST_POSTGRES_DBHOST
,TEST_GREENPLUM_DBHOST
, andTEST_MYSQL_DBHOST
.To run tests selectively, use
ONLY
andEXCEPT
Make variables formake test
.For example, to run only the test with spouse example against PostgreSQL:
make test ONLY=test/postgresql/spouse_example.bats
Or, to skip the tests against MySQL:
make test EXCEPT=test/mysql/*.bats
To create a tarball package from the built and staged code, run:
make package
The tarball is created at
dist/deepdive.tar.gz
.To build the DDlog compiler from source and place the jar under
util/
, run:make build-ddlog
To build the sampler from source and replace the binaries, run:
make build-sampler
To build the Mindbender toolchain from source and place the binary under
util/
, run:make build-mindbender
All commands shown above should be run from the top of the source tree.
Modifying DeepDive documentation
DeepDive documentation is written in Markdown under doc/
, and the website is compiled using Jekyll.
To preview your changes to the documentation locally, run:
make -C doc/ test
To deploy changes to the main website, run:
make -C doc/ deploy