Mist

MIST is a Multivariable Information Theory-based dependence Search Tool. The Mist library computes entropy-based measures that detect functional dependencies between variables. Mist provides the libmist library and mistcli Linux command line tool.

  • Mist source is hosted on Github.

  • Mist for Python is available on PyPi.

  • Mist documentation is hosted on ReadTheDocs.

Background

A biological system is intrinsically complex and can be viewed as a large set of components, variables, and attributes that store and transmit information from one another. This information depends on how each component interacts with, and is related to, other components of the system. Handling the problem of representing and measuring the information is the goal of Mist.

A central question of this problem is: How can we fully describe the joint probability density of the N variables that define the system? Characterization of the joint probability distribution is at the heart of describing the mathematical dependency among the variables. Mist provides a number of tools that are useful in the pursuit for the description and quantitation of dependences in complex biological systems.

A function between variables defines a deterministic relationship between them, a dependency; it can be as simple as if X then Y or something more complicated involving many variables. Thus, a functional dependency among variables implies the existence of a function. See [Galas2014]. Here we focus on the task of finding a functional dependency without concerning ourselves with the nature of the underlying function.

Mist is designed to quickly find functional dependencies among many variables. It uses model-free Information Theory measures based on entropy to compute the strength of the dependence. Mist allows us to detect functional dependencies for any function, involving any number of variables, limited only by processing capabilities and statistical power. This makes Mist a great tool for paring down a large set of variables into an interesting subset of dependencies, which may then be studied by other methods. This may be seen as compression of data by identifying redundant variables.

Quick Start

The easiest way to run Mist is by using the libmist Python module. The following minimal example sets up an exhaustive search for dependencies between two variables, estimated with the default measurement

import libmist
search = libmist.Search()
search.load_file('/path/to/data.csv')
search.outfile = '/dev/stdout'
search.start()

There are numerous functions to configure Mist – below are some of the most important. For a full list see Mist documentation and consult the User Guide.

search.load_ndarray()     # load data from a Python.Numpy.ndarray (see docs for restrictions)
search.tuple_size         # set the number of variables in each tuple
search.measure            # set the Information Theory Measure
search.threads            # set the number of computration threads

This Python syntax is virtually identical to the C++ code you would write for a program using the Mist library, as you can see in the examples directory.

Installation

Docker

Mist can be built into a Docker image with the included docker file

cd /path/to/mist
docker image build . -t mist
docker run --rm -v ./:/mist mist

The default command builds the Mist python module, which can then be run in an interactive session or with python script, e.g.

docker run --it --rm -v ./:/mist mist python3

mist

These packages are required to build libmist and mistcli:

  • CMake (minimum version 3.5)

  • Boost (minimum version 1.58.0)

Run cmake in out-of-tree build directory:

mkdir /path/to/build
cd /path/to/build
cmake /path/to/mist
make install

mist python library

Use pip package manager to install libmist:

pip install libmist

Or build and install from source.

Additional build requirements:

  • Python development packages (python3-dev or python-dev).

  • Boost Python and Numpy components. For Boost newer than 1.63 use the integrated Boost.Numpy (libboost-numpy) package. For earlier versions install ndarray/Boost.Numpy.

Run cmake with BuildPython set to ON:

mkdir /path/to/build
cd /path/to/build
cmake -DBuildPython:BOOL=ON /path/to/mist
make install

Note: both the mist and ndarray/Boost.numpy builds use the default python version installed on the system. To use a different python version, change the FindPythonInterp, FindPythonLibs, and FindNumpy invocations in both packages to use the same python version.

Documentation

Additional Requirements

Run cmake with BuildDoc set to ON:

mkdir /path/to/build
cd /path/to/build
cmake -DBuildDoc:BOOL=ON /path/to/mist
make Sphinx

And then run the build as above.

For Developers

This project follows the Pitchfork Layout. Namespaces are encapsulated in separate directories. Any physical unit must only include headers within its namespace, the root namespace (core), or interface headers in other namespaces. The build system discourages violations by making it awkward to link objects across namespaces.

Documentation for this project is dynamically generated with Doxygen and Sphinx. Comments in the source following Javadoc style are included in the docs. Non-documented comments, e.g. implementation notes, developer advice, etc. follow standard C++ comment style.

The included .clang-format file defines the code format, and it can should applied with the tools/format.sh script.

Credits

Mist is written by Andrew Banman. It is based on software written by Nikita Sakhanenko. The ideas behind entropy-based functional dependency come from Information Theory research by David Galas, Nikita Sakhanenko, and James Kunert.

For copyright information see the LICENSE.txt file included with the source.

References

Galas2014

Galas, David J et al. “Describing the complexity of systems: multivariable “set complexity” and the information basis of systems biology.” Journal of computational biology : a journal of computational molecular cell biology vol. 21,2 (2014): 118-40. doi:10.1089/cmb.2013.0039 PMC

Shannon1949

Shannon, Claude Elwood, and Warren Weaver. The Mathematical Theory of Communicaton. University of Illinois Press, 1949.