API

mist is comprised of logically distinct components encapsulated by namespaces. Classes access other namespaces via an interface class. Users typically only need to be concerned with classes in the root namespace, whereas developers will need the rest.

mist

The root namespace includes composition classes and classes common to the sub-namespaces.

class mist::Search

Main user interface for mist runtime.

CPP and Python users instantiate this class, load data, and optionally call various configuration methods to define the computation. Computations begin with start(). Maintains state in between runs, such as intermediate value caches for improved performance.

Public Functions

void set_measure(std::string const &measure)

Set the IT Measure to be computed.

  • Entropy : Compute only combined entropy.

  • SymmetricDelta (default) : A novel symmetric measure of shared information. See Sakhanenko, Galas in the literature.

void set_cutoff(it::entropy_type cutoff)

Set the minimum IT Measure value to keep in results.

This option is most useful for dealing with very large TupleSpaces, the results for which cannot be stored in memory or on disk.

void set_probability_algorithm(std::string const &algorithm)

Set the algorithm for generating probability distributions.

  • Vector (default) : Process each Variable as a vector. Gives best performance when Variable size is small or when there are many value bins.

  • Bitset : Convert each distinct Variable value into a bitset to leverage bitwise operations. Gives best performance when Variable size is large and the number of value bins is small.

Performance of each algorithm depends strongly on the problem, i.e. the data, and potentially also on the system. After the number of threads, this parameter has the largest effect on runtime since distribution generation dominates the computation.

void set_outfile(std::string const &filename)

Set output CSV file.

void set_ranks(int ranks)

Set number of concurrent ranks to use in this Search.

A rank on a computation node is one execution thread. The default ranks is the number of threads allowed by the node. Setting ranks to 0 causes the system to use the maximum.

void set_start_rank(int rank)

Set the starting rank for this Search.

A Mist search can run in parallel on multiple nodes in a system. For each node, configure a Search with the starting rank, number of ranks (ie threads) on the node, and total ranks among all nodes. In this way you can divide the search space among nodes in the system.

The starting rank is the zero-indexed rank number, valid over range [0,total_ranks].

Parameters

rank – Zero-indexed rank number

void set_total_ranks(int ranks)

Set the total number of ranks among all participating Searches.

Each thread on each node is counted as a rank. So the total_ranks is the sum of configured ranks (threads) on each node.

void set_tuple_size(int size)

Set the number of Variables to include in each IT measure computation.

void set_tuple_space(algorithm::TupleSpace const &ts)

Set the custom tuple space for the next computation

Side effects: sets the thread algorithm to TupleSpace so that the tuple space becomes effective immediately.

void set_tuple_limit(long limit)

Set the maximum number of tuples to process. The default it 0, meaning unlimited.

void set_show_progress(bool)

Toggle whether to write program progress to stderr.

When true, an extra thread will be made to watch progress through the TupleSpace. This option is especially useful for large searches to estimate how long the run will take.

void set_output_intermediate(bool)

Include all subcalculations in the output

void set_cache_enabled(bool)

Enable caching intermediate entropy calculation

void set_cache_size_bytes(unsigned long)

Set maximum size of entropy cache in bytes

void load_file(std::string const &filename)

Load Data from CSV or tab-separated file.

By defualt, the file is loaded in row-major order, i.e. each row is a variable.

Parameters
  • filename – path to file

  • is_row_major – Set to true for row-major variables

Pre

each row has an equal number of columns. Load Data from CSV or tab-separated file.

void load_ndarray(np::ndarray const &np)

Load Data from Python Numpy::ndarray.

Data is loaded into the library following a zero-copy guarantee.

Parameters

np – ndarray

Pre

Array is NxM matrix of the expected dtype and C memory layout.

np::ndarray python_get_results()

Return a Numpy ndarray copy of all results

np::ndarray python_start()

Start search.

Compute the configured IT measure for all Variable tuples in the configured search space. And return up to tuple_limit number of results.

void start()

Begin computation.

Compute the configured IT measure for all Variable tuples in the configured search space.

std::vector<it::entropy_type> const &get_results()

Return a copy of all results

void printCacheStats()

Print cache statistics for each cache in each thread to stdout.

std::string version()

Return the Search library Version string

class mist::Variable

Variable wraps a pointer to a data column.

Public Types

using data_t = std::int8_t

Variable values must be signed so that negative values can represent missing data, and should be as small as possible to save space for very large data sets.

Public Functions

Variable(data_ptr src, std::size_t size, std::size_t index, std::size_t bins)

Variable constructor.

Wrap a shared pointer to column data along with metadata.

Parameters
  • src – Shared pointer to memory allocated for the data column

  • size – Number of rows in the data column

  • index – Identifying column index into data matrix

  • bins – Number of data value bins

Throws

invalid_argument – data stored ptr, size, or bin argument is zero.

Pre

src data has been allocated memory for at least size elements.

Pre

src data values are binned to a contiguous non-negative integer array starting at 0.

Pre

src missing data values are represented by negative integers.

inline bool missing(std::size_t pos) const

Test if data at position is missing.

Throws

std::out_of_range

data_t &at(std::size_t const pos)
Throws

out_of_range

Variable deepCopy()

Variable uses default move and copy constructors that are shallow and maintain const requirement on underlying data. A deep copy made with this extra method.

bool operator==(Variable const &other) const noexcept

Will resort to a deep inspection so two Variables with identical content in different memory locations are equivalent. Returns false if either Variable has invalid data, e.g. as a sideeffect of std::move.

bool operator!=(Variable const &other) const noexcept

Variable inequality test.

Public Static Functions

static bool missingVal(data_t const val)

Test if value is classified as missing.

mist::algorithm

Algorithms to divide and conquer Information Theory computations.

namespace mist::algorithm
class TupleSpace
#include <TupleSpace.hpp>

Tuple Space defines the set of tuples over which to run a computation search.

Public Functions

int addVariableGroup(std::string const &name, tuple_t const &vars)

Define a named logical group of variables

Parameters
  • name – group name

  • vars – set of variables in the group, duplicates will be ignored

Throws

TupleSpaceException – variable already listed in existing variable group

Returns

index of created variable group

void addVariableGroupTuple(std::vector<std::string> const &groups)

Add a variable group tuple

The cross product of groups in the group tuple generates a set of variable tuples that will be added to the TupleSpace by TupleSpaceTupleProducer.

Parameters

groups – Array of group names

Throws

TupleSpaceException – group does not exists

void addVariableGroupTuple(tuple_t const &groups)

Add a variable group tuple

The cross product of groups in the group tuple generates a set of variable tuples that will be added to the TupleSpace by TupleSpaceTupleProducer.

Parameters

groups – Array of group indexed by order created

Throws

TupleSpaceException – group index out of range

std::vector<std::string> names() const

Get variable names

void set_names(std::vector<std::string> const &names)

Set variable names

count_t count_tuples() const

Calculate the size of the tuple space, i.e. count the generated number of tuples.

void traverse(TupleSpaceTraverser &traverser) const

Walk through all tuples in the tuple space

Parameters

traverser – Process each tuple with methods defined in specialization

void traverse(count_t start, count_t stop, TupleSpaceTraverser &traverser) const

Walk through as subset of tuples in the tuple space

TupleSpace generates an ordered list of tuples, that can begin at any position in the list.

Parameters
  • start – Begin the walk at tuple in position start

  • stop – End the walk at tuple in position stop

  • traverser – Process each tuple with methods defined in specialization

void traverse_entropy(it::EntropyCalculator &ecalc, TupleSpaceTraverser &traverser) const

Walk through all tuples in the tuple space, computing entropy values as you go.

Some it::Measure classes compute the entropy values of sub-tuples. It is most efficient to compute these as you walk through the tuple space so intermediary values can be reused many times.

Parameters
  • ecalcit::EntropyCalculator object to perform entropy computations

  • traverser – Process each tuple with methods defined in specialization

void traverse_entropy(count_t start, count_t stop, it::EntropyCalculator &ecalc, TupleSpaceTraverser &traverser) const

Walk through a subset of tuples in the tuple space, computing entropy values as you go.

class TupleSpaceException : public exception
class TupleSpaceTraverser
#include <TupleSpace.hpp>

Interface for processing tuples in the TupleSpace

A class can specialize the TupleSpaceTraverser to gain access to the stream of tuples generated by TupleSpace::traverse family of functions.

Subclassed by mist::algorithm::Worker

class Worker : public mist::algorithm::TupleSpaceTraverser
#include <Worker.hpp>

The Worker class divides and conquers the tuple search space.

The Worker processes each tuple in the configured search space, or a portion of the search space depending on the rank parameters. It is common for each computing thread on the system to have a unique Worker instance.

Public Functions

Worker(tuple_space_ptr const &ts, count_t start, count_t stop, result_t cutoff, entropy_calc_ptr &calc, std::vector<output_stream_ptr> const &out_streams, measure_ptr const &measure)

Construct and configure a Worker instance.

Parameters
  • tsTupleSpace that defines the tuple search space

  • start – Start processing at start tuple number

  • stop – Stop processing when stop tuple number is reached

  • cutoff – Discard all tuples from output with a measure less than cutoff

  • out_streams – Collection OutputStream pointers to send results

  • measure – The it::Measure to calculate the results

Worker(tuple_space_ptr const &ts, count_t start, count_t stop, entropy_calc_ptr &calc, std::vector<output_stream_ptr> const &out_streams, measure_ptr const &measure)

Construct and configure a Worker instance.

Cutoff is not used in the this instance.

void start()

Start the Worker search space execution. Returns when all tuples in the search space have been processed.

class WorkerException : public exception

mist::cache

Cache intermediate results for performance improvement.

namespace mist::cache

Typedefs

using K = Variable::indexes
using V = it::entropy_type
class Cache
#include <Cache.hpp>

Cache interface

Subclassed by mist::cache::Flat1D, mist::cache::Flat2D

Public Functions

virtual bool has(K const&) = 0

Test that key is in table

virtual void put(K const&, V const&) = 0

Insert value at key.

virtual V get(K const&) = 0

Return value at key.

out_of_range Key not in table

virtual std::size_t size() = 0

Number of entries in table

virtual std::size_t bytes() = 0

Size in bytes of table

inline std::size_t hits()

Number of cache hits

inline std::size_t misses()

Number of cache misses

inline std::size_t evictions()

Number of cache evictions

class Flat1D : public mist::cache::Cache
#include <Flat1D.hpp>

Fixed sized associative cache

Public Functions

virtual bool has(key_type const &key)

Test that key is in table

virtual void put(key_type const &key, val_type const &val)

Insert value at key.

virtual val_type get(key_type const &key)

Return value at key.

out_of_range Key not in table

virtual std::size_t size()

Number of entries in table

virtual std::size_t bytes()

Size in bytes of table

class Flat1DException : public exception
class Flat1DOutOfRange : public out_of_range
class Flat2D : public mist::cache::Cache
#include <Flat2D.hpp>

Fixed sized associative cache

Public Functions

virtual bool has(key_type const &key)

Test that key is in table

virtual void put(key_type const &key, val_type const &val)

Insert value at key.

virtual val_type get(key_type const &key)

Return value at key.

out_of_range Key not in table

virtual std::size_t size()

Number of entries in table

virtual std::size_t bytes()

Size in bytes of table

class Flat2DException : public exception
class Flat2DOutOfRange : public out_of_range

mist::io

Input/Output

namespace mist::io
class DataMatrix
#include <DataMatrix.hpp>

N x M input data matrix.

Columns are interpreted as variables with each row a sample.

class DataMatrixException : public exception
class FileOutputStream : public mist::io::OutputStream
class FileOutputStreamException : public exception
class FlatOutputStream : public mist::io::OutputStream

Public Functions

void relocate(FlatOutputStream &other)

Move all data in other to this object

class FlatOutputStreamException : public exception
class MapOutputStream : public mist::io::OutputStream
class OutputStream

Subclassed by mist::io::FileOutputStream, mist::io::FlatOutputStream, mist::io::MapOutputStream

mist::it

Information Theory definitions and algorithms.

namespace mist::it

Typedefs

using Bitset = boost::dynamic_bitset<unsigned long long>
using BitsetVariable = std::vector<Bitset>
using BitsetTable = std::vector<BitsetVariable>
using DistributionData = double
using entropy_type = double
using Entropy = std::vector<entropy_type>

Enums

enum d1

Values:

enumerator e0
enumerator size
enum d2

Values:

enumerator e0
enumerator e1
enumerator e01
enumerator size
enum d3

Values:

enumerator e0
enumerator e1
enumerator e2
enumerator e01
enumerator e02
enumerator e12
enumerator e012
enumerator size
enum d4

Values:

enumerator e0
enumerator e1
enumerator e2
enumerator e3
enumerator e01
enumerator e02
enumerator e03
enumerator e12
enumerator e13
enumerator e23
enumerator e012
enumerator e013
enumerator e023
enumerator e123
enumerator e0123
enumerator size
class BitsetCounter : public mist::it::Counter
#include <BitsetCounter.hpp>

Generates a ProbabilityDistribution from a Variable tuple.

Recasts each Variable as an array of bitsets, one for each bin value. Computes the ProbabilityDistribution using bitwise AND operation and bit counting algorithm.

class BitsetCounterOutOfRange : public out_of_range
class Counter
#include <Counter.hpp>

Abstract class. Generates a Probability Distribution from a Variable tuple

Subclassed by mist::it::BitsetCounter, mist::it::VectorCounter

class Distribution
#include <Distribution.hpp>

Joint probability array for N variables

Public Functions

template<class Container>
inline Distribution(Container const &strides)

Construct directly from dimension strides

inline Distribution(Variable::tuple const &vars)

Construct from a Variable tuple

inline void scale(double factor)

Multiply each value in distribution by factor

inline void normalize()

Normalize distribution

class DistributionOutOfRange : public out_of_range
class EntropyCalculator
class EntropyCalculatorException : public exception
class EntropyMeasure : public mist::it::Measure

Public Functions

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const

Compute the information theory measure with the computation ecalc for the given variables.

Returns

final result

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &e) const

Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.

virtual std::string header(int d, bool full_output) const

Return a comma-separated header string corresponding to the full results

Parameters
  • d – tuple size

  • full_output – whether header should include all subcalculation names

Returns

header string

virtual std::vector<std::string> const &names(int d, bool full_output) const

Return array of names for each column in the output

Parameters
  • d – tuple size

  • full_output – whether header should include all subcalculation names

Returns

array of column names in the output

inline virtual bool full_entropy() const

Whether this measure uses intermediate entropy calculations

class EntropyMeasureException : public exception
class Measure

Subclassed by mist::it::EntropyMeasure, mist::it::SymmetricDelta

Public Functions

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const = 0

Compute the information theory measure with the computation ecalc for the given variables.

Returns

final result

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &entropy) const = 0

Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.

virtual std::string header(int d, bool full_output) const = 0

Return a comma-separated header string corresponding to the full results

Parameters
  • d – tuple size

  • full_output – whether header should include all subcalculation names

Returns

header string

virtual std::vector<std::string> const &names(int d, bool full_output) const = 0

Return array of names for each column in the output

Parameters
  • d – tuple size

  • full_output – whether header should include all subcalculation names

Returns

array of column names in the output

virtual bool full_entropy() const = 0

Whether this measure uses intermediate entropy calculations

class SymmetricDelta : public mist::it::Measure

Public Functions

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const

Compute the information theory measure with the computation ecalc for the given variables.

Returns

final result

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &e) const

Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.

virtual std::string header(int d, bool full_output) const

Return a comma-separated header string corresponding to the full results

Parameters
  • d – tuple size

  • full_output – whether header should include all subcalculation names

Returns

header string

virtual std::vector<std::string> const &names(int d, bool full_output) const

Return array of names for each column in the output

Parameters
  • d – tuple size

  • full_output – whether header should include all subcalculation names

Returns

array of column names in the output

inline virtual bool full_entropy() const

Whether this measure uses intermediate entropy calculations

class SymmetricDeltaException : public exception
class VectorCounter : public mist::it::Counter
#include <VectorCounter.hpp>

Generates a ProbabilityDistribution from a Variable tuple.

Counts using standard algorithm.

class VectorCounterException : public exception