API

mist is comprised of logically distinct components encapsulated by namespaces. Classes access other namespaces via an interface class. Users typically only need to be concerned with classes in the root namespace, whereas developers will need the rest.

mist

The root namespace includes composition classes and classes common to the sub-namespaces.

class mist::Search

Main user interface for mist runtime.

CPP and Python users instantiate this class, load data, and optionally call various configuration methods to define the computation. Computations begin with start(). Maintains state in between runs, such as intermediate value caches for improved performance.

Public Functions

void set_measure(std::string const &measure)

Set the IT Measure to be computed.

  • Entropy : Compute only combined entropy.

  • SymmetricDelta (default) : A novel symmetric measure of shared information. See Sakhanenko, Galas in the literature.

void set_probability_algorithm(std::string const &algorithm)

Set the algorithm for generating probability distributions.

  • Vector (default) : Process each Variable as a vector. Gives best performance when Variable size is small or when there are many value bins.

  • Bitset : Convert each distinct Variable value into a bitset to leverage bitwise operations. Gives best performance when Variable size is large and the number of value bins is small.

Performance of each algorithm depends strongly on the problem, i.e. the data, and potentially also on the system. After the number of threads, this parameter has the largest effect on runtime since distribution generation dominates the computation.

void set_outfile(std::string const &filename)

Set output CSV file.

void set_ranks(int ranks)

Set number of concurrent ranks to use in this Search.

A rank on a computation node is one execution thread. The default ranks is the number of threads allowed by the node. Setting ranks to 0 causes the system to use the maximum.

void set_start_rank(int rank)

Set the starting rank for this Search.

A Mist search can run in parallel on multiple nodes in a system. For each node, configure a Search with the starting rank, number of ranks (ie threads) on the node, and total ranks among all nodes. In this way you can divide the search space among nodes in the system.

The starting rank is the zero-indexed rank number, valid over range [0,total_ranks].

Parameters

rank – Zero-indexed rank number

void set_total_ranks(int ranks)

Set the total number of ranks among all participating Searches.

Each thread on each node is counted as a rank. So the total_ranks is the sum of configured ranks (threads) on each node.

void set_tuple_size(int size)

Set the number of Variables to include in each IT measure computation.

void set_tuple_space(algorithm::TupleSpace const &ts)

Set the custom tuple space for the next computation

Side effects: sets the thread algorithm to TupleSpace so that the tuple space becomes effective immediately.

void set_tuple_limit(long limit)

Set the maximum number of tuples to process. The default it 0, meaning unlimited.

void set_output_intermediate(bool)

Include all subcalculations in the output

void set_cache_enabled(bool)

Enable caching intermediate entropy calculation

void set_cache_size_bytes(unsigned long)

Set maximum size of entropy cache in bytes

void load_file(std::string const &filename)

Load Data from CSV or tab-separated file.

By defualt, the file is loaded in row-major order, i.e. each row is a variable.

Parameters
  • filename – path to file

  • is_row_major – Set to true for row-major variables

Pre

each row has an equal number of columns. Load Data from CSV or tab-separated file.

void load_ndarray(np::ndarray const &np)

Load Data from Python Numpy::ndarray.

Data is loaded into the library following a zero-copy guarantee.

Parameters

np – ndarray

Pre

Array is NxM matrix of the expected dtype and C memory layout.

np::ndarray python_get_results()

Return a Numpy ndarray copy of all results

np::ndarray python_start()

Start search.

Compute the configured IT measure for all Variable tuples in the configured search space. And return up to tuple_limit number of results.

void start()

Begin computation.

Compute the configured IT measure for all Variable tuples in the configured search space.

io::MapOutputStream::map_type get_results()

Return a copy of all results

void printCacheStats()

Print cache statistics for each cache in each thread to stdout.

std::string version()

Return the Search library Version string

class mist::Variable

Variable wraps a pointer to a data column.

Public Functions

Variable(data_ptr src, std::size_t size, std::size_t index, std::size_t bins)

Variable constructor.

Wrap a shared pointer to column data along with metadata.

Parameters
  • src – Shared pointer to memory allocated for the data column

  • size – Number of rows in the data column

  • index – Identifying column index into data matrix

  • bins – Number of data value bins

Throws

invalid_argument – data stored ptr, size, or bin argument is zero.

Pre

src data has been allocated memory for at least size elements.

Pre

src data values are binned to a contiguous non-negative integer array starting at 0.

Pre

src missing data values are represented by negative integers.

inline bool missing(std::size_t pos) const

Test if data at position is missing.

Throws

std::out_of_range

data_type &at(std::size_t const pos)
Throws

out_of_range

Variable deepCopy()

Variable uses default move and copy constructors that are shallow and maintain const requirement on underlying data. A deep copy made with this extra method.

bool operator==(Variable const &other) const noexcept

Will resort to a deep inspection so two Variables with identical content in different memory locations are equivalent. Returns false if either Variable has invalid data, e.g. as a sideeffect of std::move.

bool operator!=(Variable const &other) const noexcept

Variable inequality test.

Public Static Functions

static bool missingVal(data_type const val)

Test if value is classified as missing.

mist::algorithm

Algorithms to divide and conquer Information Theory computations.

namespace mist::algorithm
class TupleSpace
#include <TupleSpace.hpp>

Tuple Space defines the set of tuples over which to run a computation search.

Public Functions

int addVariableGroup(std::string const &name, tuple_type const &vars)

Define a named logical group of variables

Parameters
  • name – group name

  • vars – set of variables in the group, duplicates will be ignored

Throws

TupleSpaceException – variable already listed in existing variable group

Returns

index of created variable group

void addVariableGroupTuple(std::vector<std::string> const &groups)

Add a variable group tuple

The cross product of groups in the group tuple generates a set of variable tuples that will be added to the TupleSpace by TupleSpaceTupleProducer.

Parameters

groups – Array of group names

Throws

TupleSpaceException – group does not exists

void addVariableGroupTuple(tuple_type const &groups)

Add a variable group tuple

The cross product of groups in the group tuple generates a set of variable tuples that will be added to the TupleSpace by TupleSpaceTupleProducer.

Parameters

groups – Array of group indexed by order created

Throws

TupleSpaceException – group index out of range

std::vector<std::string> names() const

Get variable names

void set_names(std::vector<std::string> const &names)

Set variable names

unsigned long long count_tuples() const

Count the number of tuples generated by the TupleSpace as configured

class TupleSpaceException : public exception
class Worker
#include <Worker.hpp>

The Worker class divides and conquers the tuple search space.

The Worker processes each tuple in the configured search space, or a portion of the search space depending on the rank parameters. It is common for each computing thread on the system to have a unique Worker instance.

Public Functions

Worker(int rank, int ranks, long limit, TupleSpace const &ts, entropy_calc_ptr calc, std::vector<output_stream_ptr> out_streams, measure_ptr measure)

Construct and configure a Worker instance.

Parameters
  • rank – Zero-indexed rank number [0, ranks]

  • ranks – Total number of Workers participating in the search

  • limit – Upper limit on number of tuples to processes by all Workers

  • tsTupleSpace that defines the tuple search space

  • out_streams – Collection OutputStream pointers to send results

  • measure – The it::Measure to calculate the results

void start()

Start the Worker search space execution. Returns when all tuples in the search space have been processed.

class WorkerException : public exception

mist::cache

Cache intermediate results for performance improvement.

namespace mist::cache

Typedefs

using K = Variable::indexes
template<class V>
class Cache
#include <Cache.hpp>

Cache interface

Subclassed by mist::cache::Flat< V >, mist::cache::Map< V >, mist::cache::MRU< V >, mist::cache::SmallFiles< V >

Public Functions

virtual bool has(K const&) = 0

Test that key is in table

virtual std::pair<K, V> put(K const&, V const&) = 0

Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.

virtual std::shared_ptr<V> get(K const&) = 0

Return value at key.

out_of_range Key not in table

virtual std::size_t size() = 0

Number of entries in table

virtual std::size_t bytes() = 0

Size in bytes of table

inline std::size_t hits()

Number of cache hits

inline std::size_t misses()

Number of cache misses

inline std::size_t evictions()

Number of cache evictions

template<class V>
class Flat : public mist::cache::Cache<V>
#include <Flat.hpp>

Fixed sized associative cache

Public Functions

inline virtual bool has(key_type const &key)

Test that key is in table

inline virtual std::pair<key_type, val_type> put(key_type const &key, val_type const &val)

Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.

inline virtual std::shared_ptr<V> get(key_type const &key)

Return value at key.

out_of_range Key not in table

inline virtual std::size_t size()

Number of entries in table

inline virtual std::size_t bytes()

Size in bytes of table

class FlatException : public exception
class FlatOutOfRange : public out_of_range
template<class V>
class Map : public mist::cache::Cache<V>
#include <Map.hpp>

Dynamically-expanding associative cache.

Public Functions

inline virtual bool has(K const &key)

Test that key is in table

inline virtual std::pair<K, V> put(K const &key, V const &val)

Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.

inline virtual std::shared_ptr<V> get(K const &key)

Return value at key.

out_of_range Key not in table

inline virtual std::size_t size()

Number of entries in table

inline virtual std::size_t bytes()

Size in bytes of table

class MapOutOfRange : public out_of_range
template<class V>
class MRU : public mist::cache::Cache<V>
#include <MRU.hpp>

Fixed sized associative cache with least recently added eviction.

Public Functions

inline virtual bool has(K const &key)

Test that key is in table

inline virtual std::pair<K, V> put(K const &key, V const &val)

Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.

inline virtual std::shared_ptr<V> get(K const &key)

Return value at key.

out_of_range Key not in table

inline virtual std::size_t size()

Number of entries in table

inline virtual std::size_t bytes()

Size in bytes of table

class MRUOutOfRange : public out_of_range
template<class V>
class SmallFiles : public mist::cache::Cache<V>
#include <SmallFiles.hpp>

Filesystem cache with each value a small file.

Public Functions

inline virtual bool has(K const &key)

Test that key is in table

inline virtual std::pair<K, V> put(K const &key, V const &val)

Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.

inline virtual std::shared_ptr<V> get(K const &key)

Return value at key.

out_of_range Key not in table

inline virtual std::size_t size()

Number of entries in table

inline virtual std::size_t bytes()

Size in bytes of table

class SmallFilesOutOfRange : public out_of_range

mist::io

Input/Output

namespace mist::io
class DataMatrix
#include <DataMatrix.hpp>

N x M input data matrix.

Columns are interpreted as variables with each row a sample.

class DataMatrixException : public exception
class FileOutputStream : public mist::io::OutputStream
class FileOutputStreamException : public exception
class MapOutputStream : public mist::io::OutputStream
class OutputStream

Subclassed by mist::io::FileOutputStream, mist::io::MapOutputStream

mist::it

Information Theory definitions and algorithms.

namespace mist::it

Typedefs

using Bitset = boost::dynamic_bitset<unsigned long long>
using BitsetVariable = std::vector<Bitset>
using BitsetTable = std::vector<BitsetVariable>
using DistributionData = double
using entropy_type = double
using Entropy = std::vector<entropy_type>

Enums

enum d1

Values:

enumerator e0
enumerator size
enum d2

Values:

enumerator e0
enumerator e1
enumerator e01
enumerator size
enum d3

Values:

enumerator e0
enumerator e1
enumerator e2
enumerator e01
enumerator e02
enumerator e12
enumerator e012
enumerator size
enum d4

Values:

enumerator e0
enumerator e1
enumerator e2
enumerator e3
enumerator e01
enumerator e02
enumerator e03
enumerator e12
enumerator e13
enumerator e23
enumerator e012
enumerator e013
enumerator e023
enumerator e123
enumerator e0123
enumerator size
class BitsetCounter : public mist::it::Counter
#include <BitsetCounter.hpp>

Generates a ProbabilityDistribution from a Variable tuple.

Recasts each Variable as an array of bitsets, one for each bin value. Computes the ProbabilityDistribution using bitwise AND operation and bit counting algorithm.

class BitsetCounterOutOfRange : public out_of_range
class Counter
#include <Counter.hpp>

Abstract class. Generates a Probability Distribution from a Variable tuple

Subclassed by mist::it::BitsetCounter, mist::it::VectorCounter

class Distribution
#include <Distribution.hpp>

Joint probability array for N variables

Public Functions

template<class Container>
inline Distribution(Container const &strides)

Construct directly from dimension strides

inline Distribution(Variable::tuple const &vars)

Construct from a Variable tuple

inline void scale(double factor)

Multiply each value in distribution by factor

inline void normalize()

Normalize distribution

class DistributionOutOfRange : public out_of_range
class EntropyCalculator
class EntropyCalculatorException : public exception
class EntropyMeasure : public mist::it::Measure

Public Functions

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const

Compute the information theory measure with the computation ecalc for the given variables.

Returns

final result

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &e) const

Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.

virtual std::string header(int d, bool full_output) const

Return a comma-separated header string corresponding to the full results

Parameters
  • d – tuple size

  • full_output – whether header should include all subcalculation names

Returns

header string

inline virtual bool full_entropy() const

Whether this measure uses intermediate entropy calculations

class EntropyMeasureException : public exception
class Measure

Subclassed by mist::it::EntropyMeasure, mist::it::SymmetricDelta

Public Functions

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const = 0

Compute the information theory measure with the computation ecalc for the given variables.

Returns

final result

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &entropy) const = 0

Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.

virtual std::string header(int d, bool full_output) const = 0

Return a comma-separated header string corresponding to the full results

Parameters
  • d – tuple size

  • full_output – whether header should include all subcalculation names

Returns

header string

virtual bool full_entropy() const = 0

Whether this measure uses intermediate entropy calculations

class SymmetricDelta : public mist::it::Measure

Public Functions

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const

Compute the information theory measure with the computation ecalc for the given variables.

Returns

final result

virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &e) const

Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.

virtual std::string header(int d, bool full_output) const

Return a comma-separated header string corresponding to the full results

Parameters
  • d – tuple size

  • full_output – whether header should include all subcalculation names

Returns

header string

inline virtual bool full_entropy() const

Whether this measure uses intermediate entropy calculations

class SymmetricDeltaException : public exception
class VectorCounter : public mist::it::Counter
#include <VectorCounter.hpp>

Generates a ProbabilityDistribution from a Variable tuple.

Counts using standard algorithm.

class VectorCounterException : public exception