API
mist is comprised of logically distinct components encapsulated by namespaces. Classes access other namespaces via an interface class. Users typically only need to be concerned with classes in the root namespace, whereas developers will need the rest.
mist
The root namespace includes composition classes and classes common to the sub-namespaces.
-
class mist::Search
Main user interface for mist runtime.
CPP and Python users instantiate this class, load data, and optionally call various configuration methods to define the computation. Computations begin with start(). Maintains state in between runs, such as intermediate value caches for improved performance.
Public Functions
-
void set_measure(std::string const &measure)
Set the IT Measure to be computed.
Entropy : Compute only combined entropy.
SymmetricDelta (default) : A novel symmetric measure of shared information. See Sakhanenko, Galas in the literature.
-
void set_cutoff(it::entropy_type cutoff)
Set the minimum IT Measure value to keep in results.
This option is most useful for dealing with very large TupleSpaces, the results for which cannot be stored in memory or on disk.
-
void set_probability_algorithm(std::string const &algorithm)
Set the algorithm for generating probability distributions.
Vector (default) : Process each Variable as a vector. Gives best performance when Variable size is small or when there are many value bins.
Bitset : Convert each distinct Variable value into a bitset to leverage bitwise operations. Gives best performance when Variable size is large and the number of value bins is small.
Performance of each algorithm depends strongly on the problem, i.e. the data, and potentially also on the system. After the number of threads, this parameter has the largest effect on runtime since distribution generation dominates the computation.
-
void set_outfile(std::string const &filename)
Set output CSV file.
-
void set_ranks(int ranks)
Set number of concurrent ranks to use in this Search.
A rank on a computation node is one execution thread. The default ranks is the number of threads allowed by the node. Setting ranks to 0 causes the system to use the maximum.
-
void set_start_rank(int rank)
Set the starting rank for this Search.
A Mist search can run in parallel on multiple nodes in a system. For each node, configure a Search with the starting rank, number of ranks (ie threads) on the node, and total ranks among all nodes. In this way you can divide the search space among nodes in the system.
The starting rank is the zero-indexed rank number, valid over range [0,total_ranks].
- Parameters
rank – Zero-indexed rank number
-
void set_total_ranks(int ranks)
Set the total number of ranks among all participating Searches.
Each thread on each node is counted as a rank. So the total_ranks is the sum of configured ranks (threads) on each node.
-
void set_tuple_size(int size)
Set the number of Variables to include in each IT measure computation.
-
void set_tuple_space(algorithm::TupleSpace const &ts)
Set the custom tuple space for the next computation
Side effects: sets the thread algorithm to TupleSpace so that the tuple space becomes effective immediately.
-
void set_tuple_limit(long limit)
Set the maximum number of tuples to process. The default it 0, meaning unlimited.
-
void set_show_progress(bool)
Toggle whether to write program progress to stderr.
When true, an extra thread will be made to watch progress through the TupleSpace. This option is especially useful for large searches to estimate how long the run will take.
-
void set_output_intermediate(bool)
Include all subcalculations in the output
-
void set_cache_enabled(bool)
Enable caching intermediate entropy calculation
-
void set_cache_size_bytes(unsigned long)
Set maximum size of entropy cache in bytes
-
void load_file(std::string const &filename)
Load Data from CSV or tab-separated file.
By defualt, the file is loaded in row-major order, i.e. each row is a variable.
- Parameters
filename – path to file
is_row_major – Set to true for row-major variables
- Pre
each row has an equal number of columns. Load Data from CSV or tab-separated file.
-
void load_ndarray(np::ndarray const &np)
Load Data from Python Numpy::ndarray.
Data is loaded into the library following a zero-copy guarantee.
- Parameters
np – ndarray
- Pre
Array is NxM matrix of the expected dtype and C memory layout.
-
np::ndarray python_get_results()
Return a Numpy ndarray copy of all results
-
np::ndarray python_start()
Start search.
Compute the configured IT measure for all Variable tuples in the configured search space. And return up to tuple_limit number of results.
-
void start()
Begin computation.
Compute the configured IT measure for all Variable tuples in the configured search space.
-
std::vector<it::entropy_type> const &get_results()
Return a copy of all results
-
void printCacheStats()
Print cache statistics for each cache in each thread to stdout.
-
void set_measure(std::string const &measure)
-
class mist::Variable
Variable wraps a pointer to a data column.
Public Types
Public Functions
-
Variable(data_ptr src, std::size_t size, std::size_t index, std::size_t bins)
Variable constructor.
Wrap a shared pointer to column data along with metadata.
- Parameters
src – Shared pointer to memory allocated for the data column
size – Number of rows in the data column
index – Identifying column index into data matrix
bins – Number of data value bins
- Throws
invalid_argument – data stored ptr, size, or bin argument is zero.
- Pre
src data has been allocated memory for at least size elements.
- Pre
src data values are binned to a contiguous non-negative integer array starting at 0.
- Pre
src missing data values are represented by negative integers.
-
inline bool missing(std::size_t pos) const
Test if data at position is missing.
- Throws
std::out_of_range –
-
Variable deepCopy()
Variable uses default move and copy constructors that are shallow and maintain const requirement on underlying data. A deep copy made with this extra method.
-
Variable(data_ptr src, std::size_t size, std::size_t index, std::size_t bins)
mist::algorithm
Algorithms to divide and conquer Information Theory computations.
-
namespace mist::algorithm
-
class TupleSpace
- #include <TupleSpace.hpp>
Tuple Space defines the set of tuples over which to run a computation search.
Public Functions
-
int addVariableGroup(std::string const &name, tuple_t const &vars)
Define a named logical group of variables
- Parameters
name – group name
vars – set of variables in the group, duplicates will be ignored
- Throws
TupleSpaceException – variable already listed in existing variable group
- Returns
index of created variable group
-
void addVariableGroupTuple(std::vector<std::string> const &groups)
Add a variable group tuple
The cross product of groups in the group tuple generates a set of variable tuples that will be added to the TupleSpace by TupleSpaceTupleProducer.
- Parameters
groups – Array of group names
- Throws
TupleSpaceException – group does not exists
-
void addVariableGroupTuple(tuple_t const &groups)
Add a variable group tuple
The cross product of groups in the group tuple generates a set of variable tuples that will be added to the TupleSpace by TupleSpaceTupleProducer.
- Parameters
groups – Array of group indexed by order created
- Throws
TupleSpaceException – group index out of range
-
std::vector<std::string> names() const
Get variable names
-
void set_names(std::vector<std::string> const &names)
Set variable names
-
count_t count_tuples() const
Calculate the size of the tuple space, i.e. count the generated number of tuples.
-
void traverse(TupleSpaceTraverser &traverser) const
Walk through all tuples in the tuple space
- Parameters
traverser – Process each tuple with methods defined in specialization
-
void traverse(count_t start, count_t stop, TupleSpaceTraverser &traverser) const
Walk through as subset of tuples in the tuple space
TupleSpace generates an ordered list of tuples, that can begin at any position in the list.
- Parameters
start – Begin the walk at tuple in position start
stop – End the walk at tuple in position stop
traverser – Process each tuple with methods defined in specialization
-
void traverse_entropy(it::EntropyCalculator &ecalc, TupleSpaceTraverser &traverser) const
Walk through all tuples in the tuple space, computing entropy values as you go.
Some it::Measure classes compute the entropy values of sub-tuples. It is most efficient to compute these as you walk through the tuple space so intermediary values can be reused many times.
- Parameters
ecalc – it::EntropyCalculator object to perform entropy computations
traverser – Process each tuple with methods defined in specialization
-
void traverse_entropy(count_t start, count_t stop, it::EntropyCalculator &ecalc, TupleSpaceTraverser &traverser) const
Walk through a subset of tuples in the tuple space, computing entropy values as you go.
-
int addVariableGroup(std::string const &name, tuple_t const &vars)
-
class TupleSpaceException : public exception
-
class TupleSpaceTraverser
- #include <TupleSpace.hpp>
Interface for processing tuples in the TupleSpace
A class can specialize the TupleSpaceTraverser to gain access to the stream of tuples generated by TupleSpace::traverse family of functions.
Subclassed by mist::algorithm::Worker
-
class Worker : public mist::algorithm::TupleSpaceTraverser
- #include <Worker.hpp>
The Worker class divides and conquers the tuple search space.
The Worker processes each tuple in the configured search space, or a portion of the search space depending on the rank parameters. It is common for each computing thread on the system to have a unique Worker instance.
Public Functions
-
Worker(tuple_space_ptr const &ts, count_t start, count_t stop, result_t cutoff, entropy_calc_ptr &calc, std::vector<output_stream_ptr> const &out_streams, measure_ptr const &measure)
Construct and configure a Worker instance.
- Parameters
ts – TupleSpace that defines the tuple search space
start – Start processing at start tuple number
stop – Stop processing when stop tuple number is reached
cutoff – Discard all tuples from output with a measure less than cutoff
out_streams – Collection OutputStream pointers to send results
measure – The it::Measure to calculate the results
-
Worker(tuple_space_ptr const &ts, count_t start, count_t stop, result_t cutoff, entropy_calc_ptr &calc, std::vector<output_stream_ptr> const &out_streams, measure_ptr const &measure)
-
class WorkerException : public exception
-
class TupleSpace
mist::cache
Cache intermediate results for performance improvement.
-
namespace mist::cache
-
-
class Cache
- #include <Cache.hpp>
Cache interface
Subclassed by mist::cache::Flat1D, mist::cache::Flat2D
-
class Flat1D : public mist::cache::Cache
- #include <Flat1D.hpp>
Fixed sized associative cache
Public Functions
-
virtual bool has(key_type const &key)
Test that key is in table
-
virtual void put(key_type const &key, val_type const &val)
Insert value at key.
-
virtual val_type get(key_type const &key)
Return value at key.
out_of_range Key not in table
-
virtual std::size_t size()
Number of entries in table
-
virtual std::size_t bytes()
Size in bytes of table
-
virtual bool has(key_type const &key)
-
class Flat1DException : public exception
-
class Flat1DOutOfRange : public out_of_range
-
class Flat2D : public mist::cache::Cache
- #include <Flat2D.hpp>
Fixed sized associative cache
Public Functions
-
virtual bool has(key_type const &key)
Test that key is in table
-
virtual void put(key_type const &key, val_type const &val)
Insert value at key.
-
virtual val_type get(key_type const &key)
Return value at key.
out_of_range Key not in table
-
virtual std::size_t size()
Number of entries in table
-
virtual std::size_t bytes()
Size in bytes of table
-
virtual bool has(key_type const &key)
-
class Flat2DException : public exception
-
class Flat2DOutOfRange : public out_of_range
-
class Cache
mist::io
Input/Output
-
namespace mist::io
-
class DataMatrix
- #include <DataMatrix.hpp>
N x M input data matrix.
Columns are interpreted as variables with each row a sample.
-
class DataMatrixException : public exception
-
class FileOutputStream : public mist::io::OutputStream
-
class FileOutputStreamException : public exception
-
class FlatOutputStream : public mist::io::OutputStream
Public Functions
-
void relocate(FlatOutputStream &other)
Move all data in other to this object
-
void relocate(FlatOutputStream &other)
-
class FlatOutputStreamException : public exception
-
class MapOutputStream : public mist::io::OutputStream
-
class OutputStream
Subclassed by mist::io::FileOutputStream, mist::io::FlatOutputStream, mist::io::MapOutputStream
-
class DataMatrix
mist::it
Information Theory definitions and algorithms.
-
namespace mist::it
Typedefs
-
using Bitset = boost::dynamic_bitset<unsigned long long>
-
using BitsetTable = std::vector<BitsetVariable>
-
using DistributionData = double
-
using entropy_type = double
-
using Entropy = std::vector<entropy_type>
Enums
-
class BitsetCounter : public mist::it::Counter
- #include <BitsetCounter.hpp>
Generates a ProbabilityDistribution from a Variable tuple.
Recasts each Variable as an array of bitsets, one for each bin value. Computes the ProbabilityDistribution using bitwise AND operation and bit counting algorithm.
-
class BitsetCounterOutOfRange : public out_of_range
-
class Counter
- #include <Counter.hpp>
Abstract class. Generates a Probability Distribution from a Variable tuple
Subclassed by mist::it::BitsetCounter, mist::it::VectorCounter
-
class Distribution
- #include <Distribution.hpp>
Joint probability array for N variables
-
class DistributionOutOfRange : public out_of_range
-
class EntropyCalculator
-
class EntropyCalculatorException : public exception
-
class EntropyMeasure : public mist::it::Measure
Public Functions
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const
Compute the information theory measure with the computation ecalc for the given variables.
- Returns
final result
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &e) const
Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.
-
virtual std::string header(int d, bool full_output) const
Return a comma-separated header string corresponding to the full results
- Parameters
d – tuple size
full_output – whether header should include all subcalculation names
- Returns
header string
-
virtual std::vector<std::string> const &names(int d, bool full_output) const
Return array of names for each column in the output
- Parameters
d – tuple size
full_output – whether header should include all subcalculation names
- Returns
array of column names in the output
-
inline virtual bool full_entropy() const
Whether this measure uses intermediate entropy calculations
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const
-
class EntropyMeasureException : public exception
-
class Measure
Subclassed by mist::it::EntropyMeasure, mist::it::SymmetricDelta
Public Functions
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const = 0
Compute the information theory measure with the computation ecalc for the given variables.
- Returns
final result
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &entropy) const = 0
Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.
-
virtual std::string header(int d, bool full_output) const = 0
Return a comma-separated header string corresponding to the full results
- Parameters
d – tuple size
full_output – whether header should include all subcalculation names
- Returns
header string
-
virtual std::vector<std::string> const &names(int d, bool full_output) const = 0
Return array of names for each column in the output
- Parameters
d – tuple size
full_output – whether header should include all subcalculation names
- Returns
array of column names in the output
-
virtual bool full_entropy() const = 0
Whether this measure uses intermediate entropy calculations
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const = 0
-
class SymmetricDelta : public mist::it::Measure
Public Functions
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const
Compute the information theory measure with the computation ecalc for the given variables.
- Returns
final result
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &e) const
Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.
-
virtual std::string header(int d, bool full_output) const
Return a comma-separated header string corresponding to the full results
- Parameters
d – tuple size
full_output – whether header should include all subcalculation names
- Returns
header string
-
virtual std::vector<std::string> const &names(int d, bool full_output) const
Return array of names for each column in the output
- Parameters
d – tuple size
full_output – whether header should include all subcalculation names
- Returns
array of column names in the output
-
inline virtual bool full_entropy() const
Whether this measure uses intermediate entropy calculations
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const
-
class SymmetricDeltaException : public exception
-
class VectorCounter : public mist::it::Counter
- #include <VectorCounter.hpp>
Generates a ProbabilityDistribution from a Variable tuple.
Counts using standard algorithm.
-
class VectorCounterException : public exception
-
using Bitset = boost::dynamic_bitset<unsigned long long>