API¶
mist is comprised of logically distinct components encapsulated by namespaces. Classes access other namespaces via an interface class. Users typically only need to be concerned with classes in the root namespace, whereas developers will need the rest.
mist¶
The root namespace includes composition classes and classes common to the sub-namespaces.
-
class mist::Search¶
Main user interface for mist runtime.
CPP and Python users instantiate this class, load data, and optionally call various configuration methods to define the computation. Computations begin with start(). Maintains state in between runs, such as intermediate value caches for improved performance.
Public Functions
-
void set_measure(std::string const &measure)¶
Set the IT Measure to be computed.
Entropy : Compute only combined entropy.
SymmetricDelta (default) : A novel symmetric measure of shared information. See Sakhanenko, Galas in the literature.
-
void set_probability_algorithm(std::string const &algorithm)¶
Set the algorithm for generating probability distributions.
Vector (default) : Process each Variable as a vector. Gives best performance when Variable size is small or when there are many value bins.
Bitset : Convert each distinct Variable value into a bitset to leverage bitwise operations. Gives best performance when Variable size is large and the number of value bins is small.
Performance of each algorithm depends strongly on the problem, i.e. the data, and potentially also on the system. After the number of threads, this parameter has the largest effect on runtime since distribution generation dominates the computation.
-
void set_outfile(std::string const &filename)¶
Set output CSV file.
-
void set_ranks(int ranks)¶
Set number of concurrent ranks to use in this Search.
A rank on a computation node is one execution thread. The default ranks is the number of threads allowed by the node. Setting ranks to 0 causes the system to use the maximum.
-
void set_start_rank(int rank)¶
Set the starting rank for this Search.
A Mist search can run in parallel on multiple nodes in a system. For each node, configure a Search with the starting rank, number of ranks (ie threads) on the node, and total ranks among all nodes. In this way you can divide the search space among nodes in the system.
The starting rank is the zero-indexed rank number, valid over range [0,total_ranks].
- Parameters
rank – Zero-indexed rank number
-
void set_total_ranks(int ranks)¶
Set the total number of ranks among all participating Searches.
Each thread on each node is counted as a rank. So the total_ranks is the sum of configured ranks (threads) on each node.
-
void set_tuple_size(int size)¶
Set the number of Variables to include in each IT measure computation.
-
void set_tuple_space(algorithm::TupleSpace const &ts)¶
Set the custom tuple space for the next computation
Side effects: sets the thread algorithm to TupleSpace so that the tuple space becomes effective immediately.
-
void set_tuple_limit(long limit)¶
Set the maximum number of tuples to process. The default it 0, meaning unlimited.
-
void set_output_intermediate(bool)¶
Include all subcalculations in the output
-
void set_cache_enabled(bool)¶
Enable caching intermediate entropy calculation
-
void set_cache_size_bytes(unsigned long)¶
Set maximum size of entropy cache in bytes
-
void load_file(std::string const &filename)¶
Load Data from CSV or tab-separated file.
By defualt, the file is loaded in row-major order, i.e. each row is a variable.
- Parameters
filename – path to file
is_row_major – Set to true for row-major variables
- Pre
each row has an equal number of columns. Load Data from CSV or tab-separated file.
-
void load_ndarray(np::ndarray const &np)¶
Load Data from Python Numpy::ndarray.
Data is loaded into the library following a zero-copy guarantee.
- Parameters
np – ndarray
- Pre
Array is NxM matrix of the expected dtype and C memory layout.
-
np::ndarray python_get_results()¶
Return a Numpy ndarray copy of all results
-
np::ndarray python_start()¶
Start search.
Compute the configured IT measure for all Variable tuples in the configured search space. And return up to tuple_limit number of results.
-
void start()¶
Begin computation.
Compute the configured IT measure for all Variable tuples in the configured search space.
-
io::MapOutputStream::map_type get_results()¶
Return a copy of all results
-
void printCacheStats()¶
Print cache statistics for each cache in each thread to stdout.
-
void set_measure(std::string const &measure)¶
-
class mist::Variable¶
Variable wraps a pointer to a data column.
Public Functions
-
Variable(data_ptr src, std::size_t size, std::size_t index, std::size_t bins)¶
Variable constructor.
Wrap a shared pointer to column data along with metadata.
- Parameters
src – Shared pointer to memory allocated for the data column
size – Number of rows in the data column
index – Identifying column index into data matrix
bins – Number of data value bins
- Throws
invalid_argument – data stored ptr, size, or bin argument is zero.
- Pre
src data has been allocated memory for at least size elements.
- Pre
src data values are binned to a contiguous non-negative integer array starting at 0.
- Pre
src missing data values are represented by negative integers.
-
inline bool missing(std::size_t pos) const¶
Test if data at position is missing.
- Throws
std::out_of_range –
-
data_type &at(std::size_t const pos)¶
- Throws
out_of_range –
-
Variable deepCopy()¶
Variable uses default move and copy constructors that are shallow and maintain const requirement on underlying data. A deep copy made with this extra method.
Public Static Functions
-
static bool missingVal(data_type const val)¶
Test if value is classified as missing.
-
Variable(data_ptr src, std::size_t size, std::size_t index, std::size_t bins)¶
mist::algorithm¶
Algorithms to divide and conquer Information Theory computations.
-
namespace mist::algorithm¶
-
class TupleSpace¶
- #include <TupleSpace.hpp>
Tuple Space defines the set of tuples over which to run a computation search.
Public Functions
-
int addVariableGroup(std::string const &name, tuple_type const &vars)¶
Define a named logical group of variables
- Parameters
name – group name
vars – set of variables in the group, duplicates will be ignored
- Throws
TupleSpaceException – variable already listed in existing variable group
- Returns
index of created variable group
-
void addVariableGroupTuple(std::vector<std::string> const &groups)¶
Add a variable group tuple
The cross product of groups in the group tuple generates a set of variable tuples that will be added to the TupleSpace by TupleSpaceTupleProducer.
- Parameters
groups – Array of group names
- Throws
TupleSpaceException – group does not exists
-
void addVariableGroupTuple(tuple_type const &groups)¶
Add a variable group tuple
The cross product of groups in the group tuple generates a set of variable tuples that will be added to the TupleSpace by TupleSpaceTupleProducer.
- Parameters
groups – Array of group indexed by order created
- Throws
TupleSpaceException – group index out of range
-
std::vector<std::string> names() const¶
Get variable names
-
void set_names(std::vector<std::string> const &names)¶
Set variable names
-
unsigned long long count_tuples() const¶
Count the number of tuples generated by the TupleSpace as configured
-
int addVariableGroup(std::string const &name, tuple_type const &vars)¶
-
class TupleSpaceException : public exception¶
-
class Worker¶
- #include <Worker.hpp>
The Worker class divides and conquers the tuple search space.
The Worker processes each tuple in the configured search space, or a portion of the search space depending on the rank parameters. It is common for each computing thread on the system to have a unique Worker instance.
Public Functions
-
Worker(int rank, int ranks, long limit, TupleSpace const &ts, entropy_calc_ptr calc, std::vector<output_stream_ptr> out_streams, measure_ptr measure)¶
Construct and configure a Worker instance.
- Parameters
rank – Zero-indexed rank number [0, ranks]
ranks – Total number of Workers participating in the search
limit – Upper limit on number of tuples to processes by all Workers
ts – TupleSpace that defines the tuple search space
out_streams – Collection OutputStream pointers to send results
measure – The it::Measure to calculate the results
-
Worker(int rank, int ranks, long limit, TupleSpace const &ts, entropy_calc_ptr calc, std::vector<output_stream_ptr> out_streams, measure_ptr measure)¶
-
class WorkerException : public exception¶
-
class TupleSpace¶
mist::cache¶
Cache intermediate results for performance improvement.
-
namespace mist::cache¶
-
-
template<class V>
class Cache¶ - #include <Cache.hpp>
Cache interface
Subclassed by mist::cache::Flat< V >, mist::cache::Map< V >, mist::cache::MRU< V >, mist::cache::SmallFiles< V >
Public Functions
-
virtual std::pair<K, V> put(K const&, V const&) = 0¶
Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.
-
virtual std::size_t size() = 0¶
Number of entries in table
-
virtual std::size_t bytes() = 0¶
Size in bytes of table
-
inline std::size_t hits()¶
Number of cache hits
-
inline std::size_t misses()¶
Number of cache misses
-
inline std::size_t evictions()¶
Number of cache evictions
-
virtual std::pair<K, V> put(K const&, V const&) = 0¶
-
template<class V>
class Flat : public mist::cache::Cache<V>¶ - #include <Flat.hpp>
Fixed sized associative cache
Public Functions
-
inline virtual bool has(key_type const &key)¶
Test that key is in table
-
inline virtual std::pair<key_type, val_type> put(key_type const &key, val_type const &val)¶
Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.
-
inline virtual std::shared_ptr<V> get(key_type const &key)¶
Return value at key.
out_of_range Key not in table
-
inline virtual std::size_t size()¶
Number of entries in table
-
inline virtual std::size_t bytes()¶
Size in bytes of table
-
inline virtual bool has(key_type const &key)¶
-
class FlatException : public exception¶
-
class FlatOutOfRange : public out_of_range¶
-
template<class V>
class Map : public mist::cache::Cache<V>¶ - #include <Map.hpp>
Dynamically-expanding associative cache.
Public Functions
-
inline virtual std::pair<K, V> put(K const &key, V const &val)¶
Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.
-
inline virtual std::shared_ptr<V> get(K const &key)¶
Return value at key.
out_of_range Key not in table
-
inline virtual std::size_t size()¶
Number of entries in table
-
inline virtual std::size_t bytes()¶
Size in bytes of table
-
inline virtual std::pair<K, V> put(K const &key, V const &val)¶
-
class MapOutOfRange : public out_of_range¶
-
template<class V>
class MRU : public mist::cache::Cache<V>¶ - #include <MRU.hpp>
Fixed sized associative cache with least recently added eviction.
Public Functions
-
inline virtual std::pair<K, V> put(K const &key, V const &val)¶
Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.
-
inline virtual std::shared_ptr<V> get(K const &key)¶
Return value at key.
out_of_range Key not in table
-
inline virtual std::size_t size()¶
Number of entries in table
-
inline virtual std::size_t bytes()¶
Size in bytes of table
-
inline virtual std::pair<K, V> put(K const &key, V const &val)¶
-
class MRUOutOfRange : public out_of_range¶
-
template<class V>
class SmallFiles : public mist::cache::Cache<V>¶ - #include <SmallFiles.hpp>
Filesystem cache with each value a small file.
Public Functions
-
inline virtual std::pair<K, V> put(K const &key, V const &val)¶
Insert value at key. An element will be removed if the table size would be exceeded and returned for handling.
-
inline virtual std::shared_ptr<V> get(K const &key)¶
Return value at key.
out_of_range Key not in table
-
inline virtual std::size_t size()¶
Number of entries in table
-
inline virtual std::size_t bytes()¶
Size in bytes of table
-
inline virtual std::pair<K, V> put(K const &key, V const &val)¶
-
class SmallFilesOutOfRange : public out_of_range¶
-
template<class V>
mist::io¶
Input/Output
-
namespace mist::io¶
-
class DataMatrix¶
- #include <DataMatrix.hpp>
N x M input data matrix.
Columns are interpreted as variables with each row a sample.
-
class DataMatrixException : public exception¶
-
class FileOutputStream : public mist::io::OutputStream¶
-
class FileOutputStreamException : public exception¶
-
class MapOutputStream : public mist::io::OutputStream¶
-
class OutputStream¶
Subclassed by mist::io::FileOutputStream, mist::io::MapOutputStream
-
class DataMatrix¶
mist::it¶
Information Theory definitions and algorithms.
-
namespace mist::it¶
Typedefs
-
using Bitset = boost::dynamic_bitset<unsigned long long>¶
-
using BitsetTable = std::vector<BitsetVariable>¶
-
using DistributionData = double¶
-
using entropy_type = double¶
-
using Entropy = std::vector<entropy_type>¶
Enums
-
class BitsetCounter : public mist::it::Counter¶
- #include <BitsetCounter.hpp>
Generates a ProbabilityDistribution from a Variable tuple.
Recasts each Variable as an array of bitsets, one for each bin value. Computes the ProbabilityDistribution using bitwise AND operation and bit counting algorithm.
-
class BitsetCounterOutOfRange : public out_of_range¶
-
class Counter¶
- #include <Counter.hpp>
Abstract class. Generates a Probability Distribution from a Variable tuple
Subclassed by mist::it::BitsetCounter, mist::it::VectorCounter
-
class Distribution¶
- #include <Distribution.hpp>
Joint probability array for N variables
-
class DistributionOutOfRange : public out_of_range¶
-
class EntropyCalculator¶
-
class EntropyCalculatorException : public exception¶
-
class EntropyMeasure : public mist::it::Measure¶
Public Functions
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const¶
Compute the information theory measure with the computation ecalc for the given variables.
- Returns
final result
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &e) const¶
Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.
-
virtual std::string header(int d, bool full_output) const¶
Return a comma-separated header string corresponding to the full results
- Parameters
d – tuple size
full_output – whether header should include all subcalculation names
- Returns
header string
-
inline virtual bool full_entropy() const¶
Whether this measure uses intermediate entropy calculations
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const¶
-
class EntropyMeasureException : public exception¶
-
class Measure¶
Subclassed by mist::it::EntropyMeasure, mist::it::SymmetricDelta
Public Functions
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const = 0¶
Compute the information theory measure with the computation ecalc for the given variables.
- Returns
final result
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &entropy) const = 0¶
Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.
-
virtual std::string header(int d, bool full_output) const = 0¶
Return a comma-separated header string corresponding to the full results
- Parameters
d – tuple size
full_output – whether header should include all subcalculation names
- Returns
header string
-
virtual bool full_entropy() const = 0¶
Whether this measure uses intermediate entropy calculations
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const = 0¶
-
class SymmetricDelta : public mist::it::Measure¶
Public Functions
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const¶
Compute the information theory measure with the computation ecalc for the given variables.
- Returns
final result
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple, Entropy const &e) const¶
Compute the information theory measure with the the given variables, using pre-computed entropies. Only useful for measures that use entropy sub calculations.
-
virtual std::string header(int d, bool full_output) const¶
Return a comma-separated header string corresponding to the full results
- Parameters
d – tuple size
full_output – whether header should include all subcalculation names
- Returns
header string
-
inline virtual bool full_entropy() const¶
Whether this measure uses intermediate entropy calculations
-
virtual result_type compute(EntropyCalculator &ecalc, Variable::indexes const &tuple) const¶
-
class SymmetricDeltaException : public exception¶
-
class VectorCounter : public mist::it::Counter¶
- #include <VectorCounter.hpp>
Generates a ProbabilityDistribution from a Variable tuple.
Counts using standard algorithm.
-
class VectorCounterException : public exception¶
-
using Bitset = boost::dynamic_bitset<unsigned long long>¶