Canopy  1.0
The header-only random forests library
Public Member Functions | Protected Types | Protected Member Functions | Protected Attributes | Static Protected Attributes | List of all members
canopy::circularRegressor< TNumParams > Class Template Reference

Implements a random forest classifier model to predict a circular-valued output label. More...

#include <circularRegressor.hpp>

Inheritance diagram for canopy::circularRegressor< TNumParams >:
Inheritance graph
[legend]
Collaboration diagram for canopy::circularRegressor< TNumParams >:
Collaboration graph
[legend]

Public Member Functions

 circularRegressor ()
 Default constructor. More...
 
 circularRegressor (const int num_trees, const int num_levels, const float info_gain_tresh=C_DEFAULT_MIN_INFO_GAIN)
 Full constructor. More...
 
- Public Member Functions inherited from canopy::randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams >
 randomForestBase (const int num_trees, const int num_levels)
 Full constructor. More...
 
bool readFromFile (const std::string filename, const int trees_used=-1, const int max_depth_used=-1)
 Read a pre-trained model in from a file. More...
 
bool writeToFile (const std::string filename) const
 Write a trained model to a .tr file to be stored and re-used. More...
 
bool isValid () const
 Check whether a forest model is valid. More...
 
void setFeatureDefinitionString (const std::string &header_str, const std::string &feat_str)
 Store arbitrary strings that define parameters of the feature extraction process. More...
 
void getFeatureDefinitionString (std::string &feat_str) const
 Retrieve a stored feature string. More...
 
void train (const TIdIterator first_id, const TIdIterator last_id, const TLabelIterator first_label, TFeatureFunctor &&feature_functor, TParameterFunctor &&parameter_functor, const unsigned num_param_combos_to_test, const bool bagging=true, const float bag_proportion=C_DEFAULT_BAGGING_PROPORTION, const bool fit_split_nodes=true, const unsigned min_training_data=C_DEFAULT_MIN_TRAINING_DATA)
 Train the random forest model on training data. More...
 
void predictDistGroupwise (TIdIterator first_id, const TIdIterator last_id, TOutputIterator out_it, TFeatureFunctor &&feature_functor) const
 Predict the output distribution for a number of IDs. More...
 
void predictDistSingle (TIdIterator first_id, const TIdIterator last_id, TOutputIterator out_it, TFeatureFunctor &&feature_functor) const
 Predict the output distribution for a number of IDs. More...
 
void probabilityGroupwise (TIdIterator first_id, const TIdIterator last_id, TLabelIterator label_it, TOutputIterator out_it, const bool single_label, TFeatureFunctor &&feature_functor) const
 Evaluate the probability of a certain value of the label for a set of data points. More...
 
void probabilitySingle (TIdIterator first_id, const TIdIterator last_id, TLabelIterator label_it, TOutputIterator out_it, const bool single_label, TFeatureFunctor &&feature_functor) const
 Evaluate the probability of a certain value of the label for a set of data points. More...
 
void probabilityGroupwiseBase (TIdIterator first_id, const TIdIterator last_id, TLabelIterator label_it, TOutputIterator out_it, const bool single_label, TBinaryFunction &&binary_function, TFeatureFunctor &&feature_functor, TPDFFunctor &&pdf_functor) const
 A generalised version of the probabilityGroupwise() function that enables the creation of more general functions. More...
 
void probabilitySingleBase (TIdIterator first_id, const TIdIterator last_id, TLabelIterator label_it, TOutputIterator out_it, const bool single_label, TBinaryFunction &&binary_function, TFeatureFunctor &&feature_functor, TPDFFunctor &&pdf_functor) const
 A generalised version of the probabilitySingle() function that enables the creation of more general functions. More...
 

Protected Types

typedef randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams >::scoreInternalIndexStruct scoreInternalIndexStruct
 Forward the definition of the type declared in the randomForestBase class.
 

Protected Member Functions

void initialiseNodeDist (const int t, const int n)
 Initialise a vonMisesDistribution as a node distribution for training. More...
 
template<class TLabelIterator >
float singleNodeImpurity (const TLabelIterator first_label, const std::vector< int > &nodebag, const int, const int) const
 Calculate the impurity of the label set in a single node. More...
 
template<class TLabelIterator , class TIdIterator >
void trainingPrecalculations (const TLabelIterator first_label, const TLabelIterator last_label, const TIdIterator)
 Preliminary calculations to perform berfore training begins. More...
 
void cleanupPrecalculations ()
 Clean-up of data to perform after training ends. More...
 
template<class TLabelIterator >
void bestSplit (const std::vector< scoreInternalIndexStruct > &data_structs, const TLabelIterator first_label, const int, const int, const float initial_impurity, float &info_gain, float &thresh) const
 Find the best way to split training data using the scores of a certain feature. More...
 
float minInfoGain (const int, const int) const
 Get the information gain threshold for a given node. More...
 
void printHeaderDescription (std::ofstream &) const
 Prints a string that allows a human to interpret the header information to a stream. More...
 
void printHeaderData (std::ofstream &) const
 Print the header information specific to the circularRegressor model to a stream. More...
 
void readHeader (std::ifstream &)
 Read the header information specific to the circularRegressor model from a stream. More...
 
- Protected Member Functions inherited from canopy::randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams >
void findLeavesGroupwise (TIdIterator first_id, const TIdIterator last_id, const int treenum, std::vector< const vonMisesDistribution * > &leaves, TFeatureFunctor &&feature_functor) const
 Function to query a single tree model with a set of data points and store a pointer to the leaf distribution that each reaches. More...
 
const vonMisesDistributionfindLeafSingle (const TId id, const int treenum, TFeatureFunctor &&feature_functor) const
 Function to query a single tree model with a single data point and return a pointer to the leaf distribution that it reaches. More...
 

Protected Attributes

std::vector< double > sin_precalc
 Used during training to store pre-calculated sines of the training labels.
 
std::vector< double > cos_precalc
 Used during training to store pre-calculated cosines of the training labels.
 
float min_info_gain
 If during training, the best information gain at a node goes below this threshold, a lead node is declared.
 
- Protected Attributes inherited from canopy::randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams >
int n_trees
 The number of trees in the forest.
 
int n_levels
 The maximum number of levels in each tree.
 
int n_nodes
 The number of nodes in each tree.
 
bool valid
 Whether the forest model is currently valid and usable for predictions (true = valid)
 
bool fit_split_nodes
 Whether a node distribution is fitted to all nodes (true) or just the leaf nodes (false)
 
std::vector< tree > forest
 Vector of tree models.
 
std::string feature_header
 String describing the content of the feature string.
 
std::string feature_string
 Arbitrary string describing the feature extraction process.
 
std::default_random_engine rand_engine
 Random engine for generating random numbers during training, may also be used by derived classes.
 
std::uniform_int_distribution< int > uni_dist
 For generating random integers during traning, may also be used derived classes.
 

Static Protected Attributes

static constexpr int C_NUM_SPLIT_TRIALS = 100
 This is the number of possible splits tested for each feature during training.
 
static constexpr float C_DEFAULT_MIN_INFO_GAIN = 0.1
 Default value for the information gain threshold.
 
- Static Protected Attributes inherited from canopy::randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams >
static constexpr int C_DEFAULT_MIN_TRAINING_DATA
 Default value for the minimum number of traning data points in a node before a leaf is declared.
 
static constexpr float C_DEFAULT_BAGGING_PROPORTION
 Default value for the proportion of the traning set used to train each tree.
 

Additional Inherited Members

- Static Protected Member Functions inherited from canopy::randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams >
static double fastDiscreteEntropy (const std::vector< int > &internal_index, const int n_labels, const TLabelIterator first_label, const std::vector< double > &xlogx_precalc)
 Calculates the entropy of the discrete labels of a set of data points using an efficient method. More...
 
static int fastDiscreteEntropySplit (const std::vector< scoreInternalIndexStruct > &data_structs, const int n_labels, const TLabelIterator first_label, const std::vector< double > &xlogx_precalc, double &best_split_impurity, float &thresh)
 Find the split in a set of training data that results in the best information gain for discrete labels. More...
 
static std::vector< double > preCalculateXlogX (const int N)
 Calculate an array of x*log(x) for integer x. More...
 

Detailed Description

template<unsigned TNumParams>
class canopy::circularRegressor< TNumParams >

Implements a random forest classifier model to predict a circular-valued output label.

This class uses the vonMisesDistribution as both the output distribution and the node distribution, and float as the type of the label to predict.

Template Parameters
TNumParamsThe number of parameters used by the features callback functor.

Constructor & Destructor Documentation

template<unsigned TNumParams>
canopy::circularRegressor< TNumParams >::circularRegressor ( )

Default constructor.

Note that an object initialised in this way should not be trained, but may be used to read in a pre-trained model using readFromFile()

template<unsigned TNumParams>
canopy::circularRegressor< TNumParams >::circularRegressor ( const int  num_trees,
const int  num_levels,
const float  info_gain_tresh = C_DEFAULT_MIN_INFO_GAIN 
)

Full constructor.

Creates a full forest with a specified number of trees and levels, ready to be trained.

Parameters
num_treesThe number of decision trees in the forest
num_levelsThe maximum depth of any node in the trees
info_gain_treshThe information gain threshold to use when training the model. Nodes where the best split is found to result in an information gain value less than this threshold are made into leaf nodes. Default: C_DEFAULT_MIN_INFO_GAIN

Member Function Documentation

template<unsigned TNumParams>
template<class TLabelIterator >
void canopy::circularRegressor< TNumParams >::bestSplit ( const std::vector< scoreInternalIndexStruct > &  data_structs,
const TLabelIterator  first_label,
const int  ,
const int  ,
const float  initial_impurity,
float &  info_gain,
float &  thresh 
) const
protected

Find the best way to split training data using the scores of a certain feature.

This method takes a set of training data points and their scores resulting from some feature, and calculates the best score threshold that may be used to split the data into two partitions. The best split is the one that results in the greatest information gain in the child nodes, which in this case is based on squared circular distances from the circular mean.

This method is called automatically by the base class.

Template Parameters
TLabelIteratorType of the iterator used to access the discrete labels. Must be a random access iterator that dereferences to a floating point data type.
Parameters
data_structsA vector in which each element is a structure containing the internal id (.id) and score (.score) for the current feature of the training data points. The vector is assumed to be sorted according to the score field in ascending order.
first_labelIterator to the labels for which the entropy is to be calculated. The labels should be located at the offsets from this iterator given by the IDs of elements of the data_structs vector. I.e.
first_label[data_structs[0].id]
first_label[data_structs[1].id]
etc.
-The third parameter is unused but required for compatibility with randomForestBase
-The fourth parameter is unused but required for compatibility with randomForestBase
initial_impurityThe initial impurity of the node before the split. This must be calculated with singleNodeImpurity() and passed in
info_gainThe information gain associated with the best split (i.e. the maximum achievable information gain with this feature) is returned by reference in this parameter
threshThe threshold value of the feature score corresponding to tbe best split is returned by reference in this parameter
template<unsigned TNumParams>
void canopy::circularRegressor< TNumParams >::cleanupPrecalculations ( )
protected

Clean-up of data to perform after training ends.

In this case this clears the pre-calculated arrays created by trainingPrecalculations()

This method is called automatically by the base class.

template<unsigned TNumParams>
void canopy::circularRegressor< TNumParams >::initialiseNodeDist ( const int  t,
const int  n 
)
protected

Initialise a vonMisesDistribution as a node distribution for training.

This method is called automatically by the base class.

Parameters
tIndex of the tree in which the distribution is to be initialised
nIndex of the node to be initialised within its tree
template<unsigned TNumParams>
float canopy::circularRegressor< TNumParams >::minInfoGain ( const int  ,
const int   
) const
protected

Get the information gain threshold for a given node.

In this case, this is a fixed value for all nodes. This method is called automatically by the base class.

Parameters
-The first parameter is unused but required for compatibility with randomForestBase
-The second parameter is unused but required for compatibility with randomForestBase
Returns
The threshold value for information gain. If a split results in a information gain less than this value, the node should be made a leaf instead.
template<unsigned TNumParams>
void canopy::circularRegressor< TNumParams >::printHeaderData ( std::ofstream &  ) const
protected

Print the header information specific to the circularRegressor model to a stream.

This header is blank in the case of the circularRegressor .

This method is called automatically by the base class.

Parameters
streamThe stream to which the header is printed.
template<unsigned TNumParams>
void canopy::circularRegressor< TNumParams >::printHeaderDescription ( std::ofstream &  ) const
protected

Prints a string that allows a human to interpret the header information to a stream.

This header is blank in the case of the circularRegressor . This method is called automatically by the base class.

Parameters
streamThe stream to which the header description is printed.
template<unsigned TNumParams>
void canopy::circularRegressor< TNumParams >::readHeader ( std::ifstream &  )
protected

Read the header information specific to the circularRegressor model from a stream.

This header is blank in the case of the circularRegressor .

This method is called automatically by the base class.

Parameters
streamThe stream from which the header information is read.
template<unsigned TNumParams>
template<class TLabelIterator >
float canopy::circularRegressor< TNumParams >::singleNodeImpurity ( const TLabelIterator  first_label,
const std::vector< int > &  nodebag,
const int  ,
const int   
) const
protected

Calculate the impurity of the label set in a single node.

This method takes the labels (angular labels) of a set of training data points and calculates the impurity of that set. In this case, this is based on the squared circular distance from the circular mean of the set.

This method is called automatically by the base class.

Template Parameters
TLabelIteratorType of the iterator used to access the discrete labels. Must be a random access iterator that dereferences to an floating point data type.
Parameters
first_labelIterator to the labels for which the entropy is to be calculated. The labels should be located at the offsets from this iterator given by the elements of the nodebag vector. I.e.
first_label[nodebag[0]]
first_label[nodebag[1]]
etc.
nodebagVector containing the internal training indices of the data points. These are the indices through which the labels may be accessed in first_label
-The third parameter is unused but required for compatibility with randomForestBase
-The fourth parameter is unused but required for compatibility with randomForestBase
template<unsigned TNumParams>
template<class TLabelIterator , class TIdIterator >
void canopy::circularRegressor< TNumParams >::trainingPrecalculations ( const TLabelIterator  first_label,
const TLabelIterator  last_label,
const TIdIterator   
)
protected

Preliminary calculations to perform berfore training begins.

In this case this pre-calculates an array values of the sines and cosines of the training labels to avoid calculating these many many times

This method is called automatically by the base class.

Template Parameters
TLabelIteratorType of the iterator used to access the training labels Must be a random access iterator than dereferences to an floating point data type.
TIdIteratorType of the iterator used to access the IDs of the training data. The IDs are unused by required for compatibility with randomForestBase .
Parameters
first_labelIterator to the first label in the training set
last_labelIterator to the last label in the training set
-The third parameter is unused but required for compatibility with randomForestBase

The documentation for this class was generated from the following files: