Implements a random forest classifier model to predict a circular-valued output label.
More...
|
| circularRegressor () |
| Default constructor. More...
|
|
| circularRegressor (const int num_trees, const int num_levels, const float info_gain_tresh=C_DEFAULT_MIN_INFO_GAIN) |
| Full constructor. More...
|
|
Public Member Functions inherited from canopy::randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams > |
| randomForestBase (const int num_trees, const int num_levels) |
| Full constructor. More...
|
|
bool | readFromFile (const std::string filename, const int trees_used=-1, const int max_depth_used=-1) |
| Read a pre-trained model in from a file. More...
|
|
bool | writeToFile (const std::string filename) const |
| Write a trained model to a .tr file to be stored and re-used. More...
|
|
bool | isValid () const |
| Check whether a forest model is valid. More...
|
|
void | setFeatureDefinitionString (const std::string &header_str, const std::string &feat_str) |
| Store arbitrary strings that define parameters of the feature extraction process. More...
|
|
void | getFeatureDefinitionString (std::string &feat_str) const |
| Retrieve a stored feature string. More...
|
|
void | train (const TIdIterator first_id, const TIdIterator last_id, const TLabelIterator first_label, TFeatureFunctor &&feature_functor, TParameterFunctor &¶meter_functor, const unsigned num_param_combos_to_test, const bool bagging=true, const float bag_proportion=C_DEFAULT_BAGGING_PROPORTION, const bool fit_split_nodes=true, const unsigned min_training_data=C_DEFAULT_MIN_TRAINING_DATA) |
| Train the random forest model on training data. More...
|
|
void | predictDistGroupwise (TIdIterator first_id, const TIdIterator last_id, TOutputIterator out_it, TFeatureFunctor &&feature_functor) const |
| Predict the output distribution for a number of IDs. More...
|
|
void | predictDistSingle (TIdIterator first_id, const TIdIterator last_id, TOutputIterator out_it, TFeatureFunctor &&feature_functor) const |
| Predict the output distribution for a number of IDs. More...
|
|
void | probabilityGroupwise (TIdIterator first_id, const TIdIterator last_id, TLabelIterator label_it, TOutputIterator out_it, const bool single_label, TFeatureFunctor &&feature_functor) const |
| Evaluate the probability of a certain value of the label for a set of data points. More...
|
|
void | probabilitySingle (TIdIterator first_id, const TIdIterator last_id, TLabelIterator label_it, TOutputIterator out_it, const bool single_label, TFeatureFunctor &&feature_functor) const |
| Evaluate the probability of a certain value of the label for a set of data points. More...
|
|
void | probabilityGroupwiseBase (TIdIterator first_id, const TIdIterator last_id, TLabelIterator label_it, TOutputIterator out_it, const bool single_label, TBinaryFunction &&binary_function, TFeatureFunctor &&feature_functor, TPDFFunctor &&pdf_functor) const |
| A generalised version of the probabilityGroupwise() function that enables the creation of more general functions. More...
|
|
void | probabilitySingleBase (TIdIterator first_id, const TIdIterator last_id, TLabelIterator label_it, TOutputIterator out_it, const bool single_label, TBinaryFunction &&binary_function, TFeatureFunctor &&feature_functor, TPDFFunctor &&pdf_functor) const |
| A generalised version of the probabilitySingle() function that enables the creation of more general functions. More...
|
|
|
void | initialiseNodeDist (const int t, const int n) |
| Initialise a vonMisesDistribution as a node distribution for training. More...
|
|
template<class TLabelIterator > |
float | singleNodeImpurity (const TLabelIterator first_label, const std::vector< int > &nodebag, const int, const int) const |
| Calculate the impurity of the label set in a single node. More...
|
|
template<class TLabelIterator , class TIdIterator > |
void | trainingPrecalculations (const TLabelIterator first_label, const TLabelIterator last_label, const TIdIterator) |
| Preliminary calculations to perform berfore training begins. More...
|
|
void | cleanupPrecalculations () |
| Clean-up of data to perform after training ends. More...
|
|
template<class TLabelIterator > |
void | bestSplit (const std::vector< scoreInternalIndexStruct > &data_structs, const TLabelIterator first_label, const int, const int, const float initial_impurity, float &info_gain, float &thresh) const |
| Find the best way to split training data using the scores of a certain feature. More...
|
|
float | minInfoGain (const int, const int) const |
| Get the information gain threshold for a given node. More...
|
|
void | printHeaderDescription (std::ofstream &) const |
| Prints a string that allows a human to interpret the header information to a stream. More...
|
|
void | printHeaderData (std::ofstream &) const |
| Print the header information specific to the circularRegressor model to a stream. More...
|
|
void | readHeader (std::ifstream &) |
| Read the header information specific to the circularRegressor model from a stream. More...
|
|
Protected Member Functions inherited from canopy::randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams > |
void | findLeavesGroupwise (TIdIterator first_id, const TIdIterator last_id, const int treenum, std::vector< const vonMisesDistribution * > &leaves, TFeatureFunctor &&feature_functor) const |
| Function to query a single tree model with a set of data points and store a pointer to the leaf distribution that each reaches. More...
|
|
const vonMisesDistribution * | findLeafSingle (const TId id, const int treenum, TFeatureFunctor &&feature_functor) const |
| Function to query a single tree model with a single data point and return a pointer to the leaf distribution that it reaches. More...
|
|
|
std::vector< double > | sin_precalc |
| Used during training to store pre-calculated sines of the training labels.
|
|
std::vector< double > | cos_precalc |
| Used during training to store pre-calculated cosines of the training labels.
|
|
float | min_info_gain |
| If during training, the best information gain at a node goes below this threshold, a lead node is declared.
|
|
Protected Attributes inherited from canopy::randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams > |
int | n_trees |
| The number of trees in the forest.
|
|
int | n_levels |
| The maximum number of levels in each tree.
|
|
int | n_nodes |
| The number of nodes in each tree.
|
|
bool | valid |
| Whether the forest model is currently valid and usable for predictions (true = valid)
|
|
bool | fit_split_nodes |
| Whether a node distribution is fitted to all nodes (true) or just the leaf nodes (false)
|
|
std::vector< tree > | forest |
| Vector of tree models.
|
|
std::string | feature_header |
| String describing the content of the feature string.
|
|
std::string | feature_string |
| Arbitrary string describing the feature extraction process.
|
|
std::default_random_engine | rand_engine |
| Random engine for generating random numbers during training, may also be used by derived classes.
|
|
std::uniform_int_distribution< int > | uni_dist |
| For generating random integers during traning, may also be used derived classes.
|
|
|
Static Protected Member Functions inherited from canopy::randomForestBase< circularRegressor< TNumParams >, float, vonMisesDistribution, vonMisesDistribution, TNumParams > |
static double | fastDiscreteEntropy (const std::vector< int > &internal_index, const int n_labels, const TLabelIterator first_label, const std::vector< double > &xlogx_precalc) |
| Calculates the entropy of the discrete labels of a set of data points using an efficient method. More...
|
|
static int | fastDiscreteEntropySplit (const std::vector< scoreInternalIndexStruct > &data_structs, const int n_labels, const TLabelIterator first_label, const std::vector< double > &xlogx_precalc, double &best_split_impurity, float &thresh) |
| Find the split in a set of training data that results in the best information gain for discrete labels. More...
|
|
static std::vector< double > | preCalculateXlogX (const int N) |
| Calculate an array of x*log(x) for integer x. More...
|
|
template<unsigned TNumParams>
class canopy::circularRegressor< TNumParams >
Implements a random forest classifier model to predict a circular-valued output label.
This class uses the vonMisesDistribution as both the output distribution and the node distribution, and float as the type of the label to predict.
- Template Parameters
-
TNumParams | The number of parameters used by the features callback functor. |
template<unsigned TNumParams>
template<class TLabelIterator >
void canopy::circularRegressor< TNumParams >::bestSplit |
( |
const std::vector< scoreInternalIndexStruct > & |
data_structs, |
|
|
const TLabelIterator |
first_label, |
|
|
const int |
, |
|
|
const int |
, |
|
|
const float |
initial_impurity, |
|
|
float & |
info_gain, |
|
|
float & |
thresh |
|
) |
| const |
|
protected |
Find the best way to split training data using the scores of a certain feature.
This method takes a set of training data points and their scores resulting from some feature, and calculates the best score threshold that may be used to split the data into two partitions. The best split is the one that results in the greatest information gain in the child nodes, which in this case is based on squared circular distances from the circular mean.
This method is called automatically by the base class.
- Template Parameters
-
TLabelIterator | Type of the iterator used to access the discrete labels. Must be a random access iterator that dereferences to a floating point data type. |
- Parameters
-
data_structs | A vector in which each element is a structure containing the internal id (.id) and score (.score) for the current feature of the training data points. The vector is assumed to be sorted according to the score field in ascending order. |
first_label | Iterator to the labels for which the entropy is to be calculated. The labels should be located at the offsets from this iterator given by the IDs of elements of the data_structs vector. I.e. first_label[data_structs[0].id] first_label[data_structs[1].id] etc. |
- | The third parameter is unused but required for compatibility with randomForestBase |
- | The fourth parameter is unused but required for compatibility with randomForestBase |
initial_impurity | The initial impurity of the node before the split. This must be calculated with singleNodeImpurity() and passed in |
info_gain | The information gain associated with the best split (i.e. the maximum achievable information gain with this feature) is returned by reference in this parameter |
thresh | The threshold value of the feature score corresponding to tbe best split is returned by reference in this parameter |