Canopy  1.0
The header-only random forests library
Advanced Usage - Defining Your Own Models

Table of Contents

If you want to define your own model using canopy, there are up to four steps you need to complete:

These three tasks must be completed in this order because each depends on the previous. However in some situations it may be possible to re-use one of the node or output distributions already defined in canopy, in which case you can that step. Furthermore, in some situations the node distribution and output distributions may be the same (this is the case for the canopy::classifier and canopy::circularRegressor models). In this case you just need to implement the behaviour of both models within one class.

The next four sections describe this process in detail. You should read these and read the existing code for canopy::classifier and canopy::circularRegressor along with canopy::discreteDistribution and canopy::vonMisesDistribution in order to understand how to build your own forest model.

Choosing A Label Type

Firstly, you need to choose the data type that you want your forest to predict In principle, this can be any data type. For example, the label type of the canopy::classifier is int, and the label type of the canopy::circularRegressor is float.

For the sake of this tutorial, let's assume that we've chosen a label type of myLabelType.

Defining Your Own Node Distribution

The node distribution class defines the distribution that is stored in the leaf node of each tree in the forest. Conceptually, it should capture the behaviour of a probability distribution over the label type (myLabelType in our example).

There are a number of methods that the class must define in order to be used as a node distribution within canopy. You are of course free to add other methods or properties to give the class the behaviour you need.

The following example gives the required layout of the class. It might be easier for you to copy this file, change the type names, and fill in the blanks:

#include <fstream>
class myNodeDist
{
public:
template <class TLabelIterator, class TIdIterator>
void fit(TLabelIterator first_label, TLabelIterator last_label, TIdIterator first_id);
{
/* Function used to fit the distribution to training data during
forest training. The data are passed in using iterators pointing
to the set of labels to fit to, and their IDs. In most cases, the
IDs will be unused and only the labels will be relevant.
Due to the way the function is called by the randomForestBase class,
TLabelIterator will be a random access iterator type (supports []
syntax) that dereferences to myLabelType.
*/
}
void printOut(std::ofstream& stream) const
{
/* Output parameters to 'stream' that can later be used by
readIn() to fully reconstruct the distribution.
This will probably involve reocrding parameters like mean and
variance, and possibly other information.
This will be called by randomForestBase when storing the model
to a file. */
}
void readIn(std::ifstream& stream)
{
/* Read in parameters from 'stream' and use store them.
This must match the format written by printOut()
This will be called by randomForestBase when storing the model
to a file.*/
}
template <class TId>
float pdf(const myLabelType x, const TId id) const
{
/* Return the probability of label x under the distribution
Note that the id paramater will be unused in many cases, as the
probability will not depend on the ID, only the label.
This is used by randomForestBase to perform the probability evaluation
task. */
}
// Use operator<< to print to the file stream
friend std::ofstream& operator<< (std::ofstream& stream, const myNodeDist& dist)
{
dist.printOut(stream); return stream;
}
//Use operator>> to read from the file stream
friend std::ifstream& operator>> (std::ifstream& stream, myNodeDist& dist)
{
dist.readIn(stream); return stream;
}
protected:
// Distribution parameters etc
};

Defining Your Own Output Distribution

The output distribution class defines the distribution that is created as the result of the distribution prediction task. It combines (in some sense that you can define) the node distributions reached in each forest for a given data point to produce a new distribution.

The class is required to have the layout in the following example:

class myOutputDist
{
public:
template <class TId>
void combineWith(const myNodeDist& dist, const TId id)
{
/* Update the distribution to reflect the effect of combining it
with a node distribution 'dist' */
}
void normalise()
{
/* Normalise the distribution after combining with several node
distributions to ensure a valid distribution */
}
void reset()
{
/* Clear the results of all previous combinations to give a
distribution that can be used to start the process on fresh data */
}
protected:
// Distribution parameters etc
};

To make sense of the tasks that should be performed by the three methods, it is helpful to understand the order in which they are called by the canopy::randomForestBase class. Recall from the basic usage instructions that the user code creates the output distribution object and passes it to the forest's canopy::randomForestBase::predictDistSingle() or canopy::randomForestBase::predictDistGroupwise() method (via an iterator). Then that method calls uses the output distribution's methods as follows:

Defining Your Own Forest Model

In order to define your own forest model you need to define a class whose layout follows the example below:

template <unsigned TNumParams>
class myForest : public randomForestBase<myForest<TNumParams>,myLabel,myNodeDist,myOutputDist,TNumParams>
{
public:
/* You'll probably want to define a custom constructor here, plus any
other public methods */
protected:
typedef typename randomForestBase<myForest<TNumParams>,myLabel,myNodeDist,myOutputDist,TNumParams>::scoreInternalIndexStruct scoreInternalIndexStruct;
void printHeaderDescription(std::ofstream &stream) const
{
/* Print a human-readable description of the contents of the header
data. Anything printed here is ignord by the library */
}
void printHeaderData(std::ofstream &stream) const
{
/* Print a single line containing any parameters that must be stored
in order to reconstruct the model (such as number of classes etc) */
}
void readHeader(std::ifstream &stream)
{
/* Read in the data printed using printHeaderData() in order to
reconstruct a stored forest model from file */
}
void initialiseNodeDist(const int t, const int n)
{
/* Initialise a node distribution before fitting it during training.
This can be used to perform any arbitrary action on the node distribution
in this->forest[t].nodes[n].post[0] to prepare for fitting, such as
initialising it with certain parameters and/or calling a custom
constructor */
}
float minInfoGain(const int tree, const int node) const
{
/* Return the value of information gain threshold for this node
during training.
If the actual information gain from the best split is below this,
the node will become a leaf node. This can be give different
behaviour in different nodes in the forest if desired, or can simply
return a constant. */
}
template <class TLabelIterator>
template <class TLabelIterator, class TIdIterator>
void trainingPrecalculations(const TLabelIterator first_label, const TLabelIterator last_label, const TIdIterator first_id)
{
/* This is called once at the start of the training routine and may
be used to prepare for training on the supplied dataset, for example
by precalculating values to speed up subsequent processes */
}
void cleanupPrecalculations()
{
/* This is called once at the end of the training routine and may be
used, for example, to clear up any data no longer needed */
}
float singleNodeImpurity(const TLabelIterator first_label, const std::vector<int>& nodebag, const int tree, const int node) const
{
/* Calculate the impurity of the labels in a given node before
splitting in a given tree and node. This is used to compare to the
value after splitting (found with bestSplit) in order to determine
information gain. The labels are accessed via first_label[nodebag[0]],
first_label[nodebag[1]] etc */
}
template <class TLabelIterator>
void bestSplit(const std::vector<scoreInternalIndexStruct> &data_structs, const TLabelIterator first_label, const int tree, const int node, const float initial_impurity,float& info_gain, float& thresh) const
{
/* This the key function in the training routine. It takes a list
of labels in data_structs which contains an integer-valued internal
index (.id) and float-valued feature score (.score) according to the
feature functor with the chosen parameters. The elements of
data_structs are sorted by increasing values of the score before being
passed to this method.
The labels of each of the training samples in the node can be accessed
via first_label[data_structs[0].id], first_label[data_structs[1].id]
etc
This method should find the best way to split the training samples
with a single score threshold. The 'best' split is calculated with
regards to the labels of the training samples, and is left to the
user to define.
The method returns (by reference) the chosen threshold (thresh) and
the resulting information gain between the initial impurity before
splitting (which is passed in as initial_impurity in order to avoid
redundant repeated calculations), and the impurity after splitting.
*/
}
/* Other data etc */
};

Note that your class should inherit from the canopy::randomForestBase class, and therefore provide the 5 template parameters of the base class. The 2nd, 3rd, and 4th template parameters are respectively the label type, the node distribution type and the output distribution type, which need to match the relevant types of your node and output distribution models.

The first template parameter must be the type of your own forest model (being declared). This is in order to implement the CRTP idiom of static polymorphism.

The final template parameter is the number of parameters of the feature functor, and you will often want to make this a template parameter of your model also (as in the example) to allow for maximum flexibility.

The most complicated task in defining your own model is defining the training procedure, which is controlled by the implementations of the trainingPrecalculations, cleanupPrecalculations, singleNodeImpurity, bestSplit, and minInfoGain methods.

These methods are called by the base class's training procedure as follows:

The node distribution's fit method is also called as necessary to fit a node distrbution to the training dataset.