These three tasks must be completed in this order because each depends on the previous. However in some situations it may be possible to re-use one of the node or output distributions already defined in canopy, in which case you can that step. Furthermore, in some situations the node distribution and output distributions may be the same (this is the case for the canopy::classifier and canopy::circularRegressor models). In this case you just need to implement the behaviour of both models within one class.

The next four sections describe this process in detail. You should read these and read the existing code for canopy::classifier and canopy::circularRegressor along with canopy::discreteDistribution and canopy::vonMisesDistribution in order to understand how to build your own forest model.

Choosing A Label Type

Firstly, you need to choose the data type that you want your forest to predict In principle, this can be any data type. For example, the label type of the canopy::classifier is int, and the label type of the canopy::circularRegressor is float.

For the sake of this tutorial, let's assume that we've chosen a label type of myLabelType.

Defining Your Own Node Distribution

The node distribution class defines the distribution that is stored in the leaf node of each tree in the forest. Conceptually, it should capture the behaviour of a probability distribution over the label type (myLabelType in our example).

There are a number of methods that the class must define in order to be used as a node distribution within canopy. You are of course free to add other methods or properties to give the class the behaviour you need.

The following example gives the required layout of the class. It might be easier for you to copy this file, change the type names, and fill in the blanks:

#include <fstream>
class myNodeDist
{
    public:
        template <class TLabelIterator, class TIdIterator>
        void fit(TLabelIterator first_label, TLabelIterator last_label, TIdIterator first_id);
        {
            /* Function used to fit the distribution to training data during
            forest training. The data are passed in using iterators pointing
            to the set of labels to fit to, and their IDs. In most cases, the
            IDs will be unused and only the labels will be relevant.
            Due to the way the function is called by the randomForestBase class,
            TLabelIterator will be a random access iterator type (supports []
            syntax) that dereferences to myLabelType.
            */
        }
        void printOut(std::ofstream& stream) const
        {
            /* Output parameters to 'stream' that can later be used by
            readIn() to fully reconstruct the distribution.
            This will probably involve reocrding parameters like mean and
            variance, and possibly other information.
            This will be called by randomForestBase when storing the model
            to a file. */
        }
        void readIn(std::ifstream& stream)
        {
            /* Read in parameters from 'stream' and use store them.
            This must match the format written by printOut()
            This will be called by randomForestBase when storing the model
            to a file.*/
        }
        template <class TId>
        float pdf(const myLabelType x, const TId id) const
        {
            /* Return the probability of label x under the distribution
            Note that the id paramater will be unused in many cases, as the
            probability will not depend on the ID, only the label.
            This is used by randomForestBase to perform the probability evaluation
            task. */
        }
        // Use operator<< to print to the file stream
        friend std::ofstream& operator<< (std::ofstream& stream, const myNodeDist& dist)
        {
            dist.printOut(stream); return stream;
        }
        //Use operator>> to read from the file stream
        friend std::ifstream& operator>> (std::ifstream& stream, myNodeDist& dist)
        {
            dist.readIn(stream); return stream;
        }
    protected:
        
        // Distribution parameters etc
};

Defining Your Own Output Distribution

The output distribution class defines the distribution that is created as the result of the distribution prediction task. It combines (in some sense that you can define) the node distributions reached in each forest for a given data point to produce a new distribution.

The class is required to have the layout in the following example:

class myOutputDist
{
    public:
        template <class TId>
        void combineWith(const myNodeDist& dist, const TId id)
        {
            /* Update the distribution to reflect the effect of combining it
            with a node distribution 'dist' */
        }
        void normalise()
        {
            /* Normalise the distribution after combining with several node
            distributions to ensure a valid distribution */
        }
        void reset()
        {
            /* Clear the results of all previous combinations to give a
            distribution that can be used to start the process on fresh data */
        }
    protected:
        // Distribution parameters etc
};

To make sense of the tasks that should be performed by the three methods, it is helpful to understand the order in which they are called by the canopy::randomForestBase class. Recall from the basic usage instructions that the user code creates the output distribution object and passes it to the forest's canopy::randomForestBase::predictDistSingle() or canopy::randomForestBase::predictDistGroupwise() method (via an iterator). Then that method calls uses the output distribution's methods as follows:

First the reset() method is called. This should clear any existing data from the class such that it is ready for use with the combineWith method. This can ensure for example that if the user passes the same output distribution object to two calls of the predictDistSingle/predictDistGroupwise method, the second call overwrites any information from the first.
Next the forest finds which leaf nodes the input ID reaches. It then calls the combineWith() method of the output distribution once for each of these node distributions (i.e. once for each of the trees in the forest), passing a reference to the relevant node distribution object. The method should update the parameters of the output distribution object to reflect the inclusion of that leaf node. Note that the node distributions are passed in the order they appear in the list of trees in the forest, and in a single thread (i.e. you do not need to worry about data races within the combineWith() method).
After the combineWith() method has been called for all the trees in the forest, the normalise() method is called once. This can therefore be used to ensure that the resulting parameters represent a valid distribution.

Defining Your Own Forest Model

In order to define your own forest model you need to define a class whose layout follows the example below:

template <unsigned TNumParams>
class myForest : public randomForestBase<myForest<TNumParams>,myLabel,myNodeDist,myOutputDist,TNumParams>
{
    public:
        /* You'll probably want to define a custom constructor here, plus any
        other public methods */
    protected:
        typedef typename randomForestBase<myForest<TNumParams>,myLabel,myNodeDist,myOutputDist,TNumParams>::scoreInternalIndexStruct scoreInternalIndexStruct;
        void printHeaderDescription(std::ofstream &stream) const
        {
            /* Print a human-readable description of the contents of the header
            data. Anything printed here is ignord by the library */
        }
        void printHeaderData(std::ofstream &stream) const
        {
            /* Print a single line containing any parameters that must be stored
            in order to reconstruct the model (such as number of classes etc) */
        }
        void readHeader(std::ifstream &stream)
        {
            /* Read in the data printed using printHeaderData() in order to
            reconstruct a stored forest model from file */
        }
        void initialiseNodeDist(const int t, const int n)
        {
            /* Initialise a node distribution before fitting it during training.
            This can be used to perform any arbitrary action on the node distribution
            in this->forest[t].nodes[n].post[0] to prepare for fitting, such as
            initialising it with certain parameters and/or calling a custom
            constructor */
        }
        float minInfoGain(const int tree, const int node) const
        {
            /* Return the value of information gain threshold for this node
            during training.
            If the actual information gain from the best split is below this,
            the node will become a leaf node. This can be give different
            behaviour in different nodes in the forest if desired, or can simply
            return a constant. */
        }
        template <class TLabelIterator>
        template <class TLabelIterator, class TIdIterator>
        void trainingPrecalculations(const TLabelIterator first_label, const TLabelIterator last_label, const TIdIterator first_id)
        {
            /* This is called once at the start of the training routine and may
            be used to prepare for training on the supplied dataset, for example
            by precalculating values to speed up subsequent processes */
        }
        void cleanupPrecalculations()
        {
            /* This is called once at the end of the training routine and may be
            used, for example, to clear up any data no longer needed */
        }
        float singleNodeImpurity(const TLabelIterator first_label, const std::vector<int>& nodebag, const int tree, const int node) const
        {
            /* Calculate the impurity of the labels in a given node before
            splitting in a given tree and node. This is used to compare to the
            value after splitting (found with bestSplit) in order to determine
            information gain. The labels are accessed via first_label[nodebag[0]],
            first_label[nodebag[1]] etc */
        }
        template <class TLabelIterator>
        void bestSplit(const std::vector<scoreInternalIndexStruct> &data_structs, const TLabelIterator first_label, const int tree, const int node, const float initial_impurity,float& info_gain, float& thresh) const
        {
            /* This the key function in the training routine. It takes a list
            of labels in data_structs which contains an integer-valued internal
            index (.id) and float-valued feature score (.score) according to the
            feature functor with the chosen parameters. The elements of
            data_structs are sorted by increasing values of the score before being
            passed to this method.
            The labels of each of the training samples in the node can be accessed
            via first_label[data_structs[0].id], first_label[data_structs[1].id]
            etc
            This method should find the best way to split the training samples
            with a single score threshold. The 'best' split is calculated with
            regards to the labels of the training samples, and is left to the
            user to define.
            The method returns (by reference) the chosen threshold (thresh) and
            the resulting information gain between the initial impurity before
            splitting (which is passed in as initial_impurity in order to avoid
            redundant repeated calculations), and the impurity after splitting.
            */
        }
        /* Other data etc */
};

Note that your class should inherit from the canopy::randomForestBase class, and therefore provide the 5 template parameters of the base class. The 2nd, 3rd, and 4th template parameters are respectively the label type, the node distribution type and the output distribution type, which need to match the relevant types of your node and output distribution models.

The first template parameter must be the type of your own forest model (being declared). This is in order to implement the CRTP idiom of static polymorphism.

The final template parameter is the number of parameters of the feature functor, and you will often want to make this a template parameter of your model also (as in the example) to allow for maximum flexibility.

The most complicated task in defining your own model is defining the training procedure, which is controlled by the implementations of the trainingPrecalculations, cleanupPrecalculations, singleNodeImpurity, bestSplit, and minInfoGain methods.

These methods are called by the base class's training procedure as follows:

The trainingPrecalculations method is called on the entire training dataset in order to perform any necessary pre-calculations.
Then the training continues on a node-by-node basis, starting at the root of each tree (trees are trained independently and in parallel). Each node receives a list of training IDs from its parent node. It then calculates the 'impurity' of that set before splitting with the singleNodeImpurity method (in a sense defined by you).
Next, a number of random generated feature functor parameter sets are generated and applied to the IDs in the node. Each parameter set results in a feature score value for each training ID. These are then sorted and passed to the bestSplit method, which calculates the best (in a sense defined by you) feature score threshold to use to split the data into two child nodes using this parameter set.
The parameter set resulting in the best information gain (as returned by the bestSplit method), is chosen as the feature set for that node.
However if the best information gain is lower than the value returned by the minInfoGain method, the node is declared as a leaf node. Training also stops when the maximum depth is reached or the number of training data in the node falls below the user-supplied threshold.
After training is complete for all trees, the cleanupPrecalculations method is called.

The node distribution's fit method is also called as necessary to fit a node distrbution to the training dataset.

Table of Contents

Choosing A Label Type

Defining Your Own Node Distribution

Defining Your Own Output Distribution

Defining Your Own Forest Model