Canopy
1.0
The header-only random forests library
|
If you want to define your own model using canopy, there are up to four steps you need to complete:
These three tasks must be completed in this order because each depends on the previous. However in some situations it may be possible to re-use one of the node or output distributions already defined in canopy, in which case you can that step. Furthermore, in some situations the node distribution and output distributions may be the same (this is the case for the canopy::classifier and canopy::circularRegressor models). In this case you just need to implement the behaviour of both models within one class.
The next four sections describe this process in detail. You should read these and read the existing code for canopy::classifier and canopy::circularRegressor along with canopy::discreteDistribution and canopy::vonMisesDistribution in order to understand how to build your own forest model.
Firstly, you need to choose the data type that you want your forest to predict In principle, this can be any data type. For example, the label type of the canopy::classifier is int
, and the label type of the canopy::circularRegressor is float
.
For the sake of this tutorial, let's assume that we've chosen a label type of myLabelType
.
The node distribution class defines the distribution that is stored in the leaf node of each tree in the forest. Conceptually, it should capture the behaviour of a probability distribution over the label type (myLabelType
in our example).
There are a number of methods that the class must define in order to be used as a node distribution within canopy. You are of course free to add other methods or properties to give the class the behaviour you need.
The following example gives the required layout of the class. It might be easier for you to copy this file, change the type names, and fill in the blanks:
The output distribution class defines the distribution that is created as the result of the distribution prediction task. It combines (in some sense that you can define) the node distributions reached in each forest for a given data point to produce a new distribution.
The class is required to have the layout in the following example:
To make sense of the tasks that should be performed by the three methods, it is helpful to understand the order in which they are called by the canopy::randomForestBase class. Recall from the basic usage instructions that the user code creates the output distribution object and passes it to the forest's canopy::randomForestBase::predictDistSingle() or canopy::randomForestBase::predictDistGroupwise() method (via an iterator). Then that method calls uses the output distribution's methods as follows:
reset()
method is called. This should clear any existing data from the class such that it is ready for use with the combineWith method. This can ensure for example that if the user passes the same output distribution object to two calls of the predictDistSingle/predictDistGroupwise
method, the second call overwrites any information from the first.combineWith()
method of the output distribution once for each of these node distributions (i.e. once for each of the trees in the forest), passing a reference to the relevant node distribution object. The method should update the parameters of the output distribution object to reflect the inclusion of that leaf node. Note that the node distributions are passed in the order they appear in the list of trees in the forest, and in a single thread (i.e. you do not need to worry about data races within the combineWith()
method).combineWith()
method has been called for all the trees in the forest, the normalise()
method is called once. This can therefore be used to ensure that the resulting parameters represent a valid distribution.In order to define your own forest model you need to define a class whose layout follows the example below:
Note that your class should inherit from the canopy::randomForestBase class, and therefore provide the 5 template parameters of the base class. The 2nd, 3rd, and 4th template parameters are respectively the label type, the node distribution type and the output distribution type, which need to match the relevant types of your node and output distribution models.
The first template parameter must be the type of your own forest model (being declared). This is in order to implement the CRTP idiom of static polymorphism.
The final template parameter is the number of parameters of the feature functor, and you will often want to make this a template parameter of your model also (as in the example) to allow for maximum flexibility.
The most complicated task in defining your own model is defining the training procedure, which is controlled by the implementations of the trainingPrecalculations
, cleanupPrecalculations
, singleNodeImpurity
, bestSplit
, and minInfoGain
methods.
These methods are called by the base class's training procedure as follows:
trainingPrecalculations
method is called on the entire training dataset in order to perform any necessary pre-calculations.singleNodeImpurity
method (in a sense defined by you).bestSplit
method, which calculates the best (in a sense defined by you) feature score threshold to use to split the data into two child nodes using this parameter set.bestSplit
method), is chosen as the feature set for that node.minInfoGain
method, the node is declared as a leaf node. Training also stops when the maximum depth is reached or the number of training data in the node falls below the user-supplied threshold.cleanupPrecalculations
method is called.The node distribution's fit
method is also called as necessary to fit a node distrbution to the training dataset.