Canopy
1.0
The header-only random forests library
|
In order to describe how the canopy library works, we will first need an abstract description of the random forests algorithm. I recommend the following as an excellent thorough introduction for those who need to learn or revise the basics:
A random forest is a collection of binary decision trees that is used to predict some quantity based on some input features. We will refer to the quantity being predicted as a label. The nature of the label will vary with the task being performed. For example, in a classification task, the label wil be a discrete (integer) variable representing the class. For a regression problem, the label will be a continuous quantity (represented by a floating point number), etc. The features can be similarly general, they could be a set of pre-recorded measurements, or the result of some function applied to an image, video, or signal.
A data point is passed into the root node of each tree in the forest, and is passed down the tree until it reaches a leaf node. Each node looks at the values of some of the features in order to decide whether to pass the data point down to its left or right child node. This repeats until the leaf node, which has no children, is reached. The leaf node contains some distribution over the value of the label given the input, we will call such a distribution a node distribution.
There are two tasks that the forest models created using canopy can perform at this point:
In order to train each tree in the forests, a randomly selected set of features are tested to see which can split the labels in the training set to give the purest child nodes. The concrete definition of purity depends on the type of the labels and the particular problem at hand.
The canopy library uses a single base class called canopy::randomForestBase to provide the general framework for training and testing random forests as described in the previous section.
The randomForestBase
class looks something like this:
There are several template parameters that allow classes derived from this base class to implement a wide range of different behaviours:
Specific random forest models may be derived from this base class using CRTP with a specific set of these template parameters (although TNumParams is typically left unspecified).
Canopy proivides two pre-defined random forest models:
int
) with a canopy::discreteDistribution object serving as the node and output distribution type (TNodeDist = canopy::discreteDistribution and TOutputDist = canopy::discreteDistribution).float
) and using a canopy::vonMisesDistribution object as the node and output distribution type.Furthermore, you can inherit from this base class to define your own random forest model for other tasks.
Canopy handles features during both testing and traning using functor objects in order to allow for maximum flexibility for the feature calculation process to to execute arbitrary user code and access arbitrary user data.
There are two different forms of feature functor that are accepted by canopy under different circumstances. The difference between these two forms is how they handle multiple test data.
In both cases, the purpose of the functor object is to take some test points identified by some "ID" type, apply some function to them specified by some number of integer-valued parameters, and return a single floating-point value for each input ID as the result.
The ID can be any arbitrary data type used to identify a test data point. The ID variables are not used or moved around by Canopy methods, so it really doesn't matter what they are as long as your own functors can make sense of them. In the simplest case (as in the provided example), the ID is simply an integer representing the index of the test sample in some list. However, it could be more complicated than this, for example it could be a 2D vector representing an image location in an image, or a 3D vector representing a 3D location within a volume image.
The function that the functor applies to each test data point can be parameterised by an arbitrary number of integer-valued parameters, which are passed to the functor by const reference in a std::array. The number of parameters is controlled by the TNumParams template parameter of the base class. Again it is up to you to define the meaning of these parameters, the Canopy code just passes them around without using them.
The first of the two forms calculates the features for test data in a one-by-one fashion. We will refer to this form as the Single Feature Functor form. This is the simplest option and involves the smallest overhead.
A Single Feature Functor to be used with a forest with TNumParams = MyNumParams
and using ID type MyIDType
must look like this:
The second form allows for calculating the features for multiple test points together. We will refer to this as the Groupwise Feature Functor form. This can be advantageous when such a process can be made quicker than evaluating the test data individually. For example, if the test data are different image locations in the same image, then it may be faster the perform some process (e.g. an FFT) on the entire image than work on each point individually.
This form will be referred to as the Groupwise Feature Functor. In contrast to the Single Feature Functor, which receives a single ID variable directly, the Groupwise Feature Functor receives multiple IDs via iterators. Because canopy uses some moderately complex iterator hackery internally, working out the type of the iterator it will pass your functor can be a little complicated. Therefore, I strongly recommend defining a method template and letting the compiler do the type inference for you. Taking this approach, a Groupwise Feature Functor for use with a forest with TNumParams = MyNumParams
should look like this:
If you unsure about what a random access iterator is and how to use it, see here.
Note also that it is entirely possible to define an object that can serve as either a Single Feature Functor or Groupwise Feature Functor by overloading the operator()
for both cases.
Because canopy uses OpenMP threads to query the different trees in a forest in parallel, the feature functors will be used simultaneously by multiple threads. It is your responsililty to ensure that the functors are thread-safe, i.e. there are no data races that can be caused due to access from multiple threads. If there are potential data races, you should use OpenMP lock variables to prevent them.
Alternatively you can run with a single thread by compiling without OpenMP (remove the -fopenmp
directive in g++). This will give you some compiler warnings about unknown pragmas, but otherwise the code will compile and run fine, but usually execute slower than with multiple threads.
So far we have considered creating a custom struct
(or class
) to act as our functor. This is attractive as it allows you to encapsulate the feature extraction process along with all its data in a single object.
There is another option though, the use of a C++11 lambda. This approach is attractive because it requires less "boilerplate" code and the lambda can easily access data defined elsewhere in your programme via lambda captures. Defining a Single Feature Functor for a forest model with TNumParams = MyNumParams
and using ID type MyIDType
using a lambda would look like this:
However, extending this to the Groupwise Feature Functor causes some headaches because of those tricky iterator types. With C++11, you would need to manually work out all those types. Luckily however, the C++14 standard fixes this problem by allowing you to define generic lambdas using the auto
keyword in place of the types. Then the compiler will deduce the types for you based on how the lambda is called from within the canopy code. Again assuming we are working with a forest model with TNumParams = MyNumParams
and using ID type MyIDType
, a Groupwise Feature Functor implemented using a lambda will look like this:
The example provides an example of using a lambda as a feature functor that captures local data.
In order to train a model you must create a Groupwise Feature Functor object and pass it to the randomForestBase's train() method.
Additionally, you must define a second functor object that generates valid combinations of parameters for your split functors on demand. Which combinations are valid depends entirely on the feature calculation process that you are using.
The Parameter Generator Functor for a forest model with TNumParams = MyNumParams
must take the following form:
Alternatively you could use a lambda, as follows:
A third option, if your parameter generation problem is straightforward, is to use the provided canopy::defaultParameterGenerator class. This can handle cases where the parameters can be generated independently from a uniform distribution over the integers between 0 and some user-defined upper limit.
Once you have created your feature functor and parameter generation functor, and list of IDs of the data points in the training set, you are ready to train the model. You do this by creating a forest model object and calling the canopy::randomForestBase::train() method.
As an example, for a canopy::classifier model, this looks something like this:
my_num_classes
defines the number of classes in the classfication function. my_num_trees
and my_num_levels
respectively define the number of trees in the forest model, and the number of levels in each tree. The training set is defined by a pair of iterators to ID variables (my_first_training_id
and my_last_training_id
) along with the corresponding values of the label (my_first_label). The my_num_param_trials
controls how many features are tested in each split node.
Once a model has been trained, it can be stored in a file for later use using the canopy::randomForestBase::writeToFile() method.
Recall that the distribution prediction task is that of predicting an output distribution over the label variable by combining the node distributions reached in each of the trees.
To do this we first need to create an output distribution object for each of the data points. We can then pass these objects to the forest model and have it update the parameters based on the features. The type of the output distribution depends on the specific forest model being used. For a canopy::classifier model, it is a canopy::discreteDistribution, and for a canopy::circularRegressor it is a canopy::vonMisesDistribution.
There are two methods for distribution prediction. The first, canopy::randomForestBase::predictDistSingle() uses a Single Feature Functor to calculate the features. The second canopy::randomForestBase::predictDistGroupwise() performs the same function but uses a Groupwise Feature Functor instead.
Doing this for a canopy::classifier looks something like this:
Recall that probability evaluation is the task of evaluating the probability of certain label, according to the forest. It is worth briefly considering how this is different to simply performing the distribution prediction task, and then just evaluating the probability of the label according to the output distribution (which is what we did in the final line of the above snippet). The probability evaluation task evaluates the probability of the label under each node distribution and then averages the result, whereas the distribution prediction combines several node distributions and then finds the probability according to this combined distribution.
Where it is possible to find a method of combining the distributions in a way that does not result in loss of information, the result is exactly the same. This is the case with the discrete distribution. However it is not the case with the von Mises distribution in the cirular regression task. In the latter case, the distribution prediction procedure combines several von Mises distributions to create a further von Mises distribution using a sensor-fusion approach that results in loss of information. This is easy to see, because a uni-modal von Mises distribution cannot possibly fully represent the combination of several other distributions that each have different means. Consequently you will get a different answer for probability of a certain label if you use the direct probability evaluation approach compared to if you use the distribution prediction.
Therefore, in general, if you want to evaluate the probabilty of a certain label you should use the direct probability evaluation approach. The distribution prediction task is more useful if you want some summary statistics, such as a point estimate of the mean or variance.
With that said, let's look at how to perform probability evaluation. Like in the distribution prediction, there are both single and groupwise versions that are otherwise equivalent (canopy::randomForestBase::probabilitySingle() and canopy::randomForestBase::probabilityGroupwise()). Now however, we don't need to pass in any distribution, but we do need the method some container of floats in which to place the results, and the values of the labels for which it is to evaluate the probability.
These methods can either evaluate the probability of a single label for multiple test IDs, or each test ID can have its own value of the label, depending on the value of the 5th parameter.
This example file may be found in the examples
directory of the canopy repository along with a Makefile
that can be used to compile it using GNU make and g++.
It implements a toy example using a random forest classifier to classify a number of data points into different classes. The features are randomly generated from a Gaussian distribution with a different, randomly-generated mean and variance for each class and pre-stored in an array.
To build this example in GNU/Linux with GNU make, do the following:
To run:
To clean-up:
For hints on how to do this in other environments, see Compiling User Code.