Random Forests

In order to describe how the canopy library works, we will first need an abstract description of the random forests algorithm. I recommend the following as an excellent thorough introduction for those who need to learn or revise the basics:

A. Criminisi, J. Shotton and E. Konukoglu. Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Technical Report. Microsoft Research.

A random forest is a collection of binary decision trees that is used to predict some quantity based on some input features. We will refer to the quantity being predicted as a label. The nature of the label will vary with the task being performed. For example, in a classification task, the label wil be a discrete (integer) variable representing the class. For a regression problem, the label will be a continuous quantity (represented by a floating point number), etc. The features can be similarly general, they could be a set of pre-recorded measurements, or the result of some function applied to an image, video, or signal.

A data point is passed into the root node of each tree in the forest, and is passed down the tree until it reaches a leaf node. Each node looks at the values of some of the features in order to decide whether to pass the data point down to its left or right child node. This repeats until the leaf node, which has no children, is reached. The leaf node contains some distribution over the value of the label given the input, we will call such a distribution a node distribution.

There are two tasks that the forest models created using canopy can perform at this point:

Combine the node distributions reached in each of the trees to give a single new distribution over the value of the label. We will call this distribution an output distribution. In general its form may be the same as the node distribution or it may be different. We will call this the distribution prediction task.
Evaluate the probability of a certain value of the label variable. This is done by using the node distributions reached in each tree to evaluate the probability of that label, and then averaging this result over all the trees. We will call this the probability evaluation task.

In order to train each tree in the forests, a randomly selected set of features are tested to see which can split the labels in the training set to give the purest child nodes. The concrete definition of purity depends on the type of the labels and the particular problem at hand.

The randomForestBase Class

The canopy library uses a single base class called canopy::randomForestBase to provide the general framework for training and testing random forests as described in the previous section.

The randomForestBase class looks something like this:

template <class TDerived, class TLabel, class TNodeDist, class TOutputDist, unsigned TNumParams>
class randomForestBase
{
    //...
};

There are several template parameters that allow classes derived from this base class to implement a wide range of different behaviours:

TDerived This is the type of the derived class, needed in order to use the CRTP form of static polymorphism.
TLabel The type of the label (the variable to be predicted). This can in principle be an arbitrary type.
TNodeDist This is the type of the node distribution, which must have a specific form dictated by the choice of TLabel. See Defining Your Own Node Distribution.
TOutputDist This is the type of the output distribution, which must have a specific form dictated by the choice of TNodeDist. See Defining Your Own Output Distribution.
TNumParams This is the number of parameters of the feature callback functor. See Defining Feature Functors.

Specific random forest models may be derived from this base class using CRTP with a specific set of these template parameters (although TNumParams is typically left unspecified).

Canopy proivides two pre-defined random forest models:

canopy::classifier A random forest classifier to predict a discrete label (TLabel = int) with a canopy::discreteDistribution object serving as the node and output distribution type (TNodeDist = canopy::discreteDistribution and TOutputDist = canopy::discreteDistribution).
canopy::circularRegressor A random forest model for predicting a circular-valued (angular) output variable, represented using a floating point number (TLabel = float) and using a canopy::vonMisesDistribution object as the node and output distribution type.

Furthermore, you can inherit from this base class to define your own random forest model for other tasks.

Defining Feature Functors

Canopy handles features during both testing and traning using functor objects in order to allow for maximum flexibility for the feature calculation process to to execute arbitrary user code and access arbitrary user data.

There are two different forms of feature functor that are accepted by canopy under different circumstances. The difference between these two forms is how they handle multiple test data.

In both cases, the purpose of the functor object is to take some test points identified by some "ID" type, apply some function to them specified by some number of integer-valued parameters, and return a single floating-point value for each input ID as the result.

The ID can be any arbitrary data type used to identify a test data point. The ID variables are not used or moved around by Canopy methods, so it really doesn't matter what they are as long as your own functors can make sense of them. In the simplest case (as in the provided example), the ID is simply an integer representing the index of the test sample in some list. However, it could be more complicated than this, for example it could be a 2D vector representing an image location in an image, or a 3D vector representing a 3D location within a volume image.

The function that the functor applies to each test data point can be parameterised by an arbitrary number of integer-valued parameters, which are passed to the functor by const reference in a std::array. The number of parameters is controlled by the TNumParams template parameter of the base class. Again it is up to you to define the meaning of these parameters, the Canopy code just passes them around without using them.

Single Feature Functors

The first of the two forms calculates the features for test data in a one-by-one fashion. We will refer to this form as the Single Feature Functor form. This is the simplest option and involves the smallest overhead.

A Single Feature Functor to be used with a forest with TNumParams = MyNumParams and using ID type MyIDType must look like this:

struct MySingleFeatureFunctor
{
    float operator() (const MyIDType id, const std::array<int,MyNumParams>& params)
    {
        // Add your code here to apply the feature calculation process to 'id' using the parameters in 'params'
        // and return the result
    }
    // Other user defined data and methods can go here if you want
};

Groupwise Feature Functors

The second form allows for calculating the features for multiple test points together. We will refer to this as the Groupwise Feature Functor form. This can be advantageous when such a process can be made quicker than evaluating the test data individually. For example, if the test data are different image locations in the same image, then it may be faster the perform some process (e.g. an FFT) on the entire image than work on each point individually.

This form will be referred to as the Groupwise Feature Functor. In contrast to the Single Feature Functor, which receives a single ID variable directly, the Groupwise Feature Functor receives multiple IDs via iterators. Because canopy uses some moderately complex iterator hackery internally, working out the type of the iterator it will pass your functor can be a little complicated. Therefore, I strongly recommend defining a method template and letting the compiler do the type inference for you. Taking this approach, a Groupwise Feature Functor for use with a forest with TNumParams = MyNumParams should look like this:

struct MyGroupwiseFeatureFunctor
{
    // TIdIterator will be deduced by the compiler for you
    template<class TIdIterator>
    void operator() (TIdIterator first_id, const TIdIterator last_id, const std::array<int,MyNumParams>& params, std::vector<float>::iterator out_it)
    {
        // Add your code here to iterate thhrough the elements between first_id and last_id and apply the feature calculation process to each id using
        // the parameters in 'params' and place the result in the corresponding location in 'out_it'
        // You may assume that TIdIterator is a random access iterator type and dereferences to your ID type
    }
    // Other user defined data and methods can go here if you want
};

If you unsure about what a random access iterator is and how to use it, see here.

Note also that it is entirely possible to define an object that can serve as either a Single Feature Functor or Groupwise Feature Functor by overloading the operator() for both cases.

Thread Safety of Feature Functors

Because canopy uses OpenMP threads to query the different trees in a forest in parallel, the feature functors will be used simultaneously by multiple threads. It is your responsililty to ensure that the functors are thread-safe, i.e. there are no data races that can be caused due to access from multiple threads. If there are potential data races, you should use OpenMP lock variables to prevent them.

Alternatively you can run with a single thread by compiling without OpenMP (remove the -fopenmp directive in g++). This will give you some compiler warnings about unknown pragmas, but otherwise the code will compile and run fine, but usually execute slower than with multiple threads.

Using Lambdas As Feature Functors

So far we have considered creating a custom struct (or class) to act as our functor. This is attractive as it allows you to encapsulate the feature extraction process along with all its data in a single object.

There is another option though, the use of a C++11 lambda. This approach is attractive because it requires less "boilerplate" code and the lambda can easily access data defined elsewhere in your programme via lambda captures. Defining a Single Feature Functor for a forest model with TNumParams = MyNumParams and using ID type MyIDType using a lambda would look like this:

auto MySingleFeatureLambda = [] (const MyIDType id, const std::array<int,MyNumParams>& params)
{
    // Add your code here to apply the feature calculation process to 'id' using the parameters in 'params'
    // and return the result
};

However, extending this to the Groupwise Feature Functor causes some headaches because of those tricky iterator types. With C++11, you would need to manually work out all those types. Luckily however, the C++14 standard fixes this problem by allowing you to define generic lambdas using the auto keyword in place of the types. Then the compiler will deduce the types for you based on how the lambda is called from within the canopy code. Again assuming we are working with a forest model with TNumParams = MyNumParams and using ID type MyIDType, a Groupwise Feature Functor implemented using a lambda will look like this:

auto MyGroupwiseFeatureLambda = [] (auto first_id, const auto last_id, const std::array<int,MyNumParams>& params, std::vector<float>::iterator out_it)
{
    // Add your code here to iterate thhrough the elements between first_id and last_id and apply the feature calculation process to each id using
    // the parameters in 'params' and place the result in the corresponding location in 'out_it'
    // You may assume that the type of first_id and last_id is a random access iterator type and dereferences to your ID type
};

The example provides an example of using a lambda as a feature functor that captures local data.

Training A Forest Model

In order to train a model you must create a Groupwise Feature Functor object and pass it to the randomForestBase's train() method.

Parameter Generator Functors

Additionally, you must define a second functor object that generates valid combinations of parameters for your split functors on demand. Which combinations are valid depends entirely on the feature calculation process that you are using.

The Parameter Generator Functor for a forest model with TNumParams = MyNumParams must take the following form:

struct MyParameterGeneratorFunctor
{
    void operator() (std::array<int,MyNumParams>& params)
    {
        // Generate a random valid combination of parameters and store in params
    }
    // Other user defined data and methods can go here if you want
};

Alternatively you could use a lambda, as follows:

auto MyParameterGeneratorLambda = [] (std::array<int,MyNumParams>& params)
{
    // Generate a random valid combination of parameters and store in params
};

A third option, if your parameter generation problem is straightforward, is to use the provided canopy::defaultParameterGenerator class. This can handle cases where the parameters can be generated independently from a uniform distribution over the integers between 0 and some user-defined upper limit.

Using The Train Method

Once you have created your feature functor and parameter generation functor, and list of IDs of the data points in the training set, you are ready to train the model. You do this by creating a forest model object and calling the canopy::randomForestBase::train() method.

As an example, for a canopy::classifier model, this looks something like this:

canopy::classifier<MyNumParams> my_classifier(my_num_classes,my_num_trees,my_num_levels);

my_classifer.train(my_first_training_id,my_last_training_id,my_first_label,my_feature_functor,my_param_functor,my_num_param_trials);

my_num_classes defines the number of classes in the classfication function. my_num_trees and my_num_levels respectively define the number of trees in the forest model, and the number of levels in each tree. The training set is defined by a pair of iterators to ID variables (my_first_training_id and my_last_training_id) along with the corresponding values of the label (my_first_label). The my_num_param_trials controls how many features are tested in each split node.

Once a model has been trained, it can be stored in a file for later use using the canopy::randomForestBase::writeToFile() method.

Distribution Prediction

Recall that the distribution prediction task is that of predicting an output distribution over the label variable by combining the node distributions reached in each of the trees.

To do this we first need to create an output distribution object for each of the data points. We can then pass these objects to the forest model and have it update the parameters based on the features. The type of the output distribution depends on the specific forest model being used. For a canopy::classifier model, it is a canopy::discreteDistribution, and for a canopy::circularRegressor it is a canopy::vonMisesDistribution.

There are two methods for distribution prediction. The first, canopy::randomForestBase::predictDistSingle() uses a Single Feature Functor to calculate the features. The second canopy::randomForestBase::predictDistGroupwise() performs the same function but uses a Groupwise Feature Functor instead.

Doing this for a canopy::classifier looks something like this:

// Create array of output distributions
std::vector<canopy::discreteDistribution> dists(num_test_data_points);
// Initialise them, which in this case involves specifying the number of classes
for(auto & d : dists)
    d.initialise(num_classes);
// Use the methods to find the parameters of the distributions
my_classifier.predictDistSingle(my_first_test_id,my_last_test_id,dists.begin(),my_single_feature_functor);
// Or...
my_classifier.predictDistGroupwise(my_first_test_id,my_last_test_id,dists.begin(),my_groupwise_feature_functor);
// Now the forest has found the distribution's parameters, we can use them for whatever
std::cout << dists[0].pdf(1) << std::endl; // output probability that ID 0 has class label 1

Probability Evaluation

Recall that probability evaluation is the task of evaluating the probability of certain label, according to the forest. It is worth briefly considering how this is different to simply performing the distribution prediction task, and then just evaluating the probability of the label according to the output distribution (which is what we did in the final line of the above snippet). The probability evaluation task evaluates the probability of the label under each node distribution and then averages the result, whereas the distribution prediction combines several node distributions and then finds the probability according to this combined distribution.

Where it is possible to find a method of combining the distributions in a way that does not result in loss of information, the result is exactly the same. This is the case with the discrete distribution. However it is not the case with the von Mises distribution in the cirular regression task. In the latter case, the distribution prediction procedure combines several von Mises distributions to create a further von Mises distribution using a sensor-fusion approach that results in loss of information. This is easy to see, because a uni-modal von Mises distribution cannot possibly fully represent the combination of several other distributions that each have different means. Consequently you will get a different answer for probability of a certain label if you use the direct probability evaluation approach compared to if you use the distribution prediction.

Therefore, in general, if you want to evaluate the probabilty of a certain label you should use the direct probability evaluation approach. The distribution prediction task is more useful if you want some summary statistics, such as a point estimate of the mean or variance.

With that said, let's look at how to perform probability evaluation. Like in the distribution prediction, there are both single and groupwise versions that are otherwise equivalent (canopy::randomForestBase::probabilitySingle() and canopy::randomForestBase::probabilityGroupwise()). Now however, we don't need to pass in any distribution, but we do need the method some container of floats in which to place the results, and the values of the labels for which it is to evaluate the probability.

// Container to hold the results
std::vector<float> results(num_test_ids);
the_classifier.probabilitySingle(my_first_test_id,my_last_test_id, my_first_label, results.begin(), false, my_single_feature_functor);
// Or...
the_classifier.probabilityGroupwise(my_first_test_id,my_last_test_id, my_first_label, results.begin(), false, my_groupwise_feature_functor);

These methods can either evaluate the probability of a single label for multiple test IDs, or each test ID can have its own value of the label, depending on the value of the 5th parameter.

A Full Example

This example file may be found in the examples directory of the canopy repository along with a Makefile that can be used to compile it using GNU make and g++.

It implements a toy example using a random forest classifier to classify a number of data points into different classes. The features are randomly generated from a Gaussian distribution with a different, randomly-generated mean and variance for each class and pre-stored in an array.

// Standard Library Headers
#include <iostream>
#include <array>
#include <random>
#include <algorithm>
#include <string>
// The canopy classifier header
#include <canopy/classifier/classifier.hpp>
/* This programme demonstrates how to use the basic functionality of the canopy
library. It trains a random forest classifier to proedict the discrete label
of test data given some features. The features are randomly generated from
a Gaussian distribution with different, randomly-chosen mean and variance
parameters for each of the discrete classes.
*/
int main()
{
    /* Parameters of the test */
    constexpr unsigned N_CLASSES = 3; // number of discrete class labels
    constexpr unsigned TRAINING_DATA_PER_PER_CLASS = 200;
    constexpr unsigned TOTAL_TRAINING_DATA = N_CLASSES * TRAINING_DATA_PER_PER_CLASS;
    constexpr unsigned N_DIMS = 2; // dimensionality of the feature space
    constexpr double MIN_MU = 0.0; // range of the randomly-generated mean parameters (min)
    constexpr double MAX_MU = 10.0; // range of the randomly-generated mean parameters (max)
    constexpr double MAX_SIGMA = 3.0; // maximum value for randomly-generated standard deviation parameters
    constexpr int N_TREES = 128; //number of trees in the random forest
    constexpr int N_LEVELS = 10; // maximum number of levels in each tree
    constexpr unsigned N_TESTS = 10; //number of test data
    const std::string FILENAME = "example_model.tr"; // file to save the model in
    /* Set up random number generation */
    std::default_random_engine rand_engine;
    std::random_device rd{};
    rand_engine.seed(rd());
    std::normal_distribution<double> norm_dist;
    std::uniform_int_distribution<int> uni_int_dist;
    std::uniform_real_distribution<double> uni_real_dist;
    /* Randomly generate sigma and mu parameters for each class assuming
    axis-aligned distributions for simplicity. These are arrays with classes
    down the first dimension and the feature space dimension down the second */
    std::array<std::array<double,N_DIMS>,N_CLASSES> mu;
    std::array<std::array<double,N_DIMS>,N_CLASSES> sigma;
    for(unsigned c = 0; c < N_CLASSES; ++c)
    {
        for(unsigned d = 0; d < N_DIMS; ++d)
        {
            mu[c][d] = uni_real_dist(rand_engine,std::uniform_real_distribution<double>::param_type{MIN_MU,MAX_MU});
            sigma[c][d] = uni_real_dist(rand_engine,std::uniform_real_distribution<double>::param_type{0.0,MAX_SIGMA});
        }
    }
    /* Generate training data using these distributions */
    std::array<std::array<double,N_DIMS>,TOTAL_TRAINING_DATA> training_data_features;
    std::array<int,TOTAL_TRAINING_DATA> training_data_labels;
    for(unsigned c = 0; c < N_CLASSES; ++c)
    {
        for(unsigned n = 0; n < TRAINING_DATA_PER_PER_CLASS; ++n)
        {
            const unsigned i = c*TRAINING_DATA_PER_PER_CLASS + n;
            training_data_labels[i] = c;
            for(unsigned d = 0; d < N_DIMS; ++d)
            {
                training_data_features[i][d] = norm_dist(rand_engine,std::normal_distribution<double>::param_type{mu[c][d],sigma[c][d]});
            }
        }
    }
    /* Create a classifer object and initialise with the number of classes, trees
    and levels. The TNumParams template parameter is one because there is a single
    parameter of the feature calculation process, which indexes the different features
    in the list */
    canopy::classifier<1> the_classifier(N_CLASSES,N_TREES,N_LEVELS);
    /* Create a groupwise feature functor object in order to train the model.
    A C++14 generic lambda is a convenient way to do this
    as it can capture the data array by reference and figure out all the types
    for us */
    auto train_feature_lambda = [&] (auto first_id, const auto last_id, const std::array<int,1>& params, std::vector<float>::iterator out_it)
    {
        /* Iterate over the IDs */
        while(first_id != last_id)
        {
            /* ID for this data point found by dereferencing iterator */
            const int id = *first_id;
            /* The first and only parameter represents the dimension of the
            feature space to use */
            const int d = params[0];
            /* Look up the pre-calculated feature value for this ID and dimension
            and place it in the output iterator
            (The training_data_features array is captured by reference) */
            *out_it++ = training_data_features[id][d];
            /* Advance the iterator */
            ++first_id;
        }
    };
    /* Create a parameter generator functor for training, which simply selects a
    random dimension. We could equally well use a canopy::defaultParameterGenerator
    here */
    auto param_lambda = [&] (std::array<int,1>& params)
    {
        params[0] = uni_int_dist(rand_engine,std::uniform_int_distribution<int>::param_type{0,N_DIMS-1});
    };
    /* Finally we need a way of identifying each of the data points in the training
    set. This is done by the index of the data point in the list */
    std::array<int,TOTAL_TRAINING_DATA> train_ids;
    std::iota(train_ids.begin(),train_ids.end(),0);
    /* With all this in place we are ready to train the model */
    the_classifier.train( train_ids.cbegin(), train_ids.cend(), training_data_labels.cbegin(), train_feature_lambda, param_lambda, N_DIMS/2 + 1);
    /* We can now write the model a file for later use */
    the_classifier.writeToFile(FILENAME);
    /* Generate some unseen test data from the same distributions */
    std::array<std::array<double,N_DIMS>,N_TESTS> test_data_features;
    std::array<int,N_TESTS> test_data_labels;
    for(unsigned n = 0; n < N_TESTS; ++n)
    {
        /* Choose a random class label */
        const int c = uni_int_dist(rand_engine,std::uniform_int_distribution<int>::param_type{0,N_CLASSES-1});
        test_data_labels[n] = c;
        /* Generate some features using this class's distribution parameters */
        for(unsigned d = 0; d < N_DIMS; ++d)
        {
            test_data_features[n][d] = norm_dist(rand_engine,std::normal_distribution<double>::param_type{mu[c][d],sigma[c][d]});
        }
    }
    /* We need a way of identifying each of the data points in the test
    set. This is again done by the index of the data point in the list */
    std::array<int,N_TESTS> test_ids;
    std::iota(test_ids.begin(),test_ids.end(),0);
    /* We need a functor to calculate the features for the test set.
    This is the same as before, but accesses the testing array instead of the
    training array */
    auto test_feature_lambda = [&] (auto first_id, const auto last_id, const std::array<int,1>& params, std::vector<float>::iterator out_it)
    {
        while(first_id != last_id)
        {
            const int id = *first_id;
            const int d = params[0];
            *out_it++ = test_data_features[id][d];
            ++first_id;
        }
    };
    /* There are two basic ways to use the forest to analyse test data. The first
    is to predict the full distribution over the output space given the features.
    For the classification task, this means finding a discrete distribution over the
    class labels. First we need to create some discrete distribution objects and
    initialise them to the right number of classes */
    std::array<canopy::discreteDistribution,N_TESTS> d_dists;
    for(canopy::discreteDistribution& dist : d_dists)
        dist.initialise(N_CLASSES);
    /* Use the forest model to perform the prediction */
    the_classifier.predictDistGroupwise(test_ids.cbegin(),test_ids.cend(),d_dists.begin(),test_feature_lambda);
    /* Output the results to console */
    for(unsigned n = 0; n < N_TESTS; ++n)
    {
        std::cout << "True Label " << test_data_labels[n] << ", Predicted Distribution";
        for(unsigned c = 0; c < N_CLASSES; ++c)
            std::cout << " " << d_dists[n].pdf(c);
        std::cout << std::endl;
    }
    /* The other option is to use the forest model to evaluate the probability
    of a certain value of the label. Suppose we wanted to find the probability
    under the model that each of the test data belong their ground truth class.
    This results in floating point value for each test case. */
    std::array<double,N_TESTS> probabilities;
    the_classifier.probabilityGroupwise(test_ids.cbegin(),test_ids.cend(), test_data_labels.cbegin(), probabilities.begin(), false, test_feature_lambda);
    /* Output the result */
    std::cout << std::endl << "Probabilities:" << std::endl;
    for(double p : probabilities)
        std::cout << p << std::endl;
}

To build this example in GNU/Linux with GNU make, do the following:

1 cd path/to/canopy/example

2 make

To run:

1 ./canopy_example

To clean-up:

1 make clean

For hints on how to do this in other environments, see Compiling User Code.

Table of Contents