Random Forest

Random Forest belongs to the ensemble methods and is based on decision trees. Many small decision trees are averaged to build a powerful overall model.


The concept of Random Forest is based on the method of bagging. Bagging belongs to the so-called ensemble methods. Ensemble methods have in common that they combine many weak learners to create a strong, predictive model (strong learners). Bagging stands for bootstrap aggregating. In order to counteract the high variance of a single decision tree and to increase the accuracy of the prediction, many training data would ideally be drawn from a population. A separate model is estimated for each training data set. The average of all models is then calculated to obtain an overall model.


Since such a procedure cannot be implemented from a practical point of view (due to lack of mulitple training data sets), bootstrapping is used to simulate different training data sets. In bootstrapping, a new data set with the same number of observations is generated from the original data set by drawing with replacement. This has the consequence that observations from the original data set can occur several times in a bootstrap sample.

In the context of classification decision trees, B bootstrap samples (= training data sets) are drawn by bootstrapping. A decision tree is estimated for each bootstrap sample. To receive an overall model, models are averaged by a majority vote. A new unseen observation is classified by all B decision trees, using as final decision the class that was estimated by majority vote.

The Random Element

By using bagging, all estimated decision trees tend to have a high correlation. On average two thirds of all observations are used per bootstrap sample. This in turn means that a feature that has a strong influence will appear in the majority of the decision trees (possibly even in the first split). Thus, the decision trees correlate strongly during bagging. The approach of Random Forest addresses this weakness. The goal is to obtain B decision trees that are as uncorrelated as possible. Similar to bagging, B bootstrap training samples are generated.

However, for each split in each tree only a random subset of m features from all p features is considered. The split only considers the features from the random subset. If an extremely strong feature is now present, it will on average only be considered in (p-m)/p cases (James et al. 2013). If m is set equal to p, then bagging and random forest do not differ. The B estimated trees are therefore now very different from each other, so that the correlation between them decreases.

Code Snippet

import pandas as pd

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, recall_score

from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier

randomforest = RandomForestClassifier(n_estimators = 1000, random_state = 42)

cv_scores = cross_val_score(randomforest, X_train, y_train, cv = 3, scoring = ‘recall’)


print(“Average 3-Fold CV recall score: {}”.format(np.mean(cv_scores)))

randomforest.fit(X_train, y_train)

y_pred = randomforest.predict(X_test)

y_pred_proba = randomforest.predict_proba(X_test)[:,1]

The code snippet is written in the programming language Python and is based on the module scikit-learn.

For a hands-on tutorial on how to build a random forest model, visit TowardsDataScience.

More resources about machine learning

Data integration

How machine learning benefits from data integration
The causal chain “data integration-data quality-model performance” describes the necessity of effective data integration for easier and faster implementable and more successful machine learning. In short, good data integration results in better predictive power of machine learning models due to higher data quality.

From a business perspective, there are both cost-reducing and revenue-increasing effects. The development of the models is cost-reducing (less custom code, thus less maintenance, etc.). Revenue increasing is caused by the better predictive power of the models leading to more precise targeting, cross- and upselling, and more accurate evaluation of leads and opportunities – both B2B and B2C. You can find a detailed article on the topic here:


How to use machine learning with the Integration Platform
You can make the data from your centralized Marini Integration Platform available to external machine learning services and applications. The integration works seamlessly via the HubEngine or direct access to the platform, depending on the requirements of the third-party provider. For example, one vendor for standard machine learning applications in sales is Omikron. But you can also use standard applications on AWS or in the Google Cloud. Connecting to your own servers is just as easy if you want to program your own models there.

If you need support on how to integrate machine learning models into your platform, please contact our sales team. We will be happy to help you!


Frequent applications of machine learning in sales
Machine learning can support sales in a variety of ways. For example, it can calculate closing probabilities, estimate cross-selling and up-selling potential, or predict recommendations. The essential point here is that the salesperson is supported and receives further decision-making assistance, which he can use to better concentrate on his actual activity, i.e., selling. For example, the salesperson can more quickly identify which leads, opportunities or customers are most promising at the moment and contact them. However, it remains clear that the salesperson makes the final decision and is ultimately only facilitated by machine learning. In the end, no model sells, but still the human being.

Here you will find a short introduction to machine learning and the most common applications in sales.

Further articles