Cross Validation

Cross-Validation is a procedure for the validation of models from data mining and machine learning.

The input data is prepared in Data Preparation according to current machine learning standards (pre-processing). Normally, the data is now split into training and test data set. Then the model is trained on the training data and evaluated on the test data. However, often an unfavorable split in training and test data can seriously lead to a miscalculated model. For example, the distribution of the target variable (target class) in both training and test data may be unequal, or the distribution of certain features may differ greatly. However, this is usually not a problem if the data sets are large enough. But when is the data set large enough?

k-fold Cross Validation

In k-fold cross validation, the data is split into training and test data in k iterations. Accordingly, we obtain k model evaluations. In a 5-fold-Cross-Validation (see picture) the data is split 5 times. In a 5-fold validation, the data is split k = 5 times into 1/k = 1/5 = 20% training data and 80% test data. Likewise, with k = 10 in 1/k = 1/10 = 10% training data set and 90% test data set etc. The split is done systematically by index and not randomly. In the first iteration, the first 1/k % of the data is used as training data, in the second iteration the next 1/k % of the data is used, and so on (see picture).

Cross Validation
The diagram shows a 5-fold cross validation. In each iteration a different part of the data is used for training, the rest for test data. At the end, performance metrics are averaged.

Example

We have a record of 1000 rows. For model evaluation we use accuracy (True Positive + True Negative / number of total rows). In the following this is fictitious and serves as an illustration. We apply a 5-fold cross validation.

  1. Iteration:
    • Test data: rows 1-200
    • Training data: rows 201-1000
    • True Negative: 745
    • True Positive: 85
    • Accuracy: (745+85)/1000 = 83
  2. Iteration:
    • Test data: rows 201-400
    • Training data: rows 1-200 & 401-1000
    • True Negative: 764
    • True Positive: 76
    • Accuracy: (764+76)/1000 = 84
  3. Iteration:
    • Test data: rows 401-600
    • Training data: rows 1-400 & 601-1000
    • True Negative: 789
    • True Positive: 21
    • Accuracy: (745+85)/1000 = 80%.
  4. Iteration:
    • Test data: rows 601-800
    • Training data: rows 1-600 & 801-1000
    • True Negative: 755
    • True Positive: 85
    • Accuracy: (745+85)/1000 = 84
  5. Iteration:
    • Test data: rows 801-1000
    • Training data: rows 1-800
    • True Negative: 758
    • True Positive: 52

We obtain an average accuracy of (83%+84%+80%+84%+81%)/5 = 82.4%.

Thus, if we now got an accuracy of 89% in a random split, this indicates overfitting and a randomly unfavorable split of training and test data set.

When is cross validation recommended

Advantages of k-fold cross-validation at a glance

  • Identification of overfitting
  • Testing the robustness of the model
  • Significant model performance on small data sets
  • Significant model performance on data sets with balance problems
  • Use for tuning of models

The final model parameters are determined in a usual split into training and test data set.

Looking for a hands-on tutorial in Python on how to use cross-validation? Then check out TowardsDataScience!

More resources about machine learning

Data integration

How machine learning benefits from data integration
The causal chain “data integration-data quality-model performance” describes the necessity of effective data integration for easier and faster implementable and more successful machine learning. In short, good data integration results in better predictive power of machine learning models due to higher data quality.

From a business perspective, there are both cost-reducing and revenue-increasing effects. The development of the models is cost-reducing (less custom code, thus less maintenance, etc.). Revenue increasing is caused by the better predictive power of the models leading to more precise targeting, cross- and upselling, and more accurate evaluation of leads and opportunities – both B2B and B2C. You can find a detailed article on the topic here:

Platform

How to use machine learning with the Integration Platform
You can make the data from your centralized Marini Integration Platform available to external machine learning services and applications. The integration works seamlessly via the HubEngine or direct access to the platform, depending on the requirements of the third-party provider. For example, one vendor for standard machine learning applications in sales is Omikron. But you can also use standard applications on AWS or in the Google Cloud. Connecting to your own servers is just as easy if you want to program your own models there.

If you need support on how to integrate machine learning models into your platform, please contact our sales team. We will be happy to help you!

Applications

Frequent applications of machine learning in sales
Machine learning can support sales in a variety of ways. For example, it can calculate closing probabilities, estimate cross-selling and up-selling potential, or predict recommendations. The essential point here is that the salesperson is supported and receives further decision-making assistance, which he can use to better concentrate on his actual activity, i.e., selling. For example, the salesperson can more quickly identify which leads, opportunities or customers are most promising at the moment and contact them. However, it remains clear that the salesperson makes the final decision and is ultimately only facilitated by machine learning. In the end, no model sells, but still the human being.

Here you will find a short introduction to machine learning and the most common applications in sales.

Further articles