At first glance, data integration and machine learning have little connection. However, clean and good data integration can have a significant impact on the performance of machine learning models. Through effective data integration, I significantly increase the quality of my data, which in turn is reflected in the performance of the models in machine learning. But what exactly is the relationship between data integration and data quality? And how are data quality and model performance related? In this article, we therefore disentangle the causal chain “data integration-data quality-model performance”.
Anyone who has already experienced the machine learning process (CRISP-DM) knows from their own experience how the Pareto principle behaves. You spend 80% of your time with data preparation (preparation, cleansing, standardization, etc.) and building proper data pipelines and only 20% with the actual interesting topic – modeling. Most of the time you learn how to build good models, which algorithm works how and when to apply it. And then, in practice, you are thrown in at the deep end and spend most of your time gathering and preparing the data. You are almost happy when the data is dumped on the table as a CSV file – metaphorically speaking. And anyone who wants to do machine learning productively shudders at the thought of CSV files. Direct access to the database remains the silver bullet, as the most current data (from the database, after all) can be retrieved with just querying. However, this quickly involves a number of accesses and database types – e.g. MySQL, PostgreSQL, MongoDB, MSQL or Apache Cassandra. And we are still only at data gathering. It’s much more convenient to have the relevant data centrally available without writing a bunch of custom code. Because each retrieval of data from each source has to be programmed individually.
Data integration and data quality
Clean and effective data integration can improve your data quality in many ways. For an in-depth look at how data integration impacts your data quality, see the article below.
By integrating your data with a centralized platform, you can increase your data quality in such a way as for example:
- Completeness: Data is available in its entirety and is no longer distributed across systems. As a result, it is accessible at a centralized location.
- Consistency: The data is merged and therefore consistent. This no longer has to be taken into account in data preparation in machine learning.
- Accuracy: Inaccuracies in the data are resolved by merging the data. This also no longer has to be taken into account in the data preparation in machine learning.
Validate and enrich data
If your data is consolidated in a centralized platform, you can also validate and enrich it via external data providers. Validation means that the dimensions of data quality, validity and currency, are positively influenced. This is how address data can be validated. Current address data, whether B2C or B2B, can become very relevant if the data is enriched with microgeographic metrics such as purchasing power (B2C) or location quality (B2B). Customer data can also be enriched in many ways:
- B2B: Financial Data, Payment Data, General Company Information, Risk Data.
- B2C: Microgeographical data such as purchasing power, interests, behavior (e.g. at zip code or even address level).
This makes the data richer in quantity and quality. The dimension of completeness is positively influenced.
You can validate and enrich your data via the Data Marketplace. Here we list all available external data providers.
Data quality and model performance
Data preparation is followed by model building in the machine learning process. Now the data is prepared in such a way that it has the right format for training the model. In the modeling step, multiple models are trained. Different models can result from different algorithms or from different specifications of the parameters within an algorithm. Procedures such as cross-validation are also applied in this step. Subsequently, the many trained models are evaluated and compared against each other (model evaluation). In model evaluation, the model that predicts best is usually selected. In general, one wants to make predictions that are as accurate as possible. The target variable of the prediction can be either an event (classification) or a numerical non-categorical value (e.g. sales).
What influences the prediction of the target variable?
One uses so-called independent variables (in machine learning they are called features) to predict the target variable. Example: I want to predict the sales of a customer. For this I can use information such as.
- Age of the customer
- Customer duration
- Previous purchases
- Marketing interactions (e.g. mailing openings)
- Sales interactions (e.g., number of conversations with sales representatives)
- Service interactions (e.g., number of complaints)
- Interests (e.g., indicated in newsletter)
- Purchasing power (external microgeographic data)
How to get better predictions
A model is by definition a simplified representation of reality. We try to predict a target variable with a part of all possible information available to us (see list above). In doing so, it becomes clear that we can never include all influencing factors. If I want to predict a purchase decision, thousands of factors can have an influence, which I cannot measure, e.g.
- Attitudes towards the brand
- Values of the consumer
- Character traits
- Degree of information available to the consumer
In general, however, the more information I have available, the better I can predict a target variable with a model. In other words, the volume usually (not always – cf. Variance-Bias-Tradeoff!) has a positive effect on model performance. The more information I have available for the prediction, the more accurately I can represent reality with my model.
In addition to the volume, the information value of the data is also crucial. This varies across different target variables. For example, age has a significant influence on whether individuals are willing to contract a private pension insurance. On average, willingness to do so will decline with increasing age, as the usefulness and economic benefit will also decline as a result. To stay with the same example, a much more difficult factor to measure is the willingness to take risks. Risk-averse individuals tend to be more likely to contract for pension (or even general) insurance. How could this be approximated in practice? Possibly by the number of insurance policies already closed.
So different factors matter differently for predicting different target variables. For example, it probably plays less of a role for the purchase of a television if a person likes to do sports, but it may well play a role for the purchase of a bicycle. In other words, the information value of data depends strongly on what I want to predict.
Let's put everything together
Two benefits emerge from the previous paragraphs. Through data integration, the data is more easily available to the data scientist in machine learning and additionally often in the appropriate format. Second, the quality of the data is higher – let’s remember the dimensions of data quality. This impacts the volume and information value of the data, which in turn can improve model performance.
1. benefit - data more easily available and in the appropriate format
If you have consolidated your data in a centralized platform, you have extensive transformation and validation options available there. This also applies to the validation and enrichment of data via external data providers such as Dun & Bradstreet in B2B. As already mentioned – building the data pipeline (data gathering, data preparation) in machine learning is complex. If the data is available in a centralized way, the data pipeline becomes less complex and we thus reduce the costly 80% of the time in the machine learning process. Furthermore, the data is standardized, formatted accordingly and already unified. Data preparation becomes easier. This again reduces the time needed to build the data pipeline, which in turn minimizes costs. In addition, less custom code is written, which increases the manageability of the application and also reduces costs (less maintenance, etc.).
2. benefit - higher quality of data
Effective data integration increases data quality (Completeness, Consistency, Accuracy). With additional enrichment by external data services even more (Currency, Validity, Completeness). Both the volume and the information content (through more volume and more accurate data) are positively affected, thus increasing model performance. In B2B, additional information about the company can be critical to how well a model predicts the chances of a deal. In the B2C sector, purchasing power often plays a major role and is therefore well suited to predicting purchases in models. Also, behavioral data such as interaction data is often a good predictor. For example, interaction data from the marketing automation application and the CRM system can be consolidated in the centralized platform and used there by the machine learning service.
Long story short
The demonstrated causal chain “data integration-data quality-model performance” emphasizes the necessity of effective data integration for easier and faster implementable as well as more successful machine learning. In short: from good data integration follows better predictive power of machine learning models.
In terms of business management, there are both cost-reducing and revenue-increasing effects. The development of the models is cost-reducing (less custom code, thus less maintenance, etc.). The improved predictive power of the models increases revenue in terms of more precise targeting, cross-selling and upselling, and more accurate evaluation of leads and opportunities – both in the B2B and B2C sectors.
How to use machine learning with the Integration Platform
You can make the data from your centralized Marini Integration Platform available to external machine learning services and applications. The integration works seamlessly via the HubEngine or direct access to the platform, depending on the requirements of the third-party provider. For example, one vendor for standard machine learning applications in sales is Omikron. But you can also use standard applications on AWS or in the Google Cloud. Connecting to your own servers is just as easy if you want to program your own models there.
If you need support on how to integrate machine learning models into your platform, please contact our sales team. We will be happy to help you!
Applications of machine learning in sales
Machine learning can support sales in a variety of ways. For example, it can calculate closing probabilities, estimate cross-selling and up-selling potential, or predict recommendations. The essential point here is that the salesperson is supported and receives further decision-making assistance, which he can use to better concentrate on his actual activity, i.e., selling. For example, the salesperson can more quickly identify which leads, opportunities or customers are most promising at the moment and contact them. However, it remains clear that the salesperson makes the final decision and is ultimately only facilitated by machine learning. In the end, no model sells, but still the human being.
Here you will find a short introduction to machine learning and the most common applications in sales.