These days, you often come across terms like Customer Data Platform or Customer 360. The concept is quickly clear: it’s all about a unified and complete view of the customer. This means that all important information is centrally available and visible at a glance (or two).
What's all the fuzz about the Golden Record?
Theory sounds simple. I simply pull together all the information about a customer, store it, and provide a visualization of the data – whether chart or table wise. We’ll leave aside the fact that it’s often the pulling together (the integration) that fails. If we now assume that the integration has been realized, e.g. with the Marini Integration Platform, then we face the next challenge: the creation of the golden record. This is inseparably linked to a Customer Data Platform. The Golden Record brings together the available data of an entity (e.g. person or company). But let’s start at the beginning.
The old problem: data silos
If customer data is distributed across many systems (e.g. CRM, ERP, e-commerce, marketing automation), then each system shows only an incomplete view of the theoretically available data. The focus is first on master data, which is merged and consolidated. Master data summarizes data that is necessary for the regular processing of a record and does not change frequently. Examples of master data for a contact: first name, last name, address, gender. Examples of another entity like a product: name, description, category. The management of master data is also called Master Data Management. There are often separate types of systems, called master data management systems, for governing and consolidating master data.
If the master data of an entity is combined from several systems, it is condensed into a golden record. This golden record now combines data from several sources which refer to the same real object (e.g. to the same person, the same company). Uniserv also refers to the Golden Record as the unique data record. The Golden Record is provided with a Persistent ID (PID). This is used to uniquely identify an object from the real world (e.g. contact, company, etc.). The PID can now be transferred back to the various source systems from which the data was collected. This allows a person or company to be uniquely identified even across systems, since information such as names may not be unique. Let’s take the following example:
Example of distributed master data across different systems
ERP | CRM | MA | |
ID | 1234 | abcd | 1a2b |
First Name | John | John | John |
Last Name | Doe | Doe | D |
E-Mail Address | john.doe@marini.de | – | j.doe@marini.de |
Street | Kaiserstraße 57 | Kaiserstr. 57 | Kaiser Straße |
City | Frankfurt am Main | Frankfurt am Main | Frankfurt |
Country | Germany | GER | DE |
The three records above would now be pooled into one Golden Record, as they relate to the same object (in our case, a person) in the real world, namely John Doe from Marini Systems. The next step would be to assign the same PID to these three records so that the record can be uniquely identified as the same across ERP, CRM and Marketing Automation. Now the master data from the different systems used in the company are no longer isolated. The management of master data (master data management) can now be conducted in a targeted manner. A next logical step would be, for example, to standardize the data or to merge it in a central system (CDP, EDP, MDM).
From the Golden Record to Customer 360
In addition to master data, there is also the type of transaction data. Unlike master data, which tends not to change frequently, transaction data is dynamic. Transaction data literally represents a single transaction. In the context of personal data, this can mean many things, such as:
- the purchase of a product
- the visit of a website
- the subscription to a newsletter
- the registration for a webinar
- a product complaint
Generally speaking, here we subsume interactions (at touchpoints) with the customer (B2B or B2C), that are stored.
If the golden record is now enriched with transaction data, we also speak of Customer 360, which provides a 360-degree view of the customer. All the information that has been collected about the customer is available at a glance: in the golden record, which represents a holistic customer profile. The view of such a profile is realized via customer data platforms. With such a holistic view, more informed marketing and sales actions can be triggered. The focus shifts to the customer and the company can act in a more customer-centric manner and thus increase its success.
How to build a Golden Record?
If records are now to be linked to a golden record, a decision on the linkage criterion must be made. We speak here of record linkage. Two records are assigned to each other because they refer to the same object in the real world. There are three possibilities.
- If-then rule set
- Machine Learning (probability-based)
- A combination of 1. and 2.
If-then rule set
The simpler, but not always possible option, is an if-then rule set. Rules are defined as to how link records. The simplest case considers one field of the record. For example, two records could be matched if their email addresses are identical. However, this might not always be unique. Let’s consider that married couples can share e-mail addresses, children use their parents’ e-mail addresses, collective addresses (e.g. like info@marini.de) are used in companies, or the secretary’s office uses the boss’s address. This produces plenty of cases in which the simple rule of “same e-mail address” is no longer sufficient or runs the risk of being very fuzzy.
Therefore, multiple and combined rules are often used, i.e. a set of rules is formed. This must be formed on an application-specific and company-specific basis and requires not only domain knowledge but also company-specific knowledge. For example, a simple set of rules could be: If last name and e-mail address or first name, last name and street are the identical, then the records are matched to each other. However, it becomes clear that the formulation of the set of rules is based on common sense and is difficult to validate.
Machine Learning
A disadvantage of any if-then rule set is that fields must be identical. If this and/or that field are the same, then the records are linked. In contrast, a machine learning solution uses so-called “fuzzy matching”. For each field of two records a distance measure is calculated. Depending on the data type, different distance measures can be used, e.g.
- String: Jaro-Winkler, Levenshtein, Jaccard
- Date: Difference in days
- Integer: Absolute difference
- Boolean: Dummy variable
These are then used as features in a machine learning model (support vector classifiers, for example, are well suited for this problem). The disadvantage of this approach a training data set with labeled data is required (cf. supervised learning). Once the model is trained, it can predict a probability for pairs of new data sets to be the identical. Based on the probability, automated matching could be performed starting at a certain threshold. Alternatively, the suggestions can be linked manually to achieve more unambiguous results.
Record linkage using machine learning has the advantage that even records that are not exactly the same can be matched. Theoretically, no field would have to be exactly the same; if the overall similarity is sufficiently high, the algorithm would return a correspondingly high probability. Thus, the strings Kaiserstrasse 57, Kaiserstr. 57, and Kaiser Strasse would be given a high similarity. Despite different spellings, it is considered that the spellings are very similar.
However, record linkage presents many other challenges such as sufficient blocking rules (for performance reasons), transitive mapping, class imbalance, or a sufficient amount of training data.
If-then rule set + Machine Learning
Of course, both approaches above can also be combined. For example, a probability can be calculated using machine learning and serve as a preselection, after which a rule is applied. In general, the trade-off between false positives and few positives must always be considered in automation. The more lax the rules or the matching probability, the more false positives (i.e., records that were incorrectly matched). On the other hand, the stricter the rules or the threshold of probability, the fewer identical ones are found.
Persistent Identifier
If, using one of the above methods, records are now linked, a PID is assigned to the Golden Record subsuming the linked records. The PID brings together the same entities over different systems by an ID with a 1:n relation. One or more IDs (from different systems, quasi the source IDs) are assigned to a PID. The PID now allows unique identification of an object (such as a person) across the complete system landscape.
And what's its connection to data integration?
It is inherent to integrations that as soon as data is transferred from one system to another (especially when master data is transferred), duplicates are bound to occur. The Golden Record addresses this problem. As soon as master data is integrated, the formation of a Golden Record is recommended.
Data redundancies cause dispersed information and thus information loss. This results in costs. Data integration with Golden Records (and the creation of a PID) counters data redundancies and incomplete data.
For the “account” entity, for example, the data provider Dun & Bradstreet handles the record linkage by delivering the so-called DUNS (the PID for accounts from D&B). This way, not only duplicates of accounts can be identified, but the individual data record can also be enriched with numerous data, such as risk or financial data. In addition, the data is validated by D&B, i.e. inconsistencies such as street spellings are corrected.
Conclusion
The golden record is increasingly coming into focus – especially in today’s corporate system landscape. Master data and associated transaction data are scattered in silos. The Golden Record is an attempt to unite all this data and thus provide a holistic view of the customer – Customer 360. If I want a complete view of my customer, there is no way around the Golden Record. And even in a previous step, as soon as I integrate master data from different sources, it is advisable to create a golden record and thus resolve duplicates. Because data redundancy is cost-intensive and hampers sales!