Record Linkage

When merging records into a golden record, the question arises as to which criterion to use for deciding whether to merge or not. The methodology of matching is called record linkage. Two records are linked because they refer to the same real object. There are basically two approaches for this:

  1. If-Then Rule Set
  2. Machine Learning (probability-based)

If-Then Rule Set

The simplest option is an if-then rule set. Rules determine when records are matched. In the simplest case, only a single field of a record is considered. For example, two records could be matched if their email addresses are identical. However, this may not always be unique.

Therefore, multiple and combined rules are often used, i.e. a set of rules is formed. An example of a simple set of rules: If last name and e-mail address or first name, last name and street address match, then the records are linked.

Machine Learning

A disadvantage of any if-then rule set is that fields must be the identical. In contrast, machine learning also uses “fuzzy matching”. For each field of two data sets, a distance metric (or similarity metric = 1 – distance metric) is calculated. Examples per data type:

  • String: Jaro-Winkler, Levenshtein, Jaccard
  • Date: Difference in days
  • Integer: Absolute difference
  • Boolean: Dummy-variable

The distance measures are used as features in the machine learning model (e.g. support vector classifier). However, a labeled dataset is needed for training (cf. supervised learning). Then, a matching probability can be calculated for pairs of new records.

Video zum Record Linkage

  • How does record linkage work?
  • What exactly does it have to do with machine learning?
  • Where does Record Linkage find its application in marketing and sales?

We answer these questions in this short video (in German).

Further articles