Linking Administrative Data: Strategies and Methods

WHITE PAPER: Linking Administrative Data: Strategies and Methods PDF

We review the linking of datasets that contain identifying information (e.g., names, birthdates) but not unique common identifiers for each individual. We discuss strategies for identifying matches in three families: rules-based matching, supervised machine learning, and unsupervised machine learning. These vary in the ways that they combine human knowledge with computing power. We define different measures of accuracy and explore the performance of common algorithms in test data.

Our goal is to de-mystify data linking for non-technical readers. We attempt to explain the criteria that should inform the choice of linking methods, and the decisions that need to be made to implement them.

Additional resources, including code and public data referenced on pp. 26-34 is available at: https://github.com/californiapolicylab/data-linking.