What is data linking?

Data linking is taking information about a person or an entity from various sources and collating them under different parameters to come up with a trend or pattern. This unique research tool has a broad scope of applications and a range of benefits and challenges at both the micro and macro levels. Data linking has been an integral part of research and policymaking for many years. Over time, its processes have evolved with advances in technology.

Data linking is the process of collating information from different sources in order to create a more valuable and helpful data set. The linking of information about the same person or entity from disparate sources allows, among other things, the construction of a chronological sequence of events. This information is of immense value at the policy level to derive meaningful decisions.

Linking this connecting information from a range of sources and combining it creates a vast data set that contains different parameters. The main aim of this exercise is to gain information at a macro level. For example, information about children in a local community can help decide on the volume of early childhood programs required and school locations.

Earlier, people had to rely on government data to obtain this level of information. However, now there is the capability to link data from different sources while maintaining the highest standards of privacy and safety. Social researchers can ethically use this linked data to understand the characteristics and needs of a population. This ultimately promotes better health and social services for the community.

How does data linking work?

Over a lifetime, an individual accesses many services provided by different organizations. Many times, their identity and the data provided by them is recorded. This is called administrative data. In this way, over time, offices like schools and hospitals collect a huge amount of administrative data.

Within each organization, data custodians manage this data. Part of their role is to ensure that privacy of individuals is maintained. This data is mainly used for internal purposes because it often contains identifiers or can be too small to create accurate correlations and trends. However, the process of data linkage across multiple data sets allows for data to be de-identified. It makes it easy to share data for important decisions and policymaking while maintaining privacy and ethical data standards, and ensuring security protocols.

Procedure for data linking

A researcher wants to use linked data to make a valuable contribution toward society. They believe that children born early might have some learning defects and if so, they can be given treatment at an appropriate age to encourage better development. The researcher will need birth, education, and health data to test this theory properly. In this instance, linked data becomes a valuable research tool.

In such a situation, a data link research advisor comes into play. The advisor will be a facilitator between the researcher and the custodians of the hospital birth data, childhood health data, and education health data. The researcher will put forth the reason behind asking for the data. This will ensure that the data is being asked for a genuine cause and the data custodians are assured that their data will be safe.

After this, the researcher seeks ethics approval to assure participants and governing bodies that the research is meant for the benefit of the community and will be conducted in an ethical manner. The ethics committee permits it to proceed with the data linkage. The custodians of the data are also notified of this approval and give their final nod for the use of their data. The data link research advisor is then notified.

Now the process of linking the data begins. The advisor takes information from the different custodians and extracts the data needed for this particular process. The data is run through advanced computer software and the individual records are linked across all the required data sets. Once the data is linked, each individual is given a unique code which is called the ‘linkage key,’ and the individual is de-identified. The researcher will use each linkage key later to connect the data of separate individuals across all these data sets.

This information is given to the researcher to do the research without any names or identifiers. Using the linked data, the researcher discovers that some babies born too early become prodigies, but most of them have some developmental problems that might interfere with early education. This information is given to the early education planners, data custodians, and policymakers. With all the linked data and correlations, some pilot programs are launched for early childhood development programs.

Thus, such linked data research programs have helped government departments and policymakers work closely with one another for the benefit of the community—while ensuring privacy of the individuals.

Ways to link data sets

There are numerous ways to link data based on the information available to the organization.

1. Unique identifier

This is the most straightforward way to link data between different data sets. A unique identifier is available on each data set that establishes the links between these data sets. It is also called deterministic or exact linking because the unique identifiers either match completely, or do not at all. This method means there is no uncertainty, but a unique identifier is not a standard feature of data sets.

2. Linkage key

When a unique identifier is not available, or there isn’t enough quality in the data to rely on, another approach is used called linkage key. The linkage key works like a substitute for the unique identifier in this method. This key is created using information like name and address available on both data sets. These linkage keys maintain the privacy of the person or entity as the key is used in place of the name and address.

3. Probabilistic linking

This is another style of data linking, and it is used when a unique identifier is unavailable. It is based on the probability that the pair of records, taken from one data set, refers to the same entity or person. In this method, advanced data linking software is used to obtain accurate results.

4. Statistical linking

This technique combines records similar to the entity but not necessarily the same person or organization. This kind of data linking may not give the most accurate results but does provide a pattern or trend from the given information or statistics.

Scope of data linking

Policymakers and researchers who have received approval from data custodians and the ethics committee can undertake such projects for policy development, research, evaluation, and quality assurance. Data linking is used for various other purposes, too.

Life sciences

Data linkage is used extensively in life sciences and medical fields. Genomics, proteomics, and other such fields use linked data to find a correlation between proteins, genes, and medical trials. For example, in Australia, a portal called the Atlas of Living Australia combines data about names, descriptions, and images of all life forms in Australia. It has created permanent identifiers for these species’ names and concepts so the user can combine the data about the same species in different institutions without any error-matching. It also lets the user track the name changes of organisms to find relevant matches.


There has been a strong push from a range of groups asking governments all across the globe to bring public data sets under their control. This allows for more transparency and cross-linkages among different data sets. These linkages can lead to changes in policies or initiating new regulations and schemes. Linked data also allows for harmonizing of data and creating a linkage between disparate groups.


Varied studies and research have been done through linked data in the healthcare department. Researchers, for example, can use data to find any correlations between a mother’s age and lifestyle with the child’s development and growth. Similarly, linked data can be used to know about the emergence of lifestyle diseases like diabetes.


Libraries have been the earliest adopters of linked data. For instance, the Swedish Union Catalog started linking data in 2008, and the German National Library did so in 2010. There is also a project to link German-linked data with the British National Bibliography. This kind of data linking gives more in-depth knowledge of certain books, authors, and philosophies.

Archives and universities

Similar to libraries, archives and universities have also been linking data about people, places, organizations, and published resources. For instance, the Australian Science Archives Project came up with the Online Heritage Resource Manager (OHRM), an early example of linking data on the internet.

Social media

In more recent times, social media channels such as Facebook have started using data linking to track relationships between people and the web content they are using through the Open Graph protocol. With the addition of Open Graph tags to the metadata of a web page, Facebook can display an image and a description for that page within the Facebook site.

Business uses

Linked data is also commonly used for a wide variety of business uses. It is advantageous when composite data needs to be distributed through a number of stakeholders or systems. Logistics is an example, where there are many players in the chain and the scale is massive. Connecting a range of distributed data can show where an organization could make changes to a shipping route, for instance, or streamline warehousing operations.

Benefits of data linking

Data linking is a valuable exercise for research of all kinds. It can create correlations and links between varying data sets to come up with interesting findings that can be of prolific use to the researcher. Some of the significant benefits of data linking are:

Helps in research and policymaking

Linked data sets offer the opportunity to undertake research and help in the formulation of policies under varied fields such as education and healthcare.

Integral tool for business research

Data linking is useful on the business front, too. It can be used to find a correlation between different parameters. For instance, an organization can link taxation data with business data to give information about employment outcomes of tertiary education, the transition from work to retirement, or any number of other metrics.

Time saving

Data linking uses the available information and avoids wasting time on collecting a whole new set of data for the same research.

Challenges of data linking

Data linking is a great tool that can help researchers and businesses greatly. However, linking data is not an easy task. It has its share of challenges. Let’s find out what they are.

Lack of common entity identifiers

One of the major problems while linking data from disparate resources is the lack of common entity identifiers across different data sets. For instance, an organization may not find patient identifiers in all the data sets to be linked for healthcare research. Thus, data scientists might have to take quasi-identifiers (QIDs) to identify and link information about the same entity.

Long delays in approvals

Data linking requires permission from a range of custodians of data sets and relevant ethics committees. This process can be a long-drawn-out affair that requires a considerable investment of the researcher’s time. It often leads to long delays that are not in alignment with the project schedule and funding timelines.

Inconsistent or incomplete data

Often administrative data sets have inconsistent or incomplete data that differ in content and structure format, hampering the data linkage. For instance, in Brazil, the individual's name is one of the leading data sets used to link information along with sex, date of birth, and municipality. However, the name can be a highly discriminating variable because it is structured in different ways in Brazil.

A person with five names may have recorded all five in one data set and just the first name and surname in another. Thus, standardization of variables across data sets is needed to reduce the variability between identifiers.

Financial barriers

Research involving data linking is an expensive process, with information collected from various sources and the use of advanced technological software. This could prove to be a challenge for some researchers.

Data linking is an important challenge for the future

Data linking is a significant process of finding correlations with the same person or an entity across different parameters to create a more robust data set. It is a significant process at both the micro and macro levels. The linkage of information from disparate sources allows for the creation of a proper sequence of events, and at the macro level, ensures the availability of valuable information about policies in a range of fields, including healthcare and education.

Data linking can be an invaluable medium to get valuable inputs from different data sets and finds vast implementation possibilities in varied fields, from life sciences and healthcare to social media and academia. Data linking is without a question a great tool that can be harnessed to achieve huge dividends.

data linking diagram

Ready for immersive, real-time insights for everyone?