Skip to Main Content

Research Data Management Strategy: Secure and Organize

Data Protection Terminology

 Some kinds of data are sensitive, and cannot be shared for legal or ethical reasons. This can include:

  • Personal identifiers
  • Sensitive ecological data
  • Sacred or protected cultural practices

De-identification means removing identifying data from a dataset. Once a dataset has been de-identified, the dataset can be shared without disclosing identifying information.

Removing identifiers is important to protect the confidentiality of research participants. But there is always a risk of re-identifying data, and changing technology introduces new ways to re-identify data. Managing that risk is an important part of sharing research data.

There are several ways of approaching de-identification: 

Anonymization

Anonymization refers to the processing of personal data in a manner that makes it impossible to identify individuals from them. For example, the data can be rendered down to a general level (aggregated) or converted into statistics so that individuals can no longer  be identified from them. The prevention of identification must be permanent and make it impossible for the controller or a third party to convert the data back into identifiable form with the information held by them.

Example: Anonymization | Research Data Management (ubc.ca)

Pseudonymization

Pseudonymization means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific person without the use of additional information.  Such additional information must be kept carefully separate from personal data. Pseudonymized data can still be used to single individuals out and combine their data from different records.

Example: Pseudonymization | Research Data Management (ubc.ca)

Types of Information

TCPS2 (2022) provides the following categories as guidance for assessing the extent to which information could be used to identify an individual:

  • Directly identifying information – the information identifies a specific individual through direct identifiers (e.g., name, social insurance number, personal health number).
     
  • Indirectly identifying information – the information can reasonably be expected to identify an individual through a combination of indirect identifiers (e.g., date of birth, place of residence or unique personal characteristic).
     
  • Coded information – direct identifiers are removed from the information and replaced with a code. Depending on access to the code, it may be possible to re-identify specific participants (e.g., the principal investigator retains a list that links the participants' code names with their actual names so data can be re-linked if necessary).
     
  • Anonymized information – the information is irrevocably stripped of direct identifiers, a code is not kept to allow future re-linkage, and risk of re-identification of individuals from remaining indirect identifiers is low or very low.
     
  • Anonymous information – the information never had identifiers associated with it (e.g., anonymous surveys) and risk of identification of individuals is low or very low.

Infographic Data De-Identification

Content by Vancouver Community College Library is licensed under a
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License