Preserving Privacy in Medical Data Sets

BackgroundPrivacy is a fundamental right and needs to be protected. For health care related information, there are regulations for disclosure. These regulations were motivated by the public's concern of breaches of confidentiality that might result in discrimination. The recent progress in electronic medical record technology, the Internet, and the genetic revolution, together with media reports on violations of privacy have generated increasing interest in this topic. A common belief is that sensitive information is more easily available with the use of networked computers. Since total lack of disclosure is not realistic, current regulations require that the "minimal amount" of information be given to a certain party. In the US, the HIPAA privacy rule requires that reasonable safeguards against unwanted disclosure be taken before dissemination of patient data. Quantification of what constitutes "reasonable safeguards" remains elusive, however. Hence, most de-identification strategies used in practice today rely on simple suppression of identifiers such as name, address, and social security number. Several studies, by our group and others, have shown that these simple strategies are insufficient. Also, there might be data for which disclosure is not possible without compromising privacy. Ultimately, a quantitative analysis must guide the determination of whether safeguards are reasonable or not.


We will formally define and study anonymity in databases, from a theoretical and a practical standpoint. We will develop and implement algorithms to anonymize data sets that will be in accordance with the balance of anonymity and "usefulness" of the disclosed data sets. We will also develop and implement algorithms to verify the anonymity of a given data set and indicate the type of records that are at highest risk for a privacy attack. We will make our methods and documented tools freely available to researchers via the WWW.

In addition to DBMI members (Staal Vinterbo, Lucila Ohno-Machado, MD, PhD), the following collaborators are involved:

  • Tom Lasko, Google
  • Stephan Dreiseitl

This project is funded by the NIH.