Through my position with the Council for Big Data, Ethics and Society, I recently lead the drafting of a collective public comment on the proposed revisions to the Common Rule, the federal regulation that requires federally funded research projects to receive independent, prior ethics review. The proposed revisions—the first in three decades—are largely a response to the rise of big data analytics in scientific research. Although the changes to biomedical research have received the most public attention, there are some important lessons to take home for any company utilizing data analytics.
Academic research on human subjects is governed by a set of ethical guidelines referred to as the “Common Rule.” These guidelines apply to all human-subjects research that receives government funding, and most universities and research-granting foundations require them of all research. The best known stipulation of the Common Rule is the requirement that research projects receive independent prior review to mitigate harms to research subjects. Private companies are not bound by the Common Rule insofar as they do not receive government funding, but the Common Rule sets the tone and agenda of research ethics in general—it has an outsized footprint well beyond its formal purview. Thus even private industry has good reason to pay attention to the norms animating the Common Rule, even if they are not obligated to follow these regulations.
Indeed, many of the datasets most interesting to researchers and dangerous to subjects are publicly available datasets containing private data.
The most notable problem posed by the revisions in the NPRM is the move to exclude from oversight all research that utilizes public datasets. Research ethics norms and regulations have long assumed that public datasets cannot pose additional informational harms—by definition the harm is already caused by the data contained therein becoming public. However, big data analytics render that assumption anachronistic. We used to be able to assume that data would stay put within its original context of collection. However, the power and peril of big data is that datasets are now architected to be (at least in theory) infinitely repurposable, perpetually updated, and indefinitely available. A public, open dataset that appears entirely innocuous in one context can be munged with another public dataset and pose genuine harms to the subjects of that research. See, for example, the case of NYC taxi database, and the many, many private details there were revealed about drivers and riders from a public dataset.
The risks posed by this type of research is amplified by the manner in which the NPRM defines “public” and “private,” which is slightly off from our colloquial understanding of these terms as antonyms. “Public” modifies “datasets,” describing access or availability. “Private” modifies “information” or “data” describing a reasonable subject’s expectations about sensitivity. Indeed, many of the datasets most interesting to researchers and dangerous to subjects are publicly available datasets containing private data. And many of those datasets are held by private enterprise—the NPRM explicitly defines “public” to include any dataset that can be purchased. The NPRM appears to assume that what matters about a dataset is its status, not what will be done with it or what it contains. However, the public status of a dataset is no longer a reasonable proxy for “inherently low risk.” It is troublesome that the HHS is declaring that research using public datasets should be automatically excluded from oversight because its poses no risk.
This space between public and private is highly relevant to the ethics needs of private enterprise, even though these specific regulations only directly apply to academic research. It is a sound prediction that the role of big-data analytics in business decision making and product development will continue to rapidly grow for the foreseeable future. But the matter of how that growth will be managed is under-appreciated. Data ethics scholars Jules Polonetsky, Omar Tene and Joseph Jerome have argued that as big-data ethics becomes a more visible issue within our society, organizations will “no longer view privacy strictly as a compliance matter to be addressed by legal departments or a technical issue handled by IT.” Instead, ethics review practices will become the standard, and resting on outdated assumptions about the risks posed by data is no longer acceptable. Companies need to be able to show users, regulators and the wider public that their analytic practices do not infringe on people’s legal and moral rights.
This space between public and private is highly relevant to the ethics needs of private enterprise, even though these specific regulations only directly apply to academic research.
It is now clear that, even if anonymized, data that contains private information can be used to draw conclusions about the activities of specific individuals. If your organization collects, analyses, or disseminates public data, you need to carefully consider how this data might be repurposed for unintended purposes. Insofar as your organization might be implicated in the consequences of data that it makes available, its members may be held accountable. In order to protect yourself and your organization from this type of risk, you will need to employ processes for determining and managing the risks your data practices expose you to. In particular your processes should ensure that:
decontextualized or anonymized data is analyzed to understand what conclusions might still be drawn about individuals and groups.
- your data practices are narrow enough to not collect any information you do not need that might expose you to risk
- your data sharing practices limit the use of that data in ways that are consistent with your vision for the use of that data.