Social trust, A/B testing, and Research Ethics

Jacob Metcalf

08/03/2015

Most Internet services, and especially social media services, routinely conduct experiments on users’ experiences even though few of us are aware of it and consent procedures are murky. In a recent New York Times op-ed Michelle Meyer and Christopher Chabris argue that we should enthusiastically embrace the model of experimentation on users called “A/B testing.” This type of data-intensive experimentation is the bread and butter of the Internet economy and now is at the heart of sprawling ethical dispute over whether experimenting on Internet users’ data is equivalent to human experimentation on legal, ethical or regulatory grounds. In their Op-Ed, Myer and Chabris argue that A/B testing is on the whole ethical because without it Internet services would have no idea about what works, let alone what works best. They suggest that whatever outrage users might feel about such experiments are due to a “moral illusion” wherein we are prone to assuming that the status quo is natural and any experimental changes need to be justified and regulated, but the reality of Internet services is that there is no non-experimental state.

While they’re right that this type of experimentation is a poor fit for the ways we currently regulate research ethics, they fall short of explaining that data scientists need to earn the social trust that is the foundation of ethical research in any field. Ultimately, the foundations of ethical research are about trusting social relationships, not our assumptions about how experiments are constituted. This is a critical moment for data-driven enterprises to get creative and thoughtful about building such trust.Even if those specific regulations do not work for A/B testing, it does not follow that fostering and maintaining such trust is not an essential component of knowledge production in the era of big data.

A/B testing is the process of dividing users randomly into two groups and comparing their response to different user experiences in order to determine which experience is “better.” Whichever website design or feed algorithm best achieves the prefered outcome—such as increased sales, regular feed refreshes, more accurate media recommendations, etc.—will become the default user experience. A/B testing appears innocuous enough when a company is looking for hard data about which tweaks to a website design drives sales, such as the color of the “buy” button. Few would argue that testing the color of a buy button or placement of an ad requires the informed consent of every visitor to a website. However, when a company possesses (or is accessing via data mining) a vast store of data on your life, your political preferences, your daily activities, your calendar, your personal networks, and your location, A/B testing takes on a different flavor.

The most notorious case of A/B testing ethics is the Facebook emotional contagion study, published in the Proceedings of the National Academies of Science in 2014. In this paper, Facebook data scientist Adam Kramer and Cornell information scientists Jamie Guillory and Jeff Hancock present the results of experimentally modifying the Facebook feed algorithm for 689,003 people. They demonstrated that the negative or positive emotional valence of the posts that show up on users’ News Feed alters the emotional valence of the posts that the user him or herself makes. This supported the hypothesis that emotional contagion—spreading emotional states through social contact—occurs on a massive scale in social networks. It also launched an enormous controversy about big data research ethics because the users whose feeds were altered did not consent to having their emotional state experimented upon. Furthermore the research was not approved through channels such as an Institutional Review board (IRB), as is the norm for most university researchers. (Although the Facebook terms of service now require users to consent to such experiments, it did not at the time of the experiment. ) Similar criticisms have been raised about Facebook’s experiment with voting behaviors and OKCupid’s experiments with users’ dating preferences.

Actionable items for Internet research ethics:

Greater transparency about and user control of algorithms
Establish internal and external ethics committees
Robust methods for checking business decisions against values. There have been many excellent, varied analyses of the ethical dynamics of the emotional contagion study. But I want to focus on one area that hasn’t been explored much thus far, and is raised indirectly by Myer and Chabris’ Op-Ed and in Myer’s article in the Colorado Technology Law Journal. Namely, that many of the seemingly timeless ethical concepts and regulatory regimes they claim cannot be applied to A/B testing are not arbitrary rules applied from external agencies, but are ultimately rooted in hard-won trust between researchers and subjects. Internet service companies may not be a good fit for the specific rules, but they have much to learn about the how that trust historically came to be and much to lose if they don’t pursue a similar path.

Myer suggests that critics of corporate experimentation suffer from what she terms the “A/B illusion.” The A/B illusion arises when we assume that the “A” state is the natural state, and the “B” state exists because experimenters are manipulating us, but in fact the “A” state is just as artificial as the “B” state. The illusion grows out of the assumption that all experiments have a “natural” control, but when it comes to the algorithms that generate your social media feeds or recommend new music there is no “natural” state to use as a control. Myer and Chabris suggest that we should welcome A/B testing precisely because companies generally lack any data about whether A is any good itself. In other words, Facebook “experiments” on users’ emotional states every time it routinely tweaks NewsFeed algorithm even if they do not name it as such. Similarly, as OkCupid’s founder Christian Rudder points out, a dating service has no data about whether A works at all until it tests some B’s, C’s, D’s, etc.

Since A is just as artificial as B, Myer and Chabris contend, we make an ethical and scientific mistake if we try to protect users from B because it is “experimental” but don’t require the company to prove A works and is good for users. Insofar as A/B testing is necessary to have any functioning Internet services, and actually getting informed consent from users would be a huge burden, they suggest that Internet service users should welcome experimentation without burdensome regulation.

As Myer explores extensively in her journal article, but not their op-ed, a critical philosophical distinction between practice and research is at play here, and it is a distinction undergirded by certain assumptions that apply largely to medical research. The defining ethical conflict for medical research is the line between everyday treatment of patients and medical treatment that is also research. Both are medical interventions, but the former is by definition for the good of the patient and the latter is statistically likely to not result in better outcomes for the patient and even risks harm. If medical research is to determine which interventions are best it is epistemically required that some people not receive what is ultimately the best treatment. The Belmont Report, the foundational document for the HHS’s human subjects protection rules published in 1979, draws a clear distinction between medical “practice” and “research.” When physicians cross into a research role they are obligated to signal their changing ethical commitments with practices like review boards and informed consent. That is an essential component of medical ethics, but critics like Myer are right to point out that it is problematic that all of our statutory research ethics regulations are largely intended to control for the different social roles occupied by physicians-as-caregiver and physicians-as-researcher despite potentially very different social roles and ethical obligations found in other research professions. There simply is no analogous bright line between practice and research in data science, and therefore the grounds for straightforwardly exporting research regulations from biomedicine to data science are highly suspect. Myer instead advocates for developing a model of “responsible innovation” that would obligate researchers to transparently share and make use of knowledge generated through field testing.

For the most part, I agree with Myer that there is a serious mismatch between big data and our familiar methods for regulating research ethics. We cannot simply export the Common Rule to A/B testing as if algorithm jockeys were physicians—these are genuinely different epistemologies and social roles. However, I think she misses a critically important point: it is hard-won trust that facilitates the move between practice and research for physicians. Not to be over-dramatic but in some cases that social trust was paid for in blood, such as the Nuremberg trials that resulted in hangings of war criminals and established the Nuremberg Code as the first world-wide code for regulating medical research. Following the medical and psychological research scandals of the mid-20th Century in the U.S. (Tuskegee, Milgram, Stanford Prison, etc.) and formal responses like the Belmont Report, sensitive scientific research on humans became both more regulated and more efficacious. Medical practice evolved significantly to adapt to these regulations, and the culture of medical care is much less paternalistic than it once was. The medical profession expends a lot of time and energy ensuring that ethical self-regulation works most of the time—and that functioning backstops exist for when it doesn’t—because it ultimately results in better knowledge. Although it can feel like regulations are imposed from outside bodies such as the HHS, these are ultimately policies that the medical community has developed and assents to in order to maintain the public’s trust.

So even if a firm line between practice and research is a bad fit for big data research it doesn’t mean that there is nothing to be learned by how such a line is created and maintained. We have these formal structures and legally enforceable assurances because they sustain the public’s trust that research which risks their well-being will be conducted for the good of society and all reasonable measures will be taken to mitigate potential harm. Even if those specific regulations do not work for A/B testing, it does not follow that fostering and maintaining such trust is not an essential component of knowledge production in the era of big data.

Even though I would only hazard some guesses about what formal regulations should look like or if they are even desirable, the history of research ethics demonstrates that research communities need to put in the time and energy to earn and maintain public trust. We don’t yet know what that looks like in big data, but we can anticipate that asking the public to simply celebrate corporate experimentation won’t be enough. There are actionable items that companies can take today to begin building that trust, such as providing greater transparency about their algorithms, establishing robust internal or external ethics committees and rigorously checking their business decisions against their values. As my colleague at the Data & Society Research Institute‘s Big Data Council Kate Crawford has argued about the Facebook study, Silicon Valley has done a poor job of publicly accounting for its own power and what it owes people. The sustainability of the Internet economy will be partly determined by which companies figure out that formula for trust—and not just those who figure out the best algorithms for serving feed refreshes.

The Ethical Resolve Blog
Building Ethics Capacity

Getting the formula right: Social trust, A/B testing and research ethics

Actionable items for Internet research ethics: