Advertisement

SKIP ADVERTISEMENT

Troves of Personal Data, Forbidden to Researchers

PALO ALTO, Calif. — When scientists publish their research, they also make the underlying data available so the results can be verified by other scientists.

At least that is how the system is supposed to work. But lately social scientists have come up against an exception that is, true to its name, huge.

It is “big data,” the vast sets of information gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers’ privacy. But to many scientists, the practice is an invitation to bad science, secrecy and even potential fraud.

The issue came to a boil last month at a scientific conference in Lyon, France, when three scientists from Google and the University of Cambridge declined to release data they had compiled for a paper on the popularity of YouTube videos in different countries.

The chairman of the conference panel — Bernardo A. Huberman, a physicist who directs the social computing group at HP Labs here — responded angrily. In the future, he said, the conference should not accept papers from authors who did not make their data public. He was greeted by applause from the audience.

In February, Dr. Huberman had published a letter in the journal Nature warning that privately held data was threatening the very basis of scientific research. “If another set of data does not validate results obtained with private data,” he asked, “how do we know if it is because they are not universal or the authors made a mistake?”

He added that corporate control of data could give preferential access to an elite group of scientists at the largest corporations. “If this trend continues,” he wrote, “we’ll see a small group of scientists with access to private data repositories enjoy an unfair amount of attention in the community at the expense of equally talented researchers whose only flaw is the lack of right ‘connections’ to private data.”

Facebook and Microsoft declined to comment on the issue. Hal Varian, Google’s chief economist, said he sympathized with the idea of open data but added that the privacy issues were significant.

“This is one of the reasons the general pattern at Google is to try to release data to everyone or no one,” he said. “I have been working to get companies to release more data about their industries. The idea is that you can provide proprietary data aggregated in a way that poses no threats to privacy.”

The debate will only intensify as large companies with deep pockets do more research about their users. “In the Internet era,” said Andreas Weigend, a physicist and former chief scientist at Amazon, “research has moved out of the universities to the Googles, Amazons and Facebooks of the world.”

But while social and data scientists agree on the importance of replicating experimental results, there is less consensus on what should be done and how to deal with concerns about privacy.

At leading social science journals, there are few clear guidelines on data sharing. “The American Journal of Sociology does not at present have a formal position on proprietary data,” its editor, Andrew Abbott, a sociologist at the University of Chicago, wrote in an e-mail. “Nor does it at present have formal policies enforcing the sharing of data.”

The problem is not limited to the social sciences. A recent review found that 44 of 50 leading scientific journals instructed their authors on sharing data but that fewer than 30 percent of the papers they published fully adhered to the instructions. A 2008 review of sharing requirements for genetics data found that 40 of 70 journals surveyed had policies, and that 17 of those were “weak.”

The data-sharing policy of the journal Science says, “All data necessary to understand, assess and extend the conclusions of the manuscript must be available to any reader of Science.” But in the case of a 2010 article based on data from cellphone patterns, a legal agreement with the data provider prevented the researchers from even disclosing the country of origin.

Ginger Pinholster, a spokeswoman for the American Association for the Advancement of Science, which publishes the journal, acknowledged that on “rare occasions” Science does allow exceptions to its publication guidelines to protect privacy. “Information about movements in particular locations” could provide personal information, she said, “and the authors also had to promise privacy in order to get the information from the phone company.”

The journal did not note the policy exception when it published the article.

Similarly, an April 2011 article in the journal PLoS One stated that the research was “based on the records of 72.4 million calls and 17.1 million text messages accumulated over a one-month period,” but did not identify the provider of the information.

A founder of PLoS, Michael Eisen, a cell biologist at the University of California, Berkeley, who is a a forceful advocate for “open science,” sounded rueful about that paper in an e-mail message. “It’s antithetical to the basic norms of science to make claims that cannot be validated because the necessary data are proprietary,” he wrote.

The issue was foreshadowed in a 2009 essay in Science whose authors included Albert-Laszlo Barabasi, a physicist at Northeastern University who was also an author of the controversial papers in Science and PLoS One.

“Perhaps the thorniest challenges exist on the data side, with respect to access and privacy,” they wrote. They warned that even anonymizing data sets could be imperfect, and they called for new models for collaboration between industry and academia to aid research and safeguard privacy.

Last year the National Science Foundation said that researchers who receive its funds would be “expected” to share data with other researchers.

Many scientists agree that this is as it should be.

“The obvious answer is that there needs to be more access to data,” said Alex Pentland, director of the Human Dynamics Laboratory at M.I.T. “That is beginning to happen as governments and industry realize that they need to better understand the promise and limits of big data; for instance, we will be announcing a huge, multicountry release of phone data soon.”

A version of this article appears in print on  , Section D, Page 1 of the New York edition with the headline: Troves of Personal Data, Forbidden to Researchers. Order Reprints | Today’s Paper | Subscribe

Advertisement

SKIP ADVERTISEMENT