The Low-Down: What Happens When You Put 500,000 People's DNA Online

It is important to note, first and foremost, that the data came from volunteers who freely offered their information.

But what has been revolutionary - dare we say disruptive - about this approach is that it broke down barriers and competitive data hoarding, spurring greater and faster collaborative research into the origins of dozens of diseases.

The lesson is that the returns to sharing information can be greater, due to the potentials of scale, than the returns to squirreling it away. JL

Sarah Zhang reports in The Atlantic:

In the past, research groups that had DNA data sets hoarded it for themselves, so that they could be the first to mine it for publishable insights. U.K. Biobank's data is open to anyone in the world, as long as they are a legitimate researcher and pay a fee. Research groups have built freely available tools to help other scientists make use of U.K. Biobank’s data. Never had genetics research moved so fast.

Every big, ambitious project has to start somewhere, and for U.K. Biobank, it was at an office building south of Manchester, where the project convinced its very first volunteer to pee into a cup and donate a tube of blood in 2006.

U.K. Biobank would go on to recruit 500,000 volunteers for a massive study on the origins of disease. In addition to collecting blood and urine, the study recorded volunteers’ height, weight, blood pressure; tested their cognitive function, bone density, hand-grip strength; scanned their brains, livers, hearts; analyzed their DNA. In breadth and depth, the study is the first of its kind.

Handling all the samples was a logistical challenge. To process thousands of tubes of blood, for example, U.K. Biobank’s lab needed a new robotics system. (This ultimately came from a company that builds machines for packing sausages, not unlike tubes of blood in shape.) Each tube of blood was split into its component parts—red blood cells, white blood cells, plasma—and run through a battery of tests. White blood cells contain DNA, which the project had analyzed, too. When all was said done, U.K. Biobank had assembled one of the largest single genetic data sets ever. It all took a while.

This spring, 11 years after the first volunteer gave up a tube of blood, U.K. Biobank announced it would release its full genetic data set to registered scientists in July. This huge amount of genetic information, combined with the thousands of other characteristics tracked by U.K. Biobank, allows scientists to look for the genetic determinants of virtually any disease. Geneticists marked their calendars. “We heard stories that people who head groups had canceled holidays,” says Jonathan Marchini, a statistical geneticist at the University of Oxford. “Everyone has been waiting for this for so long.”

U.K. Biobank had done data releases before, including an earlier subset of the genetic data set with just over 100,000 people. In the past, research groups using the data wrote up their papers, submitted to journals, waited for peer review, and eventually their papers trickled out to the public. In the last year, however, an increasingly popular website called bioRxiv—pronounced “bio archive”—has changed the game. BioRxiv allows biologists to publish preprints, or preliminary drafts of their papers that have not yet been peer-reviewed.

Preprints based on the latest U.K. Biobank data started to come out almost immediately. Within two weeks, David Howard and Andrew McIntosh, psychiatry researchers at the University of Edinburgh, had posted not one but two preprints, one on genetic variants linked to depression and the other to neuroticism. Their team subsisted on pizza and worked “constantly.”

Others soon followed, and the flood of preprints has continued ever since. Never had genetics research moved so fast.

* * *

Ask scientists what’s so revolutionary about U.K. Biobank and they’ll say it’s big. But they’ll also say this: Nobody gets preferential access.

In the past, research groups that had gone through the trouble and expense of building DNA data sets have hoarded it for themselves, so that they could be the first to mine it for publishable insights. U.K. Biobank, however, is supported by the United Kingdom’s National Health Service. Its data is open to anyone in the world, as long as they are a legitimate researcher and pay a fee commensurate with the amount of data they want to access—a couple thousand dollars for the full genetic data.

Neale’s group did 2,500 GWASs in a single day—and he didn’t even bother to write a paper.
When it came to releasing the 500,000-person data set, making sure everyone got the huge file (12 terabytes uncompressed) at the same time was no trivial matter. U.K. Biobank decided to allow registered researchers to start downloading the data weeks before its official July release. The catch: It was encrypted. The decryption keys went out to all research groups simultaneously on the official release date. Nobody got a head start of a few days, or even a few hours. Even Marchini, who helped U.K. Biobank process some of the data, was not allowed to analyze it for his own research purposes until it was available to all.

“The vision for providing the data to any bona fide researcher without preferential access was really a game changer,” says Manny Rivas, a biostatistician at Stanford University. Rivas, who is an assistant professor, noted it is a real boon for junior faculty, who haven’t had years to amass their own data. The availability of a data set as rich and deep as U.K. Biobank democratizes genetics research.

On top of this shared data set, several research groups have now built freely available tools to help other scientists make use of U.K. Biobank’s data. Marchini’s group made a web browser dedicated to parsing genetic and brain data from U.K. Biobank. Albert Tenesa, from the University of Edinburgh, created GeneATLAS, which accounts for family members in the database, the presence of whom usually screw up the math used to find links between genetic variants and disease. Rivas made the Global Biobank Engine, which is essentially a search engine for genes potentially associated with any disease. The Global Biobank Engine, in turn, is partly based on calculations done by Ben Neale, a geneticist at the Broad Institute, who looked at nearly 2,500 traits and disorders and how they corresponded with genetic variants in the U.K. Biobank.

(Unlike U.K. Biobank’s full data, these tools are accessible to anyone with an internet connection, but they show only aggregate data, so study participants should not be individually identifiable.)

In the past, looking at how a single trait corresponded with a set of genetic variants could be a paper in itself. It’s called a genome-wide association study, or GWAS. Neale’s group did 2,500 GWASs in a single day—and he didn’t even bother to write a paper. It’s a blog post on his website. Neale says it didn’t quite feel like a discrete journal article. It’s more a starting point for scientists interested in specific genes or traits. He’s since heard from both pharmaceutical companies and academic researchers using his GWAS data.

U.K. Biobank has made genetics research easier, but it is also raising the bar.
Tenesa, who uploaded a preprint describing GeneATLAS on bioRxiv in August, says he has also heard from a couple dozen researchers using the tool. Some have asked him to run calculations for specific traits. This is happening as he’s still working to publish a paper about GeneATLAS in an official journal. It’s the way things are now. “When I get my email from Nature Genetics these days, and they tell you what papers have just been published, I’ve often seen the papers nine months earlier on bioRxiv,” says Marchini.

But is there such a thing as too fast? Jeffrey Barrett, a geneticist at the Sanger Institute, has cautioned against hastily posting preprints based on a quick GWAS. “I understand why,” he says. “It’s the quickest way to get out the stamp that you’ve done this analysis first.” But it’s easy to miss possible artifacts or mistakes in a data set as big and complex as this one. And now that it’s easy to identify genetic variants linked to a disorder, says Barrett, simply enumerating the variants doesn’t add much value. U.K. Biobank has made genetics research easier, but it is also raising the bar.To publish, researchers increasingly will have to tease out how a genetic variant found via GWAS may be causing a disease—perhaps by tracking how it’s expressed in different parts of the body or sequencing the gene. U.K. Biobank did not fully sequence the DNA in its blood samples, which would have been far too expensive; it used a technique called genotyping that spot-checked 820,967 sites in the genome. In March, though, it signed a deal with the pharmaceutical companies Regeneron and GSK to sequence the DNA of everyone in the study. This deal gives the pharma companies exclusive access to the sequences for nine months—a change in access policy. Eventually, the data will be available to the wider research community.

U.K. Biobank is still following its 500,000 volunteers, and will continue to do so for many years as they age. The technology available to scientists will advance over time, too. When planning for the study was going on, says U.K. Biobank’s principle investigator Rory Collins, studying 820,967 genetic markers for 500,000 people seemed unlikely: “No one envisaged that being possible so soon.” A decade on, any scientist in the world can do it.