BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Where Will Healthcare's Data Scientists Find The Rich Phenotypic Data They Need?

Following
This article is more than 9 years old.

The big hairy audacious goal of most every data scientist I know in healthcare is what you might call the Integrated Medical Record, or IMR, a dataset that combines detailed genetic data and rich phenotypic information, including both clinical and “real-world” (or, perhaps, “dynamic”) phenotypic data (the sort you might get from wearables).

In my last post, I noted that, as Craig Venter told Congress this summer, the combination of next -generation sequencing (to generate raw data) and cloud computing (to efficiently process the information) begins to address the genomic data component of an IMR (disclosure: I work at DNAnexus, a genomics data company).  The question is: where will the dense phenotypic data come from?

The gold standard for clinical phenotyping are academic clinical studies (like ALLHAT and the Dallas Heart Study).  These studies are typically focused on a disease category (e.g. cardiovascular), and the clinical phenotyping on these subjects – at least around the areas of scientific interest -- is generally superb.  The studies themselves can be enormous, are often multi-institutional, and typically create a database that’s independent of the hospital’s medical record.

Inevitably, large, prospective studies can take many years to complete.  In addition, there’s generally not much real world/dynamic measurement.

The other obvious source for phenotypic data is the electronic medical record (EMR).  The logic is simple: every patient has a medical record, and increasingly, especially in hospital systems, this is electronic – i.e. an EMR.  EMRs (examples include Epic and Cerner ) generally contain lab values, test reports, provider notes, and medication and problem lists.  In theory, this should offer a broad, rich, and immediately available source of data for medical discovery.

I discussed some of the problems with EMRs in my last post – the recorded information is of variable quality, often incomplete, generally doesn’t include real-world phenotype (though this may be changing), and is typically extremely difficult to extract.  Interoperability remains a huge problem, meaning that even two hospitals that might be running the same brand of EMR can have trouble exchanging information.  Consequently, pooling EMR information at scale from multiple hospitals remains a significant problem, despite the obvious utility of being able to do this for cancer and other diseases.  (It’s also fair to say that many of the central challenges around data sharing by academic physicians are not attributable to problems with the EMR.)

DIY (enabled by companies such as PatientsLikeMe) represents another approach to phenotyping, and allows patients to share data with other members of the community.  The obvious advantages here include the breadth and richness of data associated with what can be an unfiltered patient perspective – to say nothing of the benefit of patient empowerment.  An important limitation is that the quality and consistency of the data is obviously highly dependent upon the individuals posting the information.

Pharma clinical trials would seem to represent another useful opportunity for phenotyping, given the focus on specific conditions and the rigorous attention to process and detail characteristic of pharmaceutical studies.  However, pharma studies tend to be extremely focused, and companies are typically reluctant to expand protocols to pursue exploratory endpoints if there’s any chance this will diminish recruitment or adversely impact the development of the drug.

Given these complexities, it’s not surprising that some researchers have decided the best path forward is to start from scratch.  This is essentially the premise of Google’s Baseline Study, which aims to track a cohort of healthy patients, obtaining rich clinical phenotypic data at the outset (like a traditional academic clinical study), but then supplementing this with dynamic information from a range of wearable devices.  (Disclosure: Google Ventures is an investor in DNAnexus.)

Like prospective academic studies, Baseline will presumably take a while to yield results.  In addition, they will ultimately need to enroll a very large number of volunteers -- far more than the 175 or so they’re starting out with.   Finally, while the reliance on objective measurement is understandable, it’s not clear to me whether there’s a component of traditional clinical assessment as well.

Over time, I suspect, the distinction between some of these approaches may disappear as wearable data increasingly become part of the EMR, and are obtained more regularly in prospective academic studies, and eventually, in pharma studies, like this recently announced Boehringer Ingelheim/Propeller Health pilot.  There also seems to be an increased focus on patient experience, as I discussed here.

At the same time, I’m especially sympathetic to the idea of creating a more perfect data union from the ground up.  I think the real question to ask is whether the right place to start is a novel clinical study, like Baseline -- or a novel clinical system?

Addendum:

Readers might enjoy two related posts:

"Next Hurdle For Medical Research: Capture and Integration Of Phenotype At Scale"

"Creating The Data-Inhaling Health Clinic Of The Future"