Researchers at Columbia University, Princeton and Harvard University have developed a new approach for analyzing big data that can drastically improve the ability to make accurate predictions about medicine, complex diseases, social science phenomena, and other issues.
In a study published in the December 13 issue of Proceedings of the National Academy of Sciences (PNAS), the authors introduce the Influence score, or “I-score,” as a statistic correlated with how much variables inherently can predict, or “predictivity”, which can consequently be used to identify highly predictive variables.
“In our last paper, we showed that significant variables may not necessarily be predictive, and that good predictors may not appear statistically significant,” said principal investigator Shaw-Hwa Lo, a professor of statistics at Columbia University. “This left us with an important question: how can we find highly predictive variables then, if not through a guideline of statistical significance? In this article, we provide a theoretical framework from which to design good measures of prediction in general. Importantly, we introduce a variable set’s predictivity as a new parameter of interest to estimate, and provide the I-score as a candidate statistic to estimate variable set predictivity.”
Current approaches to prediction generally include using a significance-based criterion for evaluating variables to use in models and evaluating variables and models simultaneously for prediction using cross-validation or independent test data.
“Using the I-score prediction framework allows us to define a novel measure of predictivity based on observed data, which in turn enables assessing variable sets for, preferably high, predictivity,” Lo said, adding that, while intuitively obvious, not enough attention has been paid to the consideration of predictivity as a parameter of interest to estimate. Motivated by the needs of current genome-wide association studies (GWAS), the study authors provide such a discussion.
In the paper, the authors describe the predictivity for a variable set and show that a simple sample estimation of predictivity directly does not provide usable information for the prediction-oriented researcher. They go on to demonstrate that the I-score can be used to compute a measure that asymptotically approaches predictivity. The I-score can effectively differentiate between noisy and predictive variables, Lo explained, making it helpful in variable selection. A further benefit is that while usual approaches require heavy use of cross-validation data or testing data to evaluate the predictors, the I-score approach does not rely as much on this as much.
“We offer simulations and an application of the I-score on real data to demonstrate the statistic’s predictive performance on sample data,” he said. “These show that the I-score can capture highly predictive variable sets, estimates a lower bound for the theoretical correct prediction rate, and correlates well with the out of sample correct rate. We suggest that using the I-score method can aid in finding variable sets with promising prediction rates, however, further research in the avenue of sample-based measures of predictivity is needed.”
The authors conclude that there are many applications for which using the I-score would be useful, for example in formulating predictions about diseases with high dimensional data, such as gene datasets, in the social sciences for text prediction or financial markets predictions; in terrorism, civil war, elections and financial markets.
“We’re hoping to impress upon the scientific community the notion that for those of us who might be interested in predicting an outcome of interest, possibly with rather complex or high dimensional data, we might gain by reconsidering the question as one of how to search for highly predictive variables (or variable sets) and using statistics that measure predictivity to help us identify those variables to then predict well,” Lo said. “For statisticians in particular, we’re hoping this opens up a new field of work that would focus on designing new statistics that measure predictivity.”
Report calls for more integration of physical, life sciences for needed advances in biomedical research.
What if lost limbs could be regrown? Cancers detected early with blood or urine tests, instead of invasive biopsies? Drugs delivered via nanoparticles to specific tissues or even cells, minimizing unwanted side effects? While such breakthroughs may sound futuristic, scientists are already exploring these and other promising techniques.
But the realization of these transformative advances is not guaranteed. The key to bringing them to fruition, a landmark new report argues, will be strategic and sustained support for “convergence”: the merging of approaches and insights from historically distinct disciplines such as engineering, physics, computer science, chemistry, mathematics, and the life sciences.
The report, “Convergence: The Future of Health,” was co-chaired by Tyler Jacks, the David H. Koch Professor of Biology and director of MIT’s Koch Institute for Integrative Cancer Research; Susan Hockfield, noted neuroscientist and president emerita of MIT; and Phillip Sharp, Institute Professor at MIT and Nobel laureate, and will be presented at the National Academies of Sciences, Engineering, and Medicine in Washington on June 24.
The report draws on insights from several dozen expert participants at two workshops, as well as input from scientists and researchers across academia, industry, and government. Their efforts have produced a wide range of recommendations for advancing convergence research, but the report emphasizes one critical barrier above all: the shortage of federal funding for convergence fields.
“Convergence science has advanced across many fronts, from nanotechnology to regenerative tissue,” says Sharp. “Although the promise has been recognized, the funding allocated for convergence research in biomedical science is small and needs to be expanded. In fact, there is no federal agency with the responsibility to fund convergence in biomedical research.”
Early influenza detection and the ability to predict outbreaks are critical to public health. Reliable estimates of when influenza will peak can help drive proper timing of flu shots and prevent health systems from being blindsided by unexpected surges, as happened in the 2012-2013 flu season.
The Centers for Disease Control and Prevention collects accurate data, but with a time lag of one to two weeks. Google Flu Trends began offering real-time data in 2008, based on people’s Internet searches for flu-related terms. But it ultimately failed, at least in part because not everyone who searches “flu” is actually sick. As of last year, Google instead now sends its search data to scientists at the CDC, Columbia University and Boston Children’s Hospital.
Now, a Boston Children’s-led team demonstrates a more accurate way to pick up flu trends in near-real-time — at least a week ahead of the CDC — by harnessing data from electronic health records (EHRs).
As Mauricio Santillana, PhD, John Brownstein, PhD, and colleagues describe in Scientific Reports, the team combined EHR data, historical patterns of flu activity and a machine-learning algorithm to interpret the data. This clinical “big data” approach produced predictions of national and local influenza activity that closely matched the CDC’s subsequent reporting.
“Our study shows the true value of considering multiple data streams in disease surveillance,” says Brownstein, the study’s senior investigator and Chief Innovation Officer at Boston Children’s Hospital. “While Google data provide incredible real-time, population-wide information, clinical data add a more accurate and precise assessment of disease state.”
Crunching EHR data
Instrumental to the study were data from collaborator Athenahealth, encompassing more than 72,000 healthcare providers and EHRs for more than 23 million patients.
The investigators first trained their flu-prediction algorithm, called ARES, with data captured from June 2009 through January 2012: weekly total visit counts, visit counts for flu and flu-like illness, visit counts for flu vaccination and more. ARES then used that intelligence to estimate flu activity over the next three years, through June 2015.
The team showed that ARES’ estimates of national and regional flu activity had error rates two to three times lower than earlier predictive models. ARES also correctly estimated the timing and magnitude of the national flu “peak week.” It was slightly less accurate in predicting regional peak weeks, but clearly outperformed Google Flu Trends on all measures.
The idea of capturing data directly from health care encounters definitely makes sense — assuming such data can be liberated from proprietary, HIPAA-bound healthcare IT systems. “As EHR data become more ubiquitously available, we will see major leaps in our ability to monitor and track disease outbreaks,” says Brownstein.
“Having access to near-real-time aggregated EHR information has enabled us to significantly improve our flu tracking and forecasting systems,” agrees Santillana, a member of Boston Children’s Computational Health Informatics Program (CHIP), and also affiliated with Harvard Medical School and the Harvard Institute for Applied Computational Sciences. “Real-time tracking will enable local public health officials to better prepare for unusual flu activity and potentially save lives.”
Helping computers learn to tackle big-data problems outside their comfort zones
Imagine combing through thousands of mugshots desperately looking for a match. If time is of the essence, the faster you can do this, the better. A*STAR researchers have developed a framework that could help computers learn how to process and identify these images both faster and more accurately1.
Peng Xi of the A*STAR Institute for Infocomm Research notes that the framework can be used for numerous applications, including image segmentation, motion segmentation, data clustering, hybrid system identification and image representation.
A conventional way that computers process data is called representation learning. This involves identifying a feature that allows the program to quickly extract relevant information from the dataset and categorize it — a bit like a shortcut. Supervised and unsupervised learning are two of the main methods used in representation learning. Unlike supervised learning, which relies on costly labeling of data prior to processing, unsupervised learning involves grouping or ‘clustering’ data in a similar manner to our brains, explains Peng.
PolyU breaks the world record of fastest optical communications for data centres
The Hong Kong Polytechnic University (PolyU) has achieved the world’s fastest optical communications speed for data centres by reaching 240 G bit/s over 2km, 24 times of the existing speed available in the market. Compared to existing alternatives in the market, the technology developed by PolyU has reduced the cost of data transmission per unit to just one-fourth, and therefore is practical for commercialisation purposes. Speedy transmission at a significantly low cost for data centres enables end users to widely use new forms of communications such as immersive videos, augmented reality and virtual reality.
On a societal level, the increased transmission speed will open up a new era for Big Data and Internet of Things (IoT) applications, driving innovation and technology advancement.
With this breakthrough, around 10,000 persons can stream 4K video at the same time, compared to only 400 persons under the current available speed.