Skip to Main Content U.S. Department of Energy
Data-Intensive Computing Initiative, DICI

Data-Intensive Computing Initiative (DICI)

Data-Intensive Machine Learning for Real-Time Decision Analysis

Technical Contacts: Bobbie-Jo Webb-Robertson; Christopher Oehmen

Figure 1: The overall architecture of the new SVM framework. Powerful new algorithms for training on data-intensive problems will be performed offline from the classification using parallel algorithms and specialized hardware. The classifier generated from this data-intensive training is reduced in size from the original input space and is a simple linear computation that can be performed in real time.
SHOT overcomes issues associated with PSI-BLAST (sensitivity), as well as the caveats of non-data-intensive SVM applications to this problem (data mining). Non-data-intensive SVM approaches are family based and require classifying a protein into pre-defined families. Embedded in the performance curve is the overall process of classifying a pair of proteins as homologous using the single SHOT classifier.
Calculating fragments of the kernel on-the-fly is extremely time consuming. We use a data-intensive approach to preserve the kernel, resulting in a 23-fold speedup. Parallelizing the calculation of the kernel matrix results in additional speedup in proportion to the number of processors involved.
Training on large datasets drives the need for data-intensive computing, in part, because of the large memory footprint required to store the kernel--a key to attaining a reasonable time-to-solution for large problems.

Executive Summary

Diverse issues, such as detecting anomalous events within a network or determining the function of biological molecules based on homology signatures, can both often be reliably addressed by using the classification approach of support vector machines (SVMs). However, emerging data-intensive applications are pushing SVM training requirements far beyond the capacity of most architectures. To mitigate this training cost, (1) new parallel implementations of the SVM optimization algorithm will be developed and benchmarked on multiple computing platforms, and (2) the learning process will be automated offline from the "decision making" to allow real-time decision making (Figure 1). The final product will be a novel SVM framework that enables offline and online training in either an automated or user-driven setting.

Accomplishments / Highlights

Data-Intensive Application Demonstration -- Bioinformatics

Develop a new SVM HOmology Tool (SHOT) that formulates remote homology detection as a pairwise problem (determine if two proteins share a common evolutionary ancestor). The challenge:

  • Given a small benchmark dataset (~4000 proteins), training the pairwise problem requires observing all non-redundant pairs -- about 9 million.
  • Conventional methods require storing over a petabyte of data.

Data-Intensive Training

  • Base capability allowing production of SHOT.
  • Parallel version of SVM implemented and benchmarked.

Collaboration

Data-Intensive Computing Initiative (DICI)

  • Adaptive Composite Analysis for Complex Systems project, to enable fusion of either classifications or kernels.
  • Integrated Demonstration of Biological Workflows to Support Threat Detection and Biomarker Discovery project, to enable final decision-making in the workflow.
  • Real-Time Situational Awareness from Massive Sensor Data project, to enable identification of threats in cyber security.

Environmental Biomarker Initiative (EBI)

  • The SVM methodology is being applied to non-data-intensive problems with the EBI and is also part of the integrated demonstration.

Environmental Molecular Sciences Laboratory

  • Grand challenge in Membrane Biology.
  • NMR applied to metabolomics.

Demonstration

Scientific Discovery and Insight

  • Defining ion distributions in the data.
  • Final classification of diagnostic action.
  • Defining multiple actions over multiple data sources (integrated with composite analysis).

Decision Support and Control

  • Rapid training for anomaly detection.
  • Rapid, lightweight classification system.
  • Classification of streaming data for near-real-time reaction.

Impacts

The development of a new SVM framework for data-intensive problems will enable more accurate solutions to problems that currently cannot be tackled using this technology. Such a capability will reach into vast domains, and the overall framework will enable domains in need of real-time decisions to deploy the final classifier at the sensor level if necessary. The ability to perform data-intensive SVMs adds a powerful new approach to the arsenal of methods being developed for predictive science across Laboratory initiatives and the DOE.

DICI

Demonstrations

Research Areas

Highlights

Ian Gorton, DICI Chief Architect, is Guest Editor of IEEE Computer's April 2008 issue--a special issue on data-intensive computing.

The MeDICi Integration Framework is now available for download and use in developing applications.

Targeted Research

Projects