Data-Intensive Computing Initiative (DICI)
Data-Intensive Machine Learning for Real-Time Decision Analysis
Technical Contacts: Bobbie-Jo Webb-Robertson; Christopher Oehmen
Executive Summary
Diverse issues, such as detecting anomalous events within a network or determining the function of biological molecules based on homology signatures, can both often be reliably addressed by using the classification approach of support vector machines (SVMs). However, emerging data-intensive applications are pushing SVM training requirements far beyond the capacity of most architectures. To mitigate this training cost, (1) new parallel implementations of the SVM optimization algorithm will be developed and benchmarked on multiple computing platforms, and (2) the learning process will be automated offline from the "decision making" to allow real-time decision making (Figure 1). The final product will be a novel SVM framework that enables offline and online training in either an automated or user-driven setting.
Accomplishments / Highlights
Data-Intensive Application Demonstration -- Bioinformatics
Develop a new SVM HOmology Tool (SHOT) that formulates remote homology detection as a pairwise problem (determine if two proteins share a common evolutionary ancestor). The challenge:
- Given a small benchmark dataset (~4000 proteins), training the pairwise problem requires observing all non-redundant pairs -- about 9 million.
- Conventional methods require storing over a petabyte of data.
Data-Intensive Training
- Base capability allowing production of SHOT.
- Parallel version of SVM implemented and benchmarked.
Collaboration
Data-Intensive Computing Initiative (DICI)
- Adaptive Composite Analysis for Complex Systems project, to enable fusion of either classifications or kernels.
- Integrated Demonstration of Biological Workflows to Support Threat Detection and Biomarker Discovery project, to enable final decision-making in the workflow.
- Real-Time Situational Awareness from Massive Sensor Data project, to enable identification of threats in cyber security.
Environmental Biomarker Initiative (EBI)
- The SVM methodology is being applied to non-data-intensive problems with the EBI and is also part of the integrated demonstration.
Environmental Molecular Sciences Laboratory
- Grand challenge in Membrane Biology.
- NMR applied to metabolomics.
Demonstration
Scientific Discovery and Insight
- Defining ion distributions in the data.
- Final classification of diagnostic action.
- Defining multiple actions over multiple data sources (integrated with composite analysis).
Decision Support and Control
- Rapid training for anomaly detection.
- Rapid, lightweight classification system.
- Classification of streaming data for near-real-time reaction.
Impacts
The development of a new SVM framework for data-intensive problems will enable more accurate solutions to problems that currently cannot be tackled using this technology. Such a capability will reach into vast domains, and the overall framework will enable domains in need of real-time decisions to deploy the final classifier at the sensor level if necessary. The ability to perform data-intensive SVMs adds a powerful new approach to the arsenal of methods being developed for predictive science across Laboratory initiatives and the DOE.
