Data Intensive Computing
Interactive Hypothesis Identification and Evaluation with Data Intensive Visual Analytics and Algorithms
Challenge:
Understanding genomic data from community samples (i.e. metagenomes) is a driving need for revolutionary advances in biotechnology. But making sense of the data avalanche that accompanies metagenomes is generally intractable, because most analytical tools are geared toward spreadsheet analysis on small datasets. Multiple genome analysis and metagenome analysis will need more powerful tools that leverage high performance computing (HPC) and visual analytics approaches.
Approach:
We demonstrate a human-in-the-loop workflow that combines high performance computing (ScalaBLAST and SHOT), efficient post-processing and advanced visualization capabilities (Starlight) within a single integration framework (MeDICi). This is based on a two-pass method where the first pass provides a visual representation of multigenome information from well-characterized species allowing a user to formulate a specific hypothesis to test on metagenome sequence data in the second pass. At each stage, HPC applications are guided based on the user's query— rapidly calculating the underlying sequence similarities needed as the basis for more sophisticated analysis.
The goal is to provide a framework for characterizing the space of functions that can be performed by a microbial community by relating its collective molecular components to those from well-characterized microbial isolates.
Impact:
This capability can simultaneously advance the state-of-the-art in several fields: biology, visual analytics, high performance computing, and scientific workflows.

