Data-Intensive Computing Initiative (DICI)
A Data Virtualization Architecture
Technical Contacts: Eric Stephan; Karen Schuchardt
Executive Summary
This research provides high-performance data middleware and services for provenance and metadata tracking, mechanisms to store and/or reference raw scientific data and related artifacts with virtualized query access, and an architecture upon which to experiment with advanced semantic query mechanisms. These services are critical to support long-term data management, cross-discipline data sharing, verification, and data dissemination.
A focus on provenance is the key unifying mechanism to describe, annotate, and track relationships between data and to search disparate sources for data while retaining autonomy.
Accomplishments / Highlights
- Completed preliminary study of twelve characteristics across six emerging provenance store and content management technologies; leveraged information to select and benchmark RDF and content stores and define preliminary provenance architecture.
- Prototyped new provenance system by combining Alfresco, Sesame, and SAM translation metadata extraction components.
- Tested prototype provenance system by using Kepler workflows developed in the Semantic Data Grid project and converted automated provenance capture tools system to record provenance and metadata.
- Schuchardt KL, TD Gibson, EG Stephan, and G Chin. Applying Content Management to Automated Provenance Store, 2007. Concurrency and Computation: Practice and Experience.
Collaboration
- Defined technical strategy for how the MeDICI, Adaptive Workflow, and Data Virtualization projects comprise the core architecture for the Data-Intensive Computing Initiative (DICI).
- Collaborated with 20 teams from the International Provenance and Annotation Workshop (IPAW) to compare provenance architectures and technologies; continuing to work with IPAW community to demonstrate leading-edge concepts on provenance exploration, integration, and representation. Participating in Provenance Challenge to explore interoperability of provenance models and work toward a common core model.
- Conducted cross-organizational focus groups on various technical topics including content management systems, semantic technologies, and workflow systems.
- Collaborating with members of the Environmental Molecular Sciences Laboratory Greenbook Data Management study; developed preliminary requirements needed to support large-scale data management.
Demonstration
The Data Virtualization project is tightly integrated with DICI's Adaptive Workflow and MeDICI projects. The overall integration strategy allows interoperability of workflow, remote processing, and provenance capabilities. Our asynchronous message-based provenance listener can capture provenance from MeDICI and other applications. An API will allow a workflow designer tool to retrieve and store provenance directly to the provenance store, and a search interface will allow users to query for workflow instances.
Impacts
Data virtualization will make it cost-effective and efficient to support scalable cross-discipline data sharing and dissemination across data-intensive computing problem spaces without imposing a centralized or rigid structure on data providers. Our high-performance data middleware and services will be built generically to support diverse application domains including bioinformatics, cybersecurity, intelligence analysis, and the power grid.
