Skip to Main Content U.S. Department of Energy
Data-Intensive Computing Initiative, DICI

Data-Intensive Computing Initiative (DICI)

A Data Virtualization Architecture

Technical Contacts: Eric Stephan; Karen Schuchardt

Query results on workflow provenance captured during the IPAW Provenance Challenge in January.
Query results on workflow provenance captured during the IPAW Provenance Challenge in January.
Data virtualization architecture integrated with the MeDICI architecture.
Data virtualization architecture integrated with the MeDICI architecture.

Executive Summary

This research provides high-performance data middleware and services for provenance and metadata tracking, mechanisms to store and/or reference raw scientific data and related artifacts with virtualized query access, and an architecture upon which to experiment with advanced semantic query mechanisms. These services are critical to support long-term data management, cross-discipline data sharing, verification, and data dissemination.

A focus on provenance is the key unifying mechanism to describe, annotate, and track relationships between data and to search disparate sources for data while retaining autonomy.

Accomplishments / Highlights

  • Completed preliminary study of twelve characteristics across six emerging provenance store and content management technologies; leveraged information to select and benchmark RDF and content stores and define preliminary provenance architecture.
  • Prototyped new provenance system by combining Alfresco, Sesame, and SAM translation metadata extraction components.
  • Tested prototype provenance system by using Kepler workflows developed in the Semantic Data Grid project and converted automated provenance capture tools system to record provenance and metadata.
  • Schuchardt KL, TD Gibson, EG Stephan, and G Chin. Applying Content Management to Automated Provenance Store, 2007. Concurrency and Computation: Practice and Experience.

Collaboration

  • Defined technical strategy for how the MeDICI, Adaptive Workflow, and Data Virtualization projects comprise the core architecture for the Data-Intensive Computing Initiative (DICI).
  • Collaborated with 20 teams from the International Provenance and Annotation Workshop (IPAW) to compare provenance architectures and technologies; continuing to work with IPAW community to demonstrate leading-edge concepts on provenance exploration, integration, and representation. Participating in Provenance Challenge to explore interoperability of provenance models and work toward a common core model.
  • Conducted cross-organizational focus groups on various technical topics including content management systems, semantic technologies, and workflow systems.
  • Collaborating with members of the Environmental Molecular Sciences Laboratory Greenbook Data Management study; developed preliminary requirements needed to support large-scale data management.

Demonstration

The Data Virtualization project is tightly integrated with DICI's Adaptive Workflow and MeDICI projects. The overall integration strategy allows interoperability of workflow, remote processing, and provenance capabilities. Our asynchronous message-based provenance listener can capture provenance from MeDICI and other applications. An API will allow a workflow designer tool to retrieve and store provenance directly to the provenance store, and a search interface will allow users to query for workflow instances.

Impacts

Data virtualization will make it cost-effective and efficient to support scalable cross-discipline data sharing and dissemination across data-intensive computing problem spaces without imposing a centralized or rigid structure on data providers. Our high-performance data middleware and services will be built generically to support diverse application domains including bioinformatics, cybersecurity, intelligence analysis, and the power grid.

DICI

Demonstrations

Research Areas

Highlights

Ian Gorton, DICI Chief Architect, is Guest Editor of IEEE Computer's April 2008 issue--a special issue on data-intensive computing.

The MeDICi Integration Framework is now available for download and use in developing applications.

Targeted Research

Projects