Grantee Research Project Results

Final Report: Knowledge-based Environmental Data Analysis

EPA Grant Number: R825199
Title: Knowledge-based Environmental Data Analysis
Investigators: Fine, Steven S.
Institution: MCNC / North Carolina Supercomputing Center
EPA Project Officer: Aja, Hayley
Project Period: October 29, 1996 through October 28, 1999 (Extended to October 27, 2000)
Project Amount: $597,994
RFA: High Performance Computing (1996) RFA Text | Recipients Lists
Research Category: Aquatic Ecosystems , Environmental Statistics , Human Health

Objective:

The purpose of this project was to develop a prototype Knowledge-Based Environmental Data Analysis Assistant (KEDAA) and to evaluate the Assistant's usefulness for supporting environmental data analysis. KEDAA was intended to provide an intelligent interface between the user and existing data analysis packages and to provide support at a higher conceptual level for creating and managing analyses than is currently available. The Assistant would use its knowledge to suggest appropriate analyses (e.g., contour ozone fields after running an oxidant model), to provide convenient access to previous analyses, and to utilize existing analysis packages, such as the Package for Analysis and Visualization of Environmental data (PAVE) or the Grid Analysis and Data System (GrADS). The KEDAA developers hoped that encouraging people to apply a wide variety of analysis techniques and tools to their work, simplifying that process, and allowing people to easily track and reuse analyses would allow environmental modelers and decision makers to more thoroughly and efficiently gain insight from disparate environmental data sets.

Summary/Accomplishments (Outputs/Outcomes):

This work was performed at the MCNC?North Carolina Supercomputing Center's Environmental Programs group. The project was terminated with roughly a third of the funds remaining because the Principal Investigator accepted a government position where he could not transfer the grant. The prototype KEDAA that was developed includes a number of approaches that could support those analyzing environmental data. The prototype supports organization and reuse of analyses. Supported analysis techniques include contour plots, scatterplots, and time series plots. Analyses are performed by several external packages:

PAVE (http://www.emc.mcnc.org/EDSS/pave_doc/Pave.html) Exit EPA icon ,
GrADS (http://grads.iges.org/grads/head.html) , and
gnuplot (http://www.gnuplot.info/) .

Unfortunately, the project was terminated just as it was reaching the stage where it would be practical for people to incorporate KEDAA into their work, which would have provided a realistic evaluation of our approaches. Also, the final phase of the project is where the capability to suggest analyses would have been implemented and evaluated. The KEDAA source code, byte code, and help files have been submitted to the National High Performance Computing and Communications Software Exchange (http://www.nhse.org/) as open source so interested parties can benefit from our work. The following section describes the innovative approaches that were implemented in KEDAA and the limited feedback that was obtained.

As a user performs analyses, he/she can record the description of what was done as entries in an electronic outline called a "notebook." This allows the user to reperform, copy, paste, edit, and rearrange hierarchies of analyses. This storage and reuse of analysis specifications through a graphical user interface provides several benefits including easily reviewing details of how an analysis was performed, repeating analyses when data change, and applying old analyses to new data sets. As would be expected from an electronic outline, levels can be collapsed and expanded to control the level of detail visible in the history of analyses.

Users can insert into the notebook outline items that contain arbitrary text. This provides an integrated method for recording ideas, observations, hypotheses, and conclusions with the specification of analyses that have been performed, providing an interactive lab notebook. Integrated notes and the ability to assign arbitrary names to analyses and topics in the outline should make it easier to reconstruct work and rationales at a later time. An additional feature that would be easy to implement would be to allow the user to select all or part of a notebook and to export it to HTML including graphics of the selected analyses.

A third role that the notebooks perform is to provide scopes for variables used in analyses. Each topic in the outline can have variables attached to it. Variables can correspond to variables found in a data file and variables can be derived from other variables via arithmetic operations, subsetting, and reductions (e.g., mean or maximum). A variable defined in a topic can be referenced by name in any contained analysis or topic. When the user defines a variable or analysis he/she can use any variable defined in any topic that contains his/her current location in the outline. This approach provides several advantages for users. They can easily define variables that have a well defined and easy-to-understand scope, which can be very narrow or very wide, without learning a new programming language. Nesting topics allows users to build sets of related variables. For instance, the outermost topic could contain variables read from a file and two topics that represent different subsets of the file's variables. Finally, because variables are referenced by name, topics and analyses that are moved or copied into a different parent topic apply to variables defined in their new parent topic. For instance, if a topic contained a variable called "O3" and a sibling topic contained a different variable called "O3", copying analyses that depend on O3 from the first to the second topic would cause the analyses to apply to a different set of values. Thus, a user could apply a set of analyses to multiple data sets that contain the same variable names (e.g., evaluations of multiple management alternatives or simulation results for different time periods) by creating a topic for each data set and pasting analyses into each topic. This approach allows users to easily perform repetitive analyses.
There are a couple of weaknesses with KEDAA's implementation of outline topics as variable scopes. If a user wants to reapply analyses by copying them to different topics, she must use variable names consistently. For instance, if a variable name appears in multiple topics where the same analyses will be applied, each instance of the variable name should represent the same concept and have the same structure (e.g., gridded vs. point values). Also, there is no convenient way to compare variables that are developed in topics that do not have an ancestor-descendant relationship. This problem could be resolved by a couple of techniques, such as allowing the user to copy variables from one topic to another.

KEDAA relies on external analysis packages to perform analyses. This allows KEDAA to take advantage of capabilities that already exist. Of course, this implies that KEDAA must be able to generate commands for each analysis package that it drives. This required KEDAA to account for the different behaviors and capabilities of the packages. For instance, some packages open a new window with each analysis while other packages reuse the same window for all analyses. To address these issues, KEDAA was designed such that package-specific code is isolated in two Java classes devoted to that package. One class indicates why a package cannot satisfy an analysis specification or generates commands to perform an analysis. The other class manages package-specific information related to sending commands, processing prompts, and managing windows.

Because KEDAA can drive multiple analysis packages, it in essence provides a common user interface for those packages. When analysis packages do not provide a graphical user interface, KEDAA can provide an alternate way to interact with the package. KEDAA's graphical user interface (GUI) includes a set of options that might be available for each type of analysis. For instance, people generating contour plots might want to specify contour colors. The package-specific code incorporated into KEDAA determines if and how these options can be satisfied. Because KEDAA does not allow access to all options analysis packages provide, KEDAA can be used to generate, via the GUI part of the analysis packages, commands for an analysis and then to edit the commands to include additional behaviors or options. Of course, interacting with analyses that have been performed (e.g., probing data on a graphical display) requires interacting directly with the external analysis package.

KEDAA includes an iteration capability that is more flexible than is found in most analysis packages. A user can select an arbitrary set of analyses and indicate that they should be iterated in time or space. Spatial iteration options include vertical layer and station. KEDAA's iteration control also allows the user to specify where output of analyses should be sent. This makes it easy, for instance, for someone to preview a number of analyses on-screen and then to print or save images.

Even though the project ended before we explored the most advanced concepts in our plans and before users evaluated the techniques, we believe that the techniques and prototype we developed can assist people performing environmental analyses, especially those who perform repetitive analyses. Few if any environmental data analysis packages provide analysis management capabilities (e.g., copy, paste, and edit) for previous analyses and easy application of analyses to new data sets without the use of a scripting language. The prototype that was developed and provided to the community as open source includes support for three data analysis packages (PAVE, GrADS, and gnuplot). Support for additional packages and analyses can be easily incorporated. Further, the approaches we developed for managing analyses also might be useful for managing computational studies, where a modeler specifies a set of programs to execute and then wants to apply the entire set to multiple data sets.

Journal Articles:

No journal articles submitted with this report: View all 3 publications for this project

Supplemental Keywords:

supercomputing, modeling, data analysis, RFA, Ecosystem Protection/Environmental Exposure & Risk, computing technology, ecosystem simulation, data management, environmental decision making, information technology, integrated visualization, environmental visualization packages, computer science

Relevant Websites:

http://www.emc.mcnc.org/projects/kedaa/ Exit EPA icon
http://www.emc.mcnc.org/EDSS/pave_doc/Pave.html
http://grads.iges.org/grads/head.html
http://www.gnuplot.info/
http://www.nhse.org

Progress and Final Reports:

Original Abstract

The perspectives, information and conclusions conveyed in research project abstracts, progress reports, final reports, journal abstracts and journal publications convey the viewpoints of the principal investigator and may not represent the views and policies of ORD and EPA. Conclusions drawn by the principal investigators have not been reviewed by the Agency.