Grantee Research Project Results

Final Report: High-Performance Environmental Models and Class Libraries of Parallel Algorithms

EPA Grant Number: R825293
Title: High-Performance Environmental Models and Class Libraries of Parallel Algorithms
Investigators: Sheng, Peter , Rajasekaran, Sanguthevar
Institution: University of Florida
EPA Project Officer: Aja, Hayley
Project Period: January 1, 1997 through December 31, 1999 (Extended to December 31, 2000)
Project Amount: $730,605
RFA: High Performance Computing (1996) RFA Text | Recipients Lists
Research Category: Human Health , Aquatic Ecosystems , Environmental Statistics

Objective:

To meet the demand for increased spatial and temporal resolution and uncertainty analysis in environmental models for ecological assessment and water resources management, it is essential to develop high-performance hydrodynamics and water quality models using parallel techniques. In our project, we aim to develop such high-performance environmental models and also develop class libraries for parallel numerical solvers. Specifically, we will: (1) develop reusable object-model-based class libraries for parallel numerical solvers and scientific processes related to cross-media environmental modeling and uncertainty; (2) apply the reusable library or other methods to improve the performance of existing hydrodynamics and water quality model; (3) couple a parallel hydrodynamics and water quality model to a geographic information system (GIS) model to enhance the model simulation; (4) apply the hydrodynamics-water quality-GIS model to Indian River Lagoon in Florida; and (5) make the class libraries of parallel algorithms accessible to U.S. Environmental Protection Agency (EPA) scientists.

Summary/Accomplishments (Outputs/Outcomes):

Class Library Development. On the library side, we have completed the implementation of parallel algorithms for the problems of multiplying matrices, solving banded systems of linear equations, and solving a full system of linear equations. Several algorithms such as Gaussian and Czanky's algorithms have been explored for solving a full system. Currently, we are considering the solution of sparse systems.

In our recent work (Rajasekaran, 1998), we have come up with a simple algorithm for external sorting that is not only asymptotically optimal but also practical. It performs better than the DSM algorithm that people use in practice.

External sorting and selection are important problems in the manipulation of voluminous data, not only in simulation but also in other areas of engineering. These problems have been studied extensively by many researchers in the past decade. Though a number of asymptotically optimal algorithms are known, they are not practical owing to large constants in their time bounds. We had reported the discovery of a simple algorithm that is not only asymptotically optimal, but also had the potential of being practical. Recent implementation results of Cormen and Pearson (at Dartmouth College) indicate that our algorithm is practical and it outperforms its competitors. We also have developed asymptotically optimal randomized and deterministic external selection algorithms. The deterministic selection algorithm has been shown to be competitive in practice.

Matrix operations play a vital role in engineering simulations. The operations of interest are: vector-vector multiplication, matrix-vector multiplication, matrix-matrix multiplication, and matrix inverse (for both full and sparse matrices).

We have developed scalable programs for all of the above. An efficient matrix multiplication algorithm also can be employed for solving the vector-related problems. Thus, it suffices to discuss matrix multiplication and matrix inverse. Our programs were developed for both PVM and MPI. Next, we describe our experience of using PVM in the context of matrix multiplication (Konyala, et al., 1998).

PVM (Parallel Virtual Machine) is a software package that allows a heterogenous network of parallel and serial computers to appear as a single concurrent computational resource. We have obtained good speedups for matrix multiplication using PVM.

Parallel models of computing differ in the ways in which they effect inter-processor communication. In PVM, there are two functions pvm_send and pvm_receive that can be used by the participating machines to communicate. PVM consists of two parts: a daemon process and a library that contains routines for spawning processes on other machines, communication between processes, changing the configuration, etc. To run PVM applications, a "virtual machine" must be set up. This is accomplished by starting the PVM daemon and specifying the machines to be used. The PVM library provides a large assortment of functions to configure and utilize the virtual machine, including debugging and process tracing abilities. PVM provides blocking and non-blocking sends and receives. Broadcast and Multicast also are supported. Group functions such as scatter, gather, and reduce are supplied. Trace information can be generated chronicling the PVM calls made by a process; selective control of exactly what information to trace is possible.

Task spawning and process control in PVM is robust. PVM allows multiple instances of the same program to be spawned or a separate program can be made a slave task, again allowing for many different paradigms. A typical PVM program will enroll itself in PVM first. Then, it can get any necessary information and spawn its children. Enrollment takes place with the first call to a PVM library function, usually pvm_mytid. Spawning of processes is accomplished by calling pvm_spawn. The exact number and place can be specified or can be left to the daemon to choose. Process communication takes place which calls to pvm_send and pvm_receive after some initialization and packaging of data.

The problem of matrix multiplication has been extensively investigated. The trivial algorithm for matrix multiplication takes O(n³) time on two n ? n matrices. Strassen improved this bound to Formula . The best known algorithm for matrix multiplication is due to Coppersmith and Winograd and it runs in time O(n2.376) (Horowitz, et al., 1998). Parallel algorithms also have been devised for matrix multiplication and related problems (Ja Ja, 1992). To achieve good performance, we should avoid unnecessary point to point communication between the master and the slaves. The problem of matrix multiplication is subdivided equally among the slaves, and the matrices to be multiplied are broadcast to all the slaves, thus reducing the communication overhead when the matrices are sent to each of the slaves separately. Each slave also is sent an array called slave_bounds where slave_bounds[i] is the index of the first element to be calculated by that particular slave. After the local computations by the slave, the master uses the non-blocking receive to receive the result from each slave, which is found to save time compared to the blocking receive.

We have seen the performance of the program for matrix sizes 100, 200, 300, and 600. For matrix size 100, the sequential program shows better performance than its PVM counterpart because the communication overhead is more than the gain due to parallel execution. For matrix sizes more than 100, the PVM implementation really shows tremendous improvement over the sequential program with as high as a speedup of more than 3 in the case of matrix size 600. The experimental verification of the above results were done in our LAN where the traffic varies from time to time. When the program is run in a dedicated environment, the speedup expected is still high.

1-D and 2-D Parallel Models. On the high performance simulation algorithm side, we have developed methods to convert existing 1-D and 2-D models of hydrodynamics into parallel models using PVM. Because our primary focus is on the speedup of parallel program, we decided to solve 1-D and 2-D test problems in relatively simple geometry and bathymetry?tidal circulation in a 1-D channel and a 2-D basin with constant depth and width. For solving the 1-D model equations in parallel, we had introduced the notions of communication factor and LessTalk. The idea is to perform some redundant computations in order to reduce the number of communication steps. A normal parallel simulation algorithm will perform a communication step for every step of simulation. In our approach, we perform a communication step only once in 10 or 20 steps of simulation. We had reported previously that this idea resulted in impressive speedups. For example, with our explicit parallel models, we were able to speed up by a factor of 11.4 using 12 processors, and 5.8 using 6 processors.

Parallelization of 1-D and 2-D Explicit Models. PVM (e.g., Geist, 1994) is a software package that permits a heterogeneous collection of UNIX and/or NT computers hooked together by a network to be used as a single large parallel computer. It supports a master/slave computation model by using message passing. A similar program is MPI (Snir, et al., 1998). We run our experiments on a 10-node Beowulf cluster in the Civil and Coastal Engineering Department, University of Florida. Each node has two processors whose CPUs are each a Pentium II-450 MHz. Each node has 256 MB main memory and 9 GB hard disk. All nodes are connected through Ethernet. The OS of the system is Red Hat Linux Release 6.1. We computed and compared the speedup as a function of: (1) time interval for synchronization, (2) number of slave processors, and (3) number of grid points. The results indicate that the speedup (between 1 and n, where n is the number of slave processors) increases with the number of slave processors and number of grid points (from 1,000 to 100,000). With increased time interval for synchronization, the speedup increases initially, but then decreases.

We also run the multithread programs on a 8-processor Sun workstation server in the Computer and Information Science and Engineering (CISE) Department of the University of Florida. The main memory is 2 GB and virtual memory is 4.5 GB. Each speed of each CPU is 248 MHz. The OS of the server is SUNOS 5.8. The results show that when the size of the problem (grid points) grows, the total run time of the system grows much slower than that of the sequential program. The speedup is better when the problem is bigger.

The following two figures show the speedup as a function of the number of grid points for the 1-D test on a 20-CPU Beowulf cluster (left figure) and a shared memory system (right figure). It is evident that the speedup increases with the total number of grid points.

Left Figure Right Figure

We have used the same idea to the case of 2-D explicit model as well. We tested the speedup for the PVM program, MPI program, and multithread program for the following 2-D grid system: 100x100, 500x500, and 1000x1000 points. We obtained good speedups as in the case of the 1-D case, when a large number of grid points are used (1000x1000). For example, the PVM program gets 6 times speedup when 6 slaves are used. When 100x100 points are used, however, the speedup is less than 5 because the value at any grid point depends on five neighboring values and the communication time is correspondingly longer. MPI gives slightly better speedup than PVM. The gain in speedup with increasing grid points is slower for the multithread program.

A New Simple Algorithm for Solving Tri-Diagonal Systems. For the solution of our 1-D implicit simulation model, one of the subroutines needed is for solving a tri-diagonal system of linear equations. We wanted to identify the best algorithm for solving a tri-diagonal system and, hence, we have compared experimentally many algorithms and have reported under what conditions each algorithm is preferable over the others (Davis, et al., 1998). We also have extensively investigated available parallel algorithms for solving tri-diagonal systems such as the ones in the hydrodynamics model when an implicit time integration method is used. Even though a number of algorithms are available, they are fairly complex.

We have discovered a new simple algorithm for solving general banded system of linear equations (Rajasekaran, et al., 1999). Let p 1 be the number of processors available. We have to solve the system Ax=b, where A is an n ? n matrix and b is an n ? 1 column vector. For simplicity, assume that n is an integral multiple of p. There are three stages in the algorithm. In the first stage, the equations are assigned to the processors evenly. The first processor gets the first n/p equations, the second processor is assigned the next n/p equations, and so on. Each processor eliminates as many unknowns as possible from out of the equations it received. Clearly, at the end of this stage, each processor will have two equations for a total of 2p equations. These 2p equations form a tri-diagonal system. In the second stage, the new system is solved. This can be done either in parallel or employing a single processor. In the third stage, each processor back-substitutes the known values to get the values of all of its unknowns.

Note that if all the equations involving an unknown are with the same processor, the unknown can be eliminated in O(1) time. Thus, the total work done in the first stage is O(n). The work done in the second stage also is O(n). Call this algorithm "Band." This algorithm is optimal and also performs better than the other algorithms we explored. If we have a system of n banded equations (with a constant bandwidth), our new algorithm takes only O(n/P) time using P processors. This algorithm has been tested using PVM. The method has now been applied to the 2-D implicit equations as well. The speedups of implicit models are not as impressive as those for the explicit models.

Parallelization of 1-D and 2-D Implicit Models. For the 1-D and 2-D model tests, we considered the same test problems as used for the 1-D and 2-D explicit models, respectively. For the 1-D problem, we considered speedups for grid points ranging from 5,000 to 500,000. Although the implicit model attains almost no speedup (speedup factor ~1) when 5,000 grid points are used, the speedup factor increases to 4.5 when 7 slaves and 500,000 grid points are used. When we use multithreading and machines such as Sun Enterprise and SGI Origin, the speedup also is quite impressive. For example, using 6 threads and 1,000,000 grid points, we can get a speedup of 4.5. Details of the 1-D parallel models can be found in Luo, et al. (2001a).

For the 2-D implicit model, the speedup is not as good as the 1-D case due to excessive communication effort. Nevertheless, with 5 slave processors and 500x500 grid points, the speedup is about 2. Details of the 2-D parallel models can be found in Luo, et al. (2001b).

Parallelization of a Legacy 3-D Hydrodynamic Model?CH3D. Based on the results of our 1-D and 2-D parallel models, we explored the parallelization of a multi-dimensional estuarine circulation model. Our objective was to determine which method is the best way to produce a high-performance estuarine hydrodynamic model.

There are many estuarine hydrodynamic models. A few of them (including the CH3D model) are widely used and can be considered legacy models. Most of these models are being used on single-CPU Unix-based workstations or Window-based PCs. We decided to work with the three-dimensional model CH3D (Sheng 1986, 1989) and the integrated modeling system based on CH3D (Sheng 1998, 1999, 2000). We considered two possible approaches: (1) use the PVM or MPI approach, or (2) use the multithread approach. Because CH3D model involves the inversion of a 2-D matrix just like the matrix inversion method used in its Cartesian version (Sheng and Butler, 1982), it is not strictly explicit and, hence, will not easily benefit from the PVM or MPI approach. Moreover, the PVM or MPI approach would require substantial code modification. Thus, we decided to embark on the multithread approach using a SGI Origin-2000 4-CPU system.

We have conducted extensive experiments on applying the shared memory concept to significantly speed up the CH3D code by parallelizing the factorized sparse matrix solver (Sheng and Butler, 1982; Sheng, 1986) for the "propagation" module as well as the other modules (advection, diffusion, Coriolis, density gradient, and bottom friction terms). The shared memory concept has been applied to the CH3D model by dividing the model domain into n (where n is the number of processors) parts during the x-sweeps and y-sweeps (Davis and Sheng, 2000). These tests have been conducted with a 500x60x4 (x-y-z directions) numerical grid for Indian River Lagoon, a large coastal lagoon estuary in Florida. We are able to achieve a nearly 4-fold speedup. In addition to the shared memory approach, we also have used "macros" and scripting languages, e.g., cpp (C-Preprocessor), to speed up the model compilation and computation, and to enhance the code's portability. Work has been done to convert the sediment transport model and the water quality model into a preliminary shared memory version. Preliminary results showed very good speedup.

The multithread approach, however, is not very scalable. Due to added communication effort with increased processors and the fact that the code is not 100 percent parallelized, the speedup will not scale linearly with the number of processors. For example, we found that the speedup factor for CH3D on the SGI 4-CPU system is almost 3.5, but our theoretical calculation based on the percent of serial code (Davis and Sheng, 2000) showed that the speedup for a 32-CPU system is only 6.6. The cost for individual CPUs on the SGI Origin system is quite prohibitive. We decided to explore the use of the message passing approach on a Beowulf cluster using a different existing model, which was developed by Davis and Sheng (2000).

Parallelization of a 2-D Model Using MPI and Domain Decomposition on Beowulf Cluster.
One of the achievements of this project is the application of the message passing approach to an existing 2-D hydrodynamics model (Davis and Sheng, 2000) on the Beowulf cluster consisting of 20 Linux-based Intel processors. Although we would like to convert the CH3D model to a message passing version on the Beowulf cluster, a detailed review of the CH3D code revealed that it is not amenable to the message passing approach without major restructuring of the code. Hence, we decided to test the message passing approach on a Beowulf cluster with a simpler 2-D model (Davis and Sheng, 2000) first. This approach is based on domain decomposition, as opposed to the master-slave approach discussed previously. The model differs from the 2-D version of CH3D in several aspects: (1) the model uses a rectangular grid instead of the non-orthogonal curvilinear grid used by CH3D, (2) the model solves the 2-D vertically-averaged equations of motion instead of the vertically-integrated equations used in CH3D, and (3) the model uses a robust way for calculating the wetting and drying of horizontal model boundaries, following the procedure of Davis and Sheng (1999).

With a total of n CPUs, the model domain is divided into n sub-domains, and the solution of each sub-domain is carried out by one CPU only. Communication between neighboring cells is accomplished using the MPI. The majority of the computational effort of the parallel model is involved in solving for surface elevation and to calculate the vertically integrated velocities, while less of the computational effort is spent in using MPI to communicate among cells and other operations such as I/O. We have tested the model in a large grid with 200x200 cells, which is eight times the grid points used for the Indian River Lagoon with CH3D. Using a grid spacing of 100 meters in both x and y directions and a time step of 1 second, we obtained only modest speedup with multiple CPUs. The speedup is about 6 when 20 CPUs are used. However, when grid points are increased to 200x1600, the speedup increases to 12. Because 1-second time step is rather small for long-term simulation, we are continuing our research to develop an implicit model on the Beowulf cluster.

Coupling of GIS with Hydrodynamic-Water Quality Model. With funding from Florida, we have developed a GIS model for the Indian River Lagoon. In this project, we advanced the GIS model so it can be coupled to the parallel CH3D model. The GIS model, developed with ARCVIEW, can accept and display field data from various sources, produce initial conditions for model simulation, as well as process and display results (both scalar quantities and vectors) from the CH3D model. The GIS model can be used to aid model simulation and facilitate comparison between model results and field data (Sun, et al., c 1999).

References:

Davis J, Rajasekaran S, Sheng YP. An experimental evaluation of techniques for solving tridiagonal systems. Software Practice and Experience 1998 (submitted for publication).

Geist A. PVM: Parallel Virtual Machine: A Users' Guide and Tutorial for Networked Parallel Computing, Scientific and Engineering Computation Series, MIT Press, 1994.

Ja Ja J. Introduction to Parallel Algorithms, Edison-Wesley Publishers, 1992.

Rajasekaran S. Sorting and selection on interconnection networks. DIMACS series in Discrete Mathematics and Theoretical Computer Science 1995;21:275-296.

Snir M, Gropp W, et al. MPI: The Complete Reference (2 Volume Set), MIT Press, 1998.

Journal Articles on this Report : 2 Displayed | Download in RIS Format

Publications Views
Other project views:	All 15 publications	3 publications in selected types	All 3 journal articles

Publications
Type	Citation	Project	Document Sources
Journal Article	Rajasekaran S. A framework for simple sorting algorithms on parallel disk systems. Theory of Computing Systems 2001;34(2):101-114.	R825293 (1999) R825293 (Final)	Abstract: Springer - Abstract Exit
Journal Article	Rajasekaran S. Selection algorithms for parallel disk systems. Journal of Parallel and Distributed Computing 2001;61(4):536-544.	R825293 (Final)	Abstract: Science Direct - Abstract Exit Other: University of Connecticut-Prepublication Full Text PDF Exit

Supplemental Keywords:

estuary, marine, water, watersheds, global climate, sediments, ecological effects, ecosystem, integrated assessment, aquatic, habitat, engineering, ecology, mathematics, physics, hydrology, modeling, general circulation models, satellite, southeast, Atlantic coast, Florida, FL, EPA Region 4, university., RFA, Scientific Discipline, Geographic Area, Ecosystem Protection/Environmental Exposure & Risk, State, computing technology, Environmental Monitoring, class libraries, computer science, parallel numerical solvers, reusable components, software reuse, cross-media modeling, data management, MasPar2, component-based software, information technology, parallel computing, water quality modeling, hydrodynamics, Florida, FL

Relevant Websites:

COASTAL & OCEANOGRAPHIC ENGINEERING Exit

Progress and Final Reports:

Original Abstract

The perspectives, information and conclusions conveyed in research project abstracts, progress reports, final reports, journal abstracts and journal publications convey the viewpoints of the principal investigator and may not represent the views and policies of ORD and EPA. Conclusions drawn by the principal investigators have not been reviewed by the Agency.