Science Inventory

Principle Component Analysis with Incomplete Data: A simulation of R pcaMethods package in Constructing an Environmental Quality Index with Missing Data

Citation:

Grabich, S., C. Gray, L. Messer, K. Rappazzo, J. Jagai, AND D. Lobdell. Principle Component Analysis with Incomplete Data: A simulation of R pcaMethods package in Constructing an Environmental Quality Index with Missing Data. Presented at Society for Epidemiologic Research, Seattle, WA, June 24 - 27, 2014.

Impact/Purpose:

This work furthers the methologic work that is being conducted through the creation of the Environmental Quality Index to help improve this measure in the next iteration. This abstract will be presented to other methologists so that we can discuss our methods and impove.

Description:

Missing data is a common problem in the application of statistical techniques. In principal component analysis (PCA), a technique for dimensionality reduction, incomplete data points are either discarded or imputed using interpolation methods. Such approaches are less valid when a significant portion of the data is unknown. We simulated alternative methods for handling incomplete data with PCA using the Environmental Quality Index (EQI) developed by the Environmental Protection Agency. The EQI was developed for all U.S. counties (n=3,141) and includes 5 domains: air, water, land, sociodemographic, and built environment. We simulated varying levels of missing data (5%, 10%, 20%, 30%) in the data matrix and implemented four algorithms in R pcaMethods package for handling the missing cases: Probabilistic PCA (PPCA), Bayesian PCA (BPCA), Inverse non-linear PCA(IPCA), and Non-linear estimation by iterative partial least squares (Nipals) PCA. In simulations with 30% missing three of four algorithms gave similar resulting eigenvalues and variable weights as the full data. For example, weights for 1,1,2,2-Tetrachloroethane(air domain) for the first component ranged from 0.10-0.18 with the complete data yielding a weight of 0.12. Overall BPCA and Nipals were computationally the least efficient. BPCA and IPCA consistently had the lowest standard deviations (e.g. PCA of air domain standard deviations were: 0.1 for BPCA and IPCA, 5.0 for PPCA, and 8.0 for Nipals). Nipals significantly diverged from the complete dataset as the complete data became more sparse. PPCA was the most efficient, and unbiased for large datasets even at 30%. These simulations introduce an efficient way to address incomplete data when using PCA to construct indices such as the EQI. This abstract does not necessarily reflect EPA policy.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ ABSTRACT)
Product Published Date:06/27/2014
Record Last Revised:07/21/2014
OMB Category:Other
Record ID: 281617