Science Inventory

Informatics-Based Approaches for Collecting and Curating Consumer Product Data for Exposure Assessments

Citation:

Wall, J., T. Wall, S. Burns, K. Phillips, K. Dionisio, V. Hull, AND K. Isaacs. Informatics-Based Approaches for Collecting and Curating Consumer Product Data for Exposure Assessments. SOT Annual Conference, Virtual, N/A, N/A, March 15 - 19, 2021. https://doi.org/10.23645/epacomptox.14681118

Impact/Purpose:

This abstract describes an informatics based approach to curating public documents for use in exposure assessments. The approach will rapidly increase the number of products with useable data in ORD's Chemicals and Products Database (CPDat). These new data will allow for the development of refined exposure predictions for thousands of chemicals in consumer products for use in supporting EPA decision-making.

Description:

Quantitative data on product chemical composition is necessary for characterizing consumer exposure to chemicals. EPA's Office of Research and Development (ORD) has built rapid models, including the High-Throughput Stochastic Human Exposure and Dose Simulation model (SHEDS-HT), that use this data to estimate exposures for over 300 hierarchical harmonized product use categories (PUCs). However, this data is often lacking or is in various formats, making it difficult to use in models like SHEDS-HT. To fill this data need, ORD has developed automated approaches for collecting and curating data on thousands of individual products and chemicals from public documents (safety data sheets, manufacturer ingredient disclosures, ingredient lists). However, curation of documents for individual products to PUCs is historically a bottleneck, requiring manual assessment of product names. Here, we use natural language processing machine-learning approaches to hasten this curation step. The model training dataset was comprised of all products within ORD's Chemical and Products Database (CPDat) that had a PUC manually assigned; models were built for PUCs with at least 30 products (63,593 products; 161 PUCs). For modeling, each product-brand name was combined, cleaned, lemmatized to word roots, and converted to a quantitative vector using standard libraries. A Support Vector Machine (SVM) classifier was created for each level of the 3-tier PUC classification, each informed by the higher tier prediction. The probabilistic SVM models were used to generate multiple predictions per tier; the median predicted PUC was selected. Five-fold cross validation was performed (stratified by PUC to ensure proportional representation) resulting in an average 94% classification accuracy. The final models were applied to 460,518 additional products from documents within CPDat, increasing its scope to 524,11 products and 7,134 chemicals associated with PUCs. The expanded data were used to update consumer exposure predictions using SHEDS-HT, which provided refined aggregate and PUC-specific consumer exposure distributions, particularly for home care and home maintenance PUCs (which were previously data poor in terms of products in CPDat). In summary, implementation of informatics approaches for managing and curating public documents are rapidly expanding the quantity and quality of data available for assessing consumer exposure to chemicals in consumer products. “This abstract does not necessarily reflect U.S. EPA policy”

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/19/2021
Record Last Revised:05/26/2021
OMB Category:Other
Record ID: 351770