Impact/Purpose:
In the area of improving data resources for structure-based mining, the NCCT is
supporting further development and expansion of the DSSTox (Distributed Structure-Searchable Toxicity) database network. The DSSTox project is primarily focused on migrating toxicity data from diverse areas of study into structure-annotated, standardized form for use in relational structure-based searching and structure-activity model development. An essential element of this effort involves bridging understanding and forging productive linkages between the toxicology domain experts and the data users and modelers by means of focus on clarifying the chemistry content and summary presentation of the toxicology data. The larger goal of these efforts is to, in effect, overcome inherent and limiting data constraints in focused domains of toxicological
study (e.g., cancer, developmental toxicity, neurotoxicity, etc) by expanding the searchable and mine-able data network across both chemical and biological domains.
As an extension of the DSSTox project, NCCT researchers are promoting adoption of
standardized chemical structure data fields for public toxicogenomics datasets to enable broader searchability across these data domains, and to enable integration of these datasets with legacy toxicity data and other public data. In particular, collaboration of the DSSTox project with the NIEHS Chemical Effects in Biological Systems (CEBS) project is working towards incorporation of DSSTox data fields and providing structure-searching capability and linkages across CEBS data and public genomics data, as well as DSSTox and National Toxicology Program legacy toxicity databases. Chemical structure and genomic expression patterns provide common metrics for exploring diverse toxicological effects, and can provide the basis for development of predictive patterns or signatures of a toxicological effect. Similarly, biological activity profiles consisting of experimentally determined, or computationally predicted interaction spectra (receptors, proteins, enzymes) could be viewed as expanded “properties” of the chemical and could augment structure-based information for enhancing toxicity classification and prediction algorithms.
Finally, NCCT researchers are taking a lead in efforts to address more fundamental and essential needs to migrate older paper legacy data (such as within EPA Program Offices such as OPP and OPPT) into electronic form suitable for incorporation into standardized, searchable relational databases. New commercial technologies from IBM, SciTegic and others that allow for more automated structure-annotation, and chemical indexing and retrieval procedures are being evaluated to facilitate efficient electronic conversion and structured content-annotation of legacy EPA data. In addition, related issues of quality control of chemical information are being addressed, and Agency-wide chemical structure-browser capabilities are being explored
Description:
A central regulatory mandate of the Environmental Protection Agency, spanning many Program Offices and issues, is to assess the potential health and environmental risks of large numbers of chemicals released into the environment, often in the absence of relevant test data. Models for predicting potential adverse effects of chemicals based primarily on chemical structure play a central role in prioritization and screening strategies yet are highly dependent and conditional upon the data used for developing such models. Hence, limits on data quantity, quality, and availability are considered by many to be the largest hurdles to improving prediction models in diverse areas of toxicology. Generation of new toxicity data for additional chemicals and endpoints, development of new high-throughput, mechanistically relevant bioassays, and increased generation of genomics and proteomics data that can clarify relevant mechanisms will all play important roles in improving future SAR prediction models. The potential for much greater immediate gains, across large domains of chemical and toxicity space, comes from maximizing the ability to mine and model useful information from existing toxicity data, data that represent huge past investment in research and testing expenditures. In addition, the ability to place newer “omics” data, data that potentially span many possible domains of toxicological effects, in the broader context of historical data is the means for optimizing the value of these new data.
The challenges for application of information technologies, including chem-informatics and bioinformatics, are fourfold: 1) to more efficiently migrate legacy toxicity data from diverse sources into standardized, electronic, open, and searchable forms into the public domain; 2) to employ new technologies to mine existing data for coherent patterns that can provide scientific underpinning for extrapolations; 3) to place a new chemical, of unknown hazard, appropriately in the context of existing data and chemical and biological understanding; and 4) to integrate data from different domains of toxicology and newer “omics” experiments to look beyond traditional means for classifying chemicals, inferring modes of action, and predicting potential adverse effects.
Project Information:
Progress
:DSSTox Project StatusThe EPA DSSTox website (
http://www.epa.gov/nheerl/dsstox/), launched in March 2004, provides detailed information on DSSTox standard chemical fields, guidance for creating new DSSTox databases, and links to a wide range of public information resources. A major emphasis of the DSSTox project is on creating field-delimited, content-enhanced data files for diverse toxicity endpoints. Five DSSTox databases are currently published on the website and several others are in progress or currently undergoing review. Toxicity endpoints considered include: rodent carcinogenicity, mutagenicity, estrogen receptor binding affinity, fish acute toxicity, and pharmaceutical maximum adverse effect dose levels. Additional toxicity endpoints slated for DSSTox database publication include: skin sensitization, acute toxicity, nasal and eye irritation, androgen receptor binding, rodent developmental toxicity, DNA intercalation, pesticide ecotoxicity, and immunotoxicity. A large emphasis has been placed on the quality review of chemical information, which has led to the creation of a central DSSTox Master chemical structure reference data file and detailed quality data review procedures.
CEBS DSSTox Project StatusThe DSSTox project is collaborating with the NIEHS National Toxicogenomics Center CEBS Knowledge-Base project firstly by the incorporation of DSSTox standard chemical fields into the data dictionary and CEBS data entry system. DSSTox Standard Chemical Fields (SCFs) have recently been revised to better handle the diverse chemical content of public toxicity databases, which include all variety of mixtures and less well-defined substances, and to better coordinate with other public data standards efforts, such as the ToxML public toxicity data schema project. These DSSTox SCFs will be additionally employed to index the largest public genomics databases, to provide expanded structure-searchability within and outside CEBS data content. A chemical inventory of these databases has begun and will be followed by attempts to encourage external public data standards organizations (e.g., MGED) and database sources (ArrayExpress, NCBI) to adopt more rigorous chemical structure annotations of genomics data. Coordination with other large public database efforts, such as the National Library of Medicine’s PubChem, NIH Molecular Libraries Initiative, and NIH National Cancer Institute molecular database projects, will also directly impact on the CEBS DSSTox project collaboration.
EPA Communities of Practice – Chemoinformatics Workgroup StatusAs part of the effort to improve our ability to index, search and link chemical information data files across EPA Labs, Centers, and Program Offices, NCCT Researchers have formed a “Communities of Practice” Chemo-informatics Workgroup to begin to forge Agency-wide collaborations and coordination with respect to improving treatment and utility of chemical structure-related information within EPA data files. Additionally, the National Computer Center’s Scientific Visualization group is evaluating possible solutions for providing Agency wide structure browsing capability.
Approach
:In the area of improving data resources for structure-based mining, the NCCT is supporting further development and expansion of the DSSTox (Distributed Structure-Searchable Toxicity) database network. The DSSTox project is primarily focused on migrating toxicity data from diverse areas of study into structure-annotated, standardized form for use in relational structure-based searching and structure-activity model development. An essential element of this effort involves bridging understanding and forging productive linkages between the toxicology domain experts and the data users and modelers by means of focus on clarifying the chemistry content and summary presentation of the toxicology data. The larger goal of these efforts is to, in effect, overcome inherent and limiting data constraints in focused domains of toxicological study (e.g., cancer, developmental toxicity, neurotoxicity, etc) by expanding the searchable and mine-able data network across both chemical and biological domains.
As an extension of the DSSTox project, NCCT researchers are promoting adoption of standardized chemical structure data fields for public toxicogenomics datasets to enable broader searchability across these data domains, and to enable integration of these datasets with legacy toxicity data and other public data. In particular, collaboration of the DSSTox project with the NIEHS Chemical Effects in Biological Systems (CEBS) project is working towards incorporation of DSSTox data fields and providing structure-searching capability and linkages across CEBS data and public genomics data, as well as DSSTox and National Toxicology Program legacy toxicity databases. Chemical structure and genomic expression patterns provide common metrics for exploring diverse toxicological effects, and can provide the basis for development of predictive patterns or signatures of a toxicological effect. Similarly, biological activity profiles consisting of experimentally determined, or computationally predicted interaction spectra (receptors, proteins, enzymes) could be viewed as expanded “properties” of the chemical and could augment structure-based information for enhancing toxicity classification and prediction algorithms.
Finally, NCCT researchers are taking a lead in efforts to address more fundamental and essential needs to migrate older paper legacy data (such as within EPA Program Offices such as OPP and OPPT) into electronic form suitable for incorporation into standardized, searchable relational databases. New commercial technologies from IBM, SciTegic and others that allow for more automated structure-annotation, and chemical indexing and retrieval procedures are being evaluated to facilitate efficient electronic conversion and structured content-annotation of legacy EPA data. In addition, related issues of quality control of chemical information are being addressed, and Agency-wide chemical structure-browser capabilities are being explored.
Relevance
:NCCT researchers are involved in efforts that are poised to dramatically improve capabilities to access, mine, and integrate useful chemical-biological activity information from existing and new data, both within and outside EPA. These efforts have the potential to impact a wide variety of EPA program offices that heavily rely on chemical information resources, have large internal stores of data, and have a need for structure-based data exploration, analog searching, and improved toxicity prediction models. These include many programs within OPPTS [e.g., Green Chemistry, Premanufacture-Notification Program (PMN), Office of Pesticide Programs (OPP), High-Production Volume (HPV) Testing Program] as well as EPA’s Integrated Risk Information System (IRIS) Program, Office of Water, and Office of Environmental Information. New information technologies that incorporate more flexible and diverse means for assessing of biological and chemical similarity will also improve the identification of toxicologically relevant analogs by enhancing the ability to explore data and quantify associations in diverse chemical and biological domains.
Project IDs:
ID Code
:IIB-1
Project type
:Partner Specific