Science Inventory

20190825 - Does Bigger Mean Better in the World of Chemistry Databases? (ACS Fall 2019)

Citation:

Williams, A. AND C. Southan. 20190825 - Does Bigger Mean Better in the World of Chemistry Databases? (ACS Fall 2019). American Chemical Society Fall Meeting, San Diego, CA, August 25 - 29, 2019. https://doi.org/10.23645/epacomptox.9773609

Impact/Purpose:

Presentation to the American Chemical Society Fall Meeting August 2019. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem.

Description:

The internet has changed the way we access chemistry data as well as providing access to data that can quickly proliferate and becomes referenceable. Web access to chemical structures and their integration with biological data has become massively enabling with numbers for UniChem, PubChem and ChemSpider reaching 157, 97 and 71 million respectively (at the time of writing). A range of specialist databases small enough to be curated have stand-alone utility and synergies when integrated into the larger collections. These include DrugBank, BindingDB, ChEBI, and many others. Databases of any size have inherent quality challenges but at large scale various forms of “noise” accumulate to problematic levels. The unfortunate consequence is that “bigger gets worse”. This is particularly associated with large uncurated submissions from vendors and automated document extractions (even though these are high-value). Virtual enumerations and circularity between overlapping sources add to the problem. As a result of some of the noise in the larger databases the value becomes highly dependent on the specific applications. An example includes using the databases to support non-targeted analysis. This presentation covers examples of these noise and quality issues and suggests at least some options to ameliorate the problem. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

Record Details:

Record Type: DOCUMENT ( PRESENTATION/ SLIDE)
Product Published Date: 08/29/2019
Record Last Revised: 09/05/2019
OMB Category: Other
Record ID: 346354