Science Inventory

Automating an ecological data workflow with GitHub Actions and R

Citation:

Hollister, J. Automating an ecological data workflow with GitHub Actions and R. Government Advances in Statistical Programming (GASP) 2022, NA, Virtual, February 02 - 03, 2022.

Impact/Purpose:

Ecology research projects often have data from many sources. These data sets need to be cleaned up and combined before they are analyzed. Manually combining datasets is time-consuming and error prone, so finding methods that speed up the process and make it more accurate and repeatable is important. To address this problem on a research project in two ponds Barnstable, MA, we use GitHub Actions and custom R code to download, merge, and visualize data from two water quality buoys. The impact of this work is that our data from these buoys is updated automatically and we are able to create daily figures, with very little effort, that may be used to update partners on the water quality in these two ponds.

Description:

Ecological data collection quite often relies on processing and merging data from multiple sources. Typically, researchers combine and clean multiple data files using a manual workflow via spreadsheets, or, in best case scenarios, use code to combine multiple files into a single structured file or database for subsequent analysis. This workflow requires interaction every time new data or files are added and may result in out of date datasets, incomplete data, or errors from manual data entry. Continuous integration tools (e.g. travis-ci, GitHub Actions, etc.) have long been used in software development to automatically build, test, and deploy software products. These same tools may be combined with data processing code (e.g. in R) to fully automate an ecological data workflow. In this talk, I demonstrate and discuss an automated workflow for processing water quality data from two buoys that were deployed to study Harmful Algal Blooms. The automated workflow uses GitHub Actions and custom R code to download, clean, merge, and visualize data from two buoys. I will also discuss an internal R package we are developing to facilitate automation of all our data processing steps. In addition to functions for managing the buoy data, this package will also have functions to automate the processing of fluorometry measurements and combine data from multi-parameter sondes.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:02/03/2022
Record Last Revised:03/04/2022
OMB Category:Other
Record ID: 354248