checklist_change

Research project aggregating biodiversity data from checklists. Child project from chase-lab/homogenisation and brother project of chase-lab/metacommunity-surveys

https://github.com/chase-lab/checklist_change

Last synced: 10 months ago · JSON representation ·

Repository

Research project aggregating biodiversity data from checklists. Child project from chase-lab/homogenisation and brother project of chase-lab/metacommunity-surveys

Basic Info

Host: GitHub
Owner: chase-lab
License: mit
Language: R
Default Branch: main
Homepage:
Size: 38.2 MB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 4

Created over 4 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

Checklist change

Description

This research compendium regroups scripts used to download, re-structure and aggregate data sets to constitute a large meta-analysis of communities sampled at least twice, 10 years apart or more. The specificity of this data set is that it aggregates data from studies varying greatly in their focus and methods but all sampled an area in the most exhaustive way allowing us to consider it a checklist. In some studies, only past or only present communities were provided and the other species community was built by adding invading species or excluding extinct species.

Data

Raw and aggregated data tables are provided. Raw data are stored for each data set individually in the data/raw data/ folder in compressed .rds files Aggregated data are in data/communities.csv and data/metadata.csv and column definitions are given in data/definitions_communities.txt and data/definitions_metadata.txt, and reproduced at the bottom of this readme.

Here are commands exploring the data set: ``` r dt <- data.table::fread(file = "data/communities.csv", select = c("dataset_id","regional","local","year","species"), stringsAsFactors = TRUE) |> unique()

meta <- data.table::fread(file = "data/metadata.csv", select = c("datasetid","regional","local","taxon","realm", "year", "latitude","longitude", "gammaboundingboxkm2", "gammasumgrains_km2"), stringsAsFactors = TRUE, colClasses = c(latitude = "numeric", longitude = "numeric")) [ j = year := as.integer(as.character(year)) ] |> unique()

dt[i = meta, j = ":="( taxon = i.taxon, realm = i.realm ), on = .(dataset_id, regional, local) ][ j = data.table::uniqueN(species), by = taxon ][ order(-V1) ]

How many dataset_ids

base::nlevels(meta$dataset_id)

How many dataset_ids/regions

data.table::uniqueN(meta[, .(dataset_id, regional)])

How many dataset_ids/regions/sites?

data.table::uniqueN(meta[, .(dataset_id, regional, local)])

How many sites per regions on average?

meta[j = data.table::uniqueN(local), keyby = .(dataset_id, regional) ][ j = mean(V1)]

How many dataset_ids/regions/sites with unique coordinates?

data.table::uniqueN(meta[i = meta[j = data.table::uniqueN(.SD), .SDcols = c("latitude", "longitude"), keyby = .(datasetid, regional, local)][V1 == 1L], on = .(datasetid, regional, local)])

How many samples?

data.table::uniqueN(meta[, .(dataset_id, regional, local, year)])

What is the mean year range?

unique(meta[, .(dataset_id, regional, local, year)])[j = mean(diff(range(year)))]

How many localities with 2 samples?

How many localities with at least 4, 5, 10 samples?

meta[j = data.table::uniqueN(year), by = .(dataset_id, regional, local)][j = .(y2 = sum(V1 == 2L), y4 = sum(V1 >= 4L), y5 = sum(V1 >= 5L), y10 = sum(V1 >= 10L))]

How many regions with 2 samples?

How many regions with at least 4, 5, 10 samples?

meta[j = data.table::uniqueN(year), by = .(dataset_id, regional)][j = .(y2 = sum(V1 == 2L), y4 = sum(V1 >= 4L), y5 = sum(V1 >= 5L), y10 = sum(V1 >= 10L))]

Beginning year range

meta[j = min(year), by = .(dataset_id, regional)][j = range(V1)]

End year range

meta[j = max(year), by = .(dataset_id, regional)][j = range(V1)]

How many regions were first sampled before 1800?

meta[j = min(year) <= 1800L, by = .(dataset_id, regional)][j = sum(V1)]

What is the maximum number of samples in a site?

meta[j = data.table::uniqueN(year), by = .(dataset_id, regional, local) ][ j = max(V1)]

How many samples per taxon groups?

meta[j = data.table::uniqueN(.SD), .SDcols = c("dataset_id", "regional", "local", "year"), by = "taxon"][order(-V1)]

How many regions per realm groups?

meta[j = data.table::uniqueN(.SD), .SDcols = c("dataset_id", "regional"), by = "realm"][order(-V1)]

How many samples per realm groups?

meta[j = data.table::uniqueN(.SD), .SDcols = c("dataset_id", "regional", "local", "year"), by = "realm"][order(-V1)]

Mean richness per sample?

dt[j = data.table::uniqueN(species), by = .(dataset_id, regional, local, year)][j = mean(V1)]

gamma_extent range

meta[j = .(sum = range(gammasumgrainskm2, na.rm = TRUE), box = range(gammaboundingboxkm2, na.rm = TRUE))]

```

Workflow and reproducibility

Environment

To ensure reproducibility, the working environment (R version and package version) was documented and isolated using the package renv. By running renv::restore(), renv will install all missing packages at once. This function will use the renv.lock file to download the same versions of packages that we used and install them on your system.

Relative paths

Included in the repository is a Rstudio project file: checklist_change.Rproj that should always be used to open the project to ensure that the working directory is set correctly. All paths in the project have the same relative root which is the checklist_change folder where the .Rproj file is located. Using setwd() is discouraged (read more).

Workflow

After downloading or cloning this repository, run the following scripts in order to wrangle raw data and merge all data sets together into one long table.

``` r renv::restore()

Raw data are stored in the project so that users do not need to download the data

source("R/1.0downloadingraw_data.r")

source("R/2.0wranglingrawdata.r") source("R/3.0merginglong-formattables") ```

Additional installations

You might need to install the 64-bit version of Java to run Tabulizer.

Variable definitions

Community data

`/data/definitions_communities.txt`

| Variable name | Definition | | :-------------|:-----------| | dataset_id | Unique ID linked to a publication (article or data set). If the data set was split because different taxa group are provided, a letter is added at the end. No missing value. | | year | Year of sampling. If sampling was pooled over several years, the last sampling year is used here. No missing value. | | regional | Region name, contains at least two localities. Can be a national park, a state or a forest name for example but smaller scales are also included where the region is an experimental sites. A data set can have several regions. No missing value. UTF-8 encoding. | | local | Name or code of the sampled locality or experimental sample. For example, it can correspond to the name of an island, a lake or forest. No missing value. UTF-8 encoding. | | species | Species names. Whenever possible, complete (Genus + species epithet) names were included rather than codes. No missing value. UTF-8 encoding. |

Metadata

`/data/definitions_metadata.txt`

| Variable name | Definition | | :-------------|:-----------| | datasetid | Unique ID linked to a publication (article or data set). If the data set was split because different taxa group are provided, a letter is added at the end. No missing value. | | year | Year of sampling. If sampling was pooled over several years, the last sampling year is used here. Where year (i.e., date) for historical lists was not available, they were estimated based on human visitation/colonisation history. No missing value. | | regional | Region name, contains at least two localities. Can be a national park, a state or a forest name for example but smaller scales are also included where the region is an experimental sites. A data set can have several regions. No missing value. | | local | Name or code of the sampled locality or experimental sample as given by the original data provider. For example, it can correspond to the name of an island, a lake or forest. No missing value. | | latitude | Latitude North in decimal degree, WGS84. NA values indicate that information could not be collected. | | longitude | Longitude East in decimal degree, WGS84. NA values indicate that information could not be collected. | | effort | Sampling effort expressed for example as the number of visits to a plot or the total area sampled in a given year. See Comment column for a description. NA value means that exact effort is unknown but considered extensive and exhaustive. | | datapooledbyauthors | TRUE if the data provided by the authors was already pooled covering several years: several samples made over several years pooled together. No missing value. | | datapooledbyauthorscomment | If there was pooling by the original authors, countains free text describing how the authors pooled their data. NA values when no pooling was done. | | samplingyears | If there was pooling by the original authors, contains the years sampled for each period. "1997, 1999" means 1997 and 1999, "1997-1999" means 1997, 1998 and 1999. NA values indicate that information could not be collected. | | alphagrainm2 | Area of the local unit or area of the sampling gear (in which case, alphagraintype = sample). NA values indicate that information could not be collected. | | alphagraintype | Category of alphagrain specifying what does the alphagrain measure relate to. Allowed values are: "island", "plot", "administrative" (eg the area of a park or state), "watershed", "sample", "lakepond", "archipelago", "trap", "transect", "ecosystem" (eg the area of the whole wetland or generally the area of adjacent comparable habitat), "box" (a box covering the sites) or "quadrat"". No missing value. | | alphagraincomment | Description of how the alphagrain was measured. NA values indicate that information could not be collected. | | gammaboundingboxkm2 | Measure of the extent/regional scale area as the area covering all sites, computed as a convex-hull or a rectangle box or as the area of the administrative unit in which sites are found. NA values indicate that information could not be collected. | | gammaboundingboxtype | Category of gammaboundingbox specifying what does the gammaboundingbox relate to. Generally the bounding box is computed based on coordinates of the sites of by using a coarse area such as the area of the park, the state or the whole island. Allowed values are: "administrative" (eg the area of a park or state), "island", "convex-hull", "watershed", "box" (a box covering the sites), "buffer", "functional" (eg the area of the whole wetland or generally the area of adjacent comparable habitat), "shore" or "lakepond"". NA values indicate that information could not be collected. | gammaboundingboxcomment | Description or source (eg the paper, Wikipedia or measured on Google Earth) of the gammaboundingbox value. NA values indicate that information could not be collected. | gammasumgrainskm2 | Measure of the extent/regional scale area as the sum of the grains sampled each year. NA values indicate that information could not be collected. | | gammasumgrainstype | Category of gammasumgrains specifying what does the gammasumgrains relate to. Generally the type is related to the type of the alphagrain since it is a sum of all alphagrains in a region in a given year. Allowed values are: "archipelago", "administrative" (eg the area of a park or state), "watershed", "sample", "lakepond", "plot", "quadrat", "transect", "functional" (eg the area of the whole wetland or generally the area of adjacent comparable habitat) or "box" (a box covering the sites). NA values indicate that information could not be collected. | | gammasumgrainscomment | Description or source (eg the paper, Wikipedia or measured on Google Earth) of the gammasumgrains value. NA values indicate that information could not be collected. | | realm | Realm in which the sampling was done, one of: Freshwater or Terrestrial. No missing value. | | taxon | Taxon group of the data set, one of: "Plants", "Invertebrates" or "Fish"". No missing value. | | comment | A description of the data set origin, goal and sampling method. No missing value. | | commentstandardisation | A short description of the modifications we made to the data set to ensure standard effort: excluded sites or years, excluded taxa, etc. No missing value. | | doi | One or several DOIs separated by |" that have to be cited for each data set. | | iscoordinatelocal_scale | Logical. TRUE if coordinates re given at the scale of the site/locality. FALSE if the coordinates are at the regional scale or missing. No missing value. |

Owner

Name: chase-lab
Login: chase-lab
Kind: organization

Repositories: 4
Profile: https://github.com/chase-lab

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use these data, please cite as below and cite all data sets used."
authors:
  - family-names: Sagouis
    given-names: Alban
    orcid: https://orcid.org/0000-0002-3827-1063
  - family-names: Blowes
    given-names: Shane
    orcid: https://orcid.org/0000-0001-6310-3670
  - family-names: Chase
    given-names: Jonathan
    orcid: https://orcid.org/0000-0001-5580-4303
  - family-names: Xu
    given-names: Wubing
    orcid: https://orcid.org/0000-0002-6566-4452
title: "chase-lab/checklist_change - Checklist Change data for 'Synthesis reveals approximately balanced biotic differentiation and homogenization'"
version: v1.2
date-released: 2024-01-20