R2camtrapdp: schema-driven workflow

library(R2camtrapdp)

Overview

R2camtrapdp converts camera-trap data held in an arbitrary spreadsheet into a Camera Trap Data Package (Camtrap DP).

This version is schema-driven: the structure, types and constraints of the output tables are read from the official Frictionless table schemas of the Camtrap DP version you choose. As a result the package

The classic helper functions (create_deployments(), create_media(), create_observations()) and the R6_CamtrapDP class keep the same names and arguments as before, so existing scripts continue to work. The new schema-driven behaviour is added on top.

Note on internet access. Setting a table (set_deployments() etc.) downloads the table schema for the chosen version from GitHub the first time it is needed, and then caches it. If you work offline, pass a downloaded schema file with the local_schema = argument.

Data

The package ships with example data for several deployments with image records.

# multiple deployments with image data
data("Idep")   # deployment table
data("Iobs")   # observation table

Idep holds one row per deployment (camera placement) with columns such as deploymentID, longitude, latitude, locationID, startDate/startTime, endDate/endTime, cameraID, cameraModel, Delay, Height, bait and setupBy. Iobs holds one row per observation with the institution/collection codes, filename, deploymentID, date/time, obsID, eventID, eventStart/eventEnd, object, genus, species, class and individualCount.

1. Choose a version and inspect its schema (optional)

The whole pipeline is driven by the schema of the version you pick. Camtrap DP versions 1.0, 1.0.1 and 1.0.2 are all supported; their table schemas share the same field names, types and constraints, so the only practical difference is that 1.0.2 recognises a few more missing-value tokens (NA, NaN, nan). You can inspect the schema of any version directly with TableSchema.

(Note: the official 1.0 profile — the metadata JSON Schema — has an upstream bug, a malformed internal $ref, that newer Frictionless rejects. Specifying version = "1.0" therefore emits a warning; validate_frictionless() works around the bug automatically, but 1.0.1 or later is recommended.)

version <- "1.0.1"

dep_schema <- TableSchema$new("deployments", version = version)
dep_schema$field_names()           # every column the schema defines
dep_schema$required_field_names()  # columns that must be present and non-missing
dep_schema$empty_table()           # a 0-row, correctly typed "shell" table

You rarely need to do this by hand — the R6_CamtrapDP object loads and caches the right schema for you — but it is useful for understanding what a given version expects.

check_schema() confirms that the schema itself is a well-formed Frictionless Table Schema (supported field types, constraints that are valid for each type, primary/foreign keys that reference defined fields) — useful before adopting a brand-new or hand-edited schema.

dep_schema$check_schema()

External (URL) references in a schema

Some Camtrap DP information is specified not as a machine-checkable constraint but as a URL: semantic mappings (skos:exactMatch / broadMatch / narrowMatch to Darwin Core, Audubon Core, … terms) and reference URLs in field descriptions (for example the IANA media-type registry for fileMediatype, or method DOIs for individualSpeed). The package only enforces the structured constraints; the URL-referenced meaning is not validated. To make sure you never overlook such a specification when adopting a version or a new schema flavor, list them with:

dep_schema$external_references()   # every URL the schema declares (skos, descriptions, schema URL)
dep_schema$semantic_only_fields()  # fields whose meaning is URL-defined and cannot be value-checked

external_references() returns a tidy table (resource, field, key, category, url); semantic_only_fields() flags the columns you should check against the referenced authority by hand. The whole package can be scanned at once with datapackage$external_references().

2. Build the three core tables

Create deployments

Using the deployment data (Idep), the deployments table is created exactly as before. create_deployments() accepts either combined datetimes or separate date/time columns.

deployments <- create_deployments(
  deploymentID         = Idep$deploymentID,
  longitude            = Idep$longitude,
  latitude             = Idep$latitude,
  locationID           = Idep$locationID,
  deploymentStart_date = Idep$startDate,
  deploymentStart_time = Idep$startTime,
  deploymentEnd_date   = Idep$endDate,
  deploymentEnd_time   = Idep$endTime,
  cameraID             = Idep$cameraID,
  cameraModel          = Idep$cameraModel,
  cameraDelay          = Idep$Delay,
  cameraHeight         = Idep$Height,
  baitUse              = Idep$bait,
  setupBy              = Idep$setupBy)

create_deployments() also accepts (not shown above): deploymentStart / deploymentEnd (combined datetimes, used instead of the *_date / *_time pairs), locationName, coordinateUncertainty, cameraDepth (mutually exclusive with cameraHeight), cameraTilt, cameraHeading, detectionDistance, timestampIssues, featureType, habitat, deploymentGroups, deploymentTags, deploymentComments, and tz (time zone, default "Asia/Tokyo").

Create media

# media ID
mediaIDi <- paste(Iobs$institutionCode,
                  Iobs$collectionCode,
                  Iobs$locationID,
                  as.numeric(factor(Iobs$filename)),
                  sep = "_")

# file information
fileName      <- Iobs$filename
filetype      <- tolower(unlist(lapply(strsplit(fileName, "\\."), "[", 2)))
fileMediatype <- paste("image", filetype, sep = "/")
filePublic    <- !grepl("ヒト", fileName)   # hide human images from the public

media <- create_media(
  mediaID        = mediaIDi,
  deploymentID   = Iobs$deploymentID,
  timestamp_date = Iobs$date,
  timestamp_time = Iobs$time,
  filePath       = "Image",
  filePublic     = filePublic,
  fileMediatype  = fileMediatype,
  captureMethod  = "activityDetection",
  fileName       = fileName)

create_media() also accepts (not shown above): timestamp (combined datetime, instead of timestamp_date / timestamp_time), exifData, favorite, mediaComments, tz, and omitduplicate (drop duplicate mediaIDs, default TRUE).

Create observations

# event-based observations
observationLevel <- "event"

# observationType must be one of the schema enum values
observationType <- ifelse(Iobs$object == "hito", "human",
                   ifelse(Iobs$object == "none", "blank",
                   ifelse(Iobs$object == "unidentifiable", "unknown", "animal")))

# scientific name
scientificName <- ifelse(is.na(Iobs$genus), Iobs$class, paste(Iobs$genus, Iobs$species))

# unique observation IDs
observationID <- paste(mediaIDi, Iobs$obsID, sep = "_")

observations <- create_observations(
  observationID             = observationID,
  deploymentID              = Iobs$deploymentID,
  eventID                   = Iobs$eventID,
  eventStart                = Iobs$eventStart,
  eventEnd                  = Iobs$eventEnd,
  observationLevel          = observationLevel,
  observationType           = observationType,
  scientificName            = scientificName,
  count                     = Iobs$individualCount,
  classificationMethod      = "human",
  classificationProbability = 1)

create_observations() also accepts (not shown above): mediaID, the eventStart_date / eventStart_time and eventEnd_date / eventEnd_time pairs (instead of combined eventStart / eventEnd), cameraSetupType, lifeStage, sex, behavior, individualID, individualPositionRadius, individualPositionAngle, individualSpeed, bboxX, bboxY, bboxWidth, bboxHeight, classifiedBy, classificationTimestamp, observationTags, observationComments, tz, and omitduplicate.

3. Assemble the data package

Create the R6 object (with a version)

datapackage <- R6_CamtrapDP$new(version = "1.0.1")

The version you give here selects the schemas used for validation and written into datapackage.json. Change it to target a different Camtrap DP release.

Import the tables (now schema-validated)

set_deployments(), set_media() and set_observations() keep their original names, but now each one coerces the table to the schema types and validates it against the schema for the chosen version. Any problems are printed as a summary; you can switch the printing off with validate = FALSE.

datapackage$set_deployments(deployments)
datapackage$set_media(media)
datapackage$set_observations(observations)

(The chunks that download a schema, write files, look up taxonomy, or call Python are shown but not executed when this vignette is built, so they produce no output here.)

The validation summary tells you, for every issue, the file, the column, the row, the violated rule and a message — for example a value that breaks an enum, a number outside its minimum/maximum, or a datetime that does not match the required format. A value that does not even fit the column type (e.g. a non-numeric string in a number field) is reported as a type error rather than being silently turned into NA.

Check relations between tables

Foreign keys (e.g. media.deploymentID must exist in deployments, and observations.mediaID must exist in media) and primary-key uniqueness are read from each table’s schema and checked across the tables you have added.

datapackage$check_relations()

If a primary-key or a required foreign-key column is entirely missing in a stored table (often a column-name mismatch that coercion filled with NA), check_relations() warns and points at the data, e.g. datapackage$data$observations has 'deploymentID' entirely missing ..., so you can inspect datapackage$data$<resource> directly.

4. Metadata

Camtrap DP requires five metadata properties (contributors, project, spatial, temporal, taxonomic — plus created). Six further properties are optional. The metadata functions are unchanged from previous versions.

Check which metadata the profile requires

The required metadata is itself read from the package profile (a JSON Schema). metadata_requirements() lists every required top-level property, the method that sets it, and whether it is currently set; check_metadata() validates the current object against the profile and reports anything missing (including nested keys such as project.samplingDesign).

datapackage$metadata_requirements()   # checklist: property, required, set_with, currently_set
datapackage$check_metadata()          # report missing required metadata

This is the R-side counterpart of the metadata (profile) validation that Frictionless performs (§6), so you can confirm the required structure before writing the package and calling Python.

Required metadata

Contributors

add_contributors() imports a data frame with columns title, email, path, role and organization. role may be contact, principalInvestigator, rightsHolder, publisher or contributor.

cd <- data.frame(
  title        = c("Keita Fukasawa", "Kana Terayama"),
  email        = c("fukasawa@nies.go.jp", "terayama.kana@nies.go.jp"),
  path         = c("https://orcid.org/0000-0003-0272-9180",
                   "https://orcid.org/0000-0001-6935-7233"),
  role         = c("contact", "principalInvestigator"),
  organization = c("National Institute for Environmental Studies (NIES)",
                   "National Institute for Environmental Studies (NIES)"))
datapackage$add_contributors(cd)

Project

datapackage$set_project(
  title            = "DummyData",
  samplingDesign   = "simpleRandom",
  captureMethod    = "activityDetection",
  individualAnimals = FALSE,
  observationLevel = "event")

samplingDesign is one of simpleRandom, systematicRandom, clusteredRandom, experimental, targeted or opportunistic; captureMethod is activityDetection or timeLapse; observationLevel is media or event. The optional id, acronym, description and path arguments are also available.

Spatial and temporal

set_st() derives the spatial and temporal coverage from the deployments, so it must be called after set_deployments().

datapackage$set_st()

Taxonomic

set_taxon() lists the unique scientificName values from the observations and looks up taxonID, taxonRank and the higher taxonomy from a taxonomic database (gbif by default; also itis / ncbi; see taxadb::get_ids). The Camtrap DP taxonomic block requires a taxonID (a GBIF / IUCN identifier or URI), so taxadb is a required dependency of R2camtrapdp (installed with it); this step also needs internet access.

datapackage$set_taxon()

Names that cannot be matched get taxonID = NA (omitted from the output, not a bogus <uri>NA). set_taxon() warns about scientificName values with unnecessary whitespace and about names with no taxonID in the chosen database, so you can clean or check those names.

Created

datapackage$update_created(tz = "Asia/Tokyo")

Optional metadata

Licenses

Camtrap DP expects at least one license for the data and one for the media.

datapackage$add_license(name = "CC-BY-4.0",
                        path = "http://creativecommons.org/licenses/by/4.0/",
                        scope = "data")
datapackage$add_license(name = "CC-BY-4.0",
                        path = "http://creativecommons.org/licenses/by/4.0/",
                        scope = "media")

Properties, sources and references

datapackage$set_properties(
  name     = "dummy-nies",
  homepage = "https://www.nies.go.jp/biology/snapshot_japan/index.html")
datapackage$add_sources(title = "DummyData")
datapackage$add_references(reference = "DummyNIES https://doi.org/xxxxx")

Custom resources

set_custom() attaches an extra resource (for example data used by an abundance estimator) as metadata. It must be called after the three core tables have been set.

RD <- data.frame(id = seq_len(388), Time = sample(1:29, 388, replace = TRUE))
datapackage$set_custom(name = "rest",
                       description = "data for the REST method",
                       data = RD)

5. Output the data package

# return the camtrapdp object
data_camtrapdp <- datapackage$out_camtrapdp()

# or also write deployments.csv / media.csv / observations.csv + datapackage.json
datapackage$out_camtrapdp(write = TRUE, directory = path)

When written, the CSV files contain every schema column, booleans are written as true/false, and unset metadata is omitted so that empty placeholders do not cause spurious validation errors.

6. Validate the written package with Frictionless

Conformance pre-checks (before calling Python)

Before running Python, you can check on the R side whether the package is even a well-formed Frictionless data package — and whether it is Camtrap DP form. This mirrors, in R, the structural checks Frictionless performs, so problems with a brand-new or unusual schema surface early.

datapackage$check_descriptor()        # package + table-schema structure (Frictionless spec)
datapackage$check_camtrap_profile()   # warn if the profile is not a Camtrap DP profile

A package can be a valid Frictionless data package without being Camtrap DP form: that depends on whether its profile is the Camtrap DP profile (which is the default). The authoritative check, including GeoJSON validity and the physical file structure, is still the Frictionless run below.

Run Frictionless

You can confirm the written package against the official schemas with the Python Frictionless validator. This requires Python with frictionless installed (pip install frictionless).

issues <- datapackage$validate_frictionless(directory = path, python = "python")
ctdp_is_valid(issues)   # TRUE if there are no errors

Note — this rewrites path. validate_frictionless() defaults to write = TRUE, so it calls out_camtrapdp() and overwrites the datapackage.json and CSVs in directory from the current object before validating. To validate a package that already exists on disk without overwriting it, use write = FALSE, or the standalone validate-only function (no R6 object needed):

ctdp_validate_frictionless("path/to/existing/package", python = "python")

issues is a tidy table with one row per problem, giving the source file, the field (column or property path), the row, the violated constraint, the offending value, and a message, so you can see exactly where any error occurs. For cell errors value is the failing cell; for metadata (profile) errors it is resolved from datapackage.json via the property path in the note (e.g. contributors[].email → the actual email value(s)). You can also aggregate the R-side schema checks, the relation checks, the metadata (profile) checks, the conformance pre-checks and (optionally) the Frictionless report in one call:

datapackage$validate(relations = TRUE, metadata = TRUE, conformance = TRUE,
                     frictionless = TRUE, directory = path, python = "python")

7. Converting an arbitrary spreadsheet directly

The helpers above assume you already named your variables. If instead you have a raw spreadsheet with its own column names, you can map and validate it in one step with ctdp_build_table(), which applies a column mapping, merges separate date/time columns, coerces to the schema types and validates — for any version.

version    <- "1.0.1"
dep_schema <- TableSchema$new("deployments", version = version)

# an example raw sheet with arbitrary column names + a custom column
raw <- data.frame(
  station   = c("A01", "A02"),
  lat       = c(35.1, 36.2),
  lon       = c(139.5, 140.1),
  start_day = c("2023-04-01", "2023-04-02"),
  start_clk = c("09:00:00", "10:30:00"),
  end_day   = c("2023-05-01", "2023-05-02"),
  end_clk   = c("09:00:00", "10:30:00"),
  myNote    = c("kept as a custom column", "kept too"),
  stringsAsFactors = FALSE)

# mapping: names are SOURCE columns, values are Camtrap DP FIELD names
mapping <- c(station = "deploymentID", lat = "latitude", lon = "longitude")

built <- ctdp_build_table(
  dep_schema, raw, mapping = mapping,
  datetime_merges = list(
    list(date_col = "start_day", time_col = "start_clk", target = "deploymentStart"),
    list(date_col = "end_day",   time_col = "end_clk",   target = "deploymentEnd")))

ctdp_summarize_validation(built$issues)   # any schema problems
datapackage$set_deployments(built$data)   # feed the result into the package

Custom columns such as myNote are kept; when the package is written, the custom column is declared in an inline extended schema in datapackage.json so that Frictionless accepts it.

8. Other schema flavors (e.g. bioacoustics)

Because every table is driven by the schema you point it at, the package is not limited to the camera-trap schemas hosted by TDWG. To target a different flavor — for instance the bioacoustics extension of Camtrap DP — give the table and profile URLs explicitly. These schemas live in a different repository and use their own field set (e.g. deviceID instead of cameraID, plus samplingFrequency, frequencyLow/frequencyHigh, …) and per-table datetime formats (the media / observations event timestamps use fractional seconds %Y-%m-%dT%H:%M:%S.%f%z, while the deployments times do not); the schema-driven validation adapts to all of this automatically. If your raw media / observations timestamps lack the fractional part, .000 is added automatically so the value matches the schema’s %f format.

Point the package at the flavor once with set_properties(), then add tables as usual — the set_*() methods use the configured schema_urls, so you do not need to pass schema = to each call:

ba <- "https://raw.githubusercontent.com/camera-traps/bioacoustics/main/camtrap-dp/1.0.2/%s"

dp <- R6_CamtrapDP$new(version = "1.0.2")
dp$set_properties(
  version     = "1.0.2",
  profile     = sprintf(ba, "camtrap-dp-profile-acoustic.json"),
  schema_urls = list(
    deployments  = sprintf(ba, "deployments-table-schema-acoustic.json"),
    media        = sprintf(ba, "media-table-schema-acoustic.json"),
    observations = sprintf(ba, "observations-table-schema-acoustic.json")))

# audio timestamps carry fractional seconds to match the acoustic schema format
dp$set_media(data.frame(
  mediaID = "m1", deploymentID = "D1",
  timestamp = "2023-04-01T09:05:00.000+0900",
  filePath = "audio/m1.wav", filePublic = TRUE, fileMediatype = "audio/wav",
  samplingFrequency = 48000L, channels = 1L,
  stringsAsFactors = FALSE))

Mapping camera-trap columns to the acoustic flavor

You only need a mapping for columns whose name differs from the acoustic field. Columns that already use the acoustic field name (deploymentID, latitude, deploymentStart, …) are matched automatically — no mapping needed. For deployments, the camera-trap camera* fields are renamed to device*; the camera-only fields have no acoustic equivalent and should be dropped; and a few acoustic-only fields can be set if you have the data.

library(dplyr)

# camera-trap deployments -> acoustic deployments (only the renamed columns)
mapping <- c(
  cameraID      = "deviceID",
  cameraModel   = "deviceModel",
  cameraDelay   = "deviceDelay",
  cameraHeight  = "deviceHeight",
  cameraDepth   = "deviceDepth",
  cameraTilt    = "deviceTilt",
  cameraHeading = "deviceHeading")

dep_acoustic <- camtrap_deployments %>%
  select(-any_of(c("featureType", "timestampIssues")))   # camera-only: no acoustic field

dp$set_deployments(dep_acoustic, mapping = mapping)

Field correspondence — deployments:

Camera-trap field Acoustic field Action
deploymentID, locationID, locationName, latitude, longitude, coordinateUncertainty, deploymentStart, deploymentEnd, setupBy, detectionDistance, baitUse, habitat, deploymentGroups, deploymentTags, deploymentComments same name no mapping
cameraID / cameraModel / cameraDelay / cameraHeight / cameraDepth / cameraTilt / cameraHeading deviceID / deviceModel / deviceDelay / deviceHeight / deviceDepth / deviceTilt / deviceHeading map
featureType, timestampIssues drop
elevation, devicePlatform, recordingSchedule, locationType acoustic-only (set if available)

For observations the only renamed field is cameraSetupTypedeviceSetupType (acoustic also adds frequencyLow / frequencyHigh / classificationConfirmation). For media there are no renames, only extra fields (duration, bitDepth, samplingFrequency, gain, channels).

Inspect a flavor the same way as any other schema. Note that TableSchema$new("deployments", version = "1.0.2") without url_template loads the camera-trap deployments schema; pass the acoustic URL to inspect the acoustic requirements. requirements() returns a tidy table of every field’s type, format and constraints.

acoustic_dep <- TableSchema$new(
  "deployments", version = "1.0.2",
  url_template = sprintf(ba, "deployments-table-schema-acoustic.json"))
acoustic_dep$field_names()
acoustic_dep$required_field_names()
acoustic_dep$requirements()        # field / type / format / required / enum / min / max / pattern
acoustic_dep$external_references()

Note that create_deployments(), create_media() and create_observations() are tailored to the camera-trap schema. For a different flavor (or for new columns in a future version), build the tables with the schema-driven path (ctdp_build_table() or the set_*() methods with a custom schema =) rather than the create_*() helpers.