wgtrutta database description

DIASPARA WP3.2 working document

Technical analysis of the wgtrutta database
Author

Briand Cédric, Oliviero Jules, Helminen Jani

Published

19-05-2025

The following working document is a technical analysis of the WGTRUTTA templates. This document does not engage the WGTRUTTA, it’s to try to get how this works. The documents have been kindly provided by Iain Malcolm & Carlos Alexandre. The dataset is analysed in separation without trying to fit it in a common format for the different species (which will be the final product). This analysis doesn’t question the format of the excel table for a datacall (it’s probably best to ask for a wide format, this is what we do in eel). We are just trying to puzzle out how we would store this info, and thus gaining knowledge about the data structure for future discussions. This document is listed as a task there github link to diaspara

Part I : BIO-ECOLOGICAL DATA FOR SEA TROUT RIVERS ACROSS EUROPE

Regional site spatio-temporal units (table general information)

Code to read excel files.
# note I have created 3 named regions in the excel to load the data
general_information <- openxlsx::read.xlsx(file.path(datawd, "WGTRUTTA Short BioEcological Template_DIASPARA.xlsx"), namedRegion = "general_information")
population_data <- openxlsx::read.xlsx(file.path(datawd, "WGTRUTTA Short BioEcological Template_DIASPARA.xlsx"), namedRegion = "population_data")
individual_data <- openxlsx::read.xlsx(file.path(datawd, "WGTRUTTA Short BioEcological Template_DIASPARA.xlsx"), namedRegion = "individual_data")
Code to read excel files.
kable(general_information) %>%
    kable_styling(bootstrap_options = c("striped", "hover"))
Table 1: General information table metadata in the WGTRUTTA DB
Variable.name Description
Site nº Order in which the site is inserted in the database
Country Country in which the site is located
River Basin Name of the main basin in which the site and respetive river are located
River Name Local name of the river in which the site is located
River ID Unique code attributed to the river (Country_River; e.g. PT_Mondego)
Site ID Unique code attributed to the site (Country_River_Number; e.g. PT_Mondego_#1)
Coordinates Latitute and Longitude of the site; WGS84 datum (Degrees, minutes and seconds format; e.g. 40°39'15.00'' N - 8°16'32.21'' W)
Data Provider Name of the researcher or entity providing the data
Time series available Year, or time series, for which the bioecological data is available for the site
Weblink Please add weblink, if available, with additional information about the river, sampling site or related projects
Comments Please provide additional comments on the data provided (e.g., catch methods and protocols used; information on sampling efficiency and other constraints; Please also indicate restrictions to data use, if they exist)
NOTE WGTRUTTA

Site n° is the primary key

QUESTION WGTRUTTA: River basin

Do you have specific frame for this, like in the Baltic, not all basins have Sea Trouts, so how do we proceed there ? Would there be regional units (quite large) like those of WGNAS or do you intend this to be specific river codes (which we intend to provide) soon in WP3.1. Note you might want to have a look at how we plan to handle hierarchies theoretically, and practically the integration of river basins in this report

WGTRUTTA Answer (Iain Malcolm)

I suspect we would plan to use the spatial data and catchment codes provided by Andrew and Phil from their rivers datasets. We would snap electrofishing point data to lines and extract the relevant spatial data and covariates. We may also incorporate regional groups as some sort of mrf smoother, like we have used in Scotland for characterising large scale correlated smoothed patters not identified by covariates. We haven’t defined any of this yet. In terms of species distribution we intend to take the spatial dataset provided by Andrew / Phil, make this available as an online GIS and ask members to populate barriers and (where known) sea trout distribution. This is one of the work packages.

QUESTION WGTRUTTA: River ID

If you have a referential why do you need the country ? Because of the country providing the data and these will come as separate lines ?

QUESTION WGTRUTTA: Site & Site ID

Are those the same ?

NOTE WGTRUTTA

ESPG 4326 WGS84 is given as degree decimals like -7.522888 37.663168

Question ICES

Do you have a vocabulary of data providers to ICES. Do you need one. Then wouldn’t this make adding new data time demanding ? Should we leave this open ?

ICES answer (Maria)

We are currently using EDMO for data providers: EDMO Here you can also find the list of EDMO codes for all institutes. If you can not find your institution, I think that as an EU country, you should have national managers that can add the new codes. If you do not know who it is, or are referring to a Non-EU institution, you can send an EDMO request to info@maris.nl, and CC to peter@maris.nl. As minimum, you need to provide the institution name, native name, address, contact info, web-link. In combination with the EDMO codes, we have this list of ‘roles’ where you can specify whether it is the data provider, custodian or else.

WGTRUTTA Answer (Iain Malcolm)

Not sure what you mean here. We will need to record the organisation providing the data, contact details, data references (in the case of published DOI datasets) and any constraints on data use. We need to explore issues around data license agreements before we ask people for data.

QUESTION WGTRUTTA: Time series available

Would you agree that if the database is built, a query to detailed table should give this information and that it would be tedious to maintain ?

WGTRUTTA Answer (Iain Malcolm)

Not sure what you mean here. However, I think information on observations, years and data timespan can be generates from the data themselves. > DIASPARA : Yes this is what we meant there. That this kind of information should be contained in the database already.

QUESTION WGTRUTTA: comments

This information should be more specific ? We had this in WGEEL and have now to do additional datacalls to qualify the data in a harmonized way among series. Would you agree to have several columns there ?

G global general information sampling sampling table sampling id <site id> country method protocol comment public status site site table <site id> <river id> geom sampling:f1->site:f0 country country table <country id> sampling:f2->country:f0 river river table <river id> <basin id> geom site:f1->river:f0 basin basin table <basin id> geom river:f1->basin:f0

Graph: general information: A temptative table format for the general information.

The ordering of site > river > basin is straithforward. With proper postgis tables, Gis attributes (geometries) should allow constraint to check wheter a site is within the basin. A river (Mondego) or basin can belong to many countries, country is related to the <sampling id>. For this reason there is no link between site and country. You can have a look at the format envisioned for the migratory fishes database for areas.

Population data

Table 2: Population data table metadata in the WGTRUTTA DB
Variable.name Description
Site nº Order in which the site is inserted in the database
Country Country in which the site is located
River Basin Name of the main basin in which the site and respetive river are located
River Name Local name of the river in which the site is located
River ID Unique code attributed to the river (Country_River; e.g. PT_Mondego)
Site ID Unique code attributed to the site (Country_River_Number; e.g. PT_Mondego_#1)
Sampling Year Year in which the provided information was collected
Sampling Season Season in whih the provided information was collected (Winter; Spring; Summer; Autumn)
Number of 0+ juveniles Absolute number of 0+ trout juveniles (parr) captured in a given sampling site and sampling campaign
Number of 1+ juveniles Absolute number of 1+ trout juveniles (parr) captured in a given sampling site and sampling campaign
Number of 2+ or older juveniles Absolute number of 2+ or older trout juveniles (parr) captured in a given sampling site and sampling campaign
Number of smolts Absolute number of trout smolts captured in a given sampling site and sampling campaign
Number of finnocks Absolute number of trout finnocks captured in a given sampling site and sampling campaign
Total number of adults Absolute number of trout adults captured in a given sampling site and sampling campaign
0+ Juvenile Density Density of 0+ trout juveniles (parr) captured in a given sampling site and sampling campaign; nº of fish/m2
1+ Juvenile Density Density of 1+ trout juveniles (parr) captured in a given sampling site and sampling campaign; nº of fish/m2
2+ or older Juvenile Density Density of 2+ or older trout juveniles (parr) captured in a given sampling site and sampling campaign; nº of fish/m2
0+ Juvenile Average Length Average total length of 0+ trout juveniles (parr) for a given sampling site and sampling campaign (mm)
1+ Juvenile Average Length Average total length of 1+ trout juveniles (parr) for a given sampling site and sampling campaign (mm)
2+ or older Juvenile Average Length Average total length of 2+ or older trout juveniles (parr) for a given sampling site and sampling campaign (mm)
0+ Juvenile Average Weight Average total weight of 0+ trout juveniles (parr) for a given sampling site and sampling campaign (g)
1+ Juvenile Average Weight Average total weight of 1+ trout juveniles (parr) for a given sampling site and sampling campaign (g)
2+ or older Juvenile Average Weight Average total weight of 2+ or older trout juveniles (parr) for a given sampling site and sampling campaign (g)
Smolt Average Length Average total length of trout smolts for a given sampling site and sampling campaign (mm)
Smolt Average Weight Average total weight of trout smolts for a given sampling site and sampling campaign (g)
Adult Average Length Average total length of trout adults for a given sampling site and sampling campaign (mm)
Adult Average Weight Average total weight of trout adults for a given sampling site and sampling campaign (g)
Average Sea Age Average age (nº of years spent at sea) of adult trouts returning to freshwaters for a given site and sampling campaign
Average River Age Average age (nº of years spent at river) of trouts captured at river (adults, smolts or juveniles) for a given site and sampling campaign
% Anadromous Trouts Proportion (%) of the fish identified as part of the anadromous ecotype for a given river and sampling campaign
% Resident Trouts Proportion (%) of the fish identified as part of the resident ecotype for a given river and sampling campaign
% Marine Survival (Finnock) Ratio (%) of the total number of 0+ sea age fish (finnock) recorded returning to freshwaters from the annual number of smolts recorded, in the same year, migrating downstream for a given site and sampling campaign
% Marine Survival (Total first return) Ratio (%) of the total number of maiden fish recorded returning to freshwaters for a given smolt cohort, site and sampling campaign
Total Ova Deposition Total nº of oocytes deposited by sea trout females estimated for a given river and sampling campaign (please provide details on the method of variable estimation at the comments section)
% Sea Lice (Total) Total proportion (%) of the trouts in a given sampling site and sampling campaign with any sign of sea lice infestation
% Sea Lice (Attached) Proportion (%) of the trouts in a given sampling site and sampling campaign with signs of sea lice infestation (attached sea lice)
% Sea Lice (Mobile) Proportion (%) of the trouts in a given sampling site and sampling campaign with signs of sea lice infestation (mobile sea lice)
% Sea Lice (Wounds) Proportion (%) of the trouts in a given sampling site and sampling campaign with signs of sea lice infestation (only sea lice wounds)
Other Occurring Salmonids Information about the presence of other salmonid species in a given sampling site
Comments Please provide additional comments on the data provided (e.g., catch methods and protocols used; information on how each variable was obtained and/or estimated; restrictions to data use, if they exist)

The structuring of the population data has in fact three different levels. It is obvious that the year / season creates a different population id (a unit for analysis). For this reason, it is separated from the rest, and creates the unicity constraint, this table also stores an eventual comment and info on other occurring salmonids. From this table two types of data would probably best be separated, it would make more sense to store together in a row level format all data pertaining to the same level of complexity, so some data will depend on stage, age and whether the trouts are resident or anadromous. This table is called popdetailmetric table in the diagram. Some other information are however related to the whole population and don’t need details per stage, age, resident and would go to as popgroupmetrics table

G population population data popsample popsample table <population_id> <popsample_id> year season comments other occuring salmonids popdetailmetrics popdetailmetrics table <popsample_id> value datatype stage age anadromous popdetailmetrics:f0->popsample:f1 stage stage juvenile smolt finnock adult popdetailmetrics:f2->stage age age   river age sea age   popdetailmetrics:f1->age datatypedetail datatypedetail number density   length   weight popdetailmetrics:f3->datatypedetail popgroupmetrics popgroupmetrics table <popsample_id> value datatypegroupdetail popgroupmetrics:f0->popsample:f1 datatypegroup datatypegroup %anadromous %resident %Marine survival finnock Total Ova Deposition % Sea Lice (Total) % Sea Lice (Attached) % Sea Lice (Mobile) % Sea Lice (Wounds) popgroupmetrics:f1->datatypegroup

Graph:population table. Proposed structure for the population data. The content of referential values for datatype, stage, age is given to facilitate the understanding of the proposed table structure.

Individual data

Table 3: Individual data table metadata in the WGTRUTTA DB
Variable.name Description
Site nº Order in which the site is inserted in the database
Site ID Unique code attributed to the site (Country_River_Number; e.g. PT_Mondego_#1)
Sampling Year Year in which the provided information was collected
Sampling Season Season in whih the provided information was collected (Winter; Spring; Summer; Autumn)
Date of capture Specific date of capture for a fiven fish included in the databse (dd/mm/yyyy)
Fish ID Unique code attributed to each fish (River_Number; e.g. MOND#1)
Life Stage Life stage of the captured fish; Juvenile / Smolt / Adult
Habitat Habitat in which the fish was captured: Sea / Estuary / River
Length Total length of the fish (mm)
Weight Total weight of the fish (g)
Sex Sex of the fish; Male / Female / Undefined
River age Total number of years spent by the fish in freshwater
Sea age Total number of years spent by the fish at sea
Sea Lice Infestation Presence (Y) or absence (N) of signs of sea lice infestation in the fish. If Yes, please state the type of sign: attached sea lice; mobile sea lice; only sea lice wounds)
Catch Method Method used to catch/record the fish (e.g., electrofishing; trap/counter; rod cacth; nets)
Tagged (Y/N) Please state if the fish was subjected to any type of tagging procedure (Y/N)
Type(s) of Tags If Yes in the previous field, please identify the type(s) of tag(s) used in the fish (e.g., conventional; PIT tag; radio; acoustic)
Comments Please provide additional comments on the data provided (e.g., particularities of each fish and/or in its catch method; restrictions to data use, if they exist)

QUESTION WGTRUTTA: comments

Won’t the year and season attibutes be provided by the dates ?

WGTRUTTA Comment (Iain Malcolm)

I would think so. That is how I would do it.

QUESTION WGTRUTTA: site and habitats

What is the definition of a site. Site is the lowest level in the basin / river /site but seems the basins for the population structuring. Is that right ? I was just wondering about the relation between site and habitat, and whether a site if it’s small extended over several habitats.

WGTRUTTA Comment (Iain Malcolm)

In the case of the juvenile electrofishing work I would consider I site to be a unique sampling location that extends over a specified distance and area. I would characterise sites using point locations. A site could extend over multiple “habitats” depending on how these are defined.

QUESTION WGTRUTTA: site and habitats

It seems that a population is related to a site. Will there be simple relations between site and populations or will there be situations where for given site you will have several populations ?

WGTRUTTA Comment (Iain Malcolm)

Again, I’m not sure exactly what is being asked here. However, there will be many instances where you have multiple SiteVisits to a Site, with a different population record each time > DIASPARA : OK, same site, different records…. So the schema works. Sorry our question was not clear.

QUESTION WGTRUTTA: public

Will the public status be managed at the fish level ? Shouldn’t it be related to an upper level ?

WGTRUTTA Comment (Iain Malcolm)

I would think that any designation on data availability should be at the SiteVisit level or a higher level e.g. Campaign or Organisation. In the Marine Directorate databases we have Campaigns and Projects that identify coherent pieces of work. This is likely to be useful for understanding data and could be used to identify data availability as well?

QUESTION WGTRUTTA: tag

Can there be several tag per fish ?

WGTRUTTA Comment (Iain Malcolm)

Not sure if we would use this for the juvenile work. However, in the Marine Directorate databases we can store several different tags for each individual fish e.g. Floy and micro tag.

WGTRUTTA Comment (Iain Malcolm)

For the purposes of WGTRUTTA work going forwards we may not store raw data (measurements of individual fish) for the juvenile work, but we will want to store counts (and possible mean size) for each combinations of species, lifestage, age, pass (sweep). I think it unlikely that we will be able to get individual fish data from the countries involved in the case of EF data. > DIASPARA. OK. Then we need to add sweep in the population table. Note that WGEEL does collect individual data, and most is coming from electrofishing operations. You might want to have a look a the individual metrics database when it’s created. I guess if it works for salmon it could be easily translated to WGTRUTTA.

G individual Individual data indivtable individual table <fish_id> fish_idorigin <popsample_id> stage date x y catch method habitat comments stage stage juvenile smolt finnock adult indivtable:f2->stage popsample popsample indivtable:f1->popsample:f1 fishmetrics fishmetrics table <fish_id> metric value fishmetrics:f0->indivtable:f0 metric metric length weight sex river age sea age Sea Lice Infestation Sea Lice (Attached) Sea Lice (Mobile) Sea Lice (Wounds) fishmetrics:f1->metric:f0 tag tag table <fish_id> tag_type tag_code tag_actionvalue tag:f0->indivtable:f0 tagaction tagaction tag:f1->tagaction

Graph:individual table. Proposed structure for the population data. The content of referential values for datatype, stage, age is given to facilitate the understanding of the proposed table structure. Fish idorigin is the id in the orginal table (necessary to update the right fish when changes are made). More information about tags have been added. In the end it should just be few fishes. We’ve also added more details on sea lice, those that are at population level have been reported there. Normally everything in value should work as numeric, some values (sex) or sea Lice infestation can be recoded as zero or one.

Part II: ENVIRONMENTAL DATA FOR SEA TROUT RIVERS ACROSS EUROPE

Site level environmental data

Table 4: Site - env table metadata in the WGTRUTTA DB
column
site_no
country
river_basin
river_name
river_id
site_id
year_s
season
month
length
average_width
predominant_substrate_first
predominant_substrate_second
predominant_substrate_third
average_depth
pool
run
riffle
vegetation_extension
vegetation_predominant_type
large_woody_debris
riparian_gallery_density_left_bank
riparian_gallery_density_right_bank
shade
ths_width
ths_depth
ths_velocity
ths_substrate
ths_shade
ths_slope
estimated_ths
alcalinity
comments

QUESTION WGTRUTTA: what is a site ?

The main problem is to relate this information to site. What is a site ? Is it large, several segments, is it the scale of a subcatchment, does it correspond to an electrofishing site ? It seems, as you have reach level information, that the data at a site is maybe an average of data collected in different reaches. We will probably need more explanation.

WGTRUTTA Comment (Iain Malcolm)

I suspect the problem is that the definition of site varies depending on the data being stored (e.g. traps vs. electrofishing sites). In the Marine Directorate database we store site data as points with a spatial extent noted. The point is the downstream end. An extent field then indicates how far upstream the site extends. For the juvenile work we are likely to snap points (electrofishing sites) to lines on the digital river network and extract spatial habitat data based on a distance weighted average of up and downstream nodes.

QUESTION WGTRUTTA: what is a site ?

There is an information about the year season and month. Can there really be information about multiple years ? How do you relate that to month and season ? Do you expect to fill one or the other column ? Do you expect to have chronicles of habitat measures ? If it’s the case would it be possible to simply use a date here ?

WGTRUTTA Comment (Iain Malcolm)

If the habitat is measured on each visit then yes, this is a time varying SiteVisit table e.g. wetted width, length etc may vary between years for a given starting point (Site). In the Marine Directorate database we have a Site table (static information) and a Site Visit table (Time varying records) associated with particular dates.

Macroscale information

Table 5: Macroscale information columns in the WGTRUTTA DB
column
site_no
country
river_basin
river_name
river_id
site_id
distance_to_sea
distance_to_source
wetted_area
altitude
slope
nb_dam_downstream
nb_dam_upstream
regulated_flow
habitat_restoration
fisheries
restocking
comments

To my understanding, the macroscale information will be collected in a gis table of river segments. There will be several gis, possibly with segments coverage that will be different from the portion of habitat described at site level. This table will carry out the macroscale information and be used to transfer it.

WGTRUTTA Comment (Iain Malcolm)

This looks like site level data to me. It isn’t going to be time varying This will be the critical data that we use to describe habitat for forthcoming juvenile work, extracted from the Andrew and Phil Digital River Network

Reach level habitat

Table 6: Reach level habitat columns in the WGTRUTTA DB
column
site_no
country
river_basin
river_name
river_id
site_id
year_s
season
month
predominant_substrate_first
predominant_substrate_second
predominant_substrate_third
pool
run
riffle
vegetation_extension
vegetation_predominant_type
large_woody_debris
riparian_gallery_density_left_bank
riparian_gallery_density_right_bank
shade
ths_width
ths_depth
ths_velocity
ths_substrate
ths_shade
ths_slope
estimated_ths
alcalinity
comments

QUESTION WGTRUTTA: what is a reach ?

Could the reach level habitat be characterized as a point geometry or is it a segment ? How do you relate reach and sites ?

QUESTION WGTRUTTA: what is a reach ?

Looks like this is time varying, collected on each SiteVisit. Should be in a Site Visit table

G global general information population population data individual Individual data environmental Envionmental data sampling sampling table sampling id <site id> country method protocol comment public status site site table <site id> <river id> geom sampling:f1->site:f0 country country table <country id> sampling:f2->country:f0 segmentsite segment-site table <site id> <segment id> segment segment table <segment id> distance to sea distance to farthest source wetted area altitude slope nb dam downstream regulated flow habitat restoration fisheries restocking <river id> geom segmentsite:f1->segment:f0 reach reach table <reach id> <site id> geom reach:f1->site:f0 site:f0->segmentsite:f0 river river table <river id> <basin id> geom site:f1->river:f0 basin basin table <basin id> geom river:f1->basin:f0 popsample popsample table <population_id> popsample_id year season omments other occuring salmonids popsample:f0->sampling:f0 popdetailmetrics popdetailmetrics table <popsample_id> value datatype stage age anadromous popdetailmetrics:f0->popsample:f1 stage stage juvenile smolt finnock adult popdetailmetrics:f2->stage age age   river age sea age   popdetailmetrics:f1->age datatypedetail datatypedetail number density   length   weight popdetailmetrics:f3->datatypedetail popgroupmetrics popgroupmetrics table <popsample_id> value datatypegroupdetail popgroupmetrics:f0->popsample:f1 datatypegroup datatypegroup %anadromous %resident %Marine survival finnock Total Ova Deposition % Sea Lice (Total) % Sea Lice (Attached) % Sea Lice (Mobile) % Sea Lice (Wounds) popgroupmetrics:f1->datatypegroup indivtable individual table <fish_id> fish_idorigin <popsample_id> stage date x y catch method habitat comments indivtable:f1->popsample:f1 indivtable:f2->stage fishmetrics fishmetrics table <fish_id> metric value fishmetrics:f0->indivtable:f0 metric metric length weight sex river age sea age Sea Lice Infestation Sea Lice (Attached) Sea Lice (Mobile) Sea Lice (Wounds) fishmetrics:f1->metric:f0 tag tag table <fish_id> tag_type tag_code tag_actionvalue tag:f0->indivtable:f0 tagaction tagaction read remove apply     tag:f1->tagaction sitehabitat site habitat <site id> geom year month length average_width predominant_substrate_first predominant_substrate_second predominant_substrate_third average_depth pool run riffle vegetation_extension vegetation_predominant_type large_woody_debris riparian_gallery_density_left_bank riparian_gallery_density_right_bank shade ths_width ths_depth ths_velocity ths_substrate ths_shade ths_slope estimated_ths alcalinity comments sitehabitat:f0->site:f0 reachhabitat reach habitat <reach id> geom year month length average_width predominant_substrate_first predominant_substrate_second predominant_substrate_third average_depth pool run riffle vegetation_extension vegetation_predominant_type large_woody_debris riparian_gallery_density_left_bank riparian_gallery_density_right_bank shade ths_width ths_depth ths_velocity ths_substrate ths_shade ths_slope estimated_ths alcalinity comments

graph: A temptative table format for the WGTRUTTA datacall template.

WGTRUTTA Comment (Iain Malcolm)

For WGTRUTTA moving forwards we are envisaging the need to record fish counts by electrofishing pass (sweep) and potentially also some methodological metadata on the collection method and the Teams collecting the data. This is to properly model the capture probability Modelling capture probability of Atlantic salmon (Salmo salar) from a diverse national electrofishing dataset: Implications for the estimation of abundance - ScienceDirect > OK that’s clear, I guess this is for the electrofishing database.

WGTRUTTA Comment (Iain Malcolm)

It looks like you are identifying that the current table includes information on VisitID (i.e. the specific sampling site visit). In the Marine Directorate databases we have different tables for the individual fish data / counts, the Site table (location information) and Site Visit (dates, times, site measurements etc). > DIASPARA Yes agreed electrofishing data will be structured like this.

 

EU is not reponsible for the content of the project