Starting postgres
psql -U postgres
To use the database, two methods are possible. The first method will be to create the database locally and then to populate it with the data. With this method you will then be able to use the data either on R, QGIS or PostgreSQL. The second method will be to use data straight from parquet files. No database will be created. Data will be pulled from files ensuring a good efficiency and quicker computation. Data will be usable within QGIS or R.
First you will have to launch postgres through the terminal using this command.
psql -U postgres
Once psql running you will have to follow those steps.
-- 1. Database creation
CREATE DATABASE diaspara;
-- 2. Connection to the database
\c diaspara
-- 3. Role diaspara_admin creation with password
DO $$BEGIN
IF NOT EXISTS (SELECT FROM pg_roles WHERE rolname = 'diaspara_admin') THEN
CREATE ROLE diaspara_admin WITH LOGIN PASSWORD '*********';
END IF;
END$$;
-- 4. Granting privileges to diaspara_admin
GRANT ALL PRIVILEGES ON DATABASE diaspara TO diaspara_admin;
ALTER DATABASE "diaspara" OWNER TO diaspara_admin;
-- 5. Creating a user role
DO $$BEGIN
IF NOT EXISTS (SELECT FROM pg_roles WHERE rolname = 'user') THEN
CREATE ROLE user WITH
NOSUPERUSER
CREATEDB
CREATEROLE
INHERIT
LOGINLIMIT -1;
CONNECTION ALTER ROLE user WITH password '*****************';
END IF;
END$$;
GRANT diaspara_admin TO "user";
-- 6. Managing extension
CREATE EXTENSION postgis SCHEMA "public";
-- 7. Exiting psql
/q
Once the DB is created we can now populate it with the data. They are available on Zenodo. Files that will be of use in this first part will be the .pgc files. You can either download all of them if you wish to retrieve the whole network, or download the area that you need. Once the files are downloaded you will have to open a powershell terminal and change your root directory to where you have the downloaded files using the command cd C:/path/to/my/folder
. If you wish to restore the whole database, you will be able to simply use the code from the next chunk.
# File list to restore
schemas=(
h_adriatic
h_baltic22to26
h_baltic27to29_32
h_baltic30to31
h_barents
h_biscayiberian
h_blacksea
h_celtic
h_iceland
h_medcentral
h_medeast
h_medwest
h_norwegian
h_nseanorth
h_nseasouth
h_nseauk
h_southatlantic
h_southmedcentral
h_southmedeast
h_southmedwest
h_svalbard)
echo "Starting restoration......"
for schema in "${schemas[@]}"
do
file="${schema}.pgc"
echo "Restoring $file..."
pg_restore -U postgres -d diaspara -n "$schema" "$file"
done
echo "Restoration done."
Otherwise, if you wish to restore the database only partially, you will have to specify which files you want to restore : e.g. pg_restore -U postgres -d diaspara -f h_med_central.pgc
.
Once the files are restored, you should now be able to access to the data straight from DBeaver, once you created the connection to the DB.
I would recommend to create two connections. One using postgres (superuser) and one with the user profile that you set up earlier.
: localhost
Host : diaspara
Database : postgres / user
User : ****** Password
Once the connection set up you will be able to create your own queries on the data with DBeaver.
To use data from the psql DB straight from R you can use the DBI package. I will briefly explain how to use the main features of the package but more details are available in the doc.
install.packages("DBI")
install.packages("RPostgres")
library(DBI)
library(RPostgres)
# Create a connection to a PostgreSQL database
<- dbConnect(
con ::Postgres(),
RPostgresdbname = "diaspara",
host = "localhost",
port = 5432,
user = "postgres" or "user",
password = "*******"
)
dbReadTable
can be used to read complete tables from the DB. dbGetQuery
can be used to retrieve specific data from a table. It will return the result of the SQL query. No modification can be applied with this function.
# Read entire table
<- dbReadTable(con, "your_table")
data
# Or run a custom query
<- dbGetQuery(con, "SELECT * FROM your_table WHERE value > 100") result
dbWriteTable
can be used to create a new table, importing data from outside the DB. dbRemoveTable
can be used to delete a table. dbExecute
can be used to modify the DB using a SQL query. When using it be very careful and be sure of what you are doing. Changes are permanent.
# Create a new table from a data frame
dbWriteTable(con, "new_table", data_frame, overwrite = TRUE)
# Delete a table
dbRemoveTable(con, "new_table")
# Or send any SQL command
dbExecute(con, "DELETE FROM your_table WHERE value < 0")
Once you are done, you should disconnect from the database using this function.
dbDisconnect(con)
: Name that will appear in your list
Name : 127.0.0.1 (localhost)
Host : 5432
Port : diaspara Data base
For every connection a password will be asked to you. You can otherwise store your credentials within QGIS if you wish to have it automatically filled. Once the connection is successful, you have access to all the schema within your database. Although only tables with geometries will appear. Modifications that you may do within QGIS can be saved straight onto the database.
Data in the parquet format are available on Zenodo. They are available table by table. Each schema having a riversegments
and a catchments
table. You will be able to retrieve only one part or both for each schema.
Usage in QGIS is pretty straight forward. You can simply drag and drop .parquet files within QGIS and you should have access to the data.
To be able to use .parquet files in R you will have to use the Arrow and the DuckDB packages. Parquet being a compressed, columnar format, it is ideal for large datasets. Working with Arrow allows you to work lazily with parquet files, filtering and selecting columns without loading everything into memory. On the other hand, DuckDB brings fast SQL querying directly on parquet files, avoiding costly data imports.
install.packages("arrow")
install.packages("duckdb")
install.packages("dplyr")
library(arrow)
library(duckdb)
library(dplyr)
With Arrow you can read, filter or even convert your parquet data without loading them into memory. This helps to keep a fast workflow while working with large dataset.
<- open_dataset("data.parquet")
ds
%>%
ds select(country, year, value) %>%
filter(year >= 2020) %>%
head()
With DuckDB you can work with SQL queries on R without having to create a database locally. DuckDB opens a temporary local database onto which you will be working. Applications are multiple, you can run queries, join multiple parquet files or write query results into a new parquet file.
# Connection to a temporary DuckDB database
<- dbConnect(duckdb::duckdb())
con
# Querying the parquet file without loading it into R
<- dbGetQuery(con, "
res SELECT country, AVG(value) as avg_val
FROM 'data.parquet'
WHERE year >= 2020
GROUP BY country
ORDER BY avg_val DESC
LIMIT 5
")
Please feel free to contact us if you encounter any issue during this process. Morever, to give us your feedback, questions and remarks, send us mails, we’ll answer promptly. Jules Oliviéro jules.oliviero@inrae.fr Cedric Briand cedric.briand@eaux-et-vilaine.bzh Jani Helminen jani.helminen@luke.fi
EU is not reponsible for the content of the project