Preparing your data

To ensure smooth and accurate analyses in GLOSSA, your data must be formatted correctly. This guide will walk you through preparing the necessary files: species occurrence data, environmental data, and optional projection layers and study area polygons.

Species occurrence data

Species occurrence data must be provided in a tab-separated file (TSV, tab-separated CSV, etc.). This file should include the following columns:

decimalLongitude: Longitude of the occurrence point in decimal degrees.
decimalLatitude: Latitude of the occurrence point in decimal degrees.
pa: Presence (1) or absence (0) of the species. If this column is missing, GLOSSA will assume all rows represent presences (1, presence-only) and will generate pseudo-absences before the modeling step.
timestamp: The time when the occurrence was recorded. This column is optional. If used, GLOSSA will match each occurrence to the environmental data from that specific time period. If omitted, GLOSSA will assume all occurrences occurred at the same time.

Example:

decimalLongitude decimalLatitude pa   timestamp
5.42909          43.20937        1    1
-43.05000        49.03000        0    1
-2.52369         47.29234        1    2
34.05400         -26.91300       1    3

Tip

Ensure that all occurrence points fall within the study area defined by your environmental data to avoid losing data points due to missing covariate values.
Double-check for formatting errors or missing columns before uploading to avoid processing issues.

Environmental data

Environmental data is provided as raster layers in formats like .tif or .nc (NetCDF). These layers are used as predictors for species distributions and should represent variables like temperature, salinity, or other relevant data. All environmental layers must be uploaded as a ZIP file, with each variable organized into separate subdirectories. Each subdirectory should contain raster files corresponding to the relevant time periods.

GLOSSA supports the use of categorical variables in your models. These variables can represent categorical data such as ice cover, habitat classes, or other discrete data. Categorical variables are automatically transformed using one-hot encoding by GLOSSA and the dbarts package, converting them into new binary columns for modeling. To ensure a good fit in the GLOSSA workflow, categorical layers must be properly formatted before uploading.

Including categorical variables in GLOSSA

GLOSSA now supports including categorical variables in your models, but they must be formatted correctly to ensure a smooth integration.

As with continuous variables, categorical layers must be provided as raster files (e.g., .tif, .nc, .asc) where the values are integer IDs that map to specific categories. For example, if representing ice cover, you might define 0 as “no ice” and 1 as “ice-covered.”
Rasters must be defined as factors, which typically requires metadata files (e.g., .xml) to enable proper mapping between the integer values and their corresponding categories. If categorical variables are not uploaded as factors, GLOSSA will treat them as continuous variables.
When using categorical layers across time steps or for projections, ensure the categories remain consistent. The set of categories must match or be a subset of those observed in the training data. Avoid attempting to predict on unobserved categories.

To include categorical variables in GLOSSA, you need to provide categorical factor rasters. You can create these using functions like terra::as.factor() to convert integer rasters into factor rasters with defined levels and labels:

library(terra)
raster <- rast("ice_cover.tif")
raster <- as.factor(raster)
levels(raster) <- data.frame(id = c(0, 1), label = c("no", "yes"))
writeRaster(raster, "ice_cover_factor.tif")

When saving your file, the raster metadata is often stored in a separate file (e.g., .xml, .aux.xml). Ensure these metadata files are placed alongside the corresponding raster file in the same directory of the ZIP file before uploading.

environmental_data.zip
    ├── continuous_variable
    │     └── cont_var.tif
    ├── categorical_variable
    │     ├── cat_var.tif
    │     └── cat_var.tif.aux.xml

Here are some guidelines for preparing your files:

Consistent resolution and extent: Ensure that all raster layers have the same resolution and geographic extent. GLOSSA will check for mismatches and notify you with warnings or errors:
- If the geographic extents differ, GLOSSA will extend the smaller layers with missing values to match the largest raster.
- If the resolutions differ, GLOSSA will stop and return an error, as all layers need to be at the same resolution.
Temporal alignment: If your occurrence data contains multiple time periods (timestamp column in the occurrence data), the rasters must align with those timestamps. GLOSSA expects the raster files to be ordered alphabetically by time. For example, if your occurrence data includes years 1 and 3, you should have raster files for each environmental variable for years 1, 2, and 3 (even if you don’t have occurrence data for year 2). In this case, you can provide a blank or duplicate raster for year 2, but the file must exist to ensure proper indexing for year 3.

Example ZIP structure:

environmental_data.zip
    ├── temperature
    │     ├── temp_1.tif
    │     ├── temp_2.tif
    │     └── temp_3.tif
    ├── salinity
    │     ├── sal_1.tif
    │     ├── sal_2.tif
    │     └── sal_3.tif

Tip

Large raster files may slow down the processing. For testing purposes, consider using lower-resolution rasters or aggregating the cells using the terra::aggregate() function in R
Use clear and consistent file names, especially when handling multiple time periods or variables.

Projection layers (optional, but likely your primary interest)

Projection layers allow you to forecast species distributions under different environmental conditions, such as future climate scenarios. These layers should follow the exact format as the environmental data, with identical subdirectory names and matching variable names.

Some guidelines for preparing your files:

File order: Ensure that the files within each subdirectory are ordered consistently across variables. The first file of each variable (e.g., temperature, salinity) will be treated as corresponding to the same time period, and GLOSSA will stack and project them together as part of the same time series. That is, in the following example, GLOSSA will make one projection for the temp_projection_1.tif and sal_projection_1.tif scenario, and a different projection for the conditions of temp_projection_2.tif and sal_projection_2.tif.
Multiple scenarios: If you’re working with multiple independent scenarios (e.g., two different climate models), upload each scenario in a separate ZIP file. This way, they won’t be included in the same time series, allowing you to compare scenarios separately during plotting and exporting.

Example projection ZIP structure:

projection_layers.zip
    ├── temperature
    │     ├── temp_projection_1.tif
    │     └── temp_projection_2.tif
    ├── salinity
    │     ├── sal_projection_1.tif
    │     └── sal_projection_2.tif

Tip

Ensure that projection layers have the same resolution, geographic extent, coordinate reference systems (CRS - WGS84), and variable names as the environmental layers used during model fitting. This consistency is crucial for accurate forecasting and projections.
If your projections involve different time periods, ensure the raster files are clearly organized and ordered to reflect these periods accurately.

Study area polygon (optional)

You can define a study area polygon to limit the geographic scope of your analysis. This will crop the environmental layers and filter out occurrence points that fall outside the study area. By default, GLOSSA uses the extent of your environmental rasters to define the study area, fitting the model only with occurrences that have valid values for all predictor variables. It also projects only onto cells covered by all environmental variables.

However, if your rasters cover a larger region than your area of interest or if you have occurrence points outside the region you’d like to filter, you can upload a custom polygon. This allows you to specify the geographic region of interest, and GLOSSA will automatically crop the environmental layers and restrict the analysis to within the polygon boundaries.

The supported formats of this file are:

GPKG (GeoPackage)
KML
GeoJSON

Example use case:

You might have environmental data for an entire ocean but only want to model species distributions within the Mediterranean Sea. Uploading a Mediterranean Sea polygon will crop the data accordingly.

Tip

If the resolution of your polygon is too coarse, you can apply a buffer to expand or refine it. This buffer option can be tuned in the next section of the documentation.

Conclusion

By properly formatting your data, you’ll ensure that GLOSSA runs smoothly and provides accurate results. Once your data is ready, move on to the next step: Running a new analysis.