Preparing your data
To ensure smooth and accurate analyses in GLOSSA, your data must be formatted correctly. This guide will walk you through preparing the necessary files: species occurrence data, environmental data, and optional projection layers and study area polygons.
Species occurrence data
Species occurrence data must be provided in a tab-separated file (TSV, tab-separated CSV, etc.). This file should include the following columns:
decimalLongitude
: Longitude of the occurrence point in decimal degrees.decimalLatitude
: Latitude of the occurrence point in decimal degrees.pa
: Presence (1
) or absence (0
) of the species. If this column is missing, GLOSSA will assume all rows represent presences (1
, presence-only) and will generate pseudo-absences before the modeling step.timestamp
: The time when the occurrence was recorded. This column is optional. If used, GLOSSA will match each occurrence to the environmental data from that specific time period. If omitted, GLOSSA will assume all occurrences occurred at the same time.
Example:
decimalLongitude decimalLatitude pa timestamp5.42909 43.20937 1 1
-43.05000 49.03000 0 1
-2.52369 47.29234 1 2
34.05400 -26.91300 1 3
- Ensure that all occurrence points fall within the study area defined by your environmental data to avoid losing data points due to missing covariate values.
- Double-check for formatting errors or missing columns before uploading to avoid processing issues.
Environmental data
Environmental data is provided as raster layers in formats like .tif or .nc (NetCDF). These layers are used as predictors for species distributions and should represent variables like temperature, salinity, or other relevant data. All environmental layers must be uploaded as a ZIP file, with each variable organized into separate subdirectories. Each subdirectory should contain raster files corresponding to the relevant time periods.
GLOSSA supports the use of categorical variables in your models. These variables can represent categorical data such as ice cover, habitat classes, or other discrete data. Categorical variables are automatically transformed using one-hot encoding by GLOSSA and the dbarts
package, converting them into new binary columns for modeling. To ensure a good fit in the GLOSSA workflow, categorical layers must be properly formatted before uploading.
Here are some guidelines for preparing your files:
- Consistent resolution and extent: Ensure that all raster layers have the same resolution and geographic extent. GLOSSA will check for mismatches and notify you with warnings or errors:
- If the geographic extents differ, GLOSSA will extend the smaller layers with missing values to match the largest raster.
- If the resolutions differ, GLOSSA will stop and return an error, as all layers need to be at the same resolution.
- Temporal alignment: If your occurrence data contains multiple time periods (
timestamp
column in the occurrence data), the rasters must align with those timestamps. GLOSSA expects the raster files to be ordered alphabetically by time. For example, if your occurrence data includes years 1 and 3, you should have raster files for each environmental variable for years 1, 2, and 3 (even if you don’t have occurrence data for year 2). In this case, you can provide a blank or duplicate raster for year 2, but the file must exist to ensure proper indexing for year 3.
Example ZIP structure:
environmental_data.zip
├── temperature
│ ├── temp_1.tif
│ ├── temp_2.tif
│ └── temp_3.tif
├── salinity
│ ├── sal_1.tif
│ ├── sal_2.tif
│ └── sal_3.tif
- Large raster files may slow down the processing. For testing purposes, consider using lower-resolution rasters or aggregating the cells using the
terra::aggregate()
function in R - Use clear and consistent file names, especially when handling multiple time periods or variables.
Projection layers (optional, but likely your primary interest)
Projection layers allow you to forecast species distributions under different environmental conditions, such as future climate scenarios. These layers should follow the exact format as the environmental data, with identical subdirectory names and matching variable names.
Some guidelines for preparing your files:
- File order: Ensure that the files within each subdirectory are ordered consistently across variables. The first file of each variable (e.g., temperature, salinity) will be treated as corresponding to the same time period, and GLOSSA will stack and project them together as part of the same time series. That is, in the following example, GLOSSA will make one projection for the
temp_projection_1.tif
andsal_projection_1.tif
scenario, and a different projection for the conditions oftemp_projection_2.tif
andsal_projection_2.tif
. - Multiple scenarios: If you’re working with multiple independent scenarios (e.g., two different climate models), upload each scenario in a separate ZIP file. This way, they won’t be included in the same time series, allowing you to compare scenarios separately during plotting and exporting.
Example projection ZIP structure:
projection_layers.zip
├── temperature
│ ├── temp_projection_1.tif
│ └── temp_projection_2.tif
├── salinity
│ ├── sal_projection_1.tif
│ └── sal_projection_2.tif
- Ensure that projection layers have the same resolution, geographic extent, coordinate reference systems (CRS - WGS84), and variable names as the environmental layers used during model fitting. This consistency is crucial for accurate forecasting and projections.
- If your projections involve different time periods, ensure the raster files are clearly organized and ordered to reflect these periods accurately.
Study area polygon (optional)
You can define a study area polygon to limit the geographic scope of your analysis. This will crop the environmental layers and filter out occurrence points that fall outside the study area. By default, GLOSSA uses the extent of your environmental rasters to define the study area, fitting the model only with occurrences that have valid values for all predictor variables. It also projects only onto cells covered by all environmental variables.
However, if your rasters cover a larger region than your area of interest or if you have occurrence points outside the region you’d like to filter, you can upload a custom polygon. This allows you to specify the geographic region of interest, and GLOSSA will automatically crop the environmental layers and restrict the analysis to within the polygon boundaries.
The supported formats of this file are:
- GPKG (GeoPackage)
- KML
- GeoJSON
Example use case:
You might have environmental data for an entire ocean but only want to model species distributions within the Mediterranean Sea. Uploading a Mediterranean Sea polygon will crop the data accordingly.
- If the resolution of your polygon is too coarse, you can apply a buffer to expand or refine it. This buffer option can be tuned in the next section of the documentation.
Conclusion
By properly formatting your data, you’ll ensure that GLOSSA runs smoothly and provides accurate results. Once your data is ready, move on to the next step: Running a new analysis.