Sourcing data

Finding species data

Species distribution modelling requires sample locality data for each species of interest.  This locality data can be presence-only or presence and absence data.  Typically, your data will be derived from a study or many studies on that particular species.

Estimating population size

If you are unsure as to the population of a species in a specific area you can estimate population.

  • The PRESENCE occupancy model enables you to estimate the proportion of the area occupied or the likelihood that it is occupied by a particular species.
  • The MARK occupancy model, on the other hand, allows the user to make use of mark–recapture, dead recovery, and telemetry data to calculate the proportion of sites occupied.

There are a number of websites where you can begin your search for locality data for species, such as:

When sourcing species data, the following aspects should be taken into account:

  • Standardize the data: It is preferable to make use of the same species locality datasets in different models or programs when comparing the model outcomes, since minor differences in values or localities could affect the model results.
  • Check the data accuracy:  In most scenarios of data gathering, the data creation or the combining different data sets can take far longer than the modelling itself.  When combining different datasets, it is worth noting that your final dataset can only be considered as accurate as your least accurate dataset.
  • Choose your data type:  When modelling using presence only data, make sure to check the dataset for negative values or absence data values.
  • Check the attributes:  In sourcing data it is possible to receive a number of species packaged in a single table.  Take care to make sure that only the localities for the species to be modelled are listed in the attribute table.
  • Consider the age of the dataset:  If the goal is to model the current distribution; then it is important to use the most recent dataset available. Match the age of the species dataset with the age of the explanatory data variables.
  • Clusters and outliers:  Sample data should cover the area of interest evenly, in certain cases locality data might be collected exclusively along roadsides, the distribution of this data will show clusters along roads and a deduction could be made that pavements are the preferred habitat of the species.  In other cases you may find that certain points are outliers, these are not necessarily incorrect; however they can cause your result to be skewed.  How you deal with clusters and outliers is a matter of model application and personal choice.
  • False absences:  It is important to note that there are two different methods of mapping species absence localities.  The first is through fieldwork and observations that the species is notably absent in a certain area.  The other method relies on environmental data and known species habitats, where species are marked as absent in localities where the habitat for that species is absent.  This second method can result in false absences where it is only assumed the species does not occur at the locality.
  • Duplicate data: Check on the source of the data if it was located via the internet to make sure that you are not accessing the same data from a single source via multiple web-portals.

Finding environmental data

Before continuing with the modelling species distribution you will also need to acquire sample environmental and climatic data for your area of interest. You can source environmental data from:

Finding climatic data 

The IPCC data distribution centre provides four main types of data and guidance.  They are listed below and described in more detail on the The Intergovernmental Panel on Climate Change (IPCC)  website.

  • Observed Climate Data Sets
  • Global Climate Model Data
  • Socio-economic data and scenarios
  • Data and scenarios for other environmental changes

When sourcing climate data, the following aspects should be taken into account:

  • Modelling current or current and future distributions: Depending on whether you want to model the present distribution of a species or the future distribution of a species, you should download the climate data from the IPCC data distribution centre according to your modelling needs.  If you only intend to model the present distribution of a species, then ignore the section on Carbon emission scenarios and move directly to the “Choosing a modelling method or program” section.
  • Use applicable environmental data: It is important to make use of environmental data that represents the expected habitat of the species, but also to make use of data that is not uniform over the study area, or else the result is likely to be uniform across the study area. Beware of over-fitting a model the explanatory variable should be efficient in explaining distribution, use Akaike’s Information Criterion (AIC) approaches, for example, to determine this.
  • Mapping precision:  There is always in inherent error in assigning map coordinates (x and y or Latitude and Longitude) to species localities.  Different map projections can cause different types of distortion and species localities could be mapped using different map scales.  Both of these could have a negative impact of the spatial accuracy of the data.  It is important to note which map projection and which scale the data was captured at in your study.  When finding the environmental data required for the study, it is crucial to match the map projection of the species locality data.  It is less important to match the map scale of the environmental and locality data, however it is important to note that environmental data that is coarser than locality data could result in localities falling in unexpected environmental areas.