User Tools

Site Tools


wiki:spdistr1

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

wiki:spdistr1 [2017/12/05 22:53] (current)
Line 1: Line 1:
 +====== Species Distribution Modeling I ======
 + 
 +In this session we will: 
 +  - Read in species occurrence and range data
 +  - Pre-process environmental data
 +  - Fit a generalized linear model to estimate the species distribution
 +  - Predict across the landscape and write the results to disk (for use in GIS, etc.)
 +\\
 +R code courtesy of Adam M. Wilson ([[ https://​github.com/​adammwilson/​SpatialAnalysisTutorials/​blob/​master/​SDM_intro/​SDM.md | original version]]) and adapted by Marta Jarzyna, October 2015\\
 +
 +[[ http://​www.spatial-ecology.net/​ost4sem/​exercise/​SDM/​SDM_I_code_final.R | SDM_I_code_final.R ]] 
 +
 +   wget http://​www.spatial-ecology.net/​ost4sem/​exercise/​SDM/​SDM_I_code_final.R
 +
 +   R
 +
 +   # Set working directory
 +   ​rm(list=ls(all=TRUE))
 +   wdir <- "/​media/​sf_LVM_share" ​
 +   ​setwd(wdir)
 +   
 +   # Define data directotry - this is where we store the data, but also where we will save the output to
 +   ​datadir <- "/​media/​sf_LVM_share/​data" ​
 +   
 +   # if your shared folder is not writable, create a different folder for the output
 +   ​if(!file.exists("​~/​sdm_out"​)) dir.create("​~/​sdm_out",​ recursive=T)
 +   ​outdir <- "​~/​sdm_out"​
 +   
 +   # Loading libraries
 +   ​require(rgdal)
 +   ​require(lattice)
 +   ​require(ggplot2)
 +      ​
 +   ​packages=c("​hSDM","​dismo","​maptools","​sp","​maps","​coda","​rgdal","​rgeos","​doParallel","​reshape",​
 +           "​ggplot2","​knitr","​rasterVis","​texreg"​)
 +           
 +   ​l=lapply(packages,​ library, character.only=T,​quietly=T)
 +   
 +   ​rasterOptions(chunksize=1000,​maxmemory=1000)
 +We will use Montane woodcreper (Lepidocolaptes lacrymiger) as example species. This species has a large range, occurring from the coastal cordillera of Venezuela along the Andes south to south-east Peru and central Bolivia.
 +
 +   # Read in species occurrence data and species range
 +   # Species: Montane woodcreper, Lepidocolaptes lacrymiger
 +   
 +   # Point occurrence data
 +   ​points <- as.data.frame(read.csv("/​media/​sf_LVM_share/​data/​occurrence/​Lepidocolaptes_lacrymiger_allpoints.csv",​ header=TRUE))
 +   ​head(points)
 +   
 +   # indicate that these data are presences
 +   ​presence <- matrix(1,​nrow(points),​1)
 +   ​points <- cbind(points,​presence)
 +   
 +   # retrieving spatial coordinates from a Spatial object
 +   ​points <- SpatialPointsDataFrame(points[,​c(1,​2)],​ points, coords.nrs = numeric(0), ​
 +                proj4string = CRS(as.character(NA)),​ match.ID = TRUE)
 +   # assign projection
 +   ​projection(points)="​+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,​0,​0"​
 +   ​points
 +   ​plot(points)
 +   
 +   # Species'​ range
 +   range <- readOGR("​.",​ "​cartodb-query"​)
 +   range
 +   ​plot(range,​ add=TRUE)
 +Loading eBird sampling dataset, in order to obtain "​absence"​ data
 +
 +   # link to global sampling raster
 +   ​gsampling=raster(file.path(datadir,"​eBirdSampling_filtered.tif"​))
 +   
 +   # crop to species range to create modelling domain
 +   ​sampling=crop(gsampling,​range,​file.path(outdir,"​sampling.grd"​),​overwrite=T) ​  
 +   
 +   # assign projection
 +   ​projection(sampling)="​+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,​0,​0"​
 +   
 +   # convert to points within data region
 +   ​samplingp=as(sampling,"​SpatialPointsDataFrame"​)
 +   ​samplingp=samplingp[samplingp$eBirdSampling_filtered>​0,​]
 +   
 +   # edit column names to allow aligning with presence observations
 +   ​colnames(samplingp@data)=c("​observation"​)
 +   ​samplingp$presence=0
 +   ​plot(samplingp,​ col="​red"​)#​absences
 +   ​plot(points,​ add=TRUE)#​presences
 +   ​plot(range,​ col="​blue",​add=TRUE)#​species range
 +Create a combined presence-nondetection point dataset
 +   pdata <- rbind(points[,"​presence"​],​samplingp[,"​presence"​])
 +   ​pdata@data[,​c("​lon","​lat"​)]=coordinates(pdata)
 +   ​table(pdata$presence)
 +Environmental data processing
 +   # you can add a lot of other variables, but in the interest of time we'll focus on these 4
 +   fenv <- c(cld="​data/​cloud/​meanannual.tif",​ cld_intra="​data/​cloud/​intra.tif",​ elev="​data/​elevation_mn_GMTED2010_mn.tif",​ forest="​data/​tree_mn_percentage_GFC2013.tif"​)
 +To facilitate modelling, let's crop the global rasters to a smaller domain. ​ We can use the extent of the expert range and the `crop()` function in raster package.
 +   # crop to species domain and copy to project folder ​
 +   
 +   env <- foreach(i=1:​length(fenv))%do%{
 +   ​fo=file.path(outdir,​paste0(names(fenv)[i],"​_clipped.grd"​))
 +   ​crop(raster(file.path(wdir,​fenv[i])),​range,​file=fo,​overwrite=T) ​  
 +   }
 +Read the environmental data in as a raster stack and plot layers
 +   env <- stack(list.files(path = outdir, pattern="​*_clipped.grd$"​ , full.names = TRUE ))
 +   env
 +   
 +   # rename layers for convenience
 +   ​names(env)=names(fenv)
 +   
 +   # mask by elevation to set ocean to 0
 +   ​env=mask(env,​env[["​elev"​]],​maskvalue=0)
 +   
 +   # check out the plot
 +   ​plot(env)
 +Variable selection is tricky business and we're not going to dwell on it here... We'll use the following variables.
 +   ​vars=c("​cld","​cld_intra","​elev","​forest"​)
 +Scaling and centering the environmental variables to zero mean and variance of 1, using the ```scale``` function is typically a good idea.  However, with so many people using this node at the same time, we'll skip this memory intensive step and use the unstandardized variables. ​ The downside of this is the regression coefficients are more difficult to interpret.
 +   #​senv=scale(env[[vars]])
 +   ​senv=env[[vars]]
 +Annotate the point records with the scaled environmental data; i.e., Add the (scaled) environmental data to each point
 +   ​pointsd=raster::​extract(senv,​pdata,​sp=T) ​
 +   ​pointsd=na.exclude(pointsd)
 +   
 +   # Look at the data table:
 +   ​kable(head(pointsd))
 +Explore the data
 +   # Plotting the response (presence/​absence data) and the predictors:
 +   # convert to '​long'​ format for easier plotting
 +   ​pointsdl=reshape::​melt(pointsd@data,​id.vars=c("​lat","​lon","​presence"​),​variable.name="​variable"​
 +   
 +   ​ggplot(pointsdl,​aes(x=value,​y=presence))+facet_wrap(~variable)+
 +   ​geom_point()+
 +   ​stat_smooth(method = "​lm",​ formula = y ~ x + I(x^2), col="​red"​)+
 +   ​geom_smooth(method="​gam",​formula=y ~ s(x, bs = "​cs"​))
 +Model Fitting. Here, we will fit a simple glm to the data. Choosing terms to include in a parametric model can be challenging,​ especially given the large number of possible interactions,​ etc.  In this example we'll keep it fairly simple and include only a quadratic term for elevation (as suggested by the above plot). (Feel free to try various model formulas (adding or removing terms) and see how the model performs.)
 +   ​head(pointsd)
 +   
 +   #​model 1
 +   m1 <- glm(presence~cld+elev,​data=pointsd,​family=binomial(logit))
 +   ​summary(m1)
 +   
 +   #​model 2
 +   m2 <- glm(presence~cld+cld_intra+elev*I(elev^2)+forest,​data=pointsd,​family=binomial(logit))
 +   ​summary(m2)
 +Prediction; i.e., calculating estimates of probability of occurrence of Montane woodcreper for each cell. We can use the `predict` function in the `raster` package to make the predictions across the full raster grid and save the output.
 +      #​predictions based on model 1
 +   p1 <- raster::​predict(senv,​m1,​type="​response", ​
 +   ​file=file.path(outdir,"​prediction_m1.grd"​),​overwrite=T)
 +   
 +   #​predictions based on model 2
 +   p2 <- raster::​predict(senv,​m2,​type="​response",​
 +   ​file=file.path(outdir,"​prediction_m2.grd"​),​overwrite=T)
 +   
 +   p <- stack(p1,​p2);​ names(p)=c("​Model 1","​Model 2")
 +Plot results of predictions on a map
 +   ​gplot(p,​max=1e5)+geom_tile(aes(fill=value))+
 +   ​facet_wrap(~variable)+
 +   ​scale_fill_gradientn(colours=c("​blue","​green","​yellow","​orange","​red"​),​
 +   ​na.value = "​transparent"​)+
 +   ​geom_polygon(aes(x=long,​y=lat,​group=group),​
 +   ​data=fortify(range),​fill="​transparent",​col="​darkred"​)+
 +   ​geom_point(aes(x = lon, y = lat), data = points@data,​col="​black",​size=1)+
 +   ​coord_equal()
 +Model evaluation. gplot(p,​max=1e5)+geom_tile(aes(fill=value))+
 +  facet_wrap(~variable)+
 +  scale_fill_gradientn(
 +    colours=c("​blue","​green","​yellow","​orange","​red"​),​
 +    na.value = "​transparent"​)+
 +  geom_polygon(aes(x=long,​y=lat,​group=group),​
 +               ​data=fortify(range),​fill="​transparent",​col="​darkred"​)+
 +  geom_point(aes(x = lon, y = lat), data = points@data,​col="​black",​size=1)+
 +  coord_equal()
 +Model Evaluation. In general, it is a good idea to use k-fold data partitioning instead of using the data used for fitting. There is a function in the `dismo` package called `kfold` that makes this convenient. But for now, we'll just evaluate on the same data used for fitting.
 +   # Summarize model output using `screenreg` to print a more visually pleasing summary.
 +   ​screenreg(list(m1,​m2),​digits = 7,​doctype=FALSE,​align.center=TRUE)
 +   #​htmlreg(list(m1,​m2),​digits = 7,​doctype=FALSE,​align.center=TRUE)
 +Caveats of SDMs using glms:
 +
 +(1) In this example we treated eBird _non-detections_ as _absences_ when the probability of detection given presence can be much less than zero. What are the chances that an observer would see a species in a 1km grid cell if it were present there?  ​
 +(2) We ignored the spatial autocorrelation in species presences and treated each observation as an independent sample. ​ How can we account for this in SDMs?
 +
 +Summary:
 +
 +In this script we have illustrated a complete workflow, including:
 +
 +(1) Reading in species data
 +
 +(2) Pre-processing large spatial datasets for analysis
 +
 +(3) Running a (simple) logistic GLM Species Distribution Model to make a prediction of p(occurrence|environment)
 +
 +(4) Writing results to disk.
 +
  
wiki/spdistr1.txt ยท Last modified: 2017/12/05 22:53 (external edit)