```
#this is an example code script
plot(1:10,1:10)
title("Example plot in a code chunk")
```

# E2M2 2022: Model Fitting Basics

This is a Quarto document. It allows us to combine background text, code, and its output in one document. It can be “rendered” into an HTML file that can be read in any browser. All code chunks look like the following below, and can be run in RStudio just like a line in a standard `.R`

script:

A “code-only” version of this tutorial is available in the `fire-model-fitting.R`

script.

# Introduction

This is a tutorial that was developed as part of the E2M2 workshop held in December 2022. It is based on code developed by Cara Brook, further developed by Michelle Evans. If you have any questions, please send an email to Michelle at mv.evans.phd@gmail.com.

This turorial introduces the concept of model fitting. We use the term “model fitting” broadly, referring to both statistical and mechanistic models. In this tutorial you will learn the least-squares technique to fit both types of models to simulated data of forest coverage in Madagascar.

By the end of the tutorial, you should be able to:

- Construct a simple linear regression model and fit it, both manually and via R’s built-in functions
- Construct a simple mechanistic model in continuous time
- Fit a mechanistic model via least squares

# Load the packages for this tutorial

Here are the packages we’ll need for this tutorial. I also choose to set some base configurations that I like, such as never reading in strings as factors, setting a limit for the number of digits to include when printing to the console, and setting a base theme for `ggplot`

.

```
options(stringsAsFactors = F, scipen = 999, digits = 4)
#plotting
library(ggplot2); theme_set(theme_bw())
#ordinary differential equations
library(deSolve)
#data manipulation
library(tidyr)
library(dplyr)
```

# First exploration of the data

First, we load the data and inspect the dataframe’s structure:

```
<- read.csv("treedata.csv")
treedata
head(treedata)
```

```
yr forest
1 1919 708559
2 1920 710419
3 1921 733004
4 1922 739633
5 1923 733177
6 1924 717115
```

`str(treedata)`

```
'data.frame': 92 obs. of 2 variables:
$ yr : int 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 ...
$ forest: num 708559 710419 733004 739633 733177 ...
```

It has two columns:

`yr`

: the year of the measurement`forest`

: the amount of forest coverage in Madagascar in sq. km (remember this is simulated)

We can then plot the data, looking at the amount of forest cover over time. Note that we save this plot so we can easily add regression lines to it later in the tutorial.

```
<- ggplot(treedata, aes(x = yr, y = forest)) +
base.plot geom_point(color = "forestgreen") +
ylim(c(0,8e5)) +
xlab("Year") +
ylab(bquote('Forest Cover'(km^2)))
base.plot
```

# Statistical Model

First, we will fit a statistical model. We are interested in fitting forest cover as a function of the year, to look at the change overtime:

\[forest = m * year + b\]

## Fit using `glm`

This is a linear regression, that we can fit using R’s built-in functions.

```
<- glm(formula = forest ~ yr,
mod.stat family = gaussian,
data = treedata)
#investigate parameters estimated by model
summary(mod.stat)
```

```
Call:
glm(formula = forest ~ yr, family = gaussian, data = treedata)
Deviance Residuals:
Min 1Q Median 3Q Max
-100668 -27121 3331 29925 100682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5152516 343511 15.0 <0.0000000000000002 ***
yr -2293 175 -13.1 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1983480599)
Null deviance: 519643631025 on 91 degrees of freedom
Residual deviance: 178513253944 on 90 degrees of freedom
AIC: 2235
Number of Fisher Scoring iterations: 2
```

We can see that it returns two estimates that we are interest in. That for the Intercept (b in the equation above) and for the coefficient of year (slope, m in the equation above). Let’s add this fit line to the graph. The parameters can be access via `mod.stat$coefficients`

:

`$coefficients mod.stat`

```
(Intercept) yr
5152515 -2293
```

```
+
base.plot geom_abline(slope = mod.stat$coefficients[2], intercept = mod.stat$coefficients[1], color = "darkred")
```

We can also manually create this fit line through two methods. There is a function in base R called `predict`

, that will predict values using the original dataframe used in the model:

```
<- data.frame(yr = treedata$yr,
base.predict prediction = predict(mod.stat))
head(base.predict)
```

```
yr prediction
1 1919 752343
2 1920 750050
3 1921 747757
4 1922 745464
5 1923 743171
6 1924 740878
```

We can write our own function that creates a new line based on estimated m (slope) and b (intercept):

```
<- function(m, x, b){
mxb_line = m*x + b
y return(y)
}
#example of predicting with this
<- data.frame(yr = treedata$yr,
my.predict prediction = mxb_line(m = mod.stat$coefficients[2],
x = treedata$yr,
b = mod.stat$coefficients[1]))
head(my.predict)
```

```
yr prediction
1 1919 752343
2 1920 750050
3 1921 747757
4 1922 745464
5 1923 743171
6 1924 740878
```

We can add that to the plot to see how they compare:

```
+
base.plot geom_line(data = base.predict, aes(y = prediction), color = "darkred", size = 1) +
geom_line(data = my.predict, aes(y = prediction), color = "dodgerblue", linetype = "dashed",
size = 1)
```

## Manually fit by least-squares

There are two parts to manually fitting a model, corresponding to two functions. The first function takes the parameters for `m`

and `b`

, predicts y-values based on these parameters, and calculates the sum of squares wit the true data. The second function wraps around the first function to minimize the sum of squares, resulting in the optimum estimate for the parameters.

The first function to calculate a sum of squares for one set of parameters. We’ve added some `roxygen2`

code above the function to help remind us what each of the arguments does. You can read it similar to how you would read a help file for a function.

```
#' Manually Estimate Least Squares
#' @param par vector of parameter values (m and b)
#' @param xval vector of x-values in the equation (in our case year)
#' @param ytrue vector of true y-values in our dataset
#'
#' @return Sum of Squares for given parameter values
<- function(par, xval, ytrue){
lst_sq <- par["m"]
m.guess <- par["b"]
b.guess
##run your model with your guess parameters
<- mxb_line(m = m.guess, x = xval, b = b.guess)
y.pred
##compare with data via sum of squares.
<- sum((y.pred - ytrue)^2)
sum.sq
return(sum.sq)
}
```

As an example, let’s use this function to estimate the sum of squares for our “optimal” parameter estimates above to some that we have moved away from the optimum.

```
#optimal parameters from linear regression
lst_sq(par = c("m" = -2293, "b" = 5152515), xval = treedata$yr, ytrue = treedata$forest)
```

`[1] 178514122204`

```
#estimate SS with parameters lightly moved from optimum
lst_sq(par = c("m" = -2600, "b" = 5500000), xval = treedata$yr, ytrue = treedata$forest)
```

`[1] 6200461146534`

We can see that the parameters we’ve moved away have a higher sum of squares and therefore are a worse fit. We can also double-check this visually by plotting both of the lines. The displaced line is in blue, and you can see it is a worse fit to the data.

```
+
base.plot geom_abline(intercept = 5152515, slope = -2293) +
geom_abline(intercept = 5500000, slope = -2600, color = "dodgerblue")
```

The second function is a wrapper that works to minimize the output (i.e. the sum of squares) from the previous function. By minimizing the output, we will find the optimum fit. This is done using the `optim`

function in R, which searches the parameter space to find the values that result in the minimum sum of squares.

```
= function(guess.slope, guess.int, xguess, ydata){
wrap_fit
<- c("m" = guess.slope, "b" = guess.int)
par
<- optim(par=par, fn = lst_sq, xval = xguess, ytrue = ydata)
out
= out$par['m']
m.fit
= out$par['b']
b.fit
return(list(m.fit, b.fit))
}
```

The two “guesses” that we give to this function are the initial starting values that the `optim`

function will start the search at. It is always good to try and start these near where we think they might be, otherwise the model may not converge or will result in a poor fit. For example, let’s try with some nonsense values and see what happens:

```
<- wrap_fit(guess.slope = 10, guess.int = 100,
poor.fit xguess = treedata$yr, ydata = treedata$forest)
poor.fit
```

```
[[1]]
m
329.3
[[2]]
b
168.9
```

```
+
base.plot geom_abline(slope = poor.fit[[1]], intercept = poor.fit[[2]])
```

This is clearly not a great fit to the data. Let’s make a better starting guess, taking for example the intercept of our data on the y-axis (800000) and the average decrease over this time period, about 5000 per year:

```
<- wrap_fit(guess.slope = -5000, guess.int = 800000,
good.fit xguess = treedata$yr, ydata = treedata$forest)
good.fit
```

```
[[1]]
m
-2315
[[2]]
b
5195033
```

```
#predict from this model
<- mxb_line(m = good.fit[[1]], x = treedata$yr, b = good.fit[[2]])
good.predictions
#plot this data
+
base.plot geom_line(aes(x = treedata$yr, y = good.predictions))
```

This is a much better fit, and agrees with what we saw in the statistical model above. We can compare all the linear regressions we have done in a plot

```
+
base.plot #original statistical model with glm
geom_line(aes(x = treedata$yr, y = predict(mod.stat)), color = "darkred") +
#model fit using optim
geom_line(aes(x = treedata$yr, y = good.predictions), color = "blue",
linetype = "dotted", size = 1)
```

From this we can see that forest is declining, because the slope is negative, and also estimate he amount of forest cover at the beginning of our time series using any of our predict functions. Remember that the intercept returned by the model isn’t the intercept at the beginning of our data in 1920, but would correspond to the amount of forest cover when x = 0 (which isn’t very useful to us), so you need to predict it using the value of the year.

This is all great, but we don’t know anything about why or how this decline took place. For this we need to use a mechanistic model.

# Mechanistic Model

Mechanistic models allow us to directly model the process that may be creating the data that we have. For example, we may hypothesize that the declining forest is due to slash and burn agriculture. We decide to build a model describing conversion of forest to savanna via slash and burn. Instead of m and b, the slope and intercept of a line, we now want to estimate the rate of slash and burn. We can describe this process with the following equations, describing changes in forest (F) and savanna (S):

\[ \frac{dF}{dt} = rF\frac{K-N}{K} - \gamma F S \] \[ \frac{dS}{dt} = \gamma F S \frac{K-N}{K}\]

The model has four parameters:

- r : the rate of forest growth, the number of sq. km of forest produced b one existing sq. km of forest per year
- K : the carrying capacity for the forest, equal to the amount of land available for forest and savanna
- N : the total amount of forest and savanna land
- \(\gamma\): the rate of slash and burn, the number of sq km of savanna produced per sq km of slashed forest. For example, a rate of 0.8 would mean 0.2 of the forest that is slashed reconverts to forest

We already know the rate of forest growth (r = 1.01) and the carrying capacity (K = 900000). We will build a continuous model to estimate the rate of slash and burn that explains our data.

## Create the initial model

We define the model as a function that can be read by the ODE solver:

```
<- function(t,y,parms){
ForestSavannaCont # The with() function gives access to the named values of parms within the
# local environment created by the function
with(c(as.list(y),parms),{
= For + Sav
N <- r*((K-N)/K)*For - gamma*For*Sav
dFordt <- gamma*((K-N)/K)*For*Sav
dSavdt
#Return forest to compare with data
list(c(dFordt, dSavdt))
}) }
```

To then run the model, we need to provide the initial conditions (amount of forest and savanna), the time series we want to run it over, and values for the parameters:

```
## Intialize population
<- c(For = 450000, #sq. km of forest in Mada
Mada.start Sav = 100000) #sq. km of savanna in Mada
## Set up time-steps and units - here go for slightly longer to allow model to equilibrate
<- seq(1900,2010,by=1)
times
## Define parameters
<- c(r=1.01,
values gamma = .00000045,
K = 900000)
```

We can then run the model and plot it with the data:

```
<- data.frame(lsoda(y = Mada.start, times = times, func = ForestSavannaCont,
slash.mod1parms = values))
names(slash.mod1) = c("yr", "forest", "savanna")
+
base.plot geom_line(data = slash.mod1, aes(x = yr, y = forest)) +
xlim(c(1915, 2012))
```

It looks like the model is overestimating the forest conversion rate. Let’s adjust the rate by fitting the model to our dataset

## Fitting a mechanistic model

Like the statistical model, we use a workflow consisting of several functions to fit a mechanistic model:

- A function that runs our model (we already made this above
`ForestSavannaCont`

) - A function that compares our model with the data via a statistical test (we’ll use least squares)
- A wrapper function that minimizes the difference between our model and the data (here, we’ll use R’s function ‘optim’ to minimize the function we write under #2)

The comparison function takes a set of parameters, estimates the the forest cover for each year, and compares these predictions to the true data using the sum of least squares:

```
<- function(gamma, r, K, xstart, times, data){
sum_sq_mech <- c(r = r,
parms gamma = gamma,
K = K)
<- data.frame(lsoda(y = xstart, times = times,
model.out func = ForestSavannaCont, parms = parms))
names(model.out) <- c("yr", "forest", "savanna")
# the model runs for longer than our data, so we need to make sure
# we are comparing the same years across the two datasets
<- model.out[match(data$yr, model.out$yr),]
forest.preds
## and compare the output of the model with the data
<- sum((forest.preds$forest - data$forest)^2)
sum.sq
return(sum.sq)
}
```

Let’s now test the function on one value of parameters:

```
sum_sq_mech(gamma = 0.00000045,
r = 1.01,
K = 900000,
xstart = Mada.start,
times = times,
data = treedata)
```

`[1] 13332008425590`

Now we write the wrapper function that minimizes the least squares to fit the optimum fit. It manually explores a range of values for the slashing rate and then chooses the one which results in the lowest sum of squares.

```
<- function(xstart, times, r, K, data){
wrap_mech
## make a sequence of guesses for the rate
= seq(from = 0, to = .000001, by = .0000001)
slash.list
## then, search for the minimum
= list()
sum.sq.lst for (i in 1:length(slash.list)){
= sum_sq_mech(gamma=slash.list[i], xstart=xstart, r=r, K=K, times=times, data=data)
sum.sq.lst[[i]]
}= c(unlist(sum.sq.lst))
sum.sq.lst
plot(slash.list, sum.sq.lst, type="b")
##find the minimum
= slash.list[sum.sq.lst==min(sum.sq.lst)]
fit.slash
return(fit.slash)
}
```

Then we can fit the model.

`<- wrap_mech(xstart = Mada.start, times = times, r = 1.01, K = 900000, data = treedata) fit.mech `

It returns a plot of the estimated value of the sum of squares across the values of \(\gamma\) that we evaluted. We can see that \(3e10^-7\) is the best fit. This is also what the function returns!

` fit.mech`

`[1] 0.0000003`

With this estimated parameter, let’s fit the mechanistic model and compare it to the data:

```
<- c(r = 1.01, gamma = fit.mech, K = 900000)
param.fit
<- data.frame(lsoda(y = Mada.start, times = times, func = ForestSavannaCont,
mech.fit parms = param.fit))
names(mech.fit) <- c("yr", "forest", "savanna")
+
base.plot geom_line(data = slash.mod1, aes(x = yr, y = forest)) +
geom_line(data = mech.fit, aes(x = yr, y = forest), color = "dodgerblue") +
xlim(c(1920, 2010))
```

We can see that our fit line in blue is a much better fit to the data than our original guess.