Gavin Fay
02/05/2025
1/22: Introduction to R and R Studio, working with data
1/29: Visualizing data
2/05: Probability, linear modeling
2/12: Data wrangling, model summaries
2/19: Simulation, Resampling
2/26: Iteration
3/05: Creating functions, debugging
3/19: Flex: automated reporting & Quarto
4/02: working with Spatial data
An introduction to R (Venables et al.)
– http://cran.r-project.org/doc/manuals/R-intro.pdf -
Today’s material: Chapters 8, 11.
RR includes a set of probability distributions that can
be used to simulate and model data.
If the function for the probability model is named
xxx
pxxx: the cumulative distribution \(P(X \leq x)\)dxxx: the probability distribution/density function
\(f(x)\)qxxx: the quantile \(q\), the smallest \(x\) such that \(P(X \leq x) > q\)rxxx: generate a random variable from the model
xxxR(\(\mu=0\), \(\sigma^2=1\))
Values of \(x\) for different
quantiles
qnorm(p, mean=0, sd=1, lower.tail=TRUE, log.p=FALSE)
> quants <- qnorm(c(0.01,0.025,0.05,0.95,0.975,0.99))
> round(quants,2)
[1] -2.33 -1.96 -1.64 1.64 1.96 2.33\(P(X \leq x)\)
pnorm(q, mean=0, sd=1, lower.tail=TRUE, log.p=FALSE)
Density (probability ‘mass’ per unit value of \(x\))
> dnorm(quants, mean = 0, sd = 1)
[1] 0.02665214 0.05844507 0.10313564 0.10313564 0.05844507 0.02665214Generating standard normal random variables
Often a good idea to use set.seed() and save the script
detailing which number was used.
This ensures you can exactly repeat your results.
sample() functionTo generate random numbers from discrete sets of values:
- With or without replacement
- Equal or weighted probability
Extremely useful function that underlies many modern statistical
techniques:
- Resampling
- Bootstrapping
- Markov-chain Monte-Carlo (MCMC)
e.g. Roll 10 dice
Pick 3 species of bear
Within your class Project, open a new R markdown file. Save it. (name it lastname_lab3.Rmd or something similar). At the top of the script, add comments with your name and lab 3. Work in pairs or individually.
RRecall linear regression model:
\[y_i = \displaystyle\sum_{j=0}^{p}\beta_j x_{ij} + \varepsilon_i \text{ , } \varepsilon_i \sim N(0,\sigma^2) \text{ , } i=1,\dots,n\]
In matrix form: \(y = X\beta + \varepsilon\)
Model formulae in R, \(y \sim
x\), try ?formula
lm()The basic function for fitting ordinary multiple models is
lm()
fitted.model <- lm(formula, data = data.frame)
e.g. species richness on beaches (Zuur Chapters 5 & 27)
The value of lm() is a fitted model object.
- a list of results of class lm.
Information about the fitted model can be extracted, displayed, plotted, using some generic functions, inlcuding:
anova(object1,object2) Compares a submodel with an
outer model and produces an analysis of variance tablecoef(object) Extract the regression coefficientdeviance(object) Residual sum of squaresformula(object) Extract the model formulaplot(object) Produce four plots, showing residuals,
fitted values and some diagnosticspredict(object, newdata=data.frame) Model predictions
on new dataresiduals(object) or resid(object) Extract
the matrix of residualsstep(object) forward or backward model selection using
AICsummary(object) Print a comprehensive summary of the
resultsvcov(object) Return variance-covariance matrix of main
parameterssummary(lm1)
Call:
lm(formula = Richness ~ NAP, data = RIKZ)
Residuals:
Min 1Q Median 3Q Max
-5.0675 -2.7607 -0.8029 1.3534 13.8723
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6857 0.6578 10.164 5.25e-13 ***
NAP -2.8669 0.6307 -4.545 4.42e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.16 on 43 degrees of freedom
Multiple R-squared: 0.3245, Adjusted R-squared: 0.3088
F-statistic: 20.66 on 1 and 43 DF, p-value: 4.418e-05
broom::tidy()
> library(broom)
> tidy(RIKZ_lm1, conf.int = TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.69 0.658 10.2 5.25e-13 5.36 8.01
2 NAP -2.87 0.631 -4.55 4.42e- 5 -4.14 -1.59broom::glance()
The {modelsummary} package provides a wealth of options
for creating publication-quality summary tables from fitted model
objects, with bindings for most of the types of models we are using in
this class.
There are lots of options….
In the MASS library there are many functions.
Are samples independent? (Sample design.)
Residuals normally distributed?
- Histograms, qq-plots: qqplot() and qqline()
- Kolmogorov-Smirnov normality test: ks.test()
- Shapiro-Wilk normality test: shapiro.test()
Similar variance among samples?
- Boxplots
- Bartlett’s test for equal variance: bartlett.test()
- Fligner-Killeen test for equal variance: fligner.test()
Model assumptions can be evaluated by plotting the model object.
broom::augment()
> RIKZ_lm1_aug <- augment(RIKZ_lm1, se_fit = )
> glimpse(RIKZ_lm1_aug)
Rows: 45
Columns: 8
$ Richness <dbl> 11, 10, 13, 11, 10, 8, 9, 8, 19, 17, 6, 1, 4, 3, 3, 1, 3, 3…
$ NAP <dbl> 0.045, -1.036, -1.336, 0.616, -0.684, 1.190, 0.820, 0.635, …
$ .fitted <dbl> 6.556653, 9.655722, 10.515778, 4.919680, 8.646589, 3.274107…
$ .resid <dbl> 4.4433465, 0.3442783, 2.4842223, 6.0803197, 1.3534106, 4.72…
$ .hat <dbl> 0.02432839, 0.06623475, 0.08738853, 0.02387714, 0.04669014,…
$ .sigma <dbl> 4.151533, 4.208801, 4.189991, 4.100641, 4.203722, 4.142941,…
$ .cooksd <dbl> 0.0145788938, 0.0002601525, 0.0187094818, 0.0267685231, 0.0…
$ .std.resid <dbl> 1.08136537, 0.08564557, 0.62511744, 1.47940908, 0.33321661,…> RIKZ_lm1_aug <- augment(RIKZ_lm1,
+ se_fit = TRUE)
> glimpse(RIKZ_lm1_aug)
Rows: 45
Columns: 9
$ Richness <dbl> 11, 10, 13, 11, 10, 8, 9, 8, 19, 17, 6, 1, 4, 3, 3, 1, 3, 3…
$ NAP <dbl> 0.045, -1.036, -1.336, 0.616, -0.684, 1.190, 0.820, 0.635, …
$ .fitted <dbl> 6.556653, 9.655722, 10.515778, 4.919680, 8.646589, 3.274107…
$ .se.fit <dbl> 0.6488474, 1.0706040, 1.2297395, 0.6428018, 0.8988733, 0.81…
$ .resid <dbl> 4.4433465, 0.3442783, 2.4842223, 6.0803197, 1.3534106, 4.72…
$ .hat <dbl> 0.02432839, 0.06623475, 0.08738853, 0.02387714, 0.04669014,…
$ .sigma <dbl> 4.151533, 4.208801, 4.189991, 4.100641, 4.203722, 4.142941,…
$ .cooksd <dbl> 0.0145788938, 0.0002601525, 0.0187094818, 0.0267685231, 0.0…
$ .std.resid <dbl> 1.08136537, 0.08564557, 0.62511744, 1.47940908, 0.33321661,…> RIKZ_lm1_aug <- augment(RIKZ_lm1,
+ interval = "confidence")
> glimpse(RIKZ_lm1_aug)
Rows: 45
Columns: 10
$ Richness <dbl> 11, 10, 13, 11, 10, 8, 9, 8, 19, 17, 6, 1, 4, 3, 3, 1, 3, 3…
$ NAP <dbl> 0.045, -1.036, -1.336, 0.616, -0.684, 1.190, 0.820, 0.635, …
$ .fitted <dbl> 6.556653, 9.655722, 10.515778, 4.919680, 8.646589, 3.274107…
$ .lower <dbl> 5.24812804, 7.49664302, 8.03577164, 3.62334705, 6.83383871,…
$ .upper <dbl> 7.865179, 11.814800, 12.995784, 6.216014, 10.459340, 4.9208…
$ .resid <dbl> 4.4433465, 0.3442783, 2.4842223, 6.0803197, 1.3534106, 4.72…
$ .hat <dbl> 0.02432839, 0.06623475, 0.08738853, 0.02387714, 0.04669014,…
$ .sigma <dbl> 4.151533, 4.208801, 4.189991, 4.100641, 4.203722, 4.142941,…
$ .cooksd <dbl> 0.0145788938, 0.0002601525, 0.0187094818, 0.0267685231, 0.0…
$ .std.resid <dbl> 1.08136537, 0.08564557, 0.62511744, 1.47940908, 0.33321661,…> RIKZ_lm1_aug |>
+ janitor::clean_names() |>
+ ggplot() +
+ aes(x = nap, y = fitted) +
+ geom_line() +
+ geom_point(aes(x = nap, y= richness))> ggplot(RIKZ, aes(x = NAP, y = Richness)) +
+ geom_point() +
+ geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'> library(moderndive)
> ggplot(RIKZ) +
+ aes(x = NAP, y = Richness, color = factor(week)) +
+ geom_point() +
+ labs(x = "NAP", y = "Species Richness",
+ color = "Week") +
+ geom_parallel_slopes(se = TRUE)An alternative model for the RIKZ species richness is:
> RIKZ_lm2 <- lm(Richness ~ NAP + factor(week), data = RIKZ)
> #Can also create this model using the `update()` function:
> RIKZ_lm2 <- update(RIKZ_lm1, .~. + factor(week))Compare models with AIC() and via
anova()
Both augment() and the predict() function
can be used to obtain predictions from a fitted model object to a new
data frame.
This can be useful when:
- a subset of the data was reserved from the fitting process (‘test
data’)
- you want to obtain model predictions at values other than the
data
(for example when plotting fitted values)
You can also interrogate model objects directly. Type
names(object) to get a list of the components of
object.
> names(RIKZ_lm1)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model" str() can also be used.
Note that summary(object) is also a list, and components
can be extracted from here too.
> names(summary(RIKZ_lm1))
[1] "call" "terms" "residuals" "coefficients"
[5] "aliased" "sigma" "df" "r.squared"
[9] "adj.r.squared" "fstatistic" "cov.unscaled" > str(summary(RIKZ_lm1))
List of 11
$ call : language lm(formula = Richness ~ NAP, data = RIKZ)
$ terms :Classes 'terms', 'formula' language Richness ~ NAP
.. ..- attr(*, "variables")= language list(Richness, NAP)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "Richness" "NAP"
.. .. .. ..$ : chr "NAP"
.. ..- attr(*, "term.labels")= chr "NAP"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(Richness, NAP)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "Richness" "NAP"
$ residuals : Named num [1:45] 4.443 0.344 2.484 6.08 1.353 ...
..- attr(*, "names")= chr [1:45] "1" "2" "3" "4" ...
$ coefficients : num [1:2, 1:4] 6.686 -2.867 0.658 0.631 10.164 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "NAP"
.. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
$ aliased : Named logi [1:2] FALSE FALSE
..- attr(*, "names")= chr [1:2] "(Intercept)" "NAP"
$ sigma : num 4.16
$ df : int [1:3] 2 43 2
$ r.squared : num 0.325
$ adj.r.squared: num 0.309
$ fstatistic : Named num [1:3] 20.7 1 43
..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
$ cov.unscaled : num [1:2, 1:2] 0.025 -0.00799 -0.00799 0.02299
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "(Intercept)" "NAP"
.. ..$ : chr [1:2] "(Intercept)" "NAP"
- attr(*, "class")= chr "summary.lm"When dealing with numeric covariates, it often makes sense to center
these variables before performing a regression.
- i.e. subtract the mean from each covariate.
This helps to make our parameters more meaningful.
- they now correspond to the average case in the data, rather than some
extrapolated value.
- this is particularly useful for intercept parameters.
We may be interested in fitting many (many) models.
R has functionality to do this efficiently.
> library(broom)
> ggplot(gapminder, aes(x=year, y=lifeExp, group = country)) +
+ geom_line() +
+ facet_wrap(~continent)> ggplot(gapminder_models, aes(x = estimate, group = term)) +
+ geom_histogram(fill="black",col="white") +
+ facet_wrap(~term, scales= "free")> slopes <- filter(gapminder_models, term == "year") %>%
+ arrange(desc(estimate))
> continents <- select(gapminder, country, continent) |>
+ distinct()
> slopes <- slopes |> left_join(continents)
Joining with `by = join_by(country)`> ggplot(slopes, aes(x = fct_reorder(country, estimate), y = estimate, group = term, col = continent)) +
+ #geom_histogram(fill="black",col="white") +
+ geom_point(alpha=0.5) +
+ geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=.05) +
+ coord_flip() +
+ labs(y = "slope",
+ x = "Country") +
+ theme_minimal()lm(gdpPercap ~ continent, where gdpPercap is the new
outcome variable y. Get information about the best-fitting
line from the regression table. How do the regression results match up
with those of an analysis of life expectancy by continent?