Gavin Fay
2023-02-08
1/18: Introduction to R and R Studio, working with data
1/25: Intro to Visualization
2/01: Probability, linear modeling
2/08: Data wrangling, model summaries
2/15: Simulation, Resampling
2/22: Iteration
3/01: Creating functions, debugging
3/15: Flex: more modeling (brms, glmmTMB)
3/29: Spatial data or tidymodeling
Acknowledgements: Mine Çetinkaya-Rundel, Allison Horst
Topics
All GLMs have the following three characteristics:
\[\eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]
Recall from lecture: \[\eta = \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p\] \[f_Y(y;\mu,\varphi) = \exp\left[\frac{A}{\varphi}{y\lambda(\mu)-\gamma(\lambda(\mu))}+\tau(y,\varphi)\right]\] \[\mu = m(\eta) \text{, } \eta = m^{-1}(\mu)=l(\mu)\]
The combination of a response distribution, a link function and other information needed to carry out the modeling exercise is called the family of the generalized linear model.
glm()
functionThe R
function to fit a generalized linear model is glm()
which uses the form:
fitted.model <- glm(formula, family=family.generator, data=data.frame)
Only new piece is the call to ‘family.generator’
Although complex, its use is fairly simple.
Where there is a choice of link, link can be supplied with the family name as a parameter.
Simple (inefficient) use: The following are equivalent.
Most of the extraction functions that can be applied to lm()
can also be used with glm()
.
glm()
functiontidy(RIKZ_lm1)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.69 0.658 10.2 5.25e-13
2 NAP -2.87 0.631 -4.55 4.42e- 5
tidy(RIKZ_glm1)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.69 0.658 10.2 5.25e-13
2 NAP -2.87 0.631 -4.55 4.42e- 5
\[P(X=x) = \frac{e^{-\mu} \mu^x}{x!} \text{, } \mu_i = e^{\alpha+\beta_1 x_{1,i} + \dots + \beta_j x_{j,i}}\]
Note that the default link for the poisson is log
so we don’t have to specify here (see ?family
).
summary(RIKZ_poisson)
Call:
glm(formula = Richness ~ NAP, family = poisson, data = RIKZ)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2029 -1.2432 -0.9199 0.3943 4.3256
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.79100 0.06329 28.297 < 2e-16 ***
NAP -0.55597 0.07163 -7.762 8.39e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 179.75 on 44 degrees of freedom
Residual deviance: 113.18 on 43 degrees of freedom
AIC: 259.18
Number of Fisher Scoring iterations: 5
broom::augment()
As with lm()
, the augment()
function can be used to obtain the predictions from a fitted model object, and for a new data frame.
Note: default for type.predict
is on the scale of the linear predictors. Set to "response"
to obtain predictions on the scale of the response variable.
# A tibble: 45 × 9
Richness NAP .fitted .se.fit .resid .std.resid .hat .sigma .cooksd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 0.045 1.77 0.0639 1.90 1.92 0.0239 1.61 0.0569
2 10 -1.04 2.37 0.0896 -0.206 -0.215 0.0856 1.64 0.00213
3 13 -1.34 2.53 0.106 0.112 0.121 0.142 1.64 0.00122
4 11 0.616 1.45 0.0827 2.72 2.76 0.0291 1.59 0.165
5 10 -0.684 2.17 0.0737 0.406 0.416 0.0476 1.64 0.00453
6 8 1.19 1.13 0.114 2.32 2.37 0.0401 1.60 0.170
7 9 0.82 1.34 0.0929 2.26 2.30 0.0328 1.60 0.125
8 8 0.635 1.44 0.0836 1.64 1.66 0.0294 1.62 0.0532
9 19 0.061 1.76 0.0641 4.33 4.38 0.0238 1.50 0.376
10 17 -1.33 2.53 0.106 1.18 1.27 0.141 1.63 0.148
# … with 35 more rows
#create new values for beach height
new_data <- tibble(NAP = seq(-1.5,2.5,length.out=100))
#predict values for response based on new salinity
RIKZ_pois_pred <- augment(RIKZ_poisson,
newdata = new_data,
type.predict = "response",
se_fit = "TRUE")
RIKZ_pois_pred
# A tibble: 100 × 3
NAP .fitted .se.fit
<dbl> <dbl> <dbl>
1 -1.5 13.8 1.60
2 -1.46 13.5 1.53
3 -1.42 13.2 1.46
4 -1.38 12.9 1.40
5 -1.34 12.6 1.34
6 -1.30 12.3 1.28
7 -1.26 12.1 1.23
8 -1.22 11.8 1.17
9 -1.18 11.5 1.12
10 -1.14 11.3 1.07
# … with 90 more rows
factor
).?add1
).The case_when()
function is like a really friendly if-else statement. When used within mutate()
, it allows you to add a new column containing values dependent on your condition(s).
To penguins
, add a new column size_bin
that contains:
penguins |>
mutate(size_bin = case_when(
body_mass_g > 4500 ~ "large",
body_mass_g > 3000 & body_mass_g <= 4500 ~ "medium",
body_mass_g <= 3000 ~ "small"
)
) |>
select(species, island, size_bin, everything())
# A tibble: 344 × 9
species island size_bin bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <chr> <dbl> <dbl> <int>
1 Adelie Torgersen medium 39.1 18.7 181
2 Adelie Torgersen medium 39.5 17.4 186
3 Adelie Torgersen medium 40.3 18 195
4 Adelie Torgersen <NA> NA NA NA
5 Adelie Torgersen medium 36.7 19.3 193
6 Adelie Torgersen medium 39.3 20.6 190
7 Adelie Torgersen medium 38.9 17.8 181
8 Adelie Torgersen large 39.2 19.6 195
9 Adelie Torgersen medium 34.1 18.1 193
10 Adelie Torgersen medium 42 20.2 190
# … with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
# year <int>
Starting with penguins
:
Limit the columns to species
, year
, and flipper_length_mm
Rename the year
column to study_year
Only keep observations for Adelie penguins
Add a new column called flipper_rank
that contains:
flipper_length_mm
is < 200 mmflipper_length_mm
is >= 200 mmflipper_length_mm
is anything else (e.g. NA
)penguins |>
select(species, year, flipper_length_mm) |>
rename(study_year = year) |>
filter(species == "Adelie") |>
mutate(flipper_rank = case_when(
flipper_length_mm < 200 ~ 1,
flipper_length_mm >= 200 ~ 2,
TRUE ~ 0 # 0 for anything else
))
# A tibble: 152 × 4
species study_year flipper_length_mm flipper_rank
<fct> <int> <int> <dbl>
1 Adelie 2007 181 1
2 Adelie 2007 186 1
3 Adelie 2007 195 1
4 Adelie 2007 NA 0
5 Adelie 2007 193 1
6 Adelie 2007 190 1
7 Adelie 2007 181 1
8 Adelie 2007 195 1
9 Adelie 2007 193 1
10 Adelie 2007 190 1
# … with 142 more rows
Add a new column to penguins
called study_year
that contains:
Starting with penguins
, only keep observations for chinstrap penguins, then only keep the flipper_length_mm
and body_mass_g
variables. Add a new column called fm_ratio
that contains the ratio of flipper length to body mass for each penguin. Next, add another column named ratio_bin
which contains the word “high” if fm_ratio
is greater than or equal to 0.05, “low” if the ratio is less than 0.05, and “no record” if anything else (e.g. NA
).
wider more columns ::: {.cell} ::: {.cell-output .cell-output-stdout}
# A tibble: 2 × 4
customer_id item_1 item_2 item_3
<dbl> <chr> <chr> <chr>
1 1 bread milk banana
2 2 milk toilet paper <NA>
:::
longer more rows ::: {.cell} ::: {.cell-output .cell-output-stdout}
# A tibble: 6 × 3
customer_id item_no item
<dbl> <chr> <chr>
1 1 item_1 bread
2 1 item_2 milk
3 1 item_3 banana
4 2 item_1 milk
5 2 item_2 toilet paper
6 2 item_3 <NA>
::: ::: :::
pivot_longer()
data
(as usual)cols
: columns to pivot into longer formatnames_to
: name of the column where column names of pivoted variables go (character string)values_to
: name of the column where data in pivoted variables go (character string)purchases <- customers |>
pivot_longer( #<<
cols = item_1:item_3, # variables item_1 to item_3 #<<
names_to = "item_no", # column names -> new column called item_no #<<
values_to = "item" # values in columns -> new column called item #<<
) #<<
purchases
# A tibble: 6 × 3
customer_id item_no item
<dbl> <chr> <chr>
1 1 item_1 bread
2 1 item_2 milk
3 1 item_3 banana
4 2 item_1 milk
5 2 item_2 toilet paper
6 2 item_3 <NA>
data
(as usual)names_from
: which column in the long format contains the what should be column names in the wide formatvalues_from
: which column in the long format contains the what should be values in the new columns in the wide formatWe have data on Steller sea lion pup counts over time at a bunch of rookeries in Alaska.
The number of data points for each rookery is not the same.
We want to investigate the annual trend in counts for each rookery.
SSLpupcounts.csv
.05:00
have multiple data frames
want to bring them together
Information on 10 women in science who changed the world
name |
---|
Ada Lovelace |
Marie Curie |
Janaki Ammal |
Chien-Shiung Wu |
Katherine Johnson |
Rosalind Franklin |
Vera Rubin |
Gladys West |
Flossie Wong-Staal |
Jennifer Doudna |
.footnote[ Source: Discover Magazine]
# A tibble: 10 × 2
name profession
<chr> <chr>
1 Ada Lovelace Mathematician
2 Marie Curie Physicist and Chemist
3 Janaki Ammal Botanist
4 Chien-Shiung Wu Physicist
5 Katherine Johnson Mathematician
6 Rosalind Franklin Chemist
7 Vera Rubin Astronomer
8 Gladys West Mathematician
9 Flossie Wong-Staal Virologist and Molecular Biologist
10 Jennifer Doudna Biochemist
# A tibble: 9 × 2
name known_for
<chr> <chr>
1 Ada Lovelace first computer algorithm
2 Marie Curie theory of radioactivity, discovery of elements polonium a…
3 Janaki Ammal hybrid species, biodiversity protection
4 Chien-Shiung Wu confim and refine theory of radioactive beta decy, Wu expe…
5 Katherine Johnson calculations of orbital mechanics critical to sending the …
6 Vera Rubin existence of dark matter
7 Gladys West mathematical modeling of the shape of the Earth which serv…
8 Flossie Wong-Staal first scientist to clone HIV and create a map of its genes…
9 Jennifer Doudna one of the primary developers of CRISPR, a ground-breaking…
# A tibble: 10 × 5
name profession birth_year death_year known_for
<chr> <chr> <dbl> <dbl> <chr>
1 Ada Lovelace Mathematician NA NA first co…
2 Marie Curie Physicist and Chemist NA NA theory o…
3 Janaki Ammal Botanist 1897 1984 hybrid s…
4 Chien-Shiung Wu Physicist 1912 1997 confim a…
5 Katherine Johnson Mathematician 1918 2020 calculat…
6 Rosalind Franklin Chemist 1920 1958 <NA>
7 Vera Rubin Astronomer 1928 2016 existenc…
8 Gladys West Mathematician 1930 NA mathemat…
9 Flossie Wong-Staal Virologist and Molecular … 1947 NA first sc…
10 Jennifer Doudna Biochemist 1964 NA one of t…
left_join()
: all rows from xright_join()
: all rows from yfull_join()
: all rows from both x and ysemi_join()
: all rows from x where there are matching values in y, keeping just columns from xinner_join()
: all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matchesanti_join()
: return all rows from x where there are not matching values in y, never duplicate rows of xFor the next few slides…
left_join()
left_join()
# A tibble: 10 × 4
name profession birth_year death_year
<chr> <chr> <dbl> <dbl>
1 Ada Lovelace Mathematician NA NA
2 Marie Curie Physicist and Chemist NA NA
3 Janaki Ammal Botanist 1897 1984
4 Chien-Shiung Wu Physicist 1912 1997
5 Katherine Johnson Mathematician 1918 2020
6 Rosalind Franklin Chemist 1920 1958
7 Vera Rubin Astronomer 1928 2016
8 Gladys West Mathematician 1930 NA
9 Flossie Wong-Staal Virologist and Molecular Biologist 1947 NA
10 Jennifer Doudna Biochemist 1964 NA
right_join()
right_join()
# A tibble: 8 × 4
name profession birth_year death_year
<chr> <chr> <dbl> <dbl>
1 Janaki Ammal Botanist 1897 1984
2 Chien-Shiung Wu Physicist 1912 1997
3 Katherine Johnson Mathematician 1918 2020
4 Rosalind Franklin Chemist 1920 1958
5 Vera Rubin Astronomer 1928 2016
6 Gladys West Mathematician 1930 NA
7 Flossie Wong-Staal Virologist and Molecular Biologist 1947 NA
8 Jennifer Doudna Biochemist 1964 NA
full_join()
full_join()
# A tibble: 10 × 4
name birth_year death_year known_for
<chr> <dbl> <dbl> <chr>
1 Janaki Ammal 1897 1984 hybrid species, biodiversity protec…
2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioac…
3 Katherine Johnson 1918 2020 calculations of orbital mechanics c…
4 Rosalind Franklin 1920 1958 <NA>
5 Vera Rubin 1928 2016 existence of dark matter
6 Gladys West 1930 NA mathematical modeling of the shape …
7 Flossie Wong-Staal 1947 NA first scientist to clone HIV and cr…
8 Jennifer Doudna 1964 NA one of the primary developers of CR…
9 Ada Lovelace NA NA first computer algorithm
10 Marie Curie NA NA theory of radioactivity, discovery…
inner_join()
inner_join()
# A tibble: 7 × 4
name birth_year death_year known_for
<chr> <dbl> <dbl> <chr>
1 Janaki Ammal 1897 1984 hybrid species, biodiversity protect…
2 Chien-Shiung Wu 1912 1997 confim and refine theory of radioact…
3 Katherine Johnson 1918 2020 calculations of orbital mechanics cr…
4 Vera Rubin 1928 2016 existence of dark matter
5 Gladys West 1930 NA mathematical modeling of the shape o…
6 Flossie Wong-Staal 1947 NA first scientist to clone HIV and cre…
7 Jennifer Doudna 1964 NA one of the primary developers of CRI…
# A tibble: 10 × 5
name profession birth_year death_year known_for
<chr> <chr> <dbl> <dbl> <chr>
1 Ada Lovelace Mathematician NA NA first co…
2 Marie Curie Physicist and Chemist NA NA theory o…
3 Janaki Ammal Botanist 1897 1984 hybrid s…
4 Chien-Shiung Wu Physicist 1912 1997 confim a…
5 Katherine Johnson Mathematician 1918 2020 calculat…
6 Rosalind Franklin Chemist 1920 1958 <NA>
7 Vera Rubin Astronomer 1928 2016 existenc…
8 Gladys West Mathematician 1930 NA mathemat…
9 Flossie Wong-Staal Virologist and Molecular … 1947 NA first sc…
10 Jennifer Doudna Biochemist 1964 NA one of t…
–>
–> –> –> –> –> –>
SSL_Sites.csv
contains the latitude and longitude of sea lion rookeries.join
function to add the lat/lon information and geographic region to the counts dataset.ggplot()
to map the rookery counts in 2015.–>
–>
–> –> –> –> –> –> –> –> –>
–>
–> –> –>
–>
–> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –>
–>
–>
–> –> –> –> –> –> –> –> –>