class: center, middle, inverse, title-slide # Working with multiple data frames ##
Data Science in a Box ###
datasciencebox.org
--- layout: true <div class="my-footer"> <span> <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle # .hand[We...] .huge[.green[have]] .hand[multiple data frames] .huge[.pink[want]] .hand[to bring them together] --- ## Data: Women in science Information on 10 women in science who changed the world .small[ |name | |:------------------| |Ada Lovelace | |Marie Curie | |Janaki Ammal | |Chien-Shiung Wu | |Katherine Johnson | |Rosalind Franklin | |Vera Rubin | |Gladys West | |Flossie Wong-Staal | |Jennifer Doudna | ] .footnote[ Source: [Discover Magazine](https://www.discovermagazine.com/the-sciences/meet-10-women-in-science-who-changed-the-world) ] --- ## Inputs .panelset[ .panel[.panel-name[professions] ```r professions ``` ``` ## # A tibble: 10 x 2 ## name profession ## <chr> <chr> ## 1 Ada Lovelace Mathematician ## 2 Marie Curie Physicist and Chemist ## 3 Janaki Ammal Botanist ## 4 Chien-Shiung Wu Physicist ## 5 Katherine Johnson Mathematician ## 6 Rosalind Franklin Chemist ## 7 Vera Rubin Astronomer ## 8 Gladys West Mathematician ## 9 Flossie Wong-Staal Virologist and Molecular Biologist ## 10 Jennifer Doudna Biochemist ``` ] .panel[.panel-name[dates] ```r dates ``` ``` ## # A tibble: 8 x 3 ## name birth_year death_year ## <chr> <dbl> <dbl> ## 1 Janaki Ammal 1897 1984 ## 2 Chien-Shiung Wu 1912 1997 ## 3 Katherine Johnson 1918 2020 ## 4 Rosalind Franklin 1920 1958 ## 5 Vera Rubin 1928 2016 ## 6 Gladys West 1930 NA ## 7 Flossie Wong-Staal 1947 NA ## 8 Jennifer Doudna 1964 NA ``` ] .panel[.panel-name[works] ```r works ``` ``` ## # A tibble: 9 x 2 ## name known_for ## <chr> <chr> ## 1 Ada Lovelace first computer algorithm ## 2 Marie Curie theory of radioactivity, discovery of element… ## 3 Janaki Ammal hybrid species, biodiversity protection ## 4 Chien-Shiung Wu confim and refine theory of radioactive beta d… ## 5 Katherine John… calculations of orbital mechanics critical to … ## 6 Vera Rubin existence of dark matter ## 7 Gladys West mathematical modeling of the shape of the Eart… ## 8 Flossie Wong-S… first scientist to clone HIV and create a map … ## 9 Jennifer Doudna one of the primary developers of CRISPR, a gro… ``` ] ] --- ## Desired output ``` ## # A tibble: 10 x 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lov… Mathematician NA NA first computer a… ## 2 Marie C… Physicist an… NA NA theory of radioa… ## 3 Janaki … Botanist 1897 1984 hybrid species, … ## 4 Chien-S… Physicist 1912 1997 confim and refin… ## 5 Katheri… Mathematician 1918 2020 calculations of … ## 6 Rosalin… Chemist 1920 1958 <NA> ## 7 Vera Ru… Astronomer 1928 2016 existence of dar… ## 8 Gladys … Mathematician 1930 NA mathematical mod… ## 9 Flossie… Virologist a… 1947 NA first scientist … ## 10 Jennife… Biochemist 1964 NA one of the prima… ``` --- ## Inputs, reminder .pull-left[ ```r names(professions) ``` ``` ## [1] "name" "profession" ``` ```r names(dates) ``` ``` ## [1] "name" "birth_year" "death_year" ``` ```r names(works) ``` ``` ## [1] "name" "known_for" ``` ] .pull-right[ ```r nrow(professions) ``` ``` ## [1] 10 ``` ```r nrow(dates) ``` ``` ## [1] 8 ``` ```r nrow(works) ``` ``` ## [1] 9 ``` ] --- class: middle # Joining data frames --- ## Joining data frames ```r something_join(x, y) ``` - `left_join()`: all rows from x - `right_join()`: all rows from y - `full_join()`: all rows from both x and y - `semi_join()`: all rows from x where there are matching values in y, keeping just columns from x - `inner_join()`: all rows from x where there are matching values in y, return all combination of multiple matches in the case of multiple matches - `anti_join()`: return all rows from x where there are not matching values in y, never duplicate rows of x - ... --- ## Setup For the next few slides... .pull-left[ ```r x ``` ``` ## # A tibble: 3 x 2 ## id value_x ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ] .pull-right[ ```r y ``` ``` ## # A tibble: 3 x 2 ## id value_y ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] --- ## `left_join()` .pull-left[ <img src="img/left-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ```r left_join(x, y) ``` ``` ## # A tibble: 3 x 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] --- ## `left_join()` ```r professions %>% * left_join(dates) ``` ``` ## # A tibble: 10 x 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Ada Lovelace Mathematician NA NA ## 2 Marie Curie Physicist and Chemist NA NA ## 3 Janaki Ammal Botanist 1897 1984 ## 4 Chien-Shiung … Physicist 1912 1997 ## 5 Katherine Joh… Mathematician 1918 2020 ## 6 Rosalind Fran… Chemist 1920 1958 ## 7 Vera Rubin Astronomer 1928 2016 ## 8 Gladys West Mathematician 1930 NA ## 9 Flossie Wong-… Virologist and Molecular… 1947 NA ## 10 Jennifer Doud… Biochemist 1964 NA ``` --- ## `right_join()` .pull-left[ <img src="img/right-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ```r right_join(x, y) ``` ``` ## # A tibble: 3 x 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] --- ## `right_join()` ```r professions %>% * right_join(dates) ``` ``` ## # A tibble: 8 x 4 ## name profession birth_year death_year ## <chr> <chr> <dbl> <dbl> ## 1 Janaki Ammal Botanist 1897 1984 ## 2 Chien-Shiung … Physicist 1912 1997 ## 3 Katherine Joh… Mathematician 1918 2020 ## 4 Rosalind Fran… Chemist 1920 1958 ## 5 Vera Rubin Astronomer 1928 2016 ## 6 Gladys West Mathematician 1930 NA ## 7 Flossie Wong-… Virologist and Molecular … 1947 NA ## 8 Jennifer Doud… Biochemist 1964 NA ``` --- ## `full_join()` .pull-left[ <img src="img/full-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ```r full_join(x, y) ``` ``` ## # A tibble: 4 x 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] --- ## `full_join()` ```r dates %>% * full_join(works) ``` ``` ## # A tibble: 10 x 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Am… 1897 1984 hybrid species, biodiversity… ## 2 Chien-Shi… 1912 1997 confim and refine theory of … ## 3 Katherine… 1918 2020 calculations of orbital mech… ## 4 Rosalind … 1920 1958 <NA> ## 5 Vera Rubin 1928 2016 existence of dark matter ## 6 Gladys We… 1930 NA mathematical modeling of the… ## 7 Flossie W… 1947 NA first scientist to clone HIV… ## 8 Jennifer … 1964 NA one of the primary developer… ## 9 Ada Lovel… NA NA first computer algorithm ## 10 Marie Cur… NA NA theory of radioactivity, di… ``` --- ## `inner_join()` .pull-left[ <img src="img/inner-join.gif" width="80%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] .pull-right[ ```r inner_join(x, y) ``` ``` ## # A tibble: 2 x 3 ## id value_x value_y ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] --- ## `inner_join()` ```r dates %>% * inner_join(works) ``` ``` ## # A tibble: 7 x 4 ## name birth_year death_year known_for ## <chr> <dbl> <dbl> <chr> ## 1 Janaki Amm… 1897 1984 hybrid species, biodiversity… ## 2 Chien-Shiu… 1912 1997 confim and refine theory of … ## 3 Katherine … 1918 2020 calculations of orbital mech… ## 4 Vera Rubin 1928 2016 existence of dark matter ## 5 Gladys West 1930 NA mathematical modeling of the… ## 6 Flossie Wo… 1947 NA first scientist to clone HIV… ## 7 Jennifer D… 1964 NA one of the primary developer… ``` <!-- --- --> <!-- ## `semi_join()` --> <!-- .pull-left[ --> <!-- ```{r echo=FALSE, out.width="80%", out.extra ='style="background-color: #FDF6E3"'} --> <!-- include_graphics("img/semi-join.gif") --> <!-- ``` --> <!-- ] --> <!-- .pull-right[ --> <!-- ```{r} --> <!-- semi_join(x, y) --> <!-- ``` --> <!-- ] --> <!-- --- --> <!-- ## `semi_join()` --> <!-- ```{r} --> <!-- dates %>% --> <!-- semi_join(works) #<< --> <!-- ``` --> <!-- --- --> <!-- ## `anti_join()` --> <!-- .pull-left[ --> <!-- ```{r echo=FALSE, out.width="80%", out.extra ='style="background-color: #FDF6E3"'} --> <!-- include_graphics("img/anti-join.gif") --> <!-- ``` --> <!-- ] --> <!-- .pull-right[ --> <!-- ```{r} --> <!-- anti_join(x, y) --> <!-- ``` --> <!-- ] --> <!-- --- --> <!-- ## `anti_join()` --> <!-- ```{r} --> <!-- dates %>% --> <!-- anti_join(works) #<< --> <!-- ``` --> --- ## Putting it altogether ```r professions %>% left_join(dates) %>% left_join(works) ``` ``` ## # A tibble: 10 x 5 ## name profession birth_year death_year known_for ## <chr> <chr> <dbl> <dbl> <chr> ## 1 Ada Lov… Mathematician NA NA first computer a… ## 2 Marie C… Physicist an… NA NA theory of radioa… ## 3 Janaki … Botanist 1897 1984 hybrid species, … ## 4 Chien-S… Physicist 1912 1997 confim and refin… ## 5 Katheri… Mathematician 1918 2020 calculations of … ## 6 Rosalin… Chemist 1920 1958 <NA> ## 7 Vera Ru… Astronomer 1928 2016 existence of dar… ## 8 Gladys … Mathematician 1930 NA mathematical mod… ## 9 Flossie… Virologist a… 1947 NA first scientist … ## 10 Jennife… Biochemist 1964 NA one of the prima… ``` <!-- --- --> <!-- class: middle --> <!-- # Case study: Student records --> <!-- --- --> <!-- ## Student records --> <!-- - Have: --> <!-- - Enrolment: official university enrolment records --> <!-- - Survey: Student provided info missing students who never filled it out and including students who filled it out but dropped the class --> <!-- - Want: Survey info for all enrolled in class --> <!-- -- --> <!-- ```{r include=FALSE} --> <!-- enrolment <- read_csv("data/students/enrolment.csv") --> <!-- survey <- read_csv("data/students/survey.csv") --> <!-- ``` --> <!-- .pull-left[ --> <!-- ```{r message = FALSE} --> <!-- enrolment --> <!-- ``` --> <!-- ] --> <!-- .pull-right[ --> <!-- ```{r message = FALSE} --> <!-- survey --> <!-- ``` --> <!-- ] --> <!-- --- --> <!-- ## Student records --> <!-- .panelset[ --> <!-- .panel[.panel-name[In class] --> <!-- ```{r} --> <!-- enrolment %>% --> <!-- left_join(survey, by = "id") #<< --> <!-- ``` --> <!-- ] --> <!-- .panel[.panel-name[Survey missing] --> <!-- ```{r} --> <!-- enrolment %>% --> <!-- anti_join(survey, by = "id") #<< --> <!-- ``` --> <!-- ] --> <!-- .panel[.panel-name[Dropped] --> <!-- ```{r} --> <!-- survey %>% --> <!-- anti_join(enrolment, by = "id") #<< --> <!-- ``` --> <!-- ] --> <!-- ] --> <!-- --- --> <!-- class: middle --> <!-- # Case study: Grocery sales --> <!-- --- --> <!-- ## Grocery sales --> <!-- - Have: --> <!-- - Purchases: One row per customer per item, listing purchases they made --> <!-- - Prices: One row per item in the store, listing their prices --> <!-- - Want: Total revenue --> <!-- -- --> <!-- ```{r include=FALSE} --> <!-- purchases <- read_csv("data/sales/purchases.csv") --> <!-- prices <- read_csv("data/sales/prices.csv") --> <!-- ``` --> <!-- .pull-left[ --> <!-- ```{r message = FALSE} --> <!-- purchases --> <!-- ``` --> <!-- ] --> <!-- .pull-right[ --> <!-- ```{r message = FALSE} --> <!-- prices --> <!-- ``` --> <!-- ] --> <!-- --- --> <!-- ## Grocery sales --> <!-- .panelset[ --> <!-- .panel[.panel-name[Total revenue] --> <!-- .pull-left[ --> <!-- ```{r} --> <!-- purchases %>% --> <!-- left_join(prices) #<< --> <!-- ``` --> <!-- ] --> <!-- .pull-right[ --> <!-- ```{r} --> <!-- purchases %>% --> <!-- left_join(prices) %>% --> <!-- summarise(total_revenue = sum(price)) #<< --> <!-- ``` --> <!-- ] --> <!-- ] --> <!-- .panel[.panel-name[Revenue per customer] --> <!-- .pull-left[ --> <!-- ```{r} --> <!-- purchases %>% --> <!-- left_join(prices) --> <!-- ``` --> <!-- ] --> <!-- .pull-right[ --> <!-- ```{r} --> <!-- purchases %>% --> <!-- left_join(prices) %>% --> <!-- group_by(customer_id) %>% #<< --> <!-- summarise(total_revenue = sum(price)) --> <!-- ``` --> <!-- ] --> <!-- ] --> <!-- ] -->