class: center, middle, inverse, title-slide # Tidying data ##
Data Science in a Box ###
datasciencebox.org
--- layout: true <div class="my-footer"> <span> <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle # .hand[We...] .huge[.green[have]] .hand[data organised in an unideal way for our analysis] .huge[.pink[want]] .hand[to reorganise the data to carry on with our analysis] --- ## Data: Sales <br> .pull-left[ ### .green[We have...] ``` ## # A tibble: 2 x 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[We want...] ``` ## # A tibble: 6 x 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## A grammar of data tidying .pull-left[ <img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ The goal of tidyr is to help you tidy your data via - pivoting for going between wide and long data - splitting and combining character columns - nesting and unnesting columns - clarifying how `NA`s should be treated ] --- class: middle # Pivoting data --- ## Not this... <img src="img/pivot.gif" width="70%" style="display: block; margin: auto;" /> --- ## but this! .center[ <img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] --- ## Wider vs. longer .pull-left[ ### .green[wider] more columns ``` ## # A tibble: 2 x 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[longer] more rows ``` ## # A tibble: 6 x 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) ] .pull-right[ ```r pivot_longer( * data, cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format ] .pull-right[ ```r pivot_longer( data, * cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, * names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) - `values_to`: name of the column where data in pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, names_to = "name", * values_to = "value" ) ``` ] --- ## Customers `\(\rightarrow\)` purchases ```r purchases <- customers %>% * pivot_longer( * cols = item_1:item_3, # variables item_1 to item_3 * names_to = "item_no", # column names -> new column called item_no * values_to = "item" # values in columns -> new column called item * ) purchases ``` ``` ## # A tibble: 6 x 3 ## customer_id item_no item ## <dbl> <chr> <chr> ## 1 1 item_1 bread ## 2 1 item_2 milk ## 3 1 item_3 banana ## 4 2 item_1 milk ## 5 2 item_2 toilet paper ## 6 2 item_3 <NA> ``` --- ## Why pivot? Most likely, because the next step of your analysis needs it -- .pull-left[ ```r prices ``` ``` ## # A tibble: 5 x 2 ## item price ## <chr> <dbl> ## 1 avocado 0.5 ## 2 banana 0.15 ## 3 bread 1 ## 4 milk 0.8 ## 5 toilet paper 3 ``` ] .pull-right[ ```r purchases %>% * left_join(prices) ``` ``` ## # A tibble: 6 x 4 ## customer_id item_no item price ## <dbl> <chr> <chr> <dbl> ## 1 1 item_1 bread 1 ## 2 1 item_2 milk 0.8 ## 3 1 item_3 banana 0.15 ## 4 2 item_1 milk 0.8 ## 5 2 item_2 toilet paper 3 ## 6 2 item_3 <NA> NA ``` ] --- ## Purchases `\(\rightarrow\)` customers .pull-left-narrow[ - `data` (as usual) - `names_from`: which column in the long format contains the what should be column names in the wide format - `values_from`: which column in the long format contains the what should be values in the new columns in the wide format ] .pull-right-wide[ ```r purchases %>% * pivot_wider( * names_from = item_no, * values_from = item * ) ``` ``` ## # A tibble: 2 x 4 ## customer_id item_1 item_2 item_3 ## <dbl> <chr> <chr> <chr> ## 1 1 bread milk banana ## 2 2 milk toilet paper <NA> ``` ] <!-- --- --> <!-- class: middle --> <!-- # Case study: Approval rating of Donald Trump --> <!-- --- --> <!-- ```{r echo=FALSE, out.width="70%"} --> <!-- knitr::include_graphics("img/trump-approval.png") --> <!-- ``` --> <!-- .footnote[ --> <!-- Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/trump-approval-ratings/adults/) --> <!-- ] --> <!-- --- --> <!-- ## Data --> <!-- ```{r include=FALSE} --> <!-- trump <- read_csv("data/trump/trump.csv") --> <!-- ``` --> <!-- ```{r} --> <!-- trump --> <!-- ``` --> <!-- --- --> <!-- ## Goal --> <!-- .pull-left-wide[ --> <!-- ```{r echo=FALSE, out.width="100%"} --> <!-- trump %>% --> <!-- pivot_longer( --> <!-- cols = c(approval, disapproval), --> <!-- names_to = "rating_type", --> <!-- values_to = "rating_value" --> <!-- ) %>% --> <!-- ggplot(aes(x = date, y = rating_value, --> <!-- color = rating_type, group = rating_type)) + --> <!-- geom_line() + --> <!-- facet_wrap(~ subgroup) + --> <!-- scale_color_manual(values = c("darkgreen", "orange")) + --> <!-- labs( --> <!-- x = "Date", y = "Rating", --> <!-- color = NULL, --> <!-- title = "How (un)popular is Donald Trump?", --> <!-- subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", --> <!-- caption = "Source: FiveThirtyEight modeling estimates" --> <!-- ) + --> <!-- theme_minimal() + --> <!-- theme(legend.position = "bottom") --> <!-- ``` --> <!-- ] --> <!-- -- --> <!-- .pull-right-narrow[ --> <!-- **Aesthetic mappings:** --> <!-- ✅ x = `date` --> <!-- ❌ y = `rating_value` --> <!-- ❌ color = `rating_type` --> <!-- **Facet:** --> <!-- ✅ `subgroup` (Adults and Voters) --> <!-- ] --> <!-- --- --> <!-- ## Pivot --> <!-- ```{r output.lines=11} --> <!-- trump_longer <- trump %>% --> <!-- pivot_longer( --> <!-- cols = c(approval, disapproval), --> <!-- names_to = "rating_type", --> <!-- values_to = "rating_value" --> <!-- ) --> <!-- trump_longer --> <!-- ``` --> <!-- --- --> <!-- ## Plot --> <!-- ```{r fig.asp = 0.5} --> <!-- ggplot(trump_longer, --> <!-- aes(x = date, y = rating_value, color = rating_type, group = rating_type)) + --> <!-- geom_line() + --> <!-- facet_wrap(~ subgroup) --> <!-- ``` --> <!-- --- --> <!-- .panelset[ --> <!-- .panel[.panel-name[Code] --> <!-- ```{r "trump-plot", fig.show="hide"} --> <!-- ggplot(trump_longer, --> <!-- aes(x = date, y = rating_value, --> <!-- color = rating_type, group = rating_type)) + --> <!-- geom_line() + --> <!-- facet_wrap(~ subgroup) + --> <!-- scale_color_manual(values = c("darkgreen", "orange")) + #<< --> <!-- labs( #<< --> <!-- x = "Date", y = "Rating", #<< --> <!-- color = NULL, #<< --> <!-- title = "How (un)popular is Donald Trump?", #<< --> <!-- subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", #<< --> <!-- caption = "Source: FiveThirtyEight modeling estimates" #<< --> <!-- ) #<< --> <!-- ``` --> <!-- ] --> <!-- .panel[.panel-name[Plot] --> <!-- ```{r ref.label="trump-plot", echo = FALSE, out.width="75%"} --> <!-- ``` --> <!-- ] --> <!-- ] --> <!-- --- --> <!-- .panelset[ --> <!-- .panel[.panel-name[Code] --> <!-- ```{r "trump-plot-2", fig.show="hide"} --> <!-- ggplot(trump_longer, --> <!-- aes(x = date, y = rating_value, --> <!-- color = rating_type, group = rating_type)) + --> <!-- geom_line() + --> <!-- facet_wrap(~ subgroup) + --> <!-- scale_color_manual(values = c("darkgreen", "orange")) + --> <!-- labs( --> <!-- x = "Date", y = "Rating", --> <!-- color = NULL, --> <!-- title = "How (un)popular is Donald Trump?", --> <!-- subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", --> <!-- caption = "Source: FiveThirtyEight modeling estimates" --> <!-- ) + --> <!-- theme_minimal() + #<< --> <!-- theme(legend.position = "bottom") #<< --> <!-- ``` --> <!-- ] --> <!-- .panel[.panel-name[Plot] --> <!-- ```{r ref.label="trump-plot-2", echo = FALSE, out.width="75%", fig.width=6} --> <!-- ``` --> <!-- ] --> <!-- ] -->