-
-
Notifications
You must be signed in to change notification settings - Fork 0
/
sf-rents.Rmd
164 lines (138 loc) 路 6.87 KB
/
sf-rents.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "Reshaping data frames using pivot functions from {tidyr} and tally from {dplyr}"
description: |
Demonstrating methods for changing the shape of data frames (number of columns
and rows) using the #TidyTuesday data set for week 27 of 2022
(5/7/2022): "San Francisco Rentals"
author:
- name: Ronan Harrington
url: https://github.com/rnnh/
date: 2022-07-05
repository_url: https://github.com/rnnh/TidyTuesday/
preview: sf-rents_files/figure-html5/fig2-1.png
output:
distill::distill_article:
self_contained: false
toc: true
---
```{r knitr, include=FALSE}
knitr::opts_chunk$set(include = TRUE)
knitr::opts_chunk$set(fig.height = 6)
knitr::opts_chunk$set(fig.width = 9)
```
## Introduction
In this post, the [San Francisco Rentals](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-07-05) data set is used to demonstrate data reshaping in R.
This involves changing the number of columns and rows in a data frame to fit a given use case.
A data frame is made more tall or narrow by decreasing the number of columns, and wider by increasing the number of columns.
The three reshaping methods covered in this article are:
- [Making a data frame more narrow by summarising variables using group_by() and tally()](#reshaping-a-data-frame-by-summarising-variables)
- [Making a data frame wider with pivot_wider()](#reshaping-a-data-frame-to-make-it-wider-with-the-{tidyr}-function-pivot-wider)
- ["Lengthening" a data frame with pivot_longer()](#reshaping-a-data-frame-to-make-it-more-narrow-with-the-{tidyr}-function-pivot-longer)
Data frames created with these methods were used to make two plots:
- [Count of construction permits by type per street](#plotting-permit-type-counts-per-street-using-a-tidy-data-frame-of-value-counts)
- [Annual construction by type per San Francisco county](#plotting-annual-construction-per-san-francisco-county-using-a-data-frame-created-with-pivot-longer)
## Setup
Loading the [R](https://www.r-project.org/) libraries and
[data set](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-07-05/readme.md).
```{r setup}
# Loading libraries
library(tidytuesdayR)
library(tidyverse)
library(tidytext)
library(ggthemes)
# Loading data
tt <- tt_load("2022-07-05")
```
## Reshaping a data frame by summarising variables
```{r permits_per_street}
# Printing a summary of the San Francisco (SF) permits data frame
tt$sf_permits
# Printing a summary of the shape of the data frame
paste("tt$sf_permits has", nrow(tt$sf_permits), "rows and", ncol(tt$sf_permits),
"columns.")
# Creating a tall/narrow data set of permits per street
permits_per_street <- tt$sf_permits %>%
# Selecting variables/columns to keep
select(permit_type_definition, street_name, permit_number) %>%
# Grouping the permit numbers by type and street name for counting
group_by(permit_type_definition, street_name) %>%
# Counting/tallying the number of permits by type per street
tally()
# Printing a summary of the permits per street data frame
permits_per_street
# Printing a summary of the shape of the data frame
paste("permits_per_street has", nrow(permits_per_street), "rows and",
ncol(permits_per_street), "columns.")
```
## Reshaping a data frame to make it wider with the {tidyr} function pivot wider
```{r permits_per_street_wider}
# Creating a wider copy of the permits per street data frame
permits_per_street_wider <- permits_per_street %>%
# Pivoting the street names wider (creating a column for each street) and
# selecting the "n" variable for the values in this data frame
pivot_wider(names_from = street_name, values_from = n)
# Printing the wider permits per street data frame
permits_per_street_wider
# Printing a summary of the shape of the data frame
paste("permits_per_street_wider has", nrow(permits_per_street_wider), "rows and",
ncol(permits_per_street_wider), "columns.")
```
## Reshaping a data frame to make it more narrow with the {tidyr} function pivot longer
```{r Production per county longer}
# Printing a summary of the new construction data frame
tt$new_construction
# Printing a summary of the shape of the data frame
paste("tt$new_construction has", nrow(tt$new_construction), "rows and",
ncol(tt$new_construction), "columns.")
# Creating a taller/more narrow subset of production type per county
production_per_county <- tt$new_construction %>%
# Selecting variables/columns from tt$new_construction
select(county, year, totalproduction, sfproduction, mfproduction,mhproduction) %>%
# "Lengthening" the data frame by selecting columns to be pivoted to a longer format
pivot_longer(cols = c(totalproduction, sfproduction, mfproduction, mhproduction)) %>%
# Creating a copy of the "name" column to the more descriptive "production_type", as the
# pivoted columns all describe types of production, and removing the original "name"
# column
mutate(production_type = name, name = NULL) %>%
# Changing "production_type" from a character to a factor variable, with more
# descriptive factor levels
mutate(production_type = fct_recode(production_type,
"Total" = "totalproduction", "Single family" = "sfproduction",
"Multi family" = "mfproduction", "Mobile home" = "mhproduction"))
# Printing a summary of the production per county data frame
production_per_county
# Printing a summary of the shape of the data frame
paste("production_per_county has", nrow(production_per_county), "rows and",
ncol(production_per_county), "columns.")
```
## Plotting permit type counts per street using a tidy data frame of value counts
```{r fig1}
# Plotting the top 20 streets with the total number of each permit category
permits_per_street %>%
slice_max(order_by = n, n = 20) %>%
mutate(street_name = reorder_within(street_name, n, permit_type_definition)) %>%
ggplot(aes(x = n, y = street_name, fill = permit_type_definition)) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
theme_solarized_2() +
facet_wrap(~permit_type_definition, ncol = 2, scales = "free") +
labs(title = "Count of construction permits by type per street",
x = "Tally", y = "Street name")
```
## Plotting annual construction per San Francisco county using a data frame created with pivot longer
```{r fig2, fig.cap = "In San Francisco county, new construction plateaued in 2008 before plummeting."}
# Plotting the annual construction by type per San Francisco county
production_per_county %>%
ggplot(aes(x = year, y = value,
colour = fct_reorder2(production_type, year, value))) +
geom_line() +
theme_clean() +
facet_wrap(~county, scales = "free") +
scale_colour_brewer(palette = "Dark2") +
scale_x_continuous(breaks =
seq(min(production_per_county$year), max(production_per_county$year), 8)) +
geom_vline(xintercept = 2008, linetype = 2, colour = "red", size = 0.4) +
labs(colour = "Production type", x = "Year", y = "Units",
title = "Annual construction by type per San Francisco county",
subtitle = "Red vertical line marks 2008")
```