-
Notifications
You must be signed in to change notification settings - Fork 0
/
bookreviews_analysis.Rmd
740 lines (574 loc) · 32.8 KB
/
bookreviews_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
---
title: 'Guided Project: Creating An Efficient Data Analysis Workflow'
author: "Lisa Reiber"
date: "26/07/2021"
output:
pdf_document: default
html_document:
self_contained: yes
code_folding: hide
download_md: yes
keep_md: yes
theme: flatly
css: styles.css
---
```{r setup, include=FALSE}
pacman::p_load(flextable,
glue,
here,
knitr,
kableExtra,
rmdformats,
readr,
tidyverse,
janitor,
printr,
rmarkdown,
tidylog,
xaringan)
#global options
options(max.print = "100")
opts_chunk$set(echo = TRUE,
cache = FALSE,
prompt = FALSE,
tidy = FALSE,
comment = NA, # gets rid of the two ## before print outs
message = FALSE,
warning = FALSE,
class.output = "solution-text")
opts_knit$set(width = 100)
reviews_df <- readr::read_csv(here("data/data-raw/book_reviews.csv"),
col_types = cols())
```
# Getting familiar with the data
*It's easy to lose context when we're just talking about data analysis in general. The first thing we should do before we do any analysis is to get acquainted with our dataset. There are many, many things to check with a dataset before we dive into any analysis. How much data is there? What kind of data do we actually have on hand? Is there anything "weird" that might interfere with any analyses we might need to do? Is there missing data? Answering these questions now saves we time and effort later.*
*If we don't check the data beforehand, it's easy to make some false assumptions about the data that can hinder our progress later. Maybe we think that one column looks like a number, but it's actually been read in as a string. Perhaps some things were misspelled. In any case, getting familiar with the data is our first step in the data analysis workflow. Here's a few helpful questions to start out any data analysis. Answer each of these questions for the sales dataset. (instructions from the course are always in italic)*
## Instructions
### 1. How big is the dataset?
::: {.instruction-notes}
- Recall we can use the [`dim()` function](https://stat.ethz.ch/R-manual/R-devel/library/base/html/dim.html) to see how many rows and columns are in a tibble.
- You should take a note of how many columns there are and how many rows there are.
- Write some notes to ourself to keep track of what we see in the dataset.
:::
::: {.solution-text}
There are **`r dim(reviews_df)[1]` rows** and **`r dim(reviews_df)[2]` columns/variables** in the data
:::
### 2. What are the column names?
::: {.instruction-notes}
- Recall we can use the [\`colnames()\` function](https://stat.ethz.ch/R-manual/R-devel/library/base/html/colnames.html) to return each of the column names of a tibble in a vector.\
- What do each of the columns seem to represent?
:::
::: {.solution-text}
The names of the columns are: **`r colnames(reviews_df)`**
:::
### 3. What are the *types* of each of the columns? {.tabset}
::: {.instruction-notes}
- Sometimes you may find that data will look one way, but is actually disguised as another. As mentioned above, a common example of this is numbers that are actually strings in the data!\
- You can use the [`typeof()` function](https://stat.ethz.ch/R-manual/R-devel/library/base/html/typeof.html) to learn what the type is for each column.\
- This would be a good application of a `for` loop. Since we have the column names from `colnames()`, we can loop through these and check the types of each of the column. Recall that you can access a column in a tibble with double bracket syntax `tibble_data[["column_name"]]` where `tibble_data` is a tibble and `column_name` is a string with the column name. Combining this with `typeof()` will give you back the type of the column itself.
:::
#### for loop + print
Here, the class.output option does not work, because we set the results to 'asis'. In the ouput, the code is therefore within the green box, rather than above it. As a work around, I copied the code and once it is not evaluated (eval = FALSE). When it is evaluated, it will not show the code echo = FALSE).
Helpful here: the [results = 'asis'](#%20https://bookdown.org/yihui/rmarkdown-cookbook/results-asis.html) function.
```{r eval=FALSE, results='asis'}
for (var in seq_along(reviews_df)) {
var_type <- typeof(reviews_df[[var]])
names(var_type) <- names(reviews_df[var])
var_type %>% glue_data("- `{names(.)}`: {.}") %>% print()
# var_type %>% cat(labels = paste0("- `", names(.), "`: "),
# fill = TRUE,
# append = TRUE)
}
```
::: {.solution-text}
**Variables and their unique types:**
```{r echo=FALSE, results='asis'}
for (var in seq_along(reviews_df)) {
var_type <- typeof(reviews_df[[var]])
names(var_type) <- names(reviews_df[var])
var_type %>% glue_data("- `{names(.)}`: {.}") %>% print()
}
```
:::
#### map
```{r}
reviews_df %>%
map_chr(typeof) %>%
as_tibble(rownames = "column") %>%
flextable() %>%
theme_alafoli()
# reviews_df %>% map_df(typeof) %>%
# pivot_longer(cols = everything(),
# names_to = "column",
# values_to = "type")
```
### 4. What are the unique values present in each of the columns? {.tabset}
::: {.instruction-notes}
- The If we're dealing with numbers, it's good to get a sense of how high and how low the values go. If we're dealing with strings, it's good to see all of the different possible values that are there.
:::
- We want to print the columns and all the unique values per column
- We can use glue, which makes it possible to add the "and" before the last element
#### for loop + print
```{r eval = FALSE}
for (var in seq_along(reviews_df)) {
unique_values <- reviews_df[[var]] %>%
unique() %>%
glue_collapse(sep = ", ", last = ", and ")
names(unique_values) <- names(reviews_df[var])
unique_values %>%
glue_data("- `{names(.)}`: {.}") %>% print()
# unique_values %>% cat(labels = paste0("- `", names(.), "`: "),
# fill = TRUE,
# append = TRUE)
}
```
::: {.solution-text}
**Variables and their unique values:**
```{r results = 'asis', echo = FALSE}
for (var in seq_along(reviews_df)) {
unique_values <- reviews_df[[var]] %>%
unique() %>%
na.omit() %>%
glue_collapse(sep = ", ", last = ", and ")
names(unique_values) <- names(reviews_df[var])
unique_values %>%
glue_data("- `{names(.)}`: {.}") %>% print()
}
```
:::
#### predefined for loop
[how to set names to a list in loop](https://stackoverflow.com/questions/24659680/naming-list-items-via-loop-in-r)
```{r results='asis'}
list_unique_values <- vector(mode = "list", length = ncol(reviews_df))
names(list_unique_values) <- colnames(reviews_df)
for (var in seq_along(reviews_df)) {
list_unique_values[[var]] <- reviews_df[[var]] %>% unique() %>% na.omit()
# don't like this line to get rid of NA values
# list_unique_values[[var]] <- list_unique_values[[var]][!is.na(list_unique_values[[var]])]
}
list_unique_values %>%
map(~glue_collapse(.x, sep = ", ", last = ", and ")) %>%
glue_data("- `{names(.)}`: {.}")
```
#### iwalk
```{r results='asis'}
cat_description <- function(myvector, vec_names){
cat("- `", vec_names, "`: ",
paste0(unique(myvector), sep = ", "), " \n", sep = "")
}
reviews_df %>% iwalk(cat_description)
```
#### map
Approach 1 is to print a tibble
```{r}
reviews_df %>%
map_chr(~glue_collapse(unique(.x) %>% na.omit(), sep = ", ", last = ", and ")) %>%
as_tibble(rownames = "column") %>%
flextable() %>%
theme_alafoli() %>%
autofit()
```
Approach 2 is to use glue_data to print each row in the fomat we specify
```{r results = 'asis'}
reviews_df %>%
map(~glue_collapse(unique(.x) %>% na.omit(), sep = ", ", last = ", and ")) %>%
glue_data("- Column `{names(.)}` unique values: {.}.")
```
# Handling Missing Data
*Now that we are more familiar with the data itself, now we can get into the more specific details of data analysis. A large part of a data analyst's job is to take a raw dataset and turn it into a form that we can use for analysis. Many times, we will not be able to just take a dataset and start analyzing it. It's good practice to examine the data beforehand and make note of any changes we need to make for it. This process has many names, but for future reference, we'll call it data cleaning or data processing.*
*The first issue we will contend with is the issue of missing data. Missing data is annoying because there's nothing we can really do with it. We can't perform any analysis or calculations with missing data. In R, missing data is typically shown with NA, which stands for "not available". Some other datasets may convey non-existence in a different way, but NA is the most common.*
*There are two ways that we can deal with missing data: 1) remove any rows or columns that have missing data (typically, rows) or 2) fill in the missing data in an informed, discipline way. This second way is known as imputation, and it's outside of the scope of what we've learned so far. For now, we'll take first approach with this dataset.*
## Instructions
### 1. Examine the data and get an understanding of which columns have data missing. {.tabset}
::: {.instruction-notes}
- Since we're going to check through each of the columns, a `for` loop would be useful in this case.
:::
#### for loop + kable
::: {.solution-text}
```{r}
n_missings <- vector(mode = "logical", length = ncol(reviews_df))
names(n_missings) <- colnames(reviews_df)
for (var in seq_along(reviews_df)) {
n_missings[var] <- sum(is.na(reviews_df[[var]]))
}
n_missings %>%
kableExtra::kbl(col.names = "Number of missing values") %>%
kableExtra::kable_minimal(full_width = F)
```
:::
#### map+kable
::: {.solution-text}
```{r}
reviews_df %>%
map_chr(~any(is.na(.x))) %>%
kableExtra::kbl(col.names = "Is Missing?") %>%
kableExtra::kable_minimal(full_width = F)
```
```{r}
reviews_df %>%
map_chr(~sum(is.na(.x))) %>%
kableExtra::kbl(col.names = "Number of missing values") %>%
kableExtra::kable_minimal(full_width = F)
```
:::
### 2. Subset Data
*Create a new copy of the dataset that removes all of the rows that have missing data.*
::: {.instruction-notes}
- Here, you should try to apply the `filter()` and `is.na()` functions to get to this new dataset without missing values.
:::
In the previous section, we found out that there are only missings in the review column. Instead of dropping those rows, we can add a flag that indicated whether there are missings in the row or not. This allows us to use all of the information for questions that do not involve the reviews, and also to compare which books do not have reviews (and why?).
```{r}
reviews_meta <- reviews_df %>%
rowid_to_column(var = "id") %>%
mutate(is_complete = rowSums(is.na(.)) == 0)
```
### 3. Inspect Data
*Make a note to ourself on the dimensions of the new dataset. How much data was removed by taking out the rows with missing data?*
::: {.instruction-notes}
- There's no explicitly correct answer for this, but do we think that removing that much data will affect the findings of any calculations we end up doing? It's important to keep questions like this in mind as we clean our data.
:::
The new dataset has 206 rows less
::: {.solution-text}
```{r}
reviews_meta %>% tabyl(is_complete)
```
:::
<br>
- Is there any book with exceptionally high percentage of missings? *--No*
::: {.solution-text}
```{r}
reviews_meta %>%
tabyl(book, is_complete) %>%
adorn_percentages() %>%
adorn_rounding(2) %>%
adorn_title()
```
:::
<br>
- Is there any state with exceptionally high percentage of missings? *--No*
```{r, results='asis'}
reviews_meta %>%
select(-review) %>%
dplyr::mutate_if(is.numeric, as.character) %>%
pivot_longer(-c(id, is_complete)) %>%
tabyl(value, is_complete, name, show_missing_levels = FALSE) %>%
adorn_percentages() %>%
adorn_rounding(2) %>%
adorn_title() %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F, font_size = 14, position = "left")
```
*We can also calculate summary statistics such as the average, minimum and maximum proportion of complete rows per variable value. e.g.: are there differences in the proportion of missings within the categories of the `books` column (aka for different books)?*
In the table below we can see that the missing data is not more prevalent in some books, some states or price categories than in others.
```{r}
reviews_meta %>%
select(-review, id) %>%
mutate_if(is.numeric, as.character) %>%
pivot_longer(-c(id, is_complete), names_to = "column") %>%
group_by(column, value) %>%
summarise(prop_complete = round((sum(is_complete) / n()), 3),
n = n()) %>%
summarise(avg_prop_complete = round(mean(prop_complete), 3),
min = min(prop_complete),
max = max(prop_complete)
) %>%
flextable() %>%
theme_alafoli() %>%
autofit()
```
# Dealing With Inconsistent Labels
*Now that we've removed all of the missing data from the dataset, we have a complete dataset. This is the ideal case that we would like to start any data analysis, so we're working towards a better dataset already.*
*The next thing that we need to work on is the state column. You may have noticed that the labeling for each state is inconsistent. For example, California is written as both "California" and "CA". Both "California" and "CA" refer to the same place in the United States, so we should try to clean this up. We need to choose one of the ways to refer to the state, and stick to that convention. Making labels/strings more consistent in the data will make things easier to analyze later on, so we'll handle this issue with this screen.*
*If we're unfamiliar with the shortened postal codes of the states, we can refer to this helpful guide here. It would be helpful to write down the relevant states in our RMarkdown for reference later if we forget, so we don't have to refer back to the site itself.*
## Instructions
### 1. What are all the states that are present in the dataset?
::: {.instruction-notes}
- Look back to what we've logged about the state column to see what we have here.
:::
```{r}
reviews_meta %>% tabyl(state) %>%
adorn_pct_formatting() %>%
flextable() %>%
theme_alafoli()
```
### 2. With this information in mind, choose a convention to stick to:
::: {.instruction-notes}
- either use the full name like "California" or the postal code like "CA".
- Once we decide on the convention, create a new column using the mutate() function based on the state column that makes the labeling more consistent.
:::
#### transform state
::: {.solution-text}
We want it both ways :) Once as a short column where the information is stored in the form of the code of the state `state_code` (can be useful to join more data from other sources). But We also want one column that stored the information in a informative, readable way for tables and figures, hence we also create `state_name`.
```{r}
reviews_state <- reviews_meta %>%
mutate(state_code = case_when(state == "California" ~ "CA",
state == "Florida" ~ "FL",
state == "New York" ~ "NY",
state == "Texas" ~ "TX",
TRUE ~ state),
state_code = as.factor(state_code),
state_name = case_when(state == "CA" ~ "California",
state == "FL" ~ "Florida",
state == "NY" ~ "New York",
state == "TX" ~ "Texas",
TRUE ~ state),
state_name = as.factor(state_name)) %>%
select(id, state_name, state_code)
reviews_state %>%
tabyl(state_code, state_name) %>%
flextable() %>%
theme_alafoli()
```
:::
# Transforming The Review Data
## Instructions
*The first things we'll handle in the dataset are the reviews themselves. You may have noticed in our data exploration that the reviews take the form of strings, ranging from "Poor" to "Excellent". Our goal is to evaluate the ratings of each of the textbooks, but there's not much we can do with text versions of the review scores. It would be better if we were to convert the reviews into a numerical form.*
### 1. transform review {.tabset}
::: {.instruction-notes}
#### review_num
- Using the mutate() function create a new column in the dataset called review_num. It should take the original review column and convert it into a numerical form. The column should be coded as following:
- The case_when() function might be useful here since we know how each of the -iews should be reclassified into numbers.
- "Poor" should receive a numerical score of 1
- "Fair" should receive a numerical score of 2
- "Good" should receive a numerical score of 3
- "Great" should receive a numerical score of 4
- "Excellent" should receive a numerical score of 5
#### is_high_review
It would also be helpful to have another column that helps us decide if a score is "high" or not.\
- For the sake of this exercise, let's decide that a score of 4 or higher - qualifies as a "high" score.\
- Create a new column in the dataset called is_high_review that denotes whether or not the review has a high score or not. In other words, it should be TRUE if review_num is 4 or higher, and FALSE otherwise.
:::
```{r}
reviews_review <- reviews_meta %>%
mutate(review_num = case_when(review == "Poor" ~ 1,
review == "Fair" ~ 2,
review == "Good" ~ 3,
review == "Great" ~ 4,
review == "Excellent" ~ 5,
is.na(review) ~ NA_real_,
TRUE ~ 99),
is_high_review = (review_num >= 4),
) %>%
select(id, review, review_num, is_high_review)
reviews_review %>%
tabyl(review_num, is_high_review) %>%
flextable() %>%
add_header_row(
values = c("", "Review Classification High:"),
colwidths = c(1, 3)
) %>%
theme_alafoli() %>%
align(part = "header", i = 1, align = "center")
```
Once we are done cleaning the variables of interest we will assemble them in one set
```{r}
reviews_clean <- reviews_meta %>%
select(id, is_complete, book, price) %>%
left_join(reviews_state, by = "id") %>%
left_join(reviews_review, by = "id")
```
# Analyzing The Data
## Instructions
*It's important to keep the overall goal in mind as we handle all the little details of the cleaning. We are acting as an analyst trying to figure out which books are the most profitable for the company. The initial data wasn't in a form that was ready for analysis, so we needed to do this cleaning to prepare it. A lot of analysts starting their first jobs believe that the analysis part of their job will be the bulk of their work. To the contrary, a lot of our work will focus on data cleaning itself, while by comparison the data analysis might only take a few lines.*
*With all of our data cleaning, now we're ready to do some analysis of the data. Our main goal is to figure out what book is the most profitable. How will we judge what the "most profitable" book is though? Our dataset represents customer purchases. One way to define "most profitable" might be to just choose the book that's purchased the most. Another way to define it would be to see how much money each book generates overall.*
### 1. define most_profitable
Choose a metric that we will define "most profitable" book.
::: {.instruction-notes}
- Whichever way we choose, we should write down a few notes to ourself to justify our decision and make it clear which method we chose.
- One definition of profitable may use the price column, so we can see how much money a book generated. We may also prefer to count the number of books purchased since it could be interpreted as how popular the book is.
:::
::: {.solution-text}
- `profit_rank` We will rank books based on the profit they generated. We measure profit as the total of all book prices.
- `purchased_rank` We will rank books based on purchased. We measure purchased as the number of times the book was sold.
- `review_rank` We will rank books based on reviews. We measure ratings as the average review for one book. We also consider the share of high reviews as potential indicator.
- `overall_rank` from all measures we will build an overall rank. Here, profit is weighted double for two reasons. One, after all a business needs to generate profits and secondly, the variance in the different rankings is highest in the profit ranking which we want to be reflected in the final rank. This means that in the profit ranking, rank one generates more than three times the total profits on rank five while in the review ranking, rank one is not as far away from rank five (3.05 vs 2.83 and 36.6 vs 31.2). I could generate z-scores to make the different tankings comarable but we decided to will keep it simple.
:::
### 2. Summarize Data
*For each book, calculate the chosen metric that we chose to measure "most profitable" book from the data.*
```{r}
profitable_book <- reviews_clean %>%
group_by(book) %>%
summarise(
n_purchased = n(),
avg_price = mean(price, na.rm = TRUE),
total_profit = round(sum(price, na.rm = TRUE)),
avg_review = round(mean(review_num, na.rm = TRUE),2),
n_high_review = sum(is_high_review, na.rm = TRUE),
n_complete = sum(is_complete),
pct_high_review = round(n_high_review / n_complete * 100, 2)
) %>%
mutate(prop_complete = n_complete/n_purchased,
purchased_rank = rank(-n_purchased),
profit_rank = rank(-total_profit),
review_rank = rank(-avg_review)) %>%
arrange(-total_profit) %>%
mutate(overall_rank = ((purchased_rank + profit_rank * 2 + review_rank) / 4),
overall_rank = rank(overall_rank))
profitable_book_by_state <- reviews_clean %>%
group_by(book, state_name) %>%
summarise(
n_purchased = n(),
avg_price = mean(price, na.rm = TRUE),
total_profit = round(sum(price, na.rm = TRUE)),
avg_review = round(mean(review_num, na.rm = TRUE),2),
n_high_review = sum(is_high_review, na.rm = TRUE),
pct_high_review = round(n_high_review / n() * 100, 2)
) %>%
mutate(purchased_rank = rank(-n_purchased),
profit_rank = rank(-total_profit),
review_rank = rank(-avg_review)) %>%
arrange(-total_profit) %>%
mutate(overall_rank = ((purchased_rank + profit_rank * 2 + review_rank) / 4),
overall_rank = rank(overall_rank))
```
### 3. Communicate Findings
*Investigate the results of our analysis and write out some notes about what books are the most profitable.*
```{r include = FALSE}
library(ggstatsplot)
plot_data <- reviews_clean %>%
filter(is_complete) %>%
mutate(book = fct_reorder(book, price)) %>%
group_by(book) %>%
summarize(n = n(),
profit = price * n,
avg_rating = mean(review_num),
prop_high_review = mean(is_high_review)) %>%
mutate()
plot_data <- reviews_clean %>%
filter(is_complete) %>%
mutate(book = fct_reorder(book, price)) %>%
group_by(book) %>%
add_count() %>%
mutate(profit = n * price)
ggscatterstats(data = plot_data,
x = review_num,
y = profit,
xlab = "Review of Book",
ylab = "Profit",
title = "Understanding Book Sales and reviews"
)
```
# Reporting the results
*After performing the cleaning and analysis, we might think that that's the end. Throughout the guided project, we've just advised that we write some notes to ourself about exploring the data, cleaning it and analyzing it. Remember that we are performing this analysis for the benefit of the company that hired we, so they'll most likely be expecting some form of report from we. They're not going to do the analysis themselves, so we'll need to be able to explain everything that we did in our report and our findings.*
*Writing a report or creating some polished product after the analysis is important because we need to be able to communicate our findings to others. This is especially important when the people reading our analysis are not programmers themselves and will not be able to understand the code that we wrote for it. Take some time to organize our notes into a small report. This report doesn't need to be a long essay! It only need to communicate the answer to our company's question: "What's our most profitable book?" There are several sub-questions that might need to be answered along with this question like, "How do we know it's the most profitable?" or "How did we calculate our measure for profitability"? Your report is just trying to demonstrate that we have an answer and that we gave some thought on how to get to this answer.*
*Aside from our hypothetical situation, writing a good report can help show future employers that we are able to think both programmatically and communicate well. For any future analysis we might do, say in another Guided Project, having a polished final product that can be easily read will always be looked up favorably.*
## Instructions
### 1. Structure
*A good, common way to structure a report is into three parts: Introduction, Findings, Conclusion*
### 2. Summarize the notes we've made into polished paragraphs for each section. As a guideline, we can try to answer each of the questions below for each section:
::: {.instruction-notes}
- Introduction: What motivated our analysis? What kind of data do we have? What is the main question we're trying to answer?
- Findings: What did we need to do to the data to do our analysis? What things are we calculating to answer our main question?
- Conclusion: What is the answer to our main question? Was there anything that we feel limits our analysis? What should the reader do with our findings?
:::
# Report
## Introduction
In this analysis we wanted to compare the sales of books to answer the question: "What's our most profitable book?". We defined three ways to define profitable book.
A book if profitable if
1. it generates high profit margins (profit)\
2. it gets good customer review ratings (quality)\
3. it sells many times (quantity)
Based on those three metrics we ranked the books, whereby the profit dimension is weighted double for two reasons. One, generated profits are the most important metric since a business needs to generate profits and secondly, the variance in the different rankings is highest in the profit metric, meaning that this is the dimension with the highest differences between books.
## Findings
As a result of our analysis, we find that the book "Fundamentals of R For Beginners" is the most profitable book in terms of our metrics profit, quality and quantity.
```{r, render="asis"}
profitable_book %>%
select(book, contains("rank")) %>%
arrange(overall_rank) %>%
flextable() %>%
set_header_labels(
values = list(book = "Book",
purchased_rank = "Purchase Volume",
profit_rank = "Profit",
review_rank = "Popularity",
overall_rank = "Overall")
) %>%
add_header_row(
values = c("", "Ranking"),
colwidths = c(1, 4)
) %>%
theme_alafoli() %>%
autofit() %>%
align(part = "header", i = 1, align = "center") %>%
bold(j = "overall_rank", bold = TRUE, part = "all") %>%
vline(j = "book", part = "body") %>%
bg(i = 1, bg = "#fceef5")
```
It generated the second highest profit and customer reviews and was in the top rank with regard to purchasing volume.
```{r}
profitable_book %>%
arrange(overall_rank) %>%
select(book,
n_purchased,
total_profit,
pct_high_review,
avg_price) %>%
flextable() %>%
set_header_labels(
values = list(book = "Book",
n_purchased = "Purchase Volume",
total_profit = "Profit",
pct_high_review = "Share of high reviews",
avg_price = "Price")
) %>%
theme_alafoli() %>%
autofit() %>%
vline(j = "book", part = "body") %>%
bg(i = 1, bg = "#fceef5")
```
### Comparing the overall winner to the most profitable book
The book that stands at the top in terms of our overall ratings is not the book that generated the highest profits. Why are we considering other measures as well? We are considering customer reviews, since we assume that having a book with a high percentage of good consumer reviews and many copies sold is more profitable in the long run, since happy customers are more likely to come back and purchase other books as well.
"Fundamentals of R For Beginners" (the top rated book) generated `r profitable_book %>% filter(book == 'Fundamentals of R For Beginners') %>% pull(total_profit) %>% scales::dollar(suffix = "$", prefix = "")` compared to the book with highest profit "Secrets Of R For Advanced Students" which generated `r profitable_book %>% filter(book == 'Secrets Of R For Advanced Students') %>% pull(total_profit) %>% scales::dollar(suffix = "$", prefix = "")`. This difference appears because both books sold about the same number of copies, while "Fundamentals of R For Beginners" is in a cheaper price range with 39,99\$ compared to 50,00\$ for "Secrets Of R For Advanced Students". And while higher priced books generate more profit the more copies are sold, not all books in the store should be expensive. The book store benefits from a diverse set of price ranges, because it needs to compete with the prices of other book stores and also wants to offer books for a broad range of customers with different income levels. This is supported by the fact that the second book with the highest number of purchases "R for Dummies" is in the lowest price range of 15,99\$.
### What about local differences? Are there books that are big sellers only in one state?
To go one step further, we analyzed whether the profits generated by the books show large differences between the different states. The following figure helps to see which books contributed most to the overall profits within one state and whether this share differs across states.
We can see that the books seem to sell equally well across all locations and their share of the overall generated profits remains constant across states.
```{r, out.width="80%", message = FALSE}
profitable_book_by_state %>%
bind_rows(profitable_book %>%
mutate(state_name = "All States")
) %>%
group_by(state_name, book) %>%
summarise(total_profit_book = sum(total_profit),
total_sales_book = sum(n_purchased)) %>%
dplyr::add_count(state_name, wt = total_profit_book, name = "total_profit_state") %>%
mutate(prop = total_profit_book / total_profit_state,
book = str_wrap(book, 20),
highlight = state_name == "All States") %>%
mutate(highest_prop = prop == max(prop)) %>%
ungroup() %>%
ggplot(
aes(x = fct_reorder(book, prop),
y = prop,
fill = highest_prop)
) +
geom_col(width = 0.1) +
geom_point() +
gghighlight::gghighlight(
max(prop),
max_highlight = 1L,
use_direct_label = FALSE,
calculate_per_facet = TRUE) +
coord_flip() +
theme_minimal(base_size = 12) +
#scale_y_continuous(labels = scales::percent) +
scale_y_continuous(breaks = c(0, 0.15, 0.30),
labels = scales::percent) +
scale_fill_manual(values = c("#5ab4ac", "#5ab4ac20")) +
facet_wrap(~ state_name, nrow = 1) +
labs(title = "Books show similar shares of profit across states",
subtitle = "Composition of profits by book and states, highest share is highlighted",
x = "",
y = "Share of total profit") +
theme(legend.position = "none",
#legend.justification = "left",
plot.title.position = "plot",
axis.title.x = element_text(colour = "grey20", size = 10),
axis.text.x = element_text(colour = "grey"),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank())
```
## Conclusion
In summary we found that while "Secrets Of R For Advanced Students" is the book that generates the highest profits overall and in each state, there are some reasons to believe that overall, the book "Fundamentals of R For Beginners" can be seen as the most profitable book because it generated profits but also ranked high in terms of the amount of copies sold and customer satisfaction.
# Further Steps
Next the to data that is available right now, it could be interesting to see how the prices and purchase numbers of individual books changed over time. Analysis like these could provide us with insight to the question of what prices the book stores customers can afford and what the relationship between price, popularity and number of copies sold. One could also go about this analyis in a more qualitative way and talk to customers to find out what made them come to this specific store or what the different reasons for buying a book can be. Based on the information from customers, we might find out about new ways in which we can help them to find the books they want or help them to find books for their friends or relatives. Making sure that su