-
Notifications
You must be signed in to change notification settings - Fork 0
/
report.Rmd
565 lines (441 loc) · 22.8 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
---
output:
pdf_document:
number_sections: true
header-includes:
- \hypersetup{colorlinks = false,pdfborder={1 1 1}}
- \usepackage{float}
---
<!-- Setup -------------------------------------------------------------------->
```{r setup, include=FALSE}
options(scipen=999)
# RMarkdown settings
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(fig.width=10, fig.height=5, fig.align="center")
knitr::opts_chunk$set(fig.pos = "H", out.extra = "")
# required libraries
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(knitr)) install.packages("knitr", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(knitr)
library(caret)
library(randomForest)
# load data from r-script
load("rmd-input.RData")
```
<!-- Title Page + TOC --------------------------------------------------------->
\newpage
\vspace*{4cm}
\begin{center}
{\Large Data Science: Capstone \\ (HarvardX PH125.9x)}
\end{center}
\vspace*{1.5cm}
\begin{center}
\thispagestyle{empty}
{\LARGE \bf Capstone Project II}\\[1.5cm]
{\Huge Classic Trader }\\[1.5cm]
{\bf \Large Predict prices of classic and exclusive cars}\\[1cm]
{\Large February 2022}\\[1.5cm]
{\large submitted by Bela Koch} \\[2.5cm]
\end{center}
\newpage
\tableofcontents
\newpage
<!-- Report ------------------------------------------------------------------->
# Introduction
The goal of this exploratory machine learning project is, to understand the key
drivers of the prices of classic and exclusive cars and to predict the prices
using machine learning techniques. In a first step, the data used in the project
is described and analysed. Then, the relevant metrics are described, the process
of training the models is addressed and the results of the algorithms with the
test data and the validation data are shown. Finally, possible weaknesses and
limitations are discussed and suggestions for improvement are made.
\newpage
# The data
## Data Source
The data used in this project was scraped from [classic-trader.com](https://www.classic-trader.com/) -
a website which describes itself as the international marketplace for classic
vehicles. All advertisements currently being placed were downloaded. If you intend
to run the scraping process, please note that this will take some time.
Alternatively (and as default in the corresponding script), the already scraped
data is downloaded via the GitHub repository associated with this project.
During the processing of this project, about 11 thousand observations could be
obtained by scraping classic-trader.com. After data cleansing, about 8 thousand
observations were available for the analysis.
## Overview of the data
### Variables considered
Variable | Type | Description
------------ | ------- | -------------------------------------------------------
ID | integer | Identification number of vehicle from website
manufacturer | Factor | Manufacturer of vehicle (factorized)
model | Factor | Model of vehicle (general, e.g. 3 Series; factorized)
model_name | Factor | Model name of vehicle (more specific, e.g. 318is; factorized)
year | Factor | Year of manufacture (factorized)
decade | Factor | Decade of manufacture (factorized)
mileage | integer | Mileage, all converted to kilometers
body_style | Factor | Body style of vehicle, e.g. sedan (factorized)
color | Factor | Color of exterior body (factorized)
power | integer | Metric to indicate power produced by engine, measured in kilowatts
ccm | integer | Measure of engine displacement, measured in cubic centimeter
cylinders | integer | Number of cylinders
steering | Factor | Placement of steering position, e.g. lefthand drive (factorized)
transmission | Factor | Type of transmission (factorized)
drive | Factor | Type of drivetrain, e.g. rear-wheel drive (factorized)
fuel | Factor | Fuel type (factorized)
condition | Factor | Condition of vehicle (predefined classes, factorized)
price | numeric | Advertised price measured in euro
### Missing data
The following table shows the missing data. As we can see, the variables mileage,
color, condition and price contain many missing values.
```{r nas}
missing_data %>%
mutate(
na_perc = paste0(round(na_perc*100, 2), "%")
) %>%
rename(`NAs absolute` = na_count) %>%
rename(`NAs relative` = na_perc) %>%
kable(align = c("r", "r"))
```
The missing values for
the variable price are due to the fact that the vehicles can be advertised with
"price on request". For the variable "condition", the category "original" is
assumed in case of missing values, which describes the condition of the car as
"unchanged, little signs of usage". After filling the missing values of
"condition", the remaining observations which contain missing values were dropped.
## Exploratory Data Analysis
### Price
The distribution of the variable to be estimated - the price - is analysed below.
By means of visualisation of the distribution, a right skewed distribution of
prices can be detected. Further checks are waived, since the right skewness is
clearly recognisable and is typical for variables with a natural lower limit
(i.e. zero) and a theoretical infinite upper limit.
```{r distr_prices, fig.cap="Distribution of prices", fig.width=10, fig.height=5}
par(mfrow = c(1,2))
hist(development_set_raw$price, probability = T, breaks = 25,
ylim = c(0, max(density(development_set_raw$price)$y)),
main = "", xlab = "Price")
lines(density(development_set_raw$price))
max_filter <- 500000
hist(development_set_raw$price[development_set_raw$price < max_filter], probability = T, breaks = 25,
ylim = c(0, max(density(development_set_raw$price[development_set_raw$price < max_filter])$y)),
main = "", xlab = "Price (zoomed)")
lines(density(development_set_raw$price[development_set_raw$price < max_filter]))
```
While this is, for example, not a problem for decision trees, linear models
(among others) are sensitive to outliers and skewed data. For this reason, a log
transformation was applied to approximate the distribution of prices to the
normal distribution. The graph below shows that the distribution of prices after
applying the transformation indeed is more similar - although not perfect - to
the normal distribution.
```{r distr_log_prices, fig.cap="Distribution of prices (logarithmically transfomated)"}
hist(log(development_set_raw$price), probability = T, breaks = 25,
main = "", xlab = "log(Price)")
lines(density(log(development_set_raw$price)))
```
### Manufacturer
The following graph shows the 15 manufacturers from which the most vehicles are
advertised on classic-trader.com.
```{r manufac, fig.cap="Top 15 manufacturers in dataset", fig.height=6}
manufac <- table(development_set_raw$manufacturer) %>%
as.data.frame() %>%
arrange(desc(Freq))
manufac_top <- manufac[1:15,]$Freq
names(manufac_top) <- manufac[1:15,]$Var1
manufac_top <- sort(manufac_top)
par(mar = c(5.1, 7, 4.1, 2.1))
barplot(manufac_top, horiz = T, las=1, xlab="Number of advertisments")
```
As can already be expected, the manufacturer of a vehicle has an influence on
the price of the corresponding vehicle. Thus, different manufacturers produce
vehicles of different classes and thus also of different price ranges. The
distribution of prices thus differs between the different manufacturers, which
can be seen in the following graph of the ten most represented manufacturers
in the data set.
```{r price_by_manufac, fig.cap="Prices by manufacturers", fig.height=6}
manufac_top_boxplot <- development_set_raw %>%
filter(manufacturer %in% manufac[1:10,]$Var1)
boxplot(manufac_top_boxplot$price~as.character(manufac_top_boxplot$manufacturer),
outline=F, xlab="Manufacturer (Top 10)", ylab="Price", cex.axis=0.75)
```
### Mileage
In the used car market, the mileage of a car is an important indicator to determine
the value and the condition of a vehicle. Although it is not possible to draw direct
conclusions about the condition and value of the vehicle, it nevertheless provides
a tendency in this respect. In formal terms, correlation but not necessarily
causality is expected between condition/value and mileage of a vehicle.
The graph below shows the relationship between the price and the mileage of the
cars advertised on classic-trader.com. Note that the data for creating the
following plot was filtered to be under 1 million Euros so that the relationship
is more clearly visible.
```{r plot_mileage, fig.cap="Mileage vs. Price (Price filtered $< 10^6$)"}
plot(data_raw[data_raw$price < 10^6,]$mileage, data_raw[data_raw$price < 10^6,]$price,
xlab="Mileage", ylab="Price")
```
As we can see, there is a relationship between price and mileage which could be
used as an indicator when estimating prices. Unfortunately, there are a lot of
missing values for mileage in the data set, which is why this indicator is not
considered in the analysis. For further studies however, it is recommended to
include this indicator in the models.
### Year
In order to better understand the structure of the data, the distribution of the
production years of the advertised vehicles are shown below. It is interesting to
note that major historical events are discernible in the distribution of production
years. For example, one sees a major slump in the period of World War 2 (1939-1945)
and during the oil crisis of 1973. This can be relevant for this study insofar as
car manufacturers are affected by what is happening in the world and in the global
economy and therefore are factoring this into their business decisions when
producing and developing new cars.
```{r year, fig.cap="Year of manufacture"}
barplot(table(development_set_raw$year), ylab="Count")
```
### Power and engine displacement
Since both variables are integer, the distributions are analysed. In both cases,
a tendency towards right skewness can be observed, although not as clearly as in
the distribution of prices as seen before. The best result when trying to make
the distribution of the variables conform to the normal distribution was obtained
with the box cox transformation. Although this transformation does not provide
perfect results, an improvement in the performance of the models was nevertheless
observed.
```{r power_ccm_plots, fig.cap="Distribution of Power and ccm", fig.height=7.5}
par(mfrow = c(2,2))
# plot 1
dens <- density(development_set_raw$power)
hist(development_set_raw$power, probability = T, breaks = 25, ylim = c(0, max(dens$y)),
main="Power (no transformation)", xlab="Power")
lines(dens)
# plot 2
hist((development_set_raw$power^lambda_power - 1)/lambda_power, probability = T, breaks = 25,
main="Power (box cox transformation)", xlab="Power")
lines(density((development_set_raw$power^lambda_power - 1)/lambda_power))
# plot 3
dens <- density(development_set_raw$ccm)
hist(development_set_raw$ccm, probability = T, breaks = 25, ylim = c(0, max(dens$y)),
main="ccm (no transformation)", xlab="ccm")
lines(dens)
# plot 4
hist((development_set_raw$ccm^lambda_ccm - 1)/lambda_ccm, probability = T, breaks = 25,
main="ccm (box cox transformation)", xlab="ccm")
lines(density((development_set_raw$ccm^lambda_ccm - 1)/lambda_ccm))
```
### One-hot encoding
A large part of the variables considered in this project correspond to categorical
variables. Since some algorithms cannot work with categorical data directly,
the data was also encoded using the one-hot encoding method, since the categories
have no ordinal relationship.
\newpage
# The model
## Performance Metrics
The performance metrics used in this project are briefly described below. Thereby.
$\hat{y}_n$ corresponds to the predicted value of the dependent variable $y_n$ for
the $n$th observation of $N$ observations.
### Root Mean Squared Error (RMSE)
Root Mean Squared Error calculates the average of the squared errors across all
samples and takes the square root of the result. The RMSE provides an error
measure in the same unit as the target variable, which makes RMSE as a performance
metric directly interpretable. In general, a lower RMSE is better than a higher.
The square avoids the
errors of cancel each other out and penalizes larger errors more than smaller ones.
$$
\text{RMSE} = \sqrt{\frac{\sum^N_{n=1} (y_n - \hat{y}_n)^2}{N}}
$$
### Mean Absolute Error (MAE)
The Mean Absolute Error calculates the absolute value of the errors (since the
direction of the error is not relevant for this study) and takes the average of
these values. This also prevents the errors of cancelling each other out. In contrast
to the RMSE, MAE does not penalize larger errors more severely. The error is also
measured in the same unit as the target variable, which allows direct interpretation.
Likewise, a lower value is better.
$$
\text{MAE} = \frac{\sum^N_{n=1} | y_n - \hat{y}_n |}{N}
$$
### Mean Absolute Percentage Error (MAPE)
The Mean Absolute Percentage Error is calculated similarly to the Mean Absolute
Error, but the error is shown as percentage, which allows to understand how off
a prediction is in relative terms. Here, too, a lower value is better.
$$
\text{MAPE} = \frac{1}{N} \sum^N_{n=1} \left| \frac{y_n - \hat{y}_n}{y_n} \right|
$$
## Developing the model
### Linear Regression
First, a linear model is estimated using all predictor variables. To evaluate
the model, the diagnostic plots are analysed.
1. Residual vs Fitted:
Since equally spread residuals around the horizontal line without distinct
patterns can be found, it is a good indication that there are no non-linear
relationships in the data that could not be captured by the linear model.
2. Normal Q-Q:
With the Normal Q-Q plot it can be determined that the distribution of the
residuals corresponds to a distribution with "heavy tails", which indicates
that the data has more extreme values than expected if they truly came from a
Normal distribution. The coefficient estimators should still be quite reasonable,
however, the prediction intervals are likely to be too short (since they do
not account for heavy tails), which is expected to worsen the performance of
the model.
3. Scale-Location:
This plot shows that the residuals are equally spread along the ranges of
predictors, which indicates that the residuals have a constant variance. If
this was not the case, less precise coefficients would be expected.
4. Residuals vs Leverage:
Since no dashed red lines are visible on the graph, which would correspond to
a Cook's distance of 0.5 and 1, none of the observations lay outside these
boundaries and therefore no action is required.
```{r lm_plots, fig.cap="Linear model", fig.height=10, warning=F}
par(mfrow=c(2,2))
plot(train_lm, 1)
plot(train_lm, 2)
plot(train_lm, 3)
plot(train_lm, 5)
```
When applying the linear model on the training set, following measures are
reached:
```{r lm_results}
results[1,] %>%
kable()
```
### Random Forest
Second, a random forest algorithm is trained. In a first step, the best value of
mtry - the number of variables randomly sampled as candidates at each split -
is sought. The best value of mtry minimized the minimum out of bag (OBB) error.
```{r rf_tune, fig.cap="Random Forest Tuning"}
plot(mtry, type = "b")
```
Next, the random forest algorithm is trained with the previously determined best
value of mtry (here: `r best.m`). In the following graph, the variable
importance as measured by a Random Forest can be seen.
```{r rf_plot, fig.cap="Variable Importance of Random Forest"}
varImpPlot(train_rf, main="Variable Importance")
```
When applying the random forest algorithm on the training set, following measures
are reached:
```{r rf_results}
results[1:2,] %>%
kable()
```
### K-nearest Neighbors
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
```{r tune_knn, fig.cap="Tune K-nearest neighbors"}
plot(train_knn, col = "black")
```
The following table shows the variable importance of the K-nearest neighbors
algorithm
```{r knn_plots, fig.cap="K-nearest neighbors"}
varImp(train_knn)$importance %>%
arrange(desc(Overall)) %>%
top_n(n = 10, wt = Overall) %>%
kable()
```
When applying the K-nearest Neighbor algorithm on the training set, following
measures are reached:
```{r knn_results}
results[1:3,] %>%
kable()
```
### Ensemble model (averaging)
Finally, the individual predictions of the different models are combined with the
calculation of the arithmetic mean to form a single prediction.
When applying the ensemble model on the training set, following measures are reached:
```{r ens_results}
results[1:5,] %>%
kable()
```
## Applying the model
As must be noted with the development set, the models unfortunately do not provide
reliable predictions of the prices. Nevertheless, the models are applied with the
validation set in order to further analyse the results.
Since the random forest algorithm outperforms the other models in both MAE and
MAPE and the ensemble model that does not include the linear model outperforms
the other models in RMSE, these two models are used with the validation set.
With the data from the validation set, following results are obtained:
```{r final_results}
tail(results, 2) %>%
kable()
```
The following plot shows the distribution of the relative errors made by the two models:
```{r plots_final, fig.cap="Distribution of relative errors"}
par(mfrow = c(1,2))
hist(analyze_results_final_rf$diff, breaks = 50, ylim = c(0, 200),
main="Random forest", xlab="Relative error")
hist(analyze_results_final_avg_2$diff, breaks = 60, ylim = c(0, 200),
main="Ensemble model", xlab="Relative error")
```
As can be seen in the table and the plot, random forest seems to make better
predictions - even if they are not really of much use. What is much more
interesting to note is that the distribution of the errors made by both models
are heavily skewed to the right. Thus, there are few estimates with very strong
deviations, which degrade the observed performance of the models very strongly.
\newpage
# Conclusion and Limitations
## Data Quantity and Quality
With around eight thousand observations, the analysis does not have very much
data available for developing the models. With an increasing number of observations,
the performance of the models could be expected to improve.
As was noticed when applying the models with the validation set, very high relative
errors are made in estimating prices for some observations, which is why some
estimates with very high deviations were investigated. It was found that some of
these (very) large errors are due to the assumption which was made when filling
the NAs in the variable "condition" (i.e. filling NAs with "original").
For example, an Austin-Healey 100/4 (BN1) was estimated to have a price of
100'546.10 euros, whereas the vehicle was advertised with a price of 15'850 euros
(this represents a relative discrepancy of 534.36 percent). At the time of
writing, the average price of this vehicle was 78'266.61 euros (calculated
across all conditions of vehicles), with some vehicles being advertised
with a price as high as 155'383 euros. The error of the estimate in relation to
the average price corresponds to 28.47 percent, i.e. significantly lower than the
534.36 percent. It was found that the observation with the very large difference
was incorrectly attributed the condition "original", although the vehicle is merely
a rusty bodyshell without an engine (remember: original condition describes the
condition of the car as unchanged with little signs of usage)^[The advertisements
can be found with the ID, for exmaple contained in the object "analyze_results_final_rf",
using the search function on classic-trader.com].
To improve the model, the assumption in filling the condition-variable has to
be reconsidered. Using the data, which can also be obtained via the web-scraping
process, one could also try, for example, to use text classification with the
description of the vehicle (unsystematic text written by the seller) to infer the
condition. With the available data, it would even be possible to attempt to
classify the condition using image recognition with the pictures of the vehicles
which are uploaded by the seller.
## Omitted Variable Bias
The prices of vehicles, especially classic and exclusive vehicles, are often
influenced by various other variables which are not taken into account in this
project. Possible examples that could significantly influence prices of a
vehicle and that were not taken into account in this project are successes in
Motorsports of the vehicle model or the cult status of a vehicle, established
for example through appearances in movies or through ownership by various
celebrities.
For example, the BMW E30 M3 (built from 1986-1991) is traded at
comparatively high prices, which is certainly partly due to the car's success in
Motorsport and the reputation it has built up as a result: it is often entitled
as the "most victorious racing car of all time" with 1'436 victories in just
1'628 days.
Another example is the VW Type 2 T1 (the "Hippie bus" built from 1950-1967), which is
traded at comparatively high prices (in some special versions in top condition,
prices reach six figures). One possible reason for the high prices is that this
vehicle is a symbol of the counterculture movement of the 1960s and stands for
cultural openness and diversity. Even today, this vehicle conveys a certain
attitude of life and stands for freedom, which is why the vehicle is also used
as a symbol in modern advertisements, for example, precisely because of this
characteristics.
For the reasons mentioned, success in Motorsport, for example, could be introduced
as a variable (e.g. number of wins as ratio of participation in races).
Emotional factors, such as the cult status, are more difficult to capture.
## Interaction Effects
In addition, some features of a vehicle may have positive effect on the price
of a vehicle in some cases and a negative effect in others. For example, the
impact of a right-hand drive is expected to be positive in the case of a Japanese
vehicle, as this is (in many cases) due to the fact that the vehicle was built
for the Japanese market, which may increase collector's value, whereas a right-hand
drive might have a negative impact on the price of a German vehicle
As another example, it would also be expected that the number of cylinders in an
engine tends to increase the value of the corresponding vehicle. However, it is
possible that for some vehicle models a smaller engine is more popular, for example
due to reliability, which means that the model with the smaller engine is priced
higher.
A possible solution to this potential source of error could be the introduction
of interaction-terms, for example between steering ($x_1$) and country of origin
of the vehicle ($x_2$):
$$
\hat{y} = \alpha + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2 + \epsilon
$$