-
Notifications
You must be signed in to change notification settings - Fork 0
/
the_data.Rmd
276 lines (233 loc) · 16.9 KB
/
the_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
title: "Exhibition A: The Data"
output:
html_document:
toc: true
toc_float: true
theme: cosmo
code_folding: hide
---
## Data Source
```{r setup, include = FALSE}
library(tidyverse)
library(ggplot2)
library(plotly)
load("data/met_10.RData")
load("data/met.RData")
mypal = c("#78B7C5", "#EBCC2A", "#FF0000", "#EABE94",
"#3B9AB2", "#B40F20", "#0B775E", "#F2300F",
"#5BBCD6", "#F98400", "#ab0213", "#E2D200",
"#ff7700", "#46ACC8", "#00A08A", "#78B7C5",
"#a7ba42", "#f94f8a", "#DD8D29")
```
Since its beginning in 1870, The Metropolitan Museum (The Met) has acquired and displayed over 5,000 years worth of art work from around the world.
Thankfully for us data scientists, The Met offers open access to data of its collection. They did this to encourage interaction with the museum's collection, and to help people around the world to use the wealth of data they have stored on an impressive number of artefacts. The dataset can be found [here](https://github.com/metmuseum/openaccess), for anyone interested in running their own analyses or reproducing ours.
## Generating our Sample
The Met dataset consists of over 400,000 artistic pieces. If you think the Met is overwhelming to get through now - just imagine if all of the artworks in storage were also on display. Sadly a dataset of this size is too much even for us art aficionados, so we decided to divide and conquer (which, fittingly, is how many of the artifacts from around the world ended up in an American art museum).
A dataset of this size is far too big to handle, as any analysis would take an excessively long time to process. Given this scale, we decided that it would be wise to just work with a smaller, more manageable subset of the data.
Using the code below, we took a random sample of 10% of the data, creating our final dataset of over 40,000 observations. You can use this code to create the same subset we did from The Met's raw data file `MetOBjects.txt`:
- `set.seed(1)`
- `met_10 <- sample_n(MetOBjects, nrow(MetOBjects)*.10)`
- `save(met_10, file = "data/met_10.RData")`
Once we had our 10% sample (amounting to `r format(nrow(met_10), big.mark=",")` observations), we embarked on some data cleaning.
## Checking for Completeness
First, we looked into which variables were the most complete, based percentage of works having data recorded for that characteristic.
The following table shows the % completeness for all variables.
```{r}
met_10 <- met_10 %>%
janitor::clean_names()
data.frame(sapply(met_10, function(x) round(sum(!is.na(x))/nrow(met_10)*100,2))) %>%
magrittr::set_colnames("Completeness") |>
arrange(desc(Completeness)) |>
head(15)
```
Across all observations, `object_name` and `accession_year` were some of the most complete. `department` was 100% complete. This informed which analytic questions we could answer.
Based on the missing data table and our decision to focus on objects and years, we once again limited the dataset to those observations with complete object and year data - resulting in a final analytic dataset with `r format(nrow(met), big.mark=",")` observations and `r format(ncol(met), big.mark=",")` variables.
All in all, we chose the following variables to work with:
* `department`: Indicates The Met's curatorial department responsible for the artwork
* `accession_year`: Year the artwork was acquired
* `culture`: Information about the culture, or people from which an object was created
* `object_name`: Describes the physical type of the object
* `country`: Country where the artwork was created or found
* `subregion`: Geographic location more specific than Region, but less specific than Locale, where the artwork was created or found
* `object_begin_date`: Date indicating the year the artwork was started
* `object_end_date`: Date indicating the year the artwork was completed
* `dynasty`: Dynasty (a succession of rulers of the same line or family) under which an object was created
## Cleaning Up Categorization
With the scope of our analytic dataset decided, we now turned our attention to the actual contents of the included variables.
### Object Name
First, we noticed that the values in `object_name` were far too detailed for our purposes. For example, who needs to differentiate between a relief fragment from the _Tomb of Maketre_ and a relief fragment form the _Tomb of Nespekashuty_? If you want to, don't run the code below. But we consolidated object types, like _reliefs_, to create simpler and clearer summaries.
```{r}
met <- met %>%
mutate(object_name = ifelse(
grepl("Textile", object_name), "Textile",
ifelse(grepl("Painting", object_name), "Painting",
ifelse(grepl("Relief", object_name), "Relief",
ifelse(grepl("Print", object_name), "Print",
ifelse(grepl("aseball card", object_name), "Baseball card",
ifelse(grepl("Vase", object_name), "Vase",
ifelse(grepl("rnament", object_name), "Vase",
ifelse(grepl("arring", object_name), "Earring",
ifelse(grepl("ecklace", object_name), "Necklace",
ifelse(grepl("hotograph", object_name), "Photograph",
ifelse(grepl("tatue", object_name), "Statue",
object_name))))))))))))
```
### Dynasty
As we transported ourselves to Ancient Egypt and the Met's many excavations in the area, we had to take some additional steps to dust off the data without hurting the artifacts. `Dynasty` was a key variable for the Egyptian data; because the Egyptian Art Department is the only Department in the Met to have fairly well-identified dynasties, we felt it was important to include it. However, instead of keeping to the traditional 30 Dynasties of Egypt, many of the observations were coded as "Dynasty 1-5" or "second half 11."
To harmonize the `Dynasty` variable, we used the following strategy:
* If the intended dynasty was easily identifiable from the name, we recoded the variable to said dynasty
* If the value was of a consecutive dynasty (for example, "Dynasty 12-13"), we kept the value the same, but only if that specific value was common in the data (n>20). This was to prevent having 50+ dynasty ranges and making visualization tedious.
* If the value was of an infrequently occurring consecutive dynasty, or for larger dynastic range, we recoded the variable to "Dynasty Range"
* Any other values were coded as NA
Some other minor cleaning of the Egyptian data also occurred for object creation dates.
```{r}
met_egypt = met |>
filter(department == "Egyptian Art") |>
mutate(
dynasty = case_when(
dynasty == "11" ~ "Dynasty 11",
dynasty == "Dynasty 1, 3–4" | dynasty == "Dynasty 1 3-4" ~ "Dynasty Range",
dynasty == "Dynasty 11, late" ~ "Dynasty 11",
dynasty == "Dynasty 11?" ~ "Dynasty 11",
dynasty == "Dynasty 11–13" | dynasty == "Dynasty 11-13" ~ "Dynasty Range",
dynasty == "Dynasty 11–17" | dynasty == "Dynasty 11-17" ~ "Dynasty Range",
dynasty == "Dynasty 11-17" ~ "Dynasty Range",
dynasty == "Dynasty 11–18" ~ "Dynasty Range",
dynasty == "Dynasty 11–mid 12" | dynasty == "Dynasty 11-mid 12" ~ "Dynasty Range",
dynasty == "Dynasty 11–12" | dynasty == "Dynasty 11-12" ~ "Dynasty Range",
dynasty == "Dynasty 12, early" ~ "Dynasty 12",
dynasty == "Dynasty 12, early – mid" | dynasty == "Dynasty 12, early - mid" ~ "Dynasty 12",
dynasty == "Dynasty 12, early–mid" | dynasty == "Dynasty 12, early-mid" ~ "Dynasty 12",
dynasty == "Dynasty 12, late" ~ "Dynasty 12",
dynasty == "Dynasty 12, late – 13 up to 1700" | dynasty == "Dynasty 12, late - 13 up to 1700" ~ "Dynasty Range",
dynasty == "Dynasty 12, late-13 up to 1700 B. C." ~ "Dynasty Range",
dynasty == "Dynasty 12, late–17" ~ "Dynasty Range",
dynasty == "Dynasty 12, late-17" ~ "Dynasty Range",
dynasty == "Dynasty 12, mid" ~ "Dynasty 12",
dynasty == "Dynasty 12–13" ~ "Dynasty 12-13",
dynasty == "Dynasty 12, late–early 13" ~ "Dynasty 12-13",
dynasty == "Dynasty 12–17" ~ "Dynasty Range",
dynasty == "Dynasty 12–18" ~ "Dynasty Range",
dynasty == "Dynasty 13 to 1700 B.C." ~ "Dynasty Range",
dynasty == "Dynasty 13, mid" ~ "Dynasty 13",
dynasty == "Dynasty 13–17" ~ "Dynasty Range",
dynasty == "Dynasty 13–18, early" ~ "Dynasty Range",
dynasty == "Dynasty 13–SIP" ~ "Dynasty Range",
dynasty == "Dynasty 14–15" ~ "Dynasty Range",
dynasty == "Dynasty 15–17" ~ "Dynasty Range",
dynasty == "Dynasty 17–early Dynasty 18" ~ "Dynasty 17-18",
dynasty == "Dynasty 17–Early Dynasty 18" ~ "Dynasty 17-18",
dynasty == "Dynasty 17–18" ~ "Dynasty 17-18",
dynasty == "Dynasty 18 (?)" ~ "Dynasty 18",
dynasty == "Dynasty 18 or later (?)" ~ "Dynasty 18",
dynasty == "Dynasty 18, early" ~ "Dynasty 18",
dynasty == "Dynasty 18, late" ~ "Dynasty 18",
dynasty == "Dynasty 18, possibly later" ~ "Dynasty 18",
dynasty == "Dynasty 18, second half" ~ "Dynasty 18",
dynasty == "Dynasty 18–19" | dynasty == "Dynasty 18-19" ~ "Dynasty Range",
dynasty == "Dynasty 18–20" ~ "Dynasty Range",
dynasty == "Dynasty 19–20" ~ "Dynasty 19-20",
dynasty == "Dynasty 19–20 (Ramesside)" ~ "Dynasty 19-20",
dynasty == "Dynasty 19–20 or later (?)" ~ "Dynasty 19-20",
dynasty == "Dynasty 19–21" ~ "Dynasty Range",
dynasty == "Dynasty 19–25" ~ "Dynasty Range",
dynasty == "Dynasty 19–30" ~ "Dynasty Range",
dynasty == "Dynasty 1–2" ~ "Dynasty Range",
dynasty == "Dynasty 1-2" ~ "Dynasty Range",
dynasty == "Dynasty 2, second half" ~ "Dynasty 2",
dynasty == "Dynasty 20 (Ramesside)" ~ "Dynasty 2",
dynasty == "Dynasty 20 or later" ~ "Dynasty 20",
dynasty == "Dynasty 20–21" | dynasty == "Dynasty 20-21" ~ "Dynasty Range",
dynasty == "Dynasty 20–22" ~ "Dynasty Range",
dynasty == "Dynasty 20–26" ~ "Dynasty Range",
dynasty == "Dynasty 21 (?)" ~ "Dynasty 21",
dynasty == "Dynasty 21 or 22" ~ "Dynasty Range",
dynasty == "Dynasty 21–22" ~ "Dynasty Range",
dynasty == "Dynasty 21-22" ~ "Dynasty Range",
dynasty == "Dynasty 21–24" ~ "Dynasty Range",
dynasty == "Dynasty 21–25" ~ "Dynasty Range",
dynasty == "Dynasty 21–26" ~ "Dynasty Range",
dynasty == "Dynasty 21–30" ~ "Dynasty Range",
dynasty == "Dynasty 21–early Dynasty 22" ~ "Dynasty 21-22",
dynasty == "Dynasty 22, early" ~ "Dynasty 22",
dynasty == "Dynasty 22–24" ~ "Dynasty Range",
dynasty == "Dynasty 22–26" ~ "Dynasty Range",
dynasty == "Dynasty 25-26" ~ "Dynasty Range",
dynasty == "Dynasty 25–26" ~ "Dynasty Range",
dynasty == "Dynasty 25 (Kushite)" ~ "Dynasty 25",
dynasty == "Dynasty 25–30" ~ "Dynasty Range",
dynasty == "Dynasty 26 (Saite)" ~ "Dynasty 26",
dynasty == "Dynasty 26 and later" ~ "Dynasty Range",
dynasty == "Dynasty 26 or later" ~ "Dynasty Range",
dynasty == "Dynasty 26–29" ~ "Dynasty Range",
dynasty == "Dynasty 26–4th century" ~ "Dynasty Range",
dynasty == "Dynasty 26–30" ~ "Dynasty Range",
dynasty == "Dynasty 3–4" | dynasty == "Dynasty 3-4" ~ "Dynasty Range",
dynasty == "Dynasty 27–30" ~ "Dynasty Range",
dynasty == "Dynasty 27–30?" ~ "Dynasty Range",
dynasty == "Dynasty 30 or later" ~ "Dynasty 30",
dynasty == "Dynasty 5 (?)" ~ "Dynasty 5",
dynasty == "Dynasty 5, second half" ~ "Dynasty 5",
dynasty == "Dynasty 5-6" ~ "Dynasty Range",
dynasty == "Dynasty 5–6" ~ "Dynasty Range",
dynasty == "Dynasty 6–8" ~ "Dynasty Range",
dynasty == "Dynasty 6-8" ~ "Dynasty Range",
dynasty == "Dynasty 6–11" ~ "Dynasty Range",
dynasty == "Dynasty 6–12" ~ "Dynasty 2",
dynasty == "Dynasty 7–10 (?)" ~ "Dynasty Range",
dynasty == "Dynasty 8–11" ~ "Dynasty Range",
dynasty == "Dynasty 8–12" ~ "Dynasty Range",
dynasty == "Dynasty 8–18" ~ "Dynasty Range",
dynasty == "Dynasty 9?" ~ "Dynasty 9",
dynasty == "Dynasty 9–12" ~ "Dynasty Range",
dynasty == "Dynasty 18, early" ~ "Dynasty 18",
dynasty == "original Dynasty 19" ~ "Dynasty 19",
dynasty == "mid-Dynasty 18" ~ "Dynasty 18",
dynasty == "late Dynasty 22" ~ "Dynasty 22",
dynasty == "late Dynasty 21" ~ "Dynasty 21",
dynasty == "late Dynasty 13-17" ~ "Dynasty Range",
dynasty == "late Dynasty 13–17" ~ "Dynasty Range",
dynasty == "late Dynasty 12–early Dynasty 13" ~ "Dynasty 12-13",
dynasty == "late Dynasty 12–early Dynasty 13" ~ "Dynasty 12-13",
dynasty == "late Dynasty 12–Dynasty 13" ~ "Dynasty 12-13",
dynasty == "late Dynasty 12–13" ~ "Dynasty 12-13",
dynasty == "Probably Dynasty 1" ~ "Dynasty 1",
dynasty == "Probably Dynasty 26" ~ "Dynasty 26",
dynasty == "mid Dynasty 13" ~ "Dynasty 13",
dynasty == "early Dynasty 18" ~ "Dynasty 18",
dynasty == "dynasty 11" ~ "Dynasty 11",
dynasty == "Late dynasty 11" ~ "Dynasty 11",
dynasty == "Late Dynasty 21–early Dynasty 22" ~ "Dynasty 21-22",
dynasty == "Late Dynasty 12–13" ~ "Dynasty 12-13",
dynasty == "Dynasty 9–early Dnyasty 11" ~ "Dynasty Range",
dynasty == "mid to late Dynasty 13" ~ "Dynasty 13",
dynasty == "Dynasty 9–12" ~ "Dynasty Range",
dynasty == "Second Intermediate Period" ~ "Dynasty Range",
reign == "reign of Ramesses IV" ~ "Dynasty 20",
reign == "reign of Sethnakht" ~ "Dynasty 20",
object_date == "ca. 2649–1640 B.C." | object_name == "ca. 2649-1640 B.C." ~ "Dynasty Range",
TRUE ~ dynasty,
TRUE ~ reign,
TRUE ~ object_date
),
object_begin_date = case_when(
object_begin_date == 1899 ~ -1479,
TRUE ~ object_begin_date),
object_end_date = case_when(
object_end_date == 1899 ~ -1458,
TRUE ~ object_end_date),
dynasty = ifelse(is.na(dynasty), "No Dynasty Specified", dynasty),
dynasty = factor(dynasty, levels = c("Dynasty 1", "Dynasty 2", "Dynasty 3", "Dynasty 4", "Dynasty 5", "Dynasty 6", "Dynasty 8", "Dynasty 9", "Dynasty 11", "Dynasty 12", "Dynasty 12-13", "Dynasty 13", "Dynasty 14", "Dynasty 15", "Dynasty 17", "Dynasty 17-18", "Dynasty 18", "Dynasty 19", "Dynasty 19-20", "Dynasty 20", "Dynasty 21", "Dynasty 22", "Dynasty 23", "Dynasty 24", "Dynasty 25", "Dynasty 26", "Dynasty 27", "Dynasty 30", "Dynasty Range", "NA")),
dyn_number = ifelse(is.na(dynasty), 30, as.numeric(dynasty)))
```
### Country
Geolocating the source of artworks and depicting them on a world map required some additional data management. This began with merged our already-prepped Met data with world data from `rnaturalearth` package. This world data contains information about countries and locations such as longitudes and latitudes.
However, after merging, it became clear that the `country` variable in our data need some additional cleaning, as some countries were coded in a different format than the world data dataset. This difference created more missing data on countries and total number of objects. In order to resolve the issue, we updated the name of the countries in our Met data to match the names of countries in the world data. We also pulled out the subset of the remaining objects with attributed countries that didn't match country names in the world data, so that we could investigate these non-matches in their own plot.
### Egyptian Subregion
Continuing with our art theme, the Met took some LARGE creative liberties when it described the `subregion` in which artifacts in Egypt were found. Unfortunately, the regions and subregions that the Met identified were not what the Egyptian government themselves self-identified as regions and subregions within their country.
As such, some (archeological) digging was required to map Egypt. First, we snagged a shapefile of First-Level Administrative Districts from a [UTexas library] (https://geodata.lib.utexas.edu/catalog/stanford-bb409wq6265). To better streamline our repositories, we decided to upload and then save the `EGY_adm1.shp` as an RData file and keep only the modified file for quicker loading.
Using the code below, we decided to upload and then save the `EGY_adm1.shp` as an RData file and keep only the modified file to streamline our reposiories and speed up loading time:
- `library(sf)`
- `shape_file <- st_read("data/EGY_adm1.shp")`
Thankfully the file was in Polygon form, so we had a map - now to get to the real artifacts. Due to the general ugliness of the regions/subregions data, we decided to use the subregions data and simply find the latitude and longitudes for the areas. To make sure we finished this task before being mummified, we restricted the data to only subregions with more than 10 artifact numbers. Sadly, the resulting data for artifacts is point data rather than polygons, but it would still be useful for our later mapping of artefacts by Egyptian region.