Some tips and tricks to guide/help you - working with, processing, and visualising time series data #32
Replies: 17 comments
-
Pure base R (no package dependencies) solution## Convert month into class date to be plottable ----
turkana_cmam$month <- as.Date(
paste("01", turkana_cmam$month), format = "%d %B %Y"
)
## Calculate overall values per month ----
turkana_cmam_overall <- aggregate(
cbind(admissions, cured, death, default, not_cured, discharges) ~ month,
data = turkana_cmam,
FUN = sum
)
## Calculate performance metrics ----
### By area ----
turkana_performance <- with(
turkana_cmam,
{
data.frame(
cure_rate = cured / discharges,
death_rate = death / discharges,
default_rate = default / discharges,
not_cured_rate = not_cured / discharges
)
}
)
### Concatenate by area performance metrics to main data.frame ----
turkana_cmam <- data.frame(turkana_cmam, turkana_performance)
### Overall performance metrics ----
turkana_performance_overall <- with(
turkana_cmam_overall,
{
data.frame(
cure_rate = cured / discharges,
death_rate = death / discharges,
default_rate = default / discharges,
not_cured_rate = not_cured / discharges
)
}
)
### Concatenate overall performance metrics to overall cmam data.frame ----
turkana_cmam_overall <- data.frame(turkana_cmam_overall, turkana_performance_overall) This workflow now gives me two data.frame objects: |
Beta Was this translation helpful? Give feedback.
-
Approach using dplyr and friends dependencies## Load libraries ----
library(dplyr)
library(lubridate) ## For working with dates and data formats
## Read data ----
turkana_cmam <- read.csv("https://github.com/OxfordIHTM/oxford-ihtm-forum/files/14375302/turkana_cmam.csv")
## Process data and calculate performance metrics by area and month ----
turkana_cmam <- turkana_cmam %>%
mutate(
month = lubridate::my(month),
cure_rate = cured / discharges,
death_rate = death / discharges,
default_rate = default / discharges,
non_response_rate = not_cured / discharges
)
## Process data and calculate performance metrics per month overall ----
turkana_cmam_overall <- turkana_cmam %>%
group_by(month) %>%
summarise(
admissions = sum(admissions),
cured = sum(cured),
death = sum(death),
default = sum(default),
not_cured = sum(not_cured),
discharges = sum(discharges)
) %>%
mutate(
cure_rate = cured / discharges,
death_rate = death / discharges,
default_rate = default / discharges,
non_response_rate = not_cured / discharges
) %>%
mutate(area = "Overall", .before = "month")
## Concatenate the by area and overall data.frames ----
turkana_cmam <- rbind(turkana_cmam, turkana_cmam_overall) |
Beta Was this translation helpful? Give feedback.
-
Plotting admissions over time using only base R functions### Split data into a list of areas ----
turkana_cmam_list <- split(turkana_cmam, f = turkana_cmam$area)
### Plot admissions over time for Overall, Kainuk and Lokori overlaid ----
#### Lengthen margins at bottom of plot to accommodate vertical labels ----
par(mar = c(7, 4, 4, 2))
#### Plot overall ----
plot(
x = turkana_cmam_overall$month,
y = turkana_cmam_overall$admissions,
type = "l",
lwd = 3,
col = "gray70",
xlim = c(min(turkana_cmam_overall$month), max(turkana_cmam_overall$month)),
ylim = c(0, max(turkana_cmam_overall$admissions)),
main = "CMAM Admissions over time",
xlab = "", ylab = "Admissions",
xaxt = "n", yaxt = "n", frame.plot = FALSE
)
#### Add overall points ----
points(
x = turkana_cmam_overall$month,
y = turkana_cmam_overall$admissions,
pch = 21, col = "gray70", bg = "gray70", cex = 1
)
#### Add prettified x and y axis ----
axis(
side = 1,
at = seq(
from = min(turkana_cmam$month),
to = max(turkana_cmam$month),
by = "2 month"
),
las = 2,
labels = seq(
from = min(turkana_cmam$month),
to = max(turkana_cmam$month),
by = "2 month"
)
)
axis(
side = 2,
at = seq(
from = 0,
to = max(turkana_cmam_overall$admissions),
by = 20
),
labels = seq(
from = 0,
to = max(turkana_cmam_overall$admissions),
by = 20
)
)
#### Add Kainuk time series ----
lines(
x = turkana_cmam_list$Kainuk$month,
y = turkana_cmam_list$Kainuk$admissions,
col = "purple", lwd = 3,
xaxt = "n", yaxt = "n", frame.plot = FALSE
)
#### Add Kainuk points ----
points(
x = turkana_cmam_list$Kainuk$month,
y = turkana_cmam_list$Kainuk$admissions,
pch = 21, col = "purple", bg = "purple", cex = 1
)
#### Add Lokori time series ----
lines(
x = turkana_cmam_list$Lokori$month,
y = turkana_cmam_list$Lokori$admissions,
col = "darkgreen", lwd = 3, xaxt = "n", yaxt = "n"
)
#### Add Lokori points ----
points(
x = turkana_cmam_list$Lokori$month,
y = turkana_cmam_list$Lokori$admissions,
pch = 21, col = "darkgreen", bg = "darkgreen", cex = 1
)
#### Add legend ----
legend(
x = "topright",
legend = c("Kainuk", "Lokori", "Overall"),
cex = 1,
col = c("purple", "darkgreen", "gray70"),
pch = 19, pt.cex = 2,
bty = "n"
) |
Beta Was this translation helpful? Give feedback.
-
The code above produces the following plot: |
Beta Was this translation helpful? Give feedback.
-
Some notes and comments on the above approachPlotting a single map with multiple lines overlaid on top of each other is a good approach to be able to make comparisons between time trends much easier. The scales of the axis are the same and comparable. So, with just a quick spot check, one can eyeball the differences and similarities. In the example above, it works well with 3 times series overlaid. However, as the number of time series are added on top of each other, the plot will start to get messy and hard to read/interpret. And because we need to differentiate each line through colour (and/or through line types) it may become very muddled as the lines overlap and the different colours clash. I think between 3-5 time series can be put on top of each other and still be mostly legible/readable. More than that, you should begin to question whether you need a different approach. Below is one possible pathway. |
Beta Was this translation helpful? Give feedback.
-
Plotting admissions over time using only base R functions - side-by-side plots### Plot overall, kainuk, and lokori side-by-side in facets/panels ----
#### Create 1 x 3 plotting window ----
par(mfrow = c(1, 3))
#### Lengthen margins at bottom of plot to accomodate vertical labels ----
par(mar = c(7, 4, 4, 2))
#### Plot overall ----
plot(
x = turkana_cmam_overall$month,
y = turkana_cmam_overall$admissions,
type = "l",
lwd = 2,
col = "gray70",
xlim = c(min(turkana_cmam_overall$month), max(turkana_cmam_overall$month)),
ylim = c(0, max(turkana_cmam_overall$admissions)),
xlab = "", ylab = "Admissions",
xaxt = "n", frame.plot = FALSE
)
#### Add overall points ----
points(
x = turkana_cmam_overall$month,
y = turkana_cmam_overall$admissions,
pch = 21, col = "gray70", bg = "gray70", cex = 1
)
title(main = "Overall", line = -1)
#### Add prettified x axis ----
axis(
side = 1,
at = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
),
las = 2,
labels = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
)
)
#### Plot kainuk ----
plot(
x = turkana_cmam_list$Kainuk$month,
y = turkana_cmam_list$Kainuk$admissions,
type = "l",
lwd = 2,
col = "gray70",
xlim = c(min(turkana_cmam_overall$month), max(turkana_cmam_overall$month)),
ylim = c(0, max(turkana_cmam_overall$admissions)),
xlab = "", ylab = "Admissions",
xaxt = "n", frame.plot = FALSE
)
#### Add overall points ----
points(
x = turkana_cmam_list$Kainuk$month,
y = turkana_cmam_list$Kainuk$admissions,
pch = 21, col = "gray70", bg = "gray70", cex = 1
)
title(main = "Kainuk", line = -1)
#### Add prettified x axis ----
axis(
side = 1,
at = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
),
las = 2,
labels = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
)
)
#### Plot lokori ----
plot(
x = turkana_cmam_list$Lokori$month,
y = turkana_cmam_list$Lokori$admissions,
type = "l",
lwd = 2,
col = "gray70",
xlim = c(min(turkana_cmam_overall$month), max(turkana_cmam_overall$month)),
ylim = c(0, max(turkana_cmam_overall$admissions)),
xlab = "", ylab = "Admissions",
xaxt = "n", frame.plot = FALSE
)
#### Add overall points ----
points(
x = turkana_cmam_list$Lokori$month,
y = turkana_cmam_list$Lokori$admissions,
pch = 21, col = "gray70", bg = "gray70", cex = 1
)
title(main = "Lokori", line = -1)
#### Add prettified x axis ----
axis(
side = 1,
at = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
),
las = 2,
labels = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
)
)
### Reset graphics window to 1 x 1 ----
par(mfrow = c(1, 1))
### Add overall title ----
title(main = "Admissions over time", line = 2) |
Beta Was this translation helpful? Give feedback.
-
The code above produces the following plot: |
Beta Was this translation helpful? Give feedback.
-
Some notes and comments on the side-by-side approachThe side-by-side approach had the advantage of not needing colour to differentiate the lines as each line plot has its own facet/panel and for as long as in base R you make sure the x and y axis scales are the same for each plot, the plots are comparable albeit with the reader having to scan through each plot. As the number of time series gets added, the plot just needs to be organised into appropriate rows and columns. The plot gets larger in size but clarity in the plot is maintained as each time series has its own facet/panel. I personally recommend considering this approach when you have 3 or more plots to show. |
Beta Was this translation helpful? Give feedback.
-
ggplot approach to plotting the same plots above### Process turkana_cmam data ----
turkana_cmam <- turkana_cmam %>%
mutate(
month = lubridate::my(month),
cure_rate = cured / discharges,
death_rate = death / discharges,
default_rate = default / discharges,
non_response_rate = not_cured / discharges
)
turkana_cmam_overall <- turkana_cmam %>%
group_by(month) %>%
summarise(
admissions = sum(admissions),
cured = sum(cured),
death = sum(death),
default = sum(default),
not_cured = sum(not_cured),
discharges = sum(discharges)
) %>%
mutate(
cure_rate = cured / discharges,
death_rate = death / discharges,
default_rate = default / discharges,
non_response_rate = not_cured / discharges
) %>%
mutate(area = "Overall", .before = "month")
turkana_cmam <- rbind(turkana_cmam, turkana_cmam_overall)
#### Plot overall admissions over time overlaid with kainuk and lokori ----
turkana_cmam %>%
ggplot(mapping = aes(x = month, y = admissions)) +
geom_line(mapping = aes(group = area, colour = area), linewidth = 1) +
geom_point(mapping = aes(group = area, colour = area), size = 2) +
scale_colour_manual(values = c("purple", "darkgreen", "gray70"), name = "") +
scale_x_date(
breaks = seq(
from = min(turkana_cmam$month), to = max(turkana_cmam$month),
by = "2 month"
)
) +
scale_y_continuous(
breaks = seq(from = 0, to = max(turkana_cmam_overall$admissions), by = 20)
) +
labs(
title = "Admissions over time",
subtitle = "CMAM Programme in Turkana District, Kenya"
) +
xlab(NULL) +
ylab("Admissions") +
theme_bw() +
theme(
legend.position = "top",
axis.text.x.bottom = element_text(angle = 90, vjust = 0.5, hjust = 1)
) |
Beta Was this translation helpful? Give feedback.
-
The code above produces the following plot: |
Beta Was this translation helpful? Give feedback.
-
ggplot approach to side-by-side plot#### Plot overall, kainuk, lokori in faceted plot ----
turkana_cmam %>%
mutate(area = factor(area, levels = c("overall", "Kainuk", "Lokori"))) %>%
ggplot(mapping = aes(x = month, y = admissions)) +
geom_line(colour = "gray70", linewidth = 1.5) +
geom_point(colour = "gray70", size = 2) +
scale_x_date(
breaks = seq(
from = min(turkana_cmam$month), to = max(turkana_cmam$month),
by = "2 month"
)
) +
scale_y_continuous(
breaks = seq(from = 0, to = max(turkana_cmam_overall$admissions), by = 20)
) +
labs(
title = "Admissions over time",
subtitle = "CMAM Programme in Turkana District, Kenya"
) +
xlab(NULL) +
ylab("Admissions") +
facet_wrap(. ~ area, ncol = 3) +
theme_bw() +
theme(
legend.position = "top",
axis.text.x.bottom = element_text(angle = 90, vjust = 0.5, hjust = 1)
) |
Beta Was this translation helpful? Give feedback.
-
The code above produced the following plot: |
Beta Was this translation helpful? Give feedback.
-
Notes on the gpplot apporachThe plots created when ggplot is used is very similar to the base R output although there are a few differences that make the ggplots look much more elegant. And as you would have noticed already, the lines of code needed to produce roughly the same output (plus some nice features) is much less with the ggplot approach. |
Beta Was this translation helpful? Give feedback.
-
Plotting performance indicators over time using base R### Split data into a list of areas ----
turkana_cmam_list <- split(turkana_cmam, f = turkana_cmam$area)
### Create 1 x 2 plotting window ----
par(mfrow = c(1, 2))
### Plot Kainuk ----
plot(
x = turkana_cmam_list$Kainuk$month,
y = turkana_cmam_list$Kainuk$cure_rate * 100,
type = "l",
xlab = "", ylab = "%",
main = "Kainuk",
col = "darkgreen", lwd = 2, xaxt = "n"
)
#### Add prettified x axis ----
axis(
side = 1,
at = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
),
las = 2,
labels = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
)
)
lines(
x = turkana_cmam_list$Kainuk$month,
y = turkana_cmam_list$Kainuk$default_rate * 100,
type = "l",
col = "red", lwd = 2, xaxt = "n"
)
lines(
x = turkana_cmam_list$Kainuk$month,
y = turkana_cmam_list$Kainuk$death_rate * 100,
type = "l",
col = "black", lwd = 2, xaxt = "n"
)
lines(
x = turkana_cmam_list$Kainuk$month,
y = turkana_cmam_list$Kainuk$not_cured_rate * 100,
type = "l",
col = "gray70", lwd = 2, xaxt = "n"
)
#### Add cut-offs ----
abline(h = 75, col = "darkgreen", lty = 2)
abline(h = 15, col = "red", lty = 2)
abline(h = 10, col = "black", lty = 2)
### Plot Lokori ----
plot(
x = turkana_cmam_list$Lokori$month,
y = turkana_cmam_list$Lokori$cure_rate * 100,
type = "l",
xlab = "", ylab = "%",
main = "Lokori",
col = "darkgreen", lwd = 2, xaxt = "n"
)
#### Add prettified x axis ----
axis(
side = 1,
at = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
),
las = 2,
labels = seq(
from = min(turkana_cmam_overall$month),
to = max(turkana_cmam_overall$month),
by = "2 month"
)
)
lines(
x = turkana_cmam_list$Lokori$month,
y = turkana_cmam_list$Lokori$default_rate * 100,
type = "l",
col = "red", lwd = 2, xaxt = "n"
)
lines(
x = turkana_cmam_list$Lokori$month,
y = turkana_cmam_list$Lokori$death_rate * 100,
type = "l",
col = "black", lwd = 2, xaxt = "n"
)
lines(
x = turkana_cmam_list$Lokori$month,
y = turkana_cmam_list$Lokori$not_cured_rate * 100,
type = "l",
col = "gray70", lwd = 2, xaxt = "n"
)
#### Add cut-offs ----
abline(h = 75, col = "darkgreen", lty = 2)
abline(h = 15, col = "red", lty = 2)
abline(h = 10, col = "black", lty = 2)
par(mfrow = c(1, 1))
legend(
x = as.Date("2010-02-01"), y = 60,
legend = c("Cure", "Default", "Death", "Non-response"),
cex = 1,
col = c("darkgreen", "red", "black", "gray70"),
pch = 15, pt.cex = 2,
bty = "n"
)
title(main = "Performance over time", line = 2) |
Beta Was this translation helpful? Give feedback.
-
The code above produces the following plot: |
Beta Was this translation helpful? Give feedback.
-
Notes/comments on performance over timeFor this plot, only a side-by-side per area and overlaid cure, default, death and non-response rate over time would actually make sense to show the pattern and relationship between the performance indicators. As cure rates go up, all the other metrics are low. As soon as cure rates dip, then either death or default goes up. Here we see also dash lines that indicate the cut-off standards for the performance metrics. Cure rate is expected to be greater than 75%, default rate should be less than 15%, and death rate should be less than 10%. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot, Ernest. |
Beta Was this translation helpful? Give feedback.
-
Using an example real world data from a CMAM programme in Turkana District, Kenya: turkana_cmam.csv.
We read the data as follows:
This is what
turkana_cmam
looks like (showing first 10 rows):The variables in the dataset are:
In this example, I would like to answer the following:
Beta Was this translation helpful? Give feedback.
All reactions