Tips and tricks for counting/tabulating values #8

ernestguevarra · 2023-02-16T14:13:12Z

ernestguevarra
Feb 16, 2023
Maintainer

@OxfordIHTM/class-2023

Counting/tabulating values using base function `sum()` and use of indexing

Counting or tabulating values is one of most important data processing skills to learn in R. This post describes some basic approaches to counting or tabulating values using R using the fem dataset.

## Read the fem.dat dataset
fem <- read.table(
  file = "https://raw.githubusercontent.com/OxfordIHTM/teaching_datasets/main/fem.dat",
  header = TRUE
)

With this dataset, we might be interested in counting how many of the respondents/samples in the dataset have age greater than or equal to 40. We can get this value using the following code using the sum() function:

sum(fem$AGE >= 40)

which gives us a value of:

[1] 50

So, there are 50 women in the dataset who have ages greater than or equal to 40 years old.

If we want to know the inverse of this, that is those women in the dataset who are less than 40 years old, we use the following command:

sum(fem$AGE < 40)

which gives us a value of:

[1] 68

So, there are 68 women in the dataset who have ages less than 40 years old.

Then, we might want to know perform a cross-tabulation of those women with ages greater than or equal to 40 years old and those who are not by whether they are sleeping normally (SLP variable; 1 = YES and 2 = NO). Continuing on with our approach above, we can use the following code to count these different groupings of the women in the dataset:

## Count the number of women with ages greater than or equal to 40 and those who are sleeping normally
sum(fem$AGE >= 40 & fem$SLP == 1)

Interestingly, this gives us:

[1] NA

This is unexpected. We would expect that the result will be a count of those who are 40 years and older who are also sleeping normally (SLP == 1). Investigating why this would be the case, we can look into more details about how the sum() function works by issuing the help function ?sum(). With this we get:

Sum of Vector Elements

Description
sum returns the sum of all the values present in its arguments.

Usage
sum(..., na.rm = FALSE)

Arguments
... numeric or complex or logical vectors.

na.rm logical. Should missing values (including NaN) be removed?

Details
This is a generic function: methods can be defined for it directly or via the Summary group generic. For this to work properly, the arguments ... should be unnamed, and dispatch is on the first argument.

If na.rm is FALSE an NA or NaN value in any of the arguments will cause a value of NA or NaN to be returned, otherwise NA and NaN values are ignored.

Logical true values are regarded as one, false values as zero. For historical reasons, NULL is accepted and treated as if it were integer(0).

Loss of accuracy can occur when summing values of different signs: this can even occur for sufficiently long integer inputs if the partial sums would cause integer overflow. Where possible extended-precision accumulators are used, typically well supported with C99 and newer, but possibly platform-dependent.

The description of the sum() function indicate that if any of the arguments has a value of NA or NaN and the na.rm argument is set to FALSE (the default), then NA or NaN output is returned by the function. So, in our case, the SLP variable most likely has NA values (because the AGE variable did not produce this issue earlier).

We can check this by examining the SLP variable:

fem$SLP

which gives us:

  [1]  2  2  2  2  1  2 NA  2  2  2  1  2  2  2  2  1  2  2  2  2 NA  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  2
 [45]  2  2  1  1  2  2  2  2  2  1  1  2  2  2  2  2  2  2  2  2  2 NA  2  2  2  2  2  2  2  2  2  2  2  1  2  2  2  2  2  2  2  2  2  2
 [89]  2  2 NA  2  2  2  2  2  2  2  2  2  2  2  2  2 NA  2  2  2  1  2  1  2  2  1  2  1  1  2

There are indeed NA values in the SLP variable.

Knowing more about the sum() function, we can adjust our earlier code as follows:

sum(fem$AGE >= 40 & fem$SLP == 1, na.rm = TRUE)

which gives us:

[1] 6

So, there are 6 women in the dataset who have ages >= 40 and who are sleeping normally. Continuing on with this approach, we get the values for the other groupings as follows:

sum(fem$AGE >= 40 & fem$SLP == 1, na.rm = TRUE)  ## Number of women 40 or older who sleep normally
sum(fem$AGE >= 40 & fem$SLP == 2, na.rm = TRUE)  ## Number of women 40 or older who don't sleep normally
sum(fem$AGE < 40 & fem$SLP == 1, na.rm = TRUE)   ## Number of women less than 40 who sleep normally
sum(fem$AGE < 40 & fem$SLP == 2, na.rm = TRUE)   ## Number of women less than 40 who don't sleep normally

Depending on the kind of analysis that we want to perform, it may be helpful to save the outputs of these counts into an object so we can use them later on. So, we can update the code as follows:

group1 <- sum(fem$AGE >= 40 & fem$SLP == 1, na.rm = TRUE)  ## Number of women 40 or older who sleep normally
group2 <- sum(fem$AGE >= 40 & fem$SLP == 2, na.rm = TRUE)  ## Number of women 40 or older who don't sleep normally
group3 <- sum(fem$AGE < 40 & fem$SLP == 1, na.rm = TRUE)   ## Number of women less than 40 who sleep normally
group4 <- sum(fem$AGE < 40 & fem$SLP == 2, na.rm = TRUE)   ## Number of women less than 40 who don't sleep normally

It might also be useful to count samples who didn't respond or do not have information on specific variables. In our example, we might want to count how many women 40 or older who have not responded to whether they are sleeping normally or not and the count of women less than 40 who have not responded to whether they are sleeping normally or not. We can do this as follows:

group5 <- sum(fem$AGE >= 40 & is.na(fem$SLP), na.rm = FALSE)  ## Number of women 40 or older with no information on sleeping normally
group6 <- sum(fem$AGE < 40 & is.na(fem$SLP), na.rm = FALSE)   ## Number of women less than 40 with no information on sleeping normally

Sometimes, it would be useful to present these values as a table rather than just having individual values. This can be helpful when you want to perform calculations using these counts later on.

The table can look something like this dummy table:

age	normal_sleep	abnormal_sleep
40 years or older
less than 40 years

We can do this as follows using the data.frame() function:

age_by_slp_table <- data.frame(
  age = c("40 years or older", "less than 40 years"),
  normal_sleep = c(group1, group3),
  abnormal_sleep = c(group2, group4)
)

The age_by_slp_table we created gives the following output:

                 age normal_sleep abnormal_sleep
1  40 years or older            6             43
2 less than 40 years            8             56

If we wanted to add the groupings for those with SLP values of NA into the table, we can do this:

age_by_slp_table <- data.frame(
  age = c("40 years or older", "less than 40 years"),
  normal_sleep = c(group1, group3),
  abnormal_sleep = c(group2, group4),
  na_sleep = c(group5, group6)
)

The updated age_by_slp_table gives us the following output:

                 age normal_sleep abnormal_sleep na_sleep
1  40 years or older            6             43        1
2 less than 40 years            8             56        4

Creating a single object that combines all the counts of the different groupings of the data that we are interested in is helpful when performing calculations that require these counts.

For example, if we want to report the proportion/percentage of women in the dataset who are 40 years and older, we can use the age_by_slp_table to perform the following calculation using our knowledge of indexing in R:

## Calculate proportion of women 40 or older in the dataset
sum(age_by_slp_table[1, 2:4]) / sum(age_by_slp_table[ , 2:4])

which gives:

[1] 0.4237288

## Calculate percentage of women 40 or older in the dataset
(sum(age_by_slp_table[1, 2:4]) / sum(age_by_slp_table[ , 2:4])) * 100

which gives:

[1] 42.37288

If we want to report the proportion/percentage of women in the dataset who are 40 years and older who are sleeping normally, we can calculate as follows:

## Calculate the proportion of women 40 or older who sleep normally in the dataset
age_by_slp_table[1 , 2] / sum(age_by_slp_table[ , 2:3])

which gives:

0.05309735

## Calculate the percentage of women 40 or older who sleep normally in the dataset
(age_by_slp_table[1 , 2] / sum(age_by_slp_table[ , 2:3])) * 100

which gives:

[1] 5.309735

ernestguevarra · 2023-02-17T23:27:29Z

ernestguevarra
Feb 17, 2023
Maintainer Author

@OxfordIHTM/class-2023

Counting/tabulating values using base function `table()`

The earlier solution of counting/tabulating values using sum() and indexing builds on the very basic R techniques that a beginner will learn.

However, we would all agree that this approach can be very tedious with several stages/steps needing to be done to actually get to the type of outputs that are useful.

Here we will explore the use of another base function that was designed specifically for creating tabulations of values and which is aptly called table(). To explore that the function does, run ?table() in the R console.

Cross Tabulation and Table Creation

Description

table uses cross-classifying factors to build a contingency table of the counts at each combination of factor levels.

Usage

table(...,
      exclude = if (useNA == "no") c(NA, NaN),
      useNA = c("no", "ifany", "always"),
      dnn = list.names(...), deparse.level = 1)

as.table(x, ...)
is.table(x)

## S3 method for class 'table'
as.data.frame(x, row.names = NULL, ...,
              responseName = "Freq", stringsAsFactors = TRUE,
              sep = "", base = list(LETTERS))

Arguments

... one or more objects which can be interpreted as factors (including numbers or character strings), or a list (such as a data frame) whose components can be so interpreted. (For as.table, arguments passed to specific methods; for as.data.frame, unused.)

exclude levels to remove for all factors in .... If it does not contain NA and useNA is not specified, it implies useNA = "ifany". See Details for its interpretation for non-factor arguments.

useNA whether to include NA values in the table. See ‘Details’. Can be abbreviated.

dnn the names to be given to the dimensions in the result (the dimnames names).

deparse.level controls how the default dnn is constructed. See Details.

x an arbitrary R object, or an object inheriting from class "table" for the as.data.frame method. Note that as.data.frame.table(x, *) may be called explicitly for non-table x for "reshaping" arrays.

row.names a character vector giving the row names for the data frame.

responseName The name to be used for the column of table entries, usually counts.

stringsAsFactors logical: should the classifying factors be returned as factors (the default) or character vectors?

sep, base passed to provideDimnames.

Details

If the argument dnn is not supplied, the internal function list.names is called to compute the dimname names as follows: If ... is one list with its own names(), these names are used. Otherwise, if the arguments in ... are named, those names are used. For the remaining arguments, deparse.level = 0 gives an empty name, deparse.level = 1 uses the supplied argument if it is a symbol, and deparse.level = 2 will deparse the argument.

Only when exclude is specified (i.e., not by default) and non-empty, will table potentially drop levels of factor arguments.

useNA controls if the table includes counts of NA values: the allowed values correspond to never ("no"), only if the count is positive ("ifany") and even for zero counts ("always"). Note the somewhat "pathological" case of two different kinds of NAs which are treated differently, depending on both useNA and exclude, see d.patho in the 'Examples:' below.

Both exclude and useNA operate on an "all or none" basis. If you want to control the dimensions of a multiway table separately, modify each argument using factor() or addNA().

Non-factor arguments a are coerced via factor(a, exclude=exclude). Since R 3.4.0, care is taken not to count the excluded values (where they were included in the NA count, previously).

The summary method for class "table" (used for objects created by table or xtabs() which gives basic information and performs a chi-squared test for independence of factors (note that the function chisq.test() currently only handles 2-d tables).

Value

table() returns a contingency table, an object of class "table", an array() of integer values. Note that unlike S the result is always an array, a 1D array if one factor is given.

as.table and is.table coerce to and test for contingency table, respectively.

The as.data.frame method for objects inheriting from class "table" can be used to convert the array-based representation of a contingency table to a data frame containing the classifying factors and the corresponding entries (the latter as component named by responseName). This is the inverse of xtabs().

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

So, let us try to use the table() function to replicate the steps we did earlier using the sum() function with indexing.

## Create table of those women with age greater than or equal to 40 and those age less than 40
table(fem$AGE >= 40, useNA = "ifany")

which gives:

FALSE  TRUE 
   68    50

So, there are 68 women in the dataset who have age < 40 years and 50 women with age of 40 or older.

With just one line of code, we are able to arrive at the same answers as we did earlier with 2 lines of code.

Now, let us find out the grouping of women by their age and by their sleeping pattern. We use the following code:

table(
  fem$AGE >= 40, fem$SLP, 
  useNA = "ifany"  ## tabulate NAs if found
)

which gives:

         1  2 <NA>
  FALSE  8 56    4
  TRUE   6 43    1

From this result, we are able to determine that of the women age less than 40 years old in the dataset, there are 8 who sleep normally and 56 who don't and of the women age 40 and older, there are 6 who sleep normally and 43 who don't. There are 4 women age less than 40 who didn't provide information on whether they are sleeping normally or not and there is 1 woman age 40 and older who didn't provide information on whether they are sleeping normally or not.

We were able to get this information with just one line of code compared to 10 lines of code in the earlier example. Also, with this approach, the output is already in a format/layout that is tabular compared to the earlier example in which we had to write additional lines of code to structure the output into a table format.

Now, we can assign this output into an object so that we can perform some calculations with it:

age_by_slp_table <- table(
  fem$AGE >= 40, fem$SLP, 
  useNA = "ifany"  ## tabulate NAs if found
)

Then with this object, let us try to calculate the proportion and percentage of women 40 or older who sleep normally:

## Proportion of women 40 or older who sleep normally
age_by_slp_table[2, 1] / sum(age_by_slp_table[, 1:2])

which gives

[1] 0.05309735

In this regard, the approach will be similar to that in the previous example but in total, we only needed 2 lines of code to get to this answer compared to 11 lines of code in the previous example.

However, a function called prop.table() (run ?prop.table() on R console to see help file for this function) which can be applied specifically to objects of class table can help/facilitate this kinds of calculation as follows:

prop.table(x = age_by_slp_table)

which gives

                  1           2        <NA>
  FALSE 0.067796610 0.474576271 0.033898305
  TRUE  0.050847458 0.364406780 0.008474576

However, this result gives us proportions including the NA category which we don't really want to count as there is no information for this category. So, we adjust our code as follows:

prop.table(x = age_by_slp_table[ , 1:2])

which gives

                 1          2
  FALSE 0.07079646 0.49557522
  TRUE  0.05309735 0.38053097

We get an output that calculates all the proportions of the groupings with just one function.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oxford IHTM CodeHub

Tips and tricks for counting/tabulating values #8

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Oxford IHTM CodeHub

Tips and tricks for counting/tabulating values #8

ernestguevarra Feb 16, 2023 Maintainer

Counting/tabulating values using base function sum() and use of indexing

Sum of Vector Elements

Replies: 1 comment

ernestguevarra Feb 17, 2023 Maintainer Author

Counting/tabulating values using base function table()

Cross Tabulation and Table Creation

Description

Usage

Arguments

Details

Value

References

ernestguevarra
Feb 16, 2023
Maintainer

Counting/tabulating values using base function `sum()` and use of indexing

ernestguevarra
Feb 17, 2023
Maintainer Author

Counting/tabulating values using base function `table()`