Tips and tricks for counting/tabulating values #8
Replies: 1 comment
-
@OxfordIHTM/class-2023 Counting/tabulating values using base function
|
Beta Was this translation helpful? Give feedback.
-
@OxfordIHTM/class-2023 Counting/tabulating values using base function
|
Beta Was this translation helpful? Give feedback.
-
@OxfordIHTM/class-2023
Counting/tabulating values using base function
sum()
and use of indexingCounting or tabulating values is one of most important data processing skills to learn in R. This post describes some basic approaches to counting or tabulating values using R using the
fem
dataset.With this dataset, we might be interested in counting how many of the respondents/samples in the dataset have age greater than or equal to 40. We can get this value using the following code using the
sum()
function:which gives us a value of:
So, there are 50 women in the dataset who have ages greater than or equal to 40 years old.
If we want to know the inverse of this, that is those women in the dataset who are less than 40 years old, we use the following command:
which gives us a value of:
So, there are 68 women in the dataset who have ages less than 40 years old.
Then, we might want to know perform a cross-tabulation of those women with ages greater than or equal to 40 years old and those who are not by whether they are sleeping normally (
SLP
variable;1 = YES
and2 = NO
). Continuing on with our approach above, we can use the following code to count these different groupings of the women in the dataset:Interestingly, this gives us:
This is unexpected. We would expect that the result will be a count of those who are 40 years and older who are also sleeping normally (
SLP == 1
). Investigating why this would be the case, we can look into more details about how thesum()
function works by issuing the help function?sum()
. With this we get:Sum of Vector Elements
Description
sum
returns the sum of all the values present in its arguments.Usage
sum(..., na.rm = FALSE)
Arguments
...
numeric or complex or logical vectors.na.rm
logical. Should missing values (including NaN) be removed?Details
This is a generic function: methods can be defined for it directly or via the Summary group generic. For this to work properly, the arguments
...
should be unnamed, and dispatch is on the first argument.If
na.rm
is FALSE anNA
orNaN
value in any of the arguments will cause a value ofNA
orNaN
to be returned, otherwiseNA
andNaN
values are ignored.Logical true values are regarded as one, false values as zero. For historical reasons,
NULL
is accepted and treated as if it wereinteger(0)
.Loss of accuracy can occur when summing values of different signs: this can even occur for sufficiently long integer inputs if the partial sums would cause integer overflow. Where possible extended-precision accumulators are used, typically well supported with C99 and newer, but possibly platform-dependent.
The description of the
sum()
function indicate that if any of the arguments has a value ofNA
orNaN
and thena.rm
argument is set to FALSE (the default), thenNA
orNaN
output is returned by the function. So, in our case, theSLP
variable most likely hasNA
values (because theAGE
variable did not produce this issue earlier).We can check this by examining the
SLP
variable:which gives us:
There are indeed NA values in the
SLP
variable.Knowing more about the
sum()
function, we can adjust our earlier code as follows:which gives us:
So, there are 6 women in the dataset who have ages >= 40 and who are sleeping normally. Continuing on with this approach, we get the values for the other groupings as follows:
Depending on the kind of analysis that we want to perform, it may be helpful to save the outputs of these counts into an object so we can use them later on. So, we can update the code as follows:
It might also be useful to count samples who didn't respond or do not have information on specific variables. In our example, we might want to count how many women 40 or older who have not responded to whether they are sleeping normally or not and the count of women less than 40 who have not responded to whether they are sleeping normally or not. We can do this as follows:
Sometimes, it would be useful to present these values as a table rather than just having individual values. This can be helpful when you want to perform calculations using these counts later on.
The table can look something like this dummy table:
We can do this as follows using the
data.frame()
function:The
age_by_slp_table
we created gives the following output:If we wanted to add the groupings for those with
SLP
values of NA into the table, we can do this:The updated
age_by_slp_table
gives us the following output:Creating a single object that combines all the counts of the different groupings of the data that we are interested in is helpful when performing calculations that require these counts.
For example, if we want to report the proportion/percentage of women in the dataset who are 40 years and older, we can use the
age_by_slp_table
to perform the following calculation using our knowledge of indexing in R:which gives:
which gives:
If we want to report the proportion/percentage of women in the dataset who are 40 years and older who are sleeping normally, we can calculate as follows:
which gives:
0.05309735
which gives:
Beta Was this translation helpful? Give feedback.
All reactions