STAT 220 Lab 3 - Exploring Categorical Data

In this lab, we will again make use of the lattice and tidyverse packages:

library(lattice)
library(tidyverse)

1. The CDC Data

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

The data file can be downloaded at jingshuw.org/materials/stat220/data/cdc.dat. Then set the working directory as in Lab01, and load the data using read.table as follows.

cdc = read.table("cdc.dat", header=TRUE)

If you have trouble setting working directory and loading data file, don’t hesitate to ask TA for help. To view the names of the variables, type the command

str(cdc)

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (y) or did not (n). Likewise, hlthplan indicates whether the respondent had some form of health coverage (y) or did not (n). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

Exercise 1: How many cases are there in this data set? How many variables? For each variable, identify its variable type (e.g. numerical–continuous, numerical– discrete, categorical–ordinal, categorical–nominal).

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.

2. Review of Lab 2

In Lab 2, we have learned how to find numerical summaries in R, like summary(), mean(), median(), sd() , var() , min() , max() , sum() , IQR(), as well as how to make histograms, boxplots, and scatter plots. As a review, let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI)(http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio and can be calculated as:

703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).

The following two lines first make a new object called bmi and then creates box plots of these values, defining groups by the variable genhlth.

cdc = transform(cdc, bmi = (weight / height^2) * 703)
bwplot(genhlth ~ bmi, data=cdc)

Notice genhlth is an ordinal categorical variable that the five levels are ordered: “excellent”, “very good”, “good”, “fair”, and “poor”. We need to tell R that genhlth is ordered, otherwise R is going to order the five levels alphabetically, just as in the boxplot above. We can specify the order of levels using the ordered

cdc = transform(cdc, 
                genhlth = ordered(genhlth, 
                levels=c("excellent", "very good", "good", "fair", "poor")))
bwplot(genhlth ~ bmi, data=cdc)

Likewise, we can make side-by-side histograms comparing the bmi of people with different self-rated health status.

histogram(~ bmi | genhlth, data=cdc, nint=50, layout = c(1,5))

We see all 5 histograms are right skewed, and the center of the histograms increase slightly as the health status gets worse.

The following scatterplot of weight versus desired weight.

qplot(weight, wtdesire, data=cdc)

Apparently there are two outliers, with incredibly large desired weights (the two guys must be kidding). To remove the two outliers so that we take a closer look at big chunk of points, we can remove these two observation from the plot using subset.

qplot(weight, wtdesire, data=subset(cdc,wtdesire <400), size= I(0.8))

Exercise 2: Based on the scatterplot, describe the relationship between these two variables.

One can use color of the point to indicate the gender (or other categorical variables) of the subjects.

qplot(weight, wtdesire, color=gender, data=subset(cdc,wtdesire <400))

Or one can make separate scatterplots of desire weight and true weight for men and women, using facets.

qplot(weight, wtdesire, facets=~gender, data=subset(cdc,wtdesire <400))

Note that the two scatter plots are simply labelled m and f, which are the two levels of the variable gender. It would be more clear if they could be labelled male and female which can be done in R as follows:

cdc = transform(cdc, gender = ordered(gender, levels=c("m", "f"), labels=c("male","female")))
qplot(weight, wtdesire, facets=~gender, data=subset(cdc,wtdesire <400))

We can use the color of points to represent the bmi of subjects:

qplot(weight, wtdesire, color=bmi, facets=~gender, data=subset(cdc,wtdesire <400))

Not surprisingly, for people with the same desired weight, the higher the actually weight, the heigher their bmi.

Finally, we should label the plot properly.

qplot(weight, wtdesire, facets=~gender, color=bmi, data=subset(cdc,wtdesire <400), xlab="Actual Weight (lb)", ylab="Desired Weight (lb)")

3. Summarizing and Visualizing Categorical Data

3.1 Tables

While it makes sense to describe a numerical variable like weight in terms of summary statistics like mean or sd, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function tally does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

table(cdc$smoke100)

## 
##     n     y 
## 10559  9441

One can convert the table of counts to proportions by the funstion prop.table.

prop.table(table(cdc$smoke100))

## 
##       n       y 
## 0.52795 0.47205

The table command can be used to tabulate any number of variables that you provide. For example, to cross tablulate the two variable smoke100 and genhlth, we could type

table(cdc$smoke100, cdc$genhlth)

##    
##     excellent very good good fair poor
##   n      2879      3758 2782  911  229
##   y      1778      3214 2893 1108  448

We can add marginal totals to the two-way table by the addmargins command.

addmargins(table(cdc$smoke100, cdc$genhlth))

##      
##       excellent very good  good  fair  poor   Sum
##   n        2879      3758  2782   911   229 10559
##   y        1778      3214  2893  1108   448  9441
##   Sum      4657      6972  5675  2019   677 20000

If one wants to view the proportions instead of counts, one can use:

prop.table(table(cdc$smoke100, cdc$genhlth))       # overall proportions

##    
##     excellent very good    good    fair    poor
##   n   0.14395   0.18790 0.13910 0.04555 0.01145
##   y   0.08890   0.16070 0.14465 0.05540 0.02240

prop.table(table(cdc$smoke100, cdc$genhlth),1)     # row proportions

##    
##      excellent  very good       good       fair       poor
##   n 0.27265840 0.35590492 0.26347192 0.08627711 0.02168766
##   y 0.18832751 0.34043004 0.30642940 0.11736045 0.04745260

prop.table(table(cdc$smoke100, cdc$genhlth),2)     # column proportions

##    
##     excellent very good      good      fair      poor
##   n 0.6182091 0.5390132 0.4902203 0.4512135 0.3382570
##   y 0.3817909 0.4609868 0.5097797 0.5487865 0.6617430

The first line gives the overall proportions(dividing the counts by the total number of cases), the second lines gives the row proportions (dividing the counts by the row totals), the last lines gives the column proportions (dividing the counts by the column totals), Likewise, three-way tables cross-tabulating 3 categorical variables can be created as.

table(cdc$smoke100, cdc$genhlth, cdc$gender)

## , ,  = male
## 
##    
##     excellent very good good fair poor
##   n      1357      1677 1121  326   66
##   y       941      1705 1601  558  217
## 
## , ,  = female
## 
##    
##     excellent very good good fair poor
##   n      1522      2081 1661  585  163
##   y       837      1509 1292  550  231

3.2 Bar Plots (or Bar Charts)

Next, we make a bar chart of the entries in the table by putting the table inside the barchart command.

barchart(table(cdc$smoke100), horizontal=FALSE)

Notice what we’ve done here! We’ve computed the table of cdc$smoke100 and then immediately applied the graphical function, barchart. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:

smoke = table(cdc$smoke100)
barchart(smoke, horizontal=FALSE)

Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into the console) and then used it in as the input for barchart. The special symbol = performs an assignment, taking the output of one line of code and saving it into an object in your workspace.
This is another important idea that we’ll return to later.

To create a segmented bar chart representing a two-way table, we can create the two-way table first, and then barchart the table.

smoke.hlth = table(cdc$smoke100, cdc$genhlth)
barchart(smoke.hlth, horizontal=FALSE)

To get the standardized version of the barchart above (standardized the heights of all the bars), we can first convert the table to proportions and then make barchart it.

barchart(prop.table(smoke.hlth,1), horizontal=FALSE)

In the barcharts above, it’s better to add a legend for the colors used. One can add a legend using the auto.key option.

barchart(prop.table(smoke.hlth,1), horizontal=FALSE, auto.key=TRUE)

The default location of the legend is on the top. To move the legend to the right, and to add an title to it, one can do the following

barchart(prop.table(smoke.hlth,1), horizontal=FALSE, 
         auto.key=list(space="right", title="General Health"))

From the above, we can see the size of the legend title is too big. We can adjust the size of the legend title as follows.

barchart(prop.table(smoke.hlth,1), horizontal=FALSE, 
         auto.key=list(space="right", title="General Health",cex.title=1.2))

Exercise 3: Compute the relative frequency distribution for ‘gender’ and ‘genhlth’. How many males are in the sample? What proportion of the sample reports being in excellent health?

3.3 Mosaic Plots

To investigate whether exercise has any effect on what people feel about their general health, we may look at the two-way table

exer.hlth = table(cdc$exerany, cdc$genhlth)

To create a mosaic plot of this table, we would enter the following command.

mosaicplot(exer.hlth, color=TRUE)

As a general rule, plots should always be properly labelled. The next line shows how to change the title, x-label and y-label of the mosaic plot. The `las=1’ option makes the labels perpendicular to the axes so that they don’t overlap.

mosaicplot(exer.hlth, 
           main="Exercise and general health", 
           ylab="Self-Rated General Health", 
           xlab="Exercise in the Past Month",
           las=2, color=TRUE)

Exercise 4: What does the mosaic plot reveal about smoking habits and general health status?

4.Subsetting

It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like

cdc$gender == "m"

cdc$age > 65

These commands produce a series of TRUE and FALSE values. There is one value for each respondent, where TRUE indicates that the person was male (via the first command) or older than 65 (second command).

Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset to do that for us. For example, the command

mdata = subset(cdc, gender == "male")

will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual

head(mdata)

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.

As an aside, you can use several of these conditions together with & and |. The & is read “and” so that

m_and_over65 = subset(cdc, gender == "male" & age > 65)

will give you the data for men over the age of 65. The | character is read “or” so that

m_or_over65 = subset(cdc, gender == "male" | age > 65)

will take people who are men or over the age of 65 (who are males eligible to Medicare). In principle, you may use as many “and” and “or” clauses as you like when forming a subset, and we might be interested in their self-rated health status and exercise activity.

exer.hlth.over65 = exer.hlth = table(m_or_over65$exerany, m_or_over65$genhlth)
mosaicplot(exer.hlth.over65, 
           main="Male Over 65", 
           ylab="Self-Rated General Health", 
           xlab="Exercise in the Past Month",
           las=2, 
           color=TRUE)

Exercise 5: Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise. What percentage of them are males? What percentage of them exercise at least once in the past month?

5. Ending

At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire. We’ve found an interesting association between smoking and gender, and we can say something about the relationship between people’s assessment of their general health and their own BMI. We’ve also picked up essential computing tools – summary statistics, subsetting, and plots – that will serve us well throughout this course.

6. On Your Own

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

Disclaimer

This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics. This lab can be shared or edited under a Creative Commons Attribution-ShareAlike 3.0 Unported licence.