In this lab, we will again make use of the lattice
and tidyverse
packages:
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
The data file can be downloaded at jingshuw.org/materials/stat220/data/cdc.dat. Then set the working directory as in Lab01, and load the data using read.table
as follows.
If you have trouble setting working directory and loading data file, don’t hesitate to ask TA for help. To view the names of the variables, type the command
This returns the names genhlth
, exerany
, hlthplan
, smoke100
, height
, weight
, wtdesire
, age
, and gender
. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth
, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany
variable indicates whether the respondent exercised in the past month (y) or did not (n). Likewise, hlthplan
indicates whether the respondent had some form of health coverage (y) or did not (n). The smoke100
variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height
in inches, weight
in pounds as well as their desired weight, wtdesire
, age
in years, and gender
.
Exercise 1: How many cases are there in this data set? How many variables? For each variable, identify its variable type (e.g. numerical–continuous, numerical– discrete, categorical–ordinal, categorical–nominal).
You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc
has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head
, tail
or the subsetting techniques that you’ll learn in a moment.
In Lab 2, we have learned how to find numerical summaries in R, like summary()
, mean()
, median()
, sd()
, var()
, min()
, max()
, sum()
, IQR()
, as well as how to make histograms, boxplots, and scatter plots. As a review, let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI)(http://en.wikipedia.org/wiki/Body_mass_index). BMI is a weight to height ratio and can be calculated as:
\[ \text{BMI} = \frac{\text{weight}~(lb)}{\text{height}~(\text{in})^2} * 703 \]
703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).
The following two lines first make a new object called bmi
and then creates box plots of these values, defining groups by the variable genhlth
.
Notice genhlth
is an ordinal categorical variable that the five levels are ordered: “excellent”, “very good”, “good”, “fair”, and “poor”. We need to tell R that genhlth
is ordered, otherwise R is going to order the five levels alphabetically, just as in the boxplot above. We can specify the order of levels using the ordered
cdc = transform(cdc,
genhlth = ordered(genhlth,
levels=c("excellent", "very good", "good", "fair", "poor")))
bwplot(genhlth ~ bmi, data=cdc)
Likewise, we can make side-by-side histograms comparing the bmi
of people with different self-rated health status.
We see all 5 histograms are right skewed, and the center of the histograms increase slightly as the health status gets worse.
The following scatterplot of weight versus desired weight.
Apparently there are two outliers, with incredibly large desired weights (the two guys must be kidding). To remove the two outliers so that we take a closer look at big chunk of points, we can remove these two observation from the plot using subset
.
Exercise 2: Based on the scatterplot, describe the relationship between these two variables.
One can use color of the point to indicate the gender (or other categorical variables) of the subjects.
Or one can make separate scatterplots of desire weight and true weight for men and women, using facets
.
Note that the two scatter plots are simply labelled m
and f
, which are the two levels of the variable gender.
It would be more clear if they could be labelled male
and female
which can be done in R as follows:
cdc = transform(cdc, gender = ordered(gender, levels=c("m", "f"), labels=c("male","female")))
qplot(weight, wtdesire, facets=~gender, data=subset(cdc,wtdesire <400))
We can use the color of points to represent the bmi
of subjects:
Not surprisingly, for people with the same desired weight, the higher the actually weight, the heigher their bmi
.
Finally, we should label the plot properly.
While it makes sense to describe a numerical variable like weight
in terms of summary statistics like mean or sd, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function tally
does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type
##
## n y
## 10559 9441
One can convert the table of counts to proportions by the funstion prop.table.
##
## n y
## 0.52795 0.47205
The table
command can be used to tabulate any number of variables that you provide. For example, to cross tablulate the two variable smoke100
and genhlth
, we could type
##
## excellent very good good fair poor
## n 2879 3758 2782 911 229
## y 1778 3214 2893 1108 448
We can add marginal totals to the two-way table by the addmargins
command.
##
## excellent very good good fair poor Sum
## n 2879 3758 2782 911 229 10559
## y 1778 3214 2893 1108 448 9441
## Sum 4657 6972 5675 2019 677 20000
If one wants to view the proportions instead of counts, one can use:
##
## excellent very good good fair poor
## n 0.14395 0.18790 0.13910 0.04555 0.01145
## y 0.08890 0.16070 0.14465 0.05540 0.02240
##
## excellent very good good fair poor
## n 0.27265840 0.35590492 0.26347192 0.08627711 0.02168766
## y 0.18832751 0.34043004 0.30642940 0.11736045 0.04745260
##
## excellent very good good fair poor
## n 0.6182091 0.5390132 0.4902203 0.4512135 0.3382570
## y 0.3817909 0.4609868 0.5097797 0.5487865 0.6617430
The first line gives the overall proportions(dividing the counts by the total number of cases), the second lines gives the row proportions (dividing the counts by the row totals), the last lines gives the column proportions (dividing the counts by the column totals), Likewise, three-way tables cross-tabulating 3 categorical variables can be created as.
## , , = male
##
##
## excellent very good good fair poor
## n 1357 1677 1121 326 66
## y 941 1705 1601 558 217
##
## , , = female
##
##
## excellent very good good fair poor
## n 1522 2081 1661 585 163
## y 837 1509 1292 550 231
Next, we make a bar chart of the entries in the table by putting the table inside the barchart
command.
Notice what we’ve done here! We’ve computed the table of cdc$smoke100
and then immediately applied the graphical function, barchart
. This is an important idea: R commands can be nested. You could also break this into two steps by typing the following:
Here, we’ve made a new object, a table, called smoke
(the contents of which we can see by typing smoke
into the console) and then used it in as the input for barchart
. The special symbol =
performs an assignment, taking the output of one line of code and saving it into an object in your workspace.
This is another important idea that we’ll return to later.
To create a segmented bar chart representing a two-way table, we can create the two-way table first, and then barchart
the table.
To get the standardized version of the barchart above (standardized the heights of all the bars), we can first convert the table to proportions and then make barchart
it.
In the barcharts above, it’s better to add a legend for the colors used. One can add a legend using the auto.key
option.
The default location of the legend is on the top. To move the legend to the right, and to add an title to it, one can do the following
barchart(prop.table(smoke.hlth,1), horizontal=FALSE,
auto.key=list(space="right", title="General Health"))
From the above, we can see the size of the legend title is too big. We can adjust the size of the legend title as follows.
barchart(prop.table(smoke.hlth,1), horizontal=FALSE,
auto.key=list(space="right", title="General Health",cex.title=1.2))
Exercise 3: Compute the relative frequency distribution for ‘gender’ and ‘genhlth’. How many males are in the sample? What proportion of the sample reports being in excellent health?
To investigate whether exercise has any effect on what people feel about their general health, we may look at the two-way table
To create a mosaic plot of this table, we would enter the following command.
As a general rule, plots should always be properly labelled. The next line shows how to change the title, x-label and y-label of the mosaic plot. The `las=1’ option makes the labels perpendicular to the axes so that they don’t overlap.
mosaicplot(exer.hlth,
main="Exercise and general health",
ylab="Self-Rated General Health",
xlab="Exercise in the Past Month",
las=2, color=TRUE)
Exercise 4: What does the mosaic plot reveal about smoking habits and general health status?
It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like
or
These commands produce a series of TRUE
and FALSE
values. There is one value for each respondent, where TRUE
indicates that the person was male (via the first command) or older than 65 (second command).
Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R function subset
to do that for us. For example, the command
will create a new data set called mdata
that contains only the men from the cdc
data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual
This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.
As an aside, you can use several of these conditions together with &
and |
. The &
is read “and” so that
will give you the data for men over the age of 65. The |
character is read “or” so that
will take people who are men or over the age of 65 (who are males eligible to Medicare). In principle, you may use as many “and” and “or” clauses as you like when forming a subset, and we might be interested in their self-rated health status and exercise activity.
exer.hlth.over65 = exer.hlth = table(m_or_over65$exerany, m_or_over65$genhlth)
mosaicplot(exer.hlth.over65,
main="Male Over 65",
ylab="Self-Rated General Health",
xlab="Exercise in the Past Month",
las=2,
color=TRUE)
Exercise 5: Create a new object called under23_and_smoke
that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise. What percentage of them are males? What percentage of them exercise at least once in the past month?
At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire. We’ve found an interesting association between smoking and gender, and we can say something about the relationship between people’s assessment of their general health and their own BMI. We’ve also picked up essential computing tools – summary statistics, subsetting, and plots – that will serve us well throughout this course.
Let’s consider a new variable: the difference between desired weight (wtdesire
) and current weight (weight
). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff
.
What type of data is wdiff
? If an observation wdiff
is 0, what does this mean about the person’s weight and desired weight. What if wdiff
is positive or negative?
Describe the distribution of wdiff
in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
Now it’s time to get creative. Find the mean and standard deviation of weight
and determine what proportion of the weights are within one standard deviation of the mean.
This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics. This lab can be shared or edited under a Creative Commons Attribution-ShareAlike 3.0 Unported licence.