In this lab, we will make sure of the lattice
and tidyverse
packages.
library(lattice)
library(tidyverse)
We will also make use of the following function that provides us with useful summaries:
= function(mydata){
favstats # input mydata is a numerical vector or matrices
= rep(0, 9);
result = summary(mydata);
mysummary 1] = mysummary[1];
result[2] = mysummary[2];
result[3] = mysummary[3];
result[4] = mysummary[4];
result[5] = mysummary[5];
result[6] = mysummary[6];
result[7] = sd(mydata);
result[8] = length(mydata);
result[9] = sum(is.na(mydata));
result[names(result) = c("min", "Q1", "median", "mean", "Q3", "max", "sd", "n", "missing");
result; }
In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.
We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. (A description of all variables can be found at http://www.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt.) For this lab, we’ll focus our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area
) and the sale price (SalePrice
). To save some effort throughout the lab, create two variables with short names that represent these two variables.
= ames$Gr.Liv.Area
area = ames$SalePrice price
Let’s look at the distribution of area in our population of home sales.
favstats(area)
## min Q1 median mean Q3 max sd n
## 334.0000 1126.0000 1442.0000 1499.6904 1742.7500 5642.0000 505.5089 2930.0000
## missing
## 0.0000
histogram(area)
If you would like to adjust the number of class intervals of your histogram, you can do so by specifying the nint
(for number of intervals) argument.
histogram(area, nint=60)
The command above gives a histogram with 60 class intervals. Try a number of different values of nint
until you get a histogram that best depicts the shape of the distribution.
Exercise 1: Comment on the shape of this population distribution. Is it symmetric, left-skewed or right-skewed?
In this lab we have access to the entire population, but this is rarely the case in real life.
Gathering information on an entire population is often extremely costly or impossible.
Because of this, we often take a sample of the population and use that to understand the properties of the population.
If we were interested in estimating the mean living area in Ames based on a sample, can use the following command to survey the population.
= sample(area, 50) samp1
This command collects a simple random sample of size 50 from the vector area
, and then store the sample in the variable samp1
.
This is like going into the City Assessor’s database and pulling up the files on 50 random home sales.
Working with these 50 files would be considerably simpler than working with all 2930 home sales.
Exercise 2: Describe the distribution of this sample. How does it compare to the distribution of the population?
If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.
mean(samp1)
Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.
Exercise 3: Take a second sample, also of size 50, and call it samp2
. How does the mean of samp2
compare with the mean of samp1
? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times.
Computing a single sample mean is easy:
= mean(sample(area, 50)) samp
But how can we compute, say, 5000 sample means, and put those results into a data frame to get a sampling distribution? We can use the replicate
function in R, which is a useful tool for repeating a procedure.
= as.data.frame(replicate(5000, mean(sample(area, 50))))
sample_means50 names(sample_means50) = "mean"
head(sample_means50)
## mean
## 1 1345.84
## 2 1445.50
## 3 1561.68
## 4 1495.90
## 5 1484.46
## 6 1580.90
We can examine the sampling distribution from the summary statistics and the histogram.
favstats(sample_means50$mean)
## min Q1 median mean Q3 max sd
## 1187.56000 1451.52500 1498.33000 1499.87481 1546.10500 1780.10000 71.38532
## n missing
## 5000.00000 0.00000
histogram(sample_means50$mean, nint=50, xlab="mean")
Again you should try a number of different nint
values until you get a histogram that best displays the shape of the distribution. You can overlay a normal density on the histogram using the same approach in Lab 5.
hist(sample_means50$mean, breaks = 50, freq=F,
xlab="mean", main='')
= seq(1000, 2000, by=10);
x = dnorm(x, mean(sample_means50$mean),
y sd(sample_means50$mean))
lines(x, y)
Exercise 4: How many elements are there in sample_means50
? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
replicate
functionLet’s take a break from the statistics for a moment to let that last block of code sink in. The idea behind the replicate
function is repetition: it allows you to execute a line of code as many times as you want and put the results in a data frame. In the case above, we wanted to repeatedly take a random sample of size 50 from area
and then save the mean of that sample into the sample_means50
vector.
Without the replicate
function, this would be painful. First, we’d have to create an empty vector filled with 0s to hold the 5000 sample means. Then, we’d have to compute each of the 5000 sample means one line at a time, putting them individually into the slots of the sample_means50
vector:
= rep(NA, 5000)
sample_means50
1] = mean(sample(area, 50))
sample_means50[2] = mean(sample(area, 50))
sample_means50[3] = mean(sample(area, 50))
sample_means50[4] = mean(sample(area, 50))
sample_means50[# ...and so on, 5000 times
With the replicate
function, these thousands of lines of code are compressed into one line:
= as.data.frame(replicate(5000, mean(sample(area, 10)))) sample_means10
Note that for each of the 5000 times we computed a mean, we did so from a different sample!
Exercise 5: To make sure you understand what the replicate
function does, try modifying the code to take only 100 sample means and put them in a data frame named sample_means_small
. Print the output to your screen (type sample_means_small
into the console and press enter). How many elements are there in this object called sample_means_small
? What does each element represent?
Mechanics aside, let’s return to the reason we used the replicate
function: to compute a sampling distribution. The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.
To get a sense of the effect that sample size has on our distribution, let’s build up three sampling distributions: one based on a sample size of 10, one based on a sample size of 50, and another based on a sample size of 100.
= as.data.frame(replicate(5000, mean(sample(area, 10))))
sample_means10 names(sample_means10) = "mean"
= as.data.frame(replicate(5000, mean(sample(area, 50))))
sample_means50 names(sample_means50) = "mean"
= as.data.frame(replicate(5000, mean(sample(area, 100))))
sample_means100 names(sample_means100) = "mean"
hist(sample_means10$mean, breaks = 50, freq=F, xlab="mean", main="")
= seq(1000, 2000, by=10);
x = dnorm(x, mean(sample_means10$mean), sd(sample_means10$mean))
y lines(x, y)
hist(sample_means50$mean, breaks = 50, freq=F, xlab="mean", main="")
= seq(1000, 2000, by=10);
x = dnorm(x, mean(sample_means50$mean), sd(sample_means50$mean))
y lines(x, y)
hist(sample_means100$mean, breaks = 50, freq=F, xlab="mean", main="")
= seq(1000, 2000, by=10);
x = dnorm(x, mean(sample_means100$mean), sd(sample_means100$mean))
y lines(x, y)
From the histogram above, we can see when the sample size is 100, the sampling distribution of the sample mean is fairly close to normal. Based on the Central Limit Theorem, the sampling distribution of the sample mean is roughly \(N(\mu,\sigma/sqrt{n})\), where \(\mu=1499.69\) Sq.ft. is the population mean and \(\sigma=505.5089\) Sq.ft. is the population SD. The values of \(\mu\) and \(\sigma\) have been found in the beginning.
As the sampling distribution of the sample mean looks fairly normal when the sample size is 100, we can use the CLT to find the (approximate) probability of getting a sample mean below 1450 Sq. ft. That is, to find \(P(\overline{X}<1450)\) where \(\overline{X}\) is normal with mean \(\mu=1499.69\) and SD \(\sigma/\sqrt{n}=505.5089/\sqrt{100}=50.55089.\) The z-score for 1450 is \(Z=(1450-1499.69)/50.55089\approx-0.983.\) The probability is hence \[ P(\overline{X}<1450)\approx P(Z<-0.98)\approx 0.1635\approx 16\% \] So drawing a random sample of size 100, the sample mean will be below 1450 for about 16% of the time.
Alternatively, we can compute the probability above using R as follows
pnorm(1450, 1499.69, 505.5089/sqrt(100))
We can check how CLT works by checking how close the probability computed above is to the actual proportion of the 5000 sample means that are below 1450. The R commands below count the number of entries in sample_means100
that are \(<1450\) (TRUE) and \(\ge 1450\) (FALSE), and then convert the counts to proportions.
table(sample_means100 < 1450)
table(sample_means100 < 1450)/5000
Is the proportion close to 16%?
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
price
, which shows the population distribution of the sale price of all 2930 homes in the data. Comment on the shape of the histogram.price
, which are the population mean and the population SD.price
. Find the sample mean, and compared it with the population mean you found the previous partsample_means25
. Make a histogram of the 5000 sample means.sample_means4
.sample_means100
.Compute the means and SDs for the 5000 sample means as follows.
mean(sample_means4$mean)
mean(sample_means25$mean)
mean(sample_means100$mean)
sd(sample_means4$mean)
sd(sample_means25$mean)
sd(sample_means100$mean)
Are they close to their respective theoretical values \(\mu\) and \(\sigma/\sqrt{n}\)?
sample_means100$mean
that are below $170,000. Is the percentage close to the probability computed in the previous part using CLT?sample_means4$mean
that are between $130,000, and $190,000.table(sample_means100 < 190000 & sample_means100 > 130000)
table(sample_means100 < 190000 & sample_means100 > 130000)/5000
Is the percentage close to the probability computed in the previous part using CLT? Does the CLT work well when the sample size is only 4?
This lab was expanded for STAT 220 by Yibi Huang from a lab released from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel, which originally was based on a lab written by Mark Hansen of UCLA Statistics.
This lab can be shared or edited under a Creative Commons Attribution-ShareAlike 3.0 Unported licence.