STAT 220 Lab 2 — Exploring Numerical Data

Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, we will show you how to use R to get numerical summaries of data and how to view data by graphs. Before we start, we will need to load in the following libraries:

library(tidyverse)
library(lattice)

1. The Diamonds Dataset

The diamonds dataset is a built-in data set in the ggplot2 library which is one of the components of the tidyverse library, which means that we can access it using the data() function after loading the tidyverse library

data(diamonds)

The variables in the diamonds dataset are

price: price in US dollars
carat: weight of the diamond
cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color: diamond color, from J (worst) to D (best)
clarity: a measurement of how clear the diamond is, from I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, to IF (best)

and five physical measurements, depth, table, x, y and z, as shown in Figure 1 below

Figure 1

The dimension of the diamonds dataset is

dim(diamonds)

## [1] 53940    10

from which we can see the data contains 53940 rows and 10 variables.

To view the names of the variables, type the command

str(diamonds)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Here we can see that carat, depth, table, price and x, y, z are numerical variables, and cut, color, and clarity are ordinal categorical variables. Specifically, price is an integer-valued variable.

2. Numerical Summary of Data

To calculate the mean, median, SD, variance, five-number summary, IQR, minimum, maximum of the price variable in the diamonds dataset, type

# Numerical Summary of Data
mean(diamonds$price)
median(diamonds$price)
sd(diamonds$price)
var(diamonds$price)
fivenum(diamonds$price)
IQR(diamonds$price)
min(diamonds$price)
max(diamonds$price)

In statistics, it is also import to be able to compute averages of variables within specific groupings of the data. For example, in the diamonds dataset the cut variable has 5 possible values: [Fair, Good, Very Good, Premium, Ideal], and we may be interested in the average price of diamonds in each category of cut. To compute this, we can use the aggregate() function:

aggregate(list(price=diamonds$price), list(cut=diamonds$cut), FUN=mean)

##         cut    price
## 1      Fair 4358.758
## 2      Good 3928.864
## 3 Very Good 3981.760
## 4   Premium 4584.258
## 5     Ideal 3457.542

The aggregate function requires 3 arguments. The first argument passed to it must be a list containing the columns of data we wish to aggregate. For example, computing a mean, or a sum, or a maximum, are all examples of aggregation. In this case, we want to compute the mean of price. The second argument to aggregate is a list of variables we wish to group by, which in this case is the cut variable. Finally, we must tell the function what aggregation method we wish to use, FUN=mean tells the function that we wish to compute a mean. The reason that the second argument is a list is because we may wish to group by multiple variables. For example, try to figure out what the following commands do:

aggregate(list(price=diamonds$price), list(cut=diamonds$cut, color=diamonds$color), FUN=median)
aggregate(list(price=diamonds$price, carat=diamonds$carat), list(color=diamonds$color), FUN=min)

Surprisingly, higher quality cut diamonds are not necessarily more expensive (e.g., mean price of diamonds with the best cut (Ideal) is $3457.5, lower than that of the worst cut (Fair), $4358.7). This is because we didn’t take weight (carat) of diamonds into account. Diamonds with Ideal cut tend to be smaller than diamonds with Fair cut.

aggregate(list(carat=diamonds$carat), 
          list(cut=diamonds$cut), 
          FUN=mean)

##         cut     carat
## 1      Fair 1.0461366
## 2      Good 0.8491847
## 3 Very Good 0.8063814
## 4   Premium 0.8919549
## 5     Ideal 0.7028370

We can find the mean price of diamonds grouped by cut and clarity:

aggregate(list(price=diamonds$price), 
          list(cut=diamonds$cut, clarity=diamonds$clarity), 
          FUN=mean)

The aggregate() function also works for other aggregation methods such as median(), sd() , var() , min() , max() , sum() , IQR().

3. Graphical Display of Data

3.1 Histogram

Histogram of the carat variable:

histogram(~carat, data=diamonds)

You can adjust the binwidth of the histogram

histogram(~carat, data=diamonds, width = 0.1)

histogram(~carat, data=diamonds, width = 0.01)

or you can adjust the number of intervals (nint)

#or you can adjust the number of intervals (nint)
histogram(~carat, data=diamonds, nint = 50)
histogram(~carat, data=diamonds, nint = 500)

or you can set the end points of the intervals yourself if you want unequal binwidths

histogram(~carat, data=diamonds, breaks=c(0.2,0.3,0.5,0.7,1,1.5,2,2.5,3,4,5.5))

You can split the diamonds by the quality of cut and make separate histograms for each level of cut

histogram(~price | cut, data=diamonds)

You can adjust the number of bins of side-by-side histograms by changing width, nint, or breaks as for a single histogram.

histogram(~price | cut, data=diamonds, width = 1000)

It’s usually better to stack the five histograms on the same horizontal histogram.

histogram(~price | cut, data=diamonds, width = 250, layout = c(1,5))

3.2 Boxplots

R has a built-in function boxplot() for making boxplots, e.g., the boxplot for the price of diamonds

boxplot(diamonds$price, horizontal=T)

There is also a bwplot() function in the lattice library which is more versatile:

bwplot(~price, data=diamonds)

Side-by-Side Boxplots

We can use a side-by-side boxplot to examine the relationship between a categorical variable and a numerical variable. For example, we compare the prices of diamonds with different clarity.

bwplot(price ~ clarity, data=diamonds)

If we flip price and clarity, the boxplots become horizontal.

bwplot(clarity ~ price, data=diamonds)

It might seem surprising that diamonds with the better clarity (IF, VVS1) have lower price than those with lower clarity. This is because we didn’t adjust for the size of carat, since larger diamonds are more valuable and are more likely to have defects or impurities. If we take diamonds of similar size (e.g., 0.7 to 1 carat), and make a side-by-side boxplot between price and clarity, then diamonds with better generally have higher price.

bwplot(clarity~ price, 
       data=subset(diamonds, carat >= 0.7 & carat < 1))

You can change the range of carat and see if the same relationship persists. Or one can create a categorical variable, grouping diamonds of similar size together, and creat

carat.grp = cut(diamonds$carat, breaks=c(0.2, 0.5, 0.7, 1, 1.5, 2, Inf), right=FALSE)
bwplot(clarity~ price | carat.grp, data=diamonds, layout=c(6,1))

3.3 Scatterplots

Scatterplot between carat and price of diamonds:

qplot(carat, price, data=diamonds)

Scatterplot between carat and price of diamonds, both log-transformed

qplot(log(carat), log(price), data=diamonds)

Coded Scatterplot between carat and price, with the clarity of diamonds represented by the color of dots.

qplot(log(carat), log(price), color= clarity, data=diamonds)

From the above we can see that for diamonds of the same carat, those with better clarity are more valuable. In addition to color, one can also use the shape, size to represent the 3rd variable, e.g.,

qplot(log(carat), log(price), shape= clarity, data=diamonds)
qplot(log(carat), log(price), size= clarity, data=diamonds)

However, it’s hard to view the relationship between four variables. To see the effect of cut on price after accounting for carat and clarity, we can split the diamonds by clarity and make scatterplot between price and clarity using the color of dots to represent the cut of diamonds. To split the data by clarity, we need to add facets= ~clarity.

qplot(log(carat), log(price), color= cut, facets= ~clarity, data=diamonds)

Now we examine the effect of the color of diamonds affect the prices after accounting for carat and clarity. Apparently, for diamonds with the same carat and clarity, their values change with their color from color J (least valuable) to color D (most valuable).

qplot(log(carat), log(price), color= color, facets= ~ clarity, data=diamonds)

```