Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, we will show you how to use R to get numerical summaries of data and how to view data by graphs. Before we start, we will need to load in the following libraries:

library(tidyverse)
library(lattice)

1. The Diamonds Dataset

The diamonds dataset is a built-in data set in the ggplot2 library which is one of the components of the tidyverse library, which means that we can access it using the data() function after loading the tidyverse library

data(diamonds)

The variables in the diamonds dataset are

and five physical measurements, depth, table, x, y and z, as shown in Figure 1 below

Figure 1

The dimension of the diamonds dataset is

dim(diamonds)
## [1] 53940    10

from which we can see the data contains 53940 rows and 10 variables.

To view the names of the variables, type the command

str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Here we can see that carat, depth, table, price and x, y, z are numerical variables, and cut, color, and clarity are ordinal categorical variables. Specifically, price is an integer-valued variable.

2. Numerical Summary of Data

To calculate the mean, median, SD, variance, five-number summary, IQR, minimum, maximum of the price variable in the diamonds dataset, type

# Numerical Summary of Data
mean(diamonds$price)
median(diamonds$price)
sd(diamonds$price)
var(diamonds$price)
fivenum(diamonds$price)
IQR(diamonds$price)
min(diamonds$price)
max(diamonds$price)

In statistics, it is also import to be able to compute averages of variables within specific groupings of the data. For example, in the diamonds dataset the cut variable has 5 possible values: [Fair, Good, Very Good, Premium, Ideal], and we may be interested in the average price of diamonds in each category of cut. To compute this, we can use the aggregate() function:

aggregate(list(price=diamonds$price), list(cut=diamonds$cut), FUN=mean)
##         cut    price
## 1      Fair 4358.758
## 2      Good 3928.864
## 3 Very Good 3981.760
## 4   Premium 4584.258
## 5     Ideal 3457.542

The aggregate function requires 3 arguments. The first argument passed to it must be a list containing the columns of data we wish to aggregate. For example, computing a mean, or a sum, or a maximum, are all examples of aggregation. In this case, we want to compute the mean of price. The second argument to aggregate is a list of variables we wish to group by, which in this case is the cut variable. Finally, we must tell the function what aggregation method we wish to use, FUN=mean tells the function that we wish to compute a mean. The reason that the second argument is a list is because we may wish to group by multiple variables. For example, try to figure out what the following commands do:

aggregate(list(price=diamonds$price), list(cut=diamonds$cut, color=diamonds$color), FUN=median)
aggregate(list(price=diamonds$price, carat=diamonds$carat), list(color=diamonds$color), FUN=min)

Surprisingly, higher quality cut diamonds are not necessarily more expensive (e.g., mean price of diamonds with the best cut (Ideal) is $3457.5, lower than that of the worst cut (Fair), $4358.7). This is because we didn’t take weight (carat) of diamonds into account. Diamonds with Ideal cut tend to be smaller than diamonds with Fair cut.

aggregate(list(carat=diamonds$carat), 
          list(cut=diamonds$cut), 
          FUN=mean)
##         cut     carat
## 1      Fair 1.0461366
## 2      Good 0.8491847
## 3 Very Good 0.8063814
## 4   Premium 0.8919549
## 5     Ideal 0.7028370

We can find the mean price of diamonds grouped by cut and clarity:

aggregate(list(price=diamonds$price), 
          list(cut=diamonds$cut, clarity=diamonds$clarity), 
          FUN=mean)

The aggregate() function also works for other aggregation methods such as median(), sd() , var() , min() , max() , sum() , IQR().

3. Graphical Display of Data

3.1 Histogram

Histogram of the carat variable:

histogram(~carat, data=diamonds)

You can adjust the binwidth of the histogram

histogram(~carat, data=diamonds, width = 0.1)

histogram(~carat, data=diamonds, width = 0.01)

or you can adjust the number of intervals (nint)

#or you can adjust the number of intervals (nint)
histogram(~carat, data=diamonds, nint = 50)
histogram(~carat, data=diamonds, nint = 500)

or you can set the end points of the intervals yourself if you want unequal binwidths

histogram(~carat, data=diamonds, breaks=c(0.2,0.3,0.5,0.7,1,1.5,2,2.5,3,4,5.5))

You can split the diamonds by the quality of cut and make separate histograms for each level of cut

histogram(~price | cut, data=diamonds)

You can adjust the number of bins of side-by-side histograms by changing width, nint, or breaks as for a single histogram.

histogram(~price | cut, data=diamonds, width = 1000)

It’s usually better to stack the five histograms on the same horizontal histogram.

histogram(~price | cut, data=diamonds, width = 250, layout = c(1,5))

3.2 Boxplots

R has a built-in function boxplot() for making boxplots, e.g., the boxplot for the price of diamonds

boxplot(diamonds$price, horizontal=T)