The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the texbook and also to analyze real data and come to informed conclusions. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.
Download the installation file here. You will need to choose a location that is close to you (this affects the download speed). Please install the correct version for your operating system. You should agree to all of the installation defaults (unless you already know how to customize R).
RStudio provides a friendlier working environment for R
Download RStudio Preview. Select an isntaller from (Desktop Version) based on your OS and then install.
We recommend always starting up the RStudio program instead of working in R directly.
The RStudio interface consists of several windows (see Figure 1 below).
Bottom left: command window. Here you can type simple commands after the “>” prompt and R will then execute your command. This is the most important window, because this is where R actually does stuff.
Top left: script window. Collections of commands (scripts) can be edited and saved. When you don’t get this window, you can open it with [File] – [New] – [R script]. Just typing a command in the editor window is not enough, it has to get into the command window before R executes the command. If you want to run a line from the script window (or the whole script), you can click Run or press CTRL+ENTER
to send it to the command window.
Top right: workspace / history window. In the workspace window you can see which data and values R has in its memory. You can view and edit the values by clicking on them. The history window shows what has been typed before.
Bottom right: files / plots / packages / help window. Here you can open files, view plots (also previous plots), install and load packages or use the help function.
You can change the size of the windows by dragging the grey bars between the windows.
tidyverse
and lattice
LibrariesThere are many libraries (also called packages) that can be installed to expand the standard R functions. For STAT220, we will rely maainly on two libraries named tidyverse
, which contains useful libraries for working with data and lattice
, which contains useful plotting functions for data visualisation. To find out more about the tidyverse
family of packages, please refer to the documentation. To install these packages, you need to run the following:
This may take a few minutes to complete. During the installation process
Note that you will only have to install libraries once. However, once the package is installed, you will have to load it into the workspace before it can be used, as follows:
Note that you will have to do this every time you start a new R session.
Your working directory is the folder on your computer in which you are currently working. When you ask R to open a certain file, it will look in the working directory for this file, and when you tell R to save a data file or figure, it will save it in the working directory.
Before you start working, please set your working directory to where all your data and script files are or should be stored.
Type in the command window: setwd("directoryname")
. For example:
Make sure that the slashes are forward slashes and that you don’t forget the apostrophes. R is case sensitive, so make sure you write capitals where necessary. Within RStudio you can also go to the menu on the top : [Session] –> [Set Working Directory] –> [Choose Directory…]. Or you can also go the [Files] pane on the bottom right, navigate to the directory you want, choose [More] and then [Set As Working Directory].
The Arbuthnot data set we are going to work on today comes from Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710.
The Arbuthnot data set can be download at https://jingshuw.org/materials/data/arbuthnot.dat and save it to the working directory that you just set.
After setting the working directory, we can use the read.table
function to load the data set.
The argument header=TRUE
means that the first line of the data file is the header of the data file, which contains the names the variables, and the data stars from the second line.
If the data set is loaded successfully, you should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot
that has 82 observations on 3 variables.
Setting working directory and loading data file are two steps many students struggle with. Ask TA for help if you have any trouble.
We can take a look at the data by typing its name into the console.
What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.
Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a kind of spreadsheet or table called a data frame.
You can see the dimensions of this data frame by typing:
This command should output [1] 82 3
, indicating that there are 82 rows and 3 columns (we’ll get to what the [1]
means shortly), just as it says next to the object in your workspace. You can see the names of these columns (or variables) by typing:
You should see that the data frame contains the columns year
, boys
, and girls
. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The dim
and names
commands, for example, each took a single argument, the name of a data frame.
One advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot
in the workspace panel (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner.
We can access a single variable of a data frame using the dollar-sign $
notation. For example, the command below
## [1] 5218 4858 4422 4994 5158 5035 5106 4917 4703 5359 5366 5518 5470 5460 4793
## [16] 4107 4047 3768 3796 3363 3079 2890 3231 3220 3196 3441 3655 3668 3396 3157
## [31] 3209 3724 4748 5216 5411 6041 5114 4678 5616 6073 6506 6278 6449 6443 6073
## [46] 6113 6058 6552 6423 6568 6247 6548 6822 6909 7577 7575 7484 7575 7737 7487
## [61] 7604 7909 7662 7602 7676 6985 7263 7632 8062 8426 7911 7578 8102 8031 7765
## [76] 6113 8366 7952 8379 8239 7840 7640
will return the boys
variable in the data frame arbuthnot
, which shows the number of boys baptized each year. Likewise arbuthnot$girls
will return just the counts of girls baptized each year.
Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218
follows [1]
, indicating that 5218
is the first entry in the vector. And the second line starts with [15]
means that first number 4793
in the second line is the 15-th entry in the vector, and so on for the rest of the output.
To see the element in the row 56 and the column 3 (in this case, girls
) of the table, we can type
To see the first 10 numbers in the third column, we can type
In this expression, we have asked just for rows in the range 1 through 10. R uses the :
to create a range of values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering
Finally, if we want all of the data for the first 10 years, type
By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the observations, not just the 56-th, or rows 1 through 10. Try the following to see the 3rd variable for all 82 years
Recall that column 3 represents the number of girls baptized in that year, so the command above should output the same list of numbers as arbuthnot$girls
. We see the number of girls baptized for the 56th year by typing
Similarly, for just the first 10 years
The command above returns the same result as the arbuthnot[1:10,3]
command. Both row-and-column notation and dollar-sign notation are widely used, which one you choose to use depends on your personal preference.
We can use the qplot
function in the tidyverse
package to make scatterplots. For example, we can create a simple plot of the number of girls baptized per year with the command
Note you need to load library(tidyverse)
before using qplot
. Otherwise, you will get the error message: Error: could not find function "qplot"
. You just nead to load tidyverse
once in each R session.
The plot itself should appear under the [Plots] tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with two arguments separated by a comma. The first argument in the plot function specifies the variable for the x-axis and the second for the y-axis. If we wanted to connect the data points with lines, we could add a third argument, geom="line"
.
It’s a good habit to label the axes of the plot.
Now, suppose we are interested in the ratio of the number of boys to the number of girls baptized in 1629 To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like
to see the boys/girls ratio in 1629. We could repeat this once for each year, but there is a faster way. If we take the ratio of the vector for baptisms for boys and girls, R will compute all ratios simultaneously.
You can also use with
to avoid repeatedly typing the name of the data frame. with
instructs R to interpret everything else from within the data frame that you specify.
What you will see are 82 numbers, each one representing the sum we’re after. Take a look at a few of them and verify that they are right. Therefore, we can make a plot of the total number of baptisms per year with the command
Similarly to how we computed the proportion of boys, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with
or we can act on the complete vectors with the expression
Note that with R as with your calculator, you need to be conscious of the order of operations. Here, we want to divide the number of boys by the total number of newborns, so we have to use parentheses. Without them, R will first do the division, then the addition, giving you something that is not a proportion.
Now, we make a plot of the proportion of boys over time. What do you see?
A Useful Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.
To export a plot, go to the [Files] tab in the bottomright panel, click on [Export], and then select [Copy to Clipboard]. And then a small window of the plot will pop up. In the pop up window (you may adjust the size of the plot at this point), select the option [Copy as Metafile] and then click [Copy Plot] and then paste it directly into your Word document.
Many students do STAT220 homework by typing answers in a Word document, copy-and-pasting R outputs (including plots, numerical outputs, and possiblly some R codes) to where they are needed from R Studio, and then submitting the doc file. This way is simple. One downside of this method is that it can be bothersome to copy and paste back forth between RStudio and Word document when the R work is long. Another downside is that the copied numerical R outputs look ugly in Word document since they don’t line up so well as they were in RStudio.
A more advanced way (and still simple) is to write the text answers and R codes in a R markdown file, and then RStudio can convert the R markdown file to a Word documents, with the R outputs and R plots inserted in the text. For R markdown tutorial, visit the page http://shiny.rstudio.com/articles/rmarkdown.html or ask CAs for a demonstration.
This seems like a fair bit for your first lab, so let’s stop here. To exit RStudio you can click the x in the upper right corner of the whole window.
You will be prompted to save your workspace. If you click save, RStudio will save the history of your commands and all the objects in your workspace so that the next time you launch RStudio, you will see arbuthnot
and you will have access to the commands you typed in your previous session. For now, click save, then start up RStudio again.
In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. The “present” data set can be downloaded at https://jingshuw.org/materials/data/present.dat
What years are included in this data set? What are the dimensions of the data frame and what are the variable or column names?
Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response.
In what year did we see the most total number of births in the U.S.? You can refer to the help files or the R reference card http://cran.r-project.org/doc/contrib/Short-refcard.pdf to find helpful commands.