Bash Tutorial

In R we general work with the following common packages:

* dplyr, used to manipulate data
* ggplot2, used to plot data

Dealing with Data

Once we have loaded a dataset, we can explore it by executing:

# if our dataset is called 'ds', we can do:
dim(ds)
# this will tell us number of rows and columns of dataset ds

We can get the column names by executing:

names(ds)

To give a brief peek into our data we can use the str function:

# str stands for "structure"
str(ds)

To select a single column we can do:

ds$colname

Understand the range of a column:

range(present$year)

We can inspect head and tail of a dataset with:

head(ds)
tail(ds, 18) # shows 18 elements from the tail

Summary Statistics

Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:

  • summary
  • mean
  • median
  • sd
  • var
  • IQR
  • range
  • min
  • max
  • n, which is the length of a vector
  • n_distinct, which is the number of distinct values of a vector

Data Wrangling with dplyr

The dplyr package offers mainly seven verbs (functions) for basic data manipulation:

  • filter()
  • select()
  • arrange()
  • distinct()
  • mutate()
  • summarise()
  • sample_n()

Other than the extreme flexibility of these functions, dplyr inherit from another package (called magrittr) the pipe operator %>%. This operator allows to chain dplyr operations, allowing us to write shorter snippets of code when transforming data.

Basics: Selecting Columns and Rows

We can select rows and columns in R by using the most basic dplyr commands which are filter and select.

Selecting Columns
sleepData <- select(msleep, name, sleep_total) 
 # this selects from the dataframe msleep only
 # the columns name and sleep_total

We can also exclude columns by doing:

select(msleep, -name) # we are excluding the column name

To select a range of columns by name, use the ":" (colon) operator

head(select(msleep, col1name:col4name))

We can also select columns by using also other criteria, like:

  • starts_with() = Select all columns that start with a character string
  • ends_with() = Select columns that end with a character string
  • contains() = Select columns that contain a character string
  • matches() = Select columns that match a regular expression
  • one_of() = Select columns names that are from a group of names

for example:

select(ds, starts_with("col_"))
Selecting Rows
filter(ds, colname1 >= 26)

We can also build more complex filters e.g.:

filter(ds, colname1 >= 16, colname2 >= 1)
filter(ds, country %in% c("Italy", "Romania"))

Let's do an example, using the pipe operator:

msleep %>% 
    select(col1, col2) %>% 
    head

Arrange

This dplyr is used to order or reorder data, for example:

ds %>% arrange(col1) %>% head # we are ordering data by the column called col1

Let's see a more complex example, where we first select some of the columns, then we arrange first by a column called col1 and then with a descending order by the column called col2 and the we filter rows by imposinng a condition on column col3.

ds %>% 
    select(col1, col2, col3) %>%
    arrange(col1, desc(col2)) %>% 
    filter(col3 >= 16)

Mutate

ds %>% 
    mutate(col5= col2 / col1) %>%
    head

We can also create more columns:

ds %>% 
    mutate(col5 = col2 / col1,
           col6 = col2 + col1) %>%
    head  

In this example we create a new column which will contain the string "on time" if the dep_delay is less than 5 in the other case we will set as string "delayed"

ds %>% 
    mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

Summarise

Summarise is used to create summaries on specific columns.

Let's see some examples:

msleep %>% 
    summarise(avg_sleep = mean(sleep_total), 
              min_sleep = min(sleep_total),
              max_sleep = max(sleep_total),
              total = n())

Groupby

The group_by() verb implements the concept of "split-apply-combine". This is generally used when we want split the data by some variable then apply a function to the individual data frames and then combine the output together.

In this example we split by order and then compute some statistics.

msleep %>% 
    group_by(order) %>%
    summarise(avg_sleep = mean(sleep_total), 
    min_sleep = min(sleep_total), 
    max_sleep = max(sleep_total),
    total = n())

Data Plotting with ggplot2

# Plot a histogram
ggplot(housing, aes(x = col1)) +
  geom_histogram()
# Plot a scatter plot
ggplot(data = ds, aes(x = col1, y = col2)) +
  geom_point()
# Plot a line plot
ggplot(data = ds, aes(x = col1, y = col2)) +
  geom_line()
# Plot a line plot with also points
ggplot(data = ds, aes(x = col1, y = col2)) +
  geom_line() + geom_point()
# Plot a scatter plot selecting only specific data and a legend
ggplot(subset(housing, State %in% c("MA", "TX")),
       aes(x=col1,
           y=col2,
           color=State))+
  geom_point()

Adding Columns

arbuthnot <- arbuthnot %>%
  mutate(total = boys + girls)
arbuthnot <- arbuthnot %>%
  mutate(more_boys = boys > girls)

Playing with integrated datasets

We can use explore example datasets with the command:

data()