R tutorial

In R we general work with the following common packages:

```* dplyr, used to manipulate data
* ggplot2, used to plot data
```

Dealing with Data

Once we have loaded a dataset, we can explore it by executing:

```# if our dataset is called 'ds', we can do:
dim(ds)
# this will tell us number of rows and columns of dataset ds
```

We can get the column names by executing:

```names(ds)
```

To give a brief peek into our data we can use the `str` function:

```# str stands for "structure"
str(ds)
```

To select a single column we can do:

```ds\$colname
```

Understand the range of a column:

```range(present\$year)
```

We can inspect head and tail of a dataset with:

```head(ds)
tail(ds, 18) # shows 18 elements from the tail
```

Summary Statistics

Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:

• summary
• mean
• median
• sd
• var
• IQR
• range
• min
• max
• n, which is the length of a vector
• n_distinct, which is the number of distinct values of a vector

Data Wrangling with dplyr

The dplyr package offers mainly seven verbs (functions) for basic data manipulation:

• filter()
• select()
• arrange()
• distinct()
• mutate()
• summarise()
• sample_n()

Other than the extreme flexibility of these functions, dplyr inherit from another package (called magrittr) the `pipe` operator `%>%`. This operator allows to chain dplyr operations, allowing us to write shorter snippets of code when transforming data.

Basics: Selecting Columns and Rows

We can select rows and columns in R by using the most basic dplyr commands which are `filter` and `select`.

Selecting Columns
```sleepData <- select(msleep, name, sleep_total)
# this selects from the dataframe msleep only
# the columns name and sleep_total
```

We can also exclude columns by doing:

```select(msleep, -name) # we are excluding the column name
```

To select a range of columns by name, use the ":" (colon) operator

```head(select(msleep, col1name:col4name))
```

We can also select columns by using also other criteria, like:

• starts_with() = Select all columns that start with a character string
• ends_with() = Select columns that end with a character string
• contains() = Select columns that contain a character string
• matches() = Select columns that match a regular expression
• one_of() = Select columns names that are from a group of names

for example:

```select(ds, starts_with("col_"))
```
Selecting Rows
```filter(ds, colname1 >= 26)
```

We can also build more complex filters e.g.:

```filter(ds, colname1 >= 16, colname2 >= 1)
```
```filter(ds, country %in% c("Italy", "Romania"))
```

Let's do an example, using the pipe operator:

```msleep %>%
select(col1, col2) %>%
```

Arrange

This dplyr is used to order or reorder data, for example:

```ds %>% arrange(col1) %>% head # we are ordering data by the column called col1
```

Let's see a more complex example, where we first select some of the columns, then we arrange first by a column called `col1` and then with a descending order by the column called `col2` and the we filter rows by imposinng a condition on column `col3`.

```ds %>%
select(col1, col2, col3) %>%
arrange(col1, desc(col2)) %>%
filter(col3 >= 16)
```

Mutate

```ds %>%
mutate(col5= col2 / col1) %>%
```

We can also create more columns:

```ds %>%
mutate(col5 = col2 / col1,
col6 = col2 + col1) %>%
```

In this example we create a new column which will contain the string "on time" if the dep_delay is less than 5 in the other case we will set as string "delayed"

```ds %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
```

Summarise

Summarise is used to create summaries on specific columns.

Let's see some examples:

```msleep %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
```

Groupby

The group_by() verb implements the concept of "split-apply-combine". This is generally used when we want split the data by some variable then apply a function to the individual data frames and then combine the output together.

In this example we split by order and then compute some statistics.

```msleep %>%
group_by(order) %>%
summarise(avg_sleep = mean(sleep_total),
min_sleep = min(sleep_total),
max_sleep = max(sleep_total),
total = n())
```

Data Plotting with ggplot2

```# Plot a histogram
ggplot(housing, aes(x = col1)) +
geom_histogram()
```
```# Plot a scatter plot
ggplot(data = ds, aes(x = col1, y = col2)) +
geom_point()
```
```# Plot a line plot
ggplot(data = ds, aes(x = col1, y = col2)) +
geom_line()
```
```# Plot a line plot with also points
ggplot(data = ds, aes(x = col1, y = col2)) +
geom_line() + geom_point()
```
```# Plot a scatter plot selecting only specific data and a legend
ggplot(subset(housing, State %in% c("MA", "TX")),
aes(x=col1,
y=col2,
color=State))+
geom_point()
```

```arbuthnot <- arbuthnot %>%
mutate(total = boys + girls)
```
```arbuthnot <- arbuthnot %>%
mutate(more_boys = boys > girls)
```

Playing with integrated datasets

We can use explore example datasets with the command:

```data()
```