Bash Tutorial
In R we general work with the following common packages:
* dplyr, used to manipulate data * ggplot2, used to plot data
Dealing with Data
Once we have loaded a dataset, we can explore it by executing:
# if our dataset is called 'ds', we can do: dim(ds) # this will tell us number of rows and columns of dataset ds
We can get the column names by executing:
names(ds)
To give a brief peek into our data we can use the str
function:
# str stands for "structure" str(ds)
To select a single column we can do:
ds$colname
Understand the range of a column:
range(present$year)
We can inspect head and tail of a dataset with:
head(ds) tail(ds, 18) # shows 18 elements from the tail
Summary Statistics
Summary statistics: Some useful function calls for summary statistics for a single numerical variable are as follows:
- summary
- mean
- median
- sd
- var
- IQR
- range
- min
- max
- n, which is the length of a vector
- n_distinct, which is the number of distinct values of a vector
Data Wrangling with dplyr
The dplyr package offers mainly seven verbs (functions) for basic data manipulation:
- filter()
- select()
- arrange()
- distinct()
- mutate()
- summarise()
- sample_n()
Other than the extreme flexibility of these functions, dplyr inherit from another
package (called magrittr) the pipe
operator %>%
. This operator
allows to chain dplyr operations, allowing us to write shorter snippets of code
when transforming data.
Basics: Selecting Columns and Rows
We can select rows and columns in R by using the most basic dplyr commands
which are filter
and select
.
Selecting Columns
sleepData <- select(msleep, name, sleep_total) # this selects from the dataframe msleep only # the columns name and sleep_total
We can also exclude columns by doing:
select(msleep, -name) # we are excluding the column name
To select a range of columns by name, use the ":" (colon) operator
head(select(msleep, col1name:col4name))
We can also select columns by using also other criteria, like:
- starts_with() = Select all columns that start with a character string
- ends_with() = Select columns that end with a character string
- contains() = Select columns that contain a character string
- matches() = Select columns that match a regular expression
- one_of() = Select columns names that are from a group of names
for example:
select(ds, starts_with("col_"))
Selecting Rows
filter(ds, colname1 >= 26)
We can also build more complex filters e.g.:
filter(ds, colname1 >= 16, colname2 >= 1)
filter(ds, country %in% c("Italy", "Romania"))
Let's do an example, using the pipe operator:
msleep %>% select(col1, col2) %>% head
Arrange
This dplyr is used to order or reorder data, for example:
ds %>% arrange(col1) %>% head # we are ordering data by the column called col1
Let's see a more complex example, where we first select
some of the columns, then we arrange first by a column
called col1
and then with a descending order by the column
called col2
and the we filter rows by imposinng a condition
on column col3
.
ds %>% select(col1, col2, col3) %>% arrange(col1, desc(col2)) %>% filter(col3 >= 16)
Mutate
ds %>% mutate(col5= col2 / col1) %>% head
We can also create more columns:
ds %>% mutate(col5 = col2 / col1, col6 = col2 + col1) %>% head
In this example we create a new column which will contain the string "on time" if the dep_delay is less than 5 in the other case we will set as string "delayed"
ds %>% mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
Summarise
Summarise is used to create summaries on specific columns.
Let's see some examples:
msleep %>% summarise(avg_sleep = mean(sleep_total), min_sleep = min(sleep_total), max_sleep = max(sleep_total), total = n())
Groupby
The group_by() verb implements the concept of "split-apply-combine". This is generally used when we want split the data by some variable then apply a function to the individual data frames and then combine the output together.
In this example we split by order and then compute some statistics.
msleep %>% group_by(order) %>% summarise(avg_sleep = mean(sleep_total), min_sleep = min(sleep_total), max_sleep = max(sleep_total), total = n())
Data Plotting with ggplot2
# Plot a histogram ggplot(housing, aes(x = col1)) + geom_histogram()
# Plot a scatter plot ggplot(data = ds, aes(x = col1, y = col2)) + geom_point()
# Plot a line plot ggplot(data = ds, aes(x = col1, y = col2)) + geom_line()
# Plot a line plot with also points ggplot(data = ds, aes(x = col1, y = col2)) + geom_line() + geom_point()
# Plot a scatter plot selecting only specific data and a legend ggplot(subset(housing, State %in% c("MA", "TX")), aes(x=col1, y=col2, color=State))+ geom_point()
Adding Columns
arbuthnot <- arbuthnot %>% mutate(total = boys + girls)
arbuthnot <- arbuthnot %>% mutate(more_boys = boys > girls)
Playing with integrated datasets
We can use explore example datasets with the command:
data()