In pandas we have two datastructure: Series (which can be thought as an array) DataFrame (which can be thought as a matrix)

A DataFrame can be also viewed as an array of Series, dataframes and series are flexible, and can contain labels for fields, indexes and many other interesting features which would be complicated to have on plain arrays/matrices.

Creating a Dataframe

df1 = pd.DataFrame({
    "city": ["new york","chicago","orlando"],
    "temperature": [21,14,35],

Describe a dataset

ds.describe(include = "all")
ds.memory_user(deep = True)


We can view and inspect types of a dataframe with:


In order to change a column to a categoric type we can do:

ds['column_name'] = ds['column_name'].astype('category', categories=['good', 'very good', 'excellent'])

Basic Pandas Statistics

ds.field.quantile([0.1,0.15, .9])

Manipulating CSV Files

Reading a CSV File

ds = pd.read_csv(filename, sep=None, engine='python', parse_dates=['fcast_date','timestamp'], dtype={'user_id': "category", 'stringa':'object'})

Writing to a CSV File

ds.to_csv(filename, index = False)

Selecting Data

Selecting with Labels

In order to select by labels we use the loc method:

ds.loc[0:4, ['column1','column2']]

This can be considered another way to remove columns and just keep those in which we are interested:

ds.loc[:, ['column1','column2']]
ds.loc[0:4, 'column1':'column2']

Selecting and changing a specific value

If we want to modify the value in column 'b' which is on the first row we can do:

df.loc[1, 'b'] = 'XXXXXX'

### Selecting with Numbers
We can use iloc if we want to select data referring to numbers for
columns like:

ds.iloc[:, 0:4]


ds[(ds.column1 >= 200) & (ds.column2 == 'Drama')]

Pandas Conditionals

df.loc[df.AAA >= 5,['BBB','CCC']] = 555;

```python pd_if_else df['logic'] = np.where(df['AAA'] > 5,'high','low'); df

## Column Operations

### Remove columns
ds.drop(['column1','column2'], 1, inplace = True)

Remove Column on a Condition

c = c[c.n_opts != 5]

Rename columns

ds.rename(columns={'fcast_date_a': 'date_fcast'}, inplace=True)

Create new Columns

ds["days_from_start"] = ds["fcast_date_a"] - ds["date_start"]

Create new Columns with Apply

def compute_euclidean_distance(row):
    a = np.array([row['value_a'], row['value_b'], row['value_c']])
    b = np.array([row['a'], row['b'], row['c']])
    return distance.euclidean(a, b)

ds['new_distance'] = ds.apply(compute_euclidean_distance, axis=1)

Create Dummy Columns for One-Hot Encoding

one_hot_cols = pd.get_dummies(ds['outcome'], prefix='outcome')
ds.drop('outcome', axis=1, inplace = True)
ds = ds.join(one_hot_cols)

Create a Dataframe as a combination of two dataframes with different columns

The main purpose of a cross-tabulation is to enable readers to readily compare two categorical variables.

```python pd_rowconcat ds = pd.concat([df_even, df_odd], axis=1)

## Row Operations

```python pd_unique

Split a Dataset into Train/Test

```python pd_traintestsplit train = dataset.sample(frac=0.95,random_state=200) test = dataset.drop(train.index)

### Concatenate rows of two different datasets with same columns
In order to concatenate rows of more datasets we can basically do:
pd.concat([df1, df2, df3], ignore_index = True)

A useful shortcut to concat() are the append() instance methods on Series and DataFrame. These methods actually predated concat. They concatenate along axis=0, namely the index:

result = df1.append(df2, ignore_index = True)

Another example of this, is when our dataset is split among more files, in this case we can do:

frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames, ignore_index = True)


In order to merge on a field which could be considered a primary key we can do:

c = pd.merge(ds1, ds2, on='ifp_id')

Now this is by default an inner join, that means, that only the 'ifp_id' which are intersection of both ds1 and ds2 are taken into account.

We can do an outer join by specifying the attribute called 'how'.

df3 = pd.merge(df1,df2,on="city",how="outer")

We can also specify if we want to keep all the keys containes only in the left dataset or right dataset with:

df3 = pd.merge(df1,df2,on="city",how="left")

If we have column names which are shared by both datasets we can easily add suffixes, for example:

df3 = pd.merge(df1,df2,on="city",how="outer", suffixes=('_first','_second'))

or let's say we have a couple of predictions, so user in table 1 has value1 value2 value3 and also a user in table 2, so we can do:

df3 = pd.merge(df1,df2,on="ifp_id",how="inner", suffixes=('_user1','_user2'))

Dealing with Null Values

Summarizing Null Values


Removing Null Values

In order to drop all the rows which have a null value on any field we do:


In order to drop all the rows which have a null value on all the field we do:


In order to drop all the rows which have a null value on any field within a subset we do:

ds.dropna(subset = ['column', 'column2'], how='any')

In order to drop all the rows which have a null value on all the fields within a subset we do:

ds.dropna(subset = ['column', 'column2'], how='all')

Reaplacing Null Values

ds['column_name'].fillna(value='not assigned', inplace = True)
ds['columnname'].value_counts(dropna = False)

Dealing with Duplicates

Counting Duplicates

In order to count all the duplicated values we do:


In order to count all the duplicated values with respect to a certain subset of fields we do:


In order to count duplicates with respect to a certain column we can do:


Visualizing Duplicates

In order to view all the duplicates we can do:

ds.loc[users.duplicated(keep = 'last'), :]

where keep = 'last' means that we are showing the last encountered instance of a duplicate row

Removing Duplicates

To remove duplicates and just keep the first encountered instances we do:

ds.drop_duplicates(keep = 'first')

To remove duplicates and just keep the last encountered instances we do:

ds.drop_duplicates(keep = 'last')

To remove duplicates with respect to a subset of fields:

ds.drop_duplicates(subset = ['age', 'zip_code'])

Sorting Values

ds.sort_values(['column_1'], ascending=False)

Grouping Values


Map, Apply and ApplyMap

Map applies a translation to each element of a series:

ds['new_column'] ={'female':0, 'male':1})

Apply applies a function to each element of a series

ds['new_column'] = train.col1.apply(len)

Cross Tab

The main purpose of a cross-tabulation is to enable readers to readily compare two categorical variables:

pd.crosstab(ds.column_x, ds.column_y)

Plotting with Pandas

ds.column_name.plot(kind = 'hist')
ds.column_name.plot(kind = 'bar')