Pandas DataFrames

Overview

Teaching: 15 min
Exercises: 15 min

Questions

How can I do statistical analysis of tabular data?

Objectives

Select individual values from a Pandas dataframe.

Select entire rows or entire columns from a dataframe.

Select a subset of both rows and columns from a dataframe in a single operation.

Select a subset of a dataframe by a single Boolean criterion.

Note about Pandas DataFrames/Series

A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

Selecting values

To access a value at the position [i,j] of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides an index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

Can specify location by numerical index analogously to 2D version of character selection in strings.

import pandas as pd
data = pd.read_csv("data/Thames_Initiative_2009-2017.csv")
data.iloc[0, 4]

'7.2'

Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

Can specify location by row and/or column name.

data.loc[0, "Site name"]

'Thame at Wheatley'

Use `:` on its own to mean all columns or all rows.

Just like Python’s usual slicing notation.

data.loc[:, "Lab pH"]

     8.01
     7.94
     8.05
     8.14
     8.20
        ...
  7.80
  7.81
  7.70
  7.66
  7.75
Name: Lab pH, Length: 9130, dtype: float64

Would get the same result with data["Lab pH"] (without a second index).

Select multiple columns or rows using `DataFrame.loc` and a named slice.

data.loc[:, "Site name": "Lab pH"]

	Site name 	Sampling Date 	Time of sampling 	Temperature (oC) 	Lab pH
Thame at Wheatley 	03/03/2009 	09:25 	7.2 	8.01
Thame at Wheatley 	09/03/2009 	09:40 	6.8 	7.94
Thame at Wheatley 	16/03/2009 	10:00 	9.3 	8.05
Thame at Wheatley 	24/03/2009 	09:45 	7.8 	8.14
Thame at Wheatley 	01/04/2009 	09:46 	8.9 	8.20
... 	... 	... 	... 	... 	...
Cole at Lynt Bridge 	29/08/2017 	12:15 	17.5 	7.80
Cole at Lynt Bridge 	04/09/2017 	12:10 	15.4 	7.81
Cole at Lynt Bridge 	11/09/2017 	13:40 	14.5 	7.70
Cole at Lynt Bridge 	18/09/2017 	13:40 	12.8 	7.66
Cole at Lynt Bridge 	25/09/2017 	13:20 	15 	7.75

rows × 5 columns

In the above code, we discover that slicing using loc is inclusive at both ends, which differs from slicing using iloc, where slicing indicates everything up to but not including the final index.

And here’s selecting a subset of rows as well as columns.

data.loc[0: 10, "Site name": "Lab pH"]

 	Site name 	Sampling Date 	Time of sampling 	Temperature (oC) 	Lab pH
Thame at Wheatley 	03/03/2009 	09:25 	7.2 	8.01
Thame at Wheatley 	09/03/2009 	09:40 	6.8 	7.94
Thame at Wheatley 	16/03/2009 	10:00 	9.3 	8.05
Thame at Wheatley 	24/03/2009 	09:45 	7.8 	8.14
Thame at Wheatley 	01/04/2009 	09:46 	8.9 	8.20
Thame at Wheatley 	06/04/2009 	09:48 	11.3 	8.20
Thame at Wheatley 	14/04/2009 	09:10 	11.9 	8.11
Thame at Wheatley 	20/04/2009 	10:00 	11.9 	8.00
Thame at Wheatley 	27/04/2009 	09:40 	12.2 	7.98
Thame at Wheatley 	05/05/2009 	09:25 	12.5 	7.90
Thame at Wheatley 	11/05/2009 	09:44 	14.3 	7.93

Result of slicing can be used in further operations.

Usually don’t just print a slice.
All the statistical operators that work on entire dataframes work the same way on slices.
E.g., calculate max of a slice.

data.loc[:, "Site name": "Lab pH"].max()

Site name        Wye at Bourne End
Sampling Date           31/10/2016
Lab pH                        9.36
dtype: object

data.loc[:, "Site name": "Lab pH"].min()

Site name        Cherwell at Hampton Poyle
Sampling Date                   01/02/2010
Lab pH                                6.67
dtype: object

Use comparisons to select data based on value.

Comparison is applied element by element.
Returns a similarly-shaped dataframe of True and False.

# Use a subset of data to keep output readable.
subset = data.loc[0: 10, "Lab pH"]
subset

# Which values were greater than 8?
subset > 8

   8.01
   7.94
   8.05
   8.14
   8.20
   8.20
   8.11
   8.00
   7.98
   7.90
  7.93
Name: Lab pH, dtype: float64

    True
   False
    True
    True
    True
    True
    True
   False
   False
   False
  False
Name: Lab pH, dtype: bool

Select values or NaN using a Boolean mask.

A frame full of Booleans is sometimes called a mask because of how it can be used.

mask = subset > 8
subset[mask]

  8.01
  8.05
  8.14
  8.20
  8.20
  8.11
Name: Lab pH, dtype: float64

What would the following lines do?

mask = data["Lab pH"] > data["Lab pH"].quantile(0.999)
basic = data[mask]
basic.loc[:, "Site name": "Lab pH"]

	Site name 	Sampling Date 	Time of sampling 	Temperature (oC) 	Lab pH
Ock at Abingdon 	03/05/2016 	14:45 	11.9 	8.59
The Cut at Paley Street 	08/09/2014 	12:00 	16.3 	9.36
Thames at Wallingford 	11/07/2011 	16:35 	19.2 	8.69
Thames at Wallingford 	26/03/2012 	17:18 	13 	8.63
Thames at Hannington 	01/08/2016 	13:10 	16.8 	8.61
Kennet at Woolhampton 	01/08/2016 	09:35 	15.3 	8.73
Enborne at Brimpton 	01/08/2016 	09:50 	15.3 	8.66
Jubilee River at Datchet 	05/03/2012 	12:31 	9.4 	8.88
Colne at Staines 	01/08/2016 	13:50 	17.7 	8.79
Cole at Lynt Bridge 	03/05/2016 	12:10 	11.7 	8.59

Group By: split-apply-combine

Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.

data.loc[:, ["Site name", "Lab pH"]].groupby("Site name").mean()

 	Lab pH
Site name
Cherwell at Hampton Poyle 	7.938881
Cole at Lynt Bridge 	7.946381
Coln at Whelford 	8.006821
Colne at Staines 	8.121493
Enborne at Brimpton 	7.796158
Evenlode at Cassington Mill 	7.933395
Jubilee River at Datchet 	7.950212
Kennet at Woolhampton 	8.034962
Leach at Lechlade 	7.890348
Lodden at Charvil 	7.843077
Ock at Abingdon 	8.007889
Pang at Tidmarsh 	7.930935
Ray at Islip 	7.696256
Thame at Wheatley 	7.846891
Thames at Hannington 	7.915225
Thames at Newbridge 	8.007239
Thames at Runnymede 	7.956557
Thames at Sonning 	7.953984
Thames at Sonning 	8.021600
Thames at Swinford 	8.019977
Thames at Wallingford 	8.021651
The Cut at Paley Street 	7.558159
Windrush at Newbridge 	8.081605
Wye at Bourne End 	8.058515

Various other aggregation functions, like sum, median, max, etc. are available

Can you create a list of the maximum flow rates by sampling site?

e.g.

data[["Site name", "Mean daily flow (m3/s)"]].groupby("Site name").max()

Now explain what each line in the following short program does: what is in first, second, etc.?

first = pd.read_csv("data/Thames_Initiative_2009-2017.csv")
second = first[["Site name", "Lab pH", "Mean daily flow (m3/s)"]]
third = second.groupby("Site name").mean()
fourth = third.sort_values("Mean daily flow (m3/s)")
fourth.to_csv("processed.csv")

Datetime indicies

Pandas is very adept at manipulating timeseries, but the data must be in the correct format first.

Let’s take the date and the time columns, combine them, and create a DatetimeIndex

data["datetime"] = pd.to_datetime(data["Sampling Date"] + " " + data["Time of sampling"])
temporal = data.set_index("datetime")
temporal.index

DatetimeIndex(['2009-03-03 09:25:00', '2009-09-03 09:40:00',
               '2009-03-16 10:00:00', '2009-03-24 09:45:00',
               '2009-01-04 09:46:00', '2009-06-04 09:48:00',
               '2009-04-14 09:10:00', '2009-04-20 10:00:00',
               '2009-04-27 09:40:00', '2009-05-05 09:25:00',
               ...
               '2017-07-24 12:30:00', '2017-07-31 12:10:00',
               '2017-07-08 12:00:00', '2017-08-14 12:15:00',
               '2017-08-21 12:10:00', '2017-08-29 12:15:00',
               '2017-04-09 12:10:00', '2017-11-09 13:40:00',
               '2017-09-18 13:40:00', '2017-09-25 13:20:00'],
              dtype='datetime64[ns]', name='datetime', length=9130, freq=None)

We can now index the data by date as well as all the other methods of indexing!

e.g.

temporal.loc['2009-03-03 09:25:00', "Lab pH"]
temporal.loc['2009-03-03', "Lab pH"]
temporal.loc['2009', "Lab pH"]

It also paves the way for split-apply-combine on time ranges

temporal["Lab pH"].resample("1W").mean()

More on plotting in a moment, but here’s the data resampled to yearly.

temporal["Lab pH"].resample("1A").mean().plot()

Many Ways of Access

There are at least two ways of accessing a value or slice of a DataFrame: by name or index. However, there are many others. For example, a single column or row can be accessed either as a DataFrame or a Series object.

Suggest different ways of doing the following operations on a DataFrame:

Access a single column

Access a single row

Access an individual DataFrame element

Access several columns

Access several rows

Access a subset of specific rows and columns

Access a subset of row and column ranges

Solution

1. Access a single column:

# by name
data["col_name"]   # as a Series
data[["col_name"]] # as a DataFrame

# by name using .loc
data.T.loc["col_name"]  # as a Series
data.T.loc[["col_name"]].T  # as a DataFrame

# Dot notation (Series)
data.col_name

# by index (iloc)
data.iloc[:, col_index]   # as a Series
data.iloc[:, [col_index]] # as a DataFrame

# using a mask
data.T[data.T.index == "col_name"].T

2. Access a single row:

# by name using .loc
data.loc["row_name"] # as a Series
data.loc[["row_name"]] # as a DataFrame

# by name
data.T["row_name"] # as a Series
data.T[["row_name"]].T # as a DataFrame

# by index
data.iloc[row_index]   # as a Series
data.iloc[[row_index]]   # as a DataFrame

# using mask
data[data.index == "row_name"]

3. Access an individual DataFrame element:

# by column/row names
data["column_name"]["row_name"]         # as a Series

data[["col_name"]].loc["row_name"]  # as a Series
data[["col_name"]].loc[["row_name"]]  # as a DataFrame

data.loc["row_name"]["col_name"]  # as a value
data.loc[["row_name"]]["col_name"]  # as a Series
data.loc[["row_name"]][["col_name"]]  # as a DataFrame

data.loc["row_name", "col_name"]  # as a value
data.loc[["row_name"], "col_name"]  # as a Series. Preserves index. Column name is moved to `.name`.
data.loc["row_name", ["col_name"]]  # as a Series. Index is moved to `.name.` Sets index to column name.
data.loc[["row_name"], ["col_name"]]  # as a DataFrame (preserves original index and column name)

# by column/row names: Dot notation
data.col_name.row_name

# by column/row indices
data.iloc[row_index, col_index] # as a value
data.iloc[[row_index], col_index] # as a Series. Preserves index. Column name is moved to `.name`
data.iloc[row_index, [col_index]] # as a Series. Index is moved to `.name.` Sets index to column name.
data.iloc[[row_index], [col_index]] # as a DataFrame (preserves original index and column name)

# column name + row index
data["col_name"][row_index]
data.col_name[row_index]
data["col_name"].iloc[row_index]

# column index + row name
data.iloc[:, [col_index]].loc["row_name"]  # as a Series
data.iloc[:, [col_index]].loc[["row_name"]]  # as a DataFrame

# using masks
data[data.index == "row_name"].T[data.T.index == "col_name"].T

4. Access several columns:

# by name
data[["col1", "col2", "col3"]]
data.loc[:, ["col1", "col2", "col3"]]

# by index
data.iloc[:, [col1_index, col2_index, col3_index]]

5. Access several rows

# by name
data.loc[["row1", "row2", "row3"]]

# by index
data.iloc[[row1_index, row2_index, row3_index]]

6. Access a subset of specific rows and columns

# by names
data.loc[["row1", "row2", "row3"], ["col1", "col2", "col3"]]

# by indices
data.iloc[[row1_index, row2_index, row3_index], [col1_index, col2_index, col3_index]]

# column names + row indices
data[["col1", "col2", "col3"]].iloc[[row1_index, row2_index, row3_index]]

# column indices + row names
data.iloc[:, [col1_index, col2_index, col3_index]].loc[["row1", "row2", "row3"]]

7. Access a subset of row and column ranges

# by name
data.loc["row1":"row2", "col1":"col2"]

# by index
data.iloc[row1_index:row2_index, col1_index:col2_index]

# column names + row indices
data.loc[:, "col1_name":"col2_name"].iloc[row1_index:row2_index]

# column indices + row names
data.iloc[:, col1_index:col2_index].loc["row1":"row2"]

Key Points

Use DataFrame.iloc[..., ...] to select values by integer location.

Use : on its own to mean all columns or all rows.

Select multiple columns or rows using DataFrame.loc and a named slice.

Result of slicing can be used in further operations.

Use comparisons to select data based on value.

Split apply combine operations

Select values or NaN using a Boolean mask.

Datetime indicies for operations on timeseries

previous episode

Plotting and Programming in Python

next episode

Pandas DataFrames

Overview

Note about Pandas DataFrames/Series

Selecting values

Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

Use `:` on its own to mean all columns or all rows.

Select multiple columns or rows using `DataFrame.loc` and a named slice.

Result of slicing can be used in further operations.

Use comparisons to select data based on value.

Select values or NaN using a Boolean mask.

Group By: split-apply-combine

Datetime indicies

Many Ways of Access

Solution

Key Points

previous episode

next episode

previous episode

Plotting and Programming in Python

next episode

Pandas DataFrames

Overview

Note about Pandas DataFrames/Series

Selecting values

Use DataFrame.iloc[..., ...] to select values by their (entry) position

Use DataFrame.loc[..., ...] to select values by their (entry) label.

Use : on its own to mean all columns or all rows.

Select multiple columns or rows using DataFrame.loc and a named slice.

Result of slicing can be used in further operations.

Use comparisons to select data based on value.

Select values or NaN using a Boolean mask.

Group By: split-apply-combine

Datetime indicies

Many Ways of Access

Solution

Key Points

previous episode

next episode

Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

Use `:` on its own to mean all columns or all rows.

Select multiple columns or rows using `DataFrame.loc` and a named slice.