Introduction to Pandas

Pandas introduces new data structures, the most important of which are the Series and the DataFrame.

Series

The Series data structure consists of an index plus data. It is similar to a dictionary with the differences that

  • the size is fixed
  • requesting a non-existent index results in a Key Error (no dynamic creation)

Series objects are one-dimensional and can contain any type. The indices are treated like row labels.

This simple example loads a Series with normally-distributed random numbers, then prints some of them, prints the basic statistics, and creates a line plot.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

randns=pd.Series(np.random.randn(1000))
print(randns.head())
print(randns.tail())
print(randns.describe())
randns.plot()
plt.show()

Series are conceptually like a single column of a spreadsheet, with any headers omitted. Values are similar to a NumPy Ndarray and most NumPy methods can be applied to the Series data, provided they are defined on the data type. Missing data, represented by np.nan by default, will be omitted from the data used by these methods.

However, if the values are needed as an actual Ndarray they must be converted.

vals=randns.to_numpy()

We can load a Series with a dictionary.

scores=pd.Series({'Cougars':11,'Bears':9,'Cubs':8,'Tigers':6})

We can still slice it

scores[1:3]

We can still use iloc to extract by row number. We can also use loc to extract by the row name.

scores.loc['Cubs']

Remember to print if using Spyder, or to run in the interpreter pane.

DataFrames

The most important data structure in Pandas is the DataFrame. It can be conceptualized as a representation of a spreadsheet. DataFrames are two-dimensional. Each column has a name, which can be read from the headers of a spreadsheet, rows are numbered, and datatypes may be different in different columns. Alternatively, a DataFrame may be regarded as a dictionary with values that can be lists, Ndarrays, dictionaries, or Series.

The DataFrame is a mutable type.

We can create a DataFrame by passing a dictionary. Consider a simple grade-book example.


grade_book=pd.DataFrame({"Name":["Jim Dandy","Betty Boop","Minnie Moocher",
                                 "Joe Friday","Teddy Salad"],
                         "Year":[2,4,1,2,3],"Grade":[85.4,91.7,73.2,82.3,98.5]})
print(grade_book)

The result of printing the DataFrame should look like this:

Name Year Grade
0 Jim Dandy 2 85.4
1 Betty Boop 4 91.7
2 Minnie Moocher 1 73.2
3 Joe Friday 2 82.3
4 Teddy Salad 3 98.5

Now we can apply methods to the grade_book DataFrame.

grade_book.describe() #Summarizes
grade_book.head()     #print first few lines
grade_book.tail()     #print last lines

The head and tail methods are more useful for longer datasets. We can provide them parameters to print a specified number of rows other than the default 5.

grade_book.head(2)
grade_book.tail(1)

Accessing and Modifying Data

We can access individual columns by name. If the name of the column is a valid Python variable name then we may use it as an attribute; otherwise we must refer to it as we would to a dictionary key.

grade_book.Name
grade_book['Name']
grade_book.Grade.mean()

An individual column is of type Series.

Columns can be deleted. This does not change the original dataframe; it returns a new dataframe. To overwrite the dataframe, add an option inplace=True.

grades_only=grade_book.drop(columns='Year')

A new column can be appended (the number of rows must be the same)

grade_book["Letter Grade"]=["B","A","C","B","A"]

Extract values into an Ndarray

grades=grade_book["Grade"].values

To add a row, we should use concat. The number of columns must match.

new_row=pd.DataFrame([["Dinsdale Piranha",1,75.5]],columns=["Name","Year","Grade"])
grade_book=pd.concat([grade_book,new_row],axis=0)

To delete a row

grade_book.drop([len(grade_book)-1])

This drops the last row.

Plots

We can directly apply basic Matplotlib commands to DataFrame columns

grade_book.Grade.hist()
grade_book.Grade.plot()

Exercise

Set up a dataframe with the following “weather” data (it is synthetic):

Date, Minimum Temp, Maximum Temp
"2000-01-01 00:00:00",-5.87,8.79
"2000-01-02 00:00:00",-3.82,4.78
"2000-01-03 00:00:00",-4.58,5.10
"2000-01-04 00:00:00",-6.40,2.68
"2000-01-05 00:00:00",-5.50,6.18
"2000-01-06 00:00:00",-3.29,4.50

Run describe. Print the mean values. Extract the minimum temperature and the maximum temperature into Ndarrays. Plot the data using Pandas.

Example solution

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

weather=pd.DataFrame({"Date":["2000-01-01 00:00:00","2000-01-02 00:00:00", 
                              "2000-01-03 00:00:00","2000-01-04 00:00:00",
                              "2000-01-05 00:00:00","2000-01-06 00:00:00"],
                      "Minimum Temp":[-5.87,-3.82,-4.58,-6.40,-5.50,-3.29],
                      "Maximum Temp":[8.79,4.78,5.10,2.68,6.18,4.50]})

print(weather.describe())
print(weather["Minimum Temp"].mean())
print(weather["Maximum Temp"].mean())

tmin_vals=weather["Minimum Temp"].values
tmax_vals=weather["Maximum Temp"].values

weather["Minimum Temp"].plot()
weather["Maximum Temp"].plot()

plt.show()




Previous
Next