Project Set 4

Pandas

These are in generally in order of difficulty, from easiest to most difficult. We have not covered all of these items in our notes. There are many good resources online, e.g. you can Google: “python pandas create new dataframe”.

Project 14

Create a new dataframe using the following table:

Coach School Email Career Wins National Championships
Tony Bennett Virginia tbennett@virginia.edu 346 1
Roy Williams North Carolina rwilliams@unc.edu 871 3
Mike Krzyzewski Duke coachk@duke.edu 1132 5
Tom Izzo Michigan State tizzo@msu.edu 606 2
Jim Boeheim Syracuse jboeheim@syracuse.edu 944 1

  • Begin by importing pandas
  • Create a new column that is the ratio of career wins to national championships.
  • Rename your columns with a variable name. This is not required but often makes your code more readable.
  • Print a list of all columns.
  • split a subset of your dataframe based on some condition. Do it at least twice. Once using .loc and once using .iloc.
  • use the groupby() method to group your dataframe in some way that makes sense for your data
Example solution

#import pandas
import pandas as pd

#create new dataframe
basketball_coaches = pd.DataFrame({"Name": ['Tony Bennett', 'Roy Williams', 'Mike Krzyzewski', 'Tom Izzo', 'Jim Boeheim'],
                          "School": ['Virginia', 'North Carolina', 'Duke', 'Michigan State', 'Syracuse'],
                          "Email": ['tbennett@virginia.edu', 'rwilliams@unc.edu', 'coachk@duke.edu','tizzo@msu.edu', 'jboeheim@syracuse.edu'],
                          "Career Wins": [346, 871, 1132, 606, 944],
                          "National Championships": [1, 3, 5, 2, 1]})


print(basketball_coaches)

#rename columns with a variable name
name = basketball_coaches['Name']
school = basketball_coaches['School']
email = basketball_coaches['Email']
career_wins = basketball_coaches['Career Wins']
championships = basketball_coaches['National Championships']

#make new subset of data
wins_per_championship = career_wins/championships

#or if you want it to be a part of the dataframe
basketball_coaches['Wins Per Championship'] = basketball_coaches['Career Wins'] / basketball_coaches['National Championships']

#print a list of all the columns
print(basketball_coaches.columns.tolist())

#split a subset of the dataframe based on some condition. Several examples listed here
uva_coach = basketball_coaches.loc[basketball_coaches['School'] == 'Virginia']

multiple_championships = basketball_coaches.loc[basketball_coaches['National Championships'] > 1]

first_three_coaches = basketball_coaches.iloc[:3]

print()
wins=basketball_coaches.groupby(championships)
for key, item in wins:
    print(wins.get_group(key))

Project 15

Take the following code block:

import pandas

df1 = pd.DataFrame({'ID #': [1,2,3,4,5,6,7,8,9,10],
                    'Names': ['John','Dave','Mary','Sarah','Mohammed','Rohan','Prisha','Vijay','Ananya','Raj'],
                    'Country': ['USA','USA','USA','UK','India','India','UK','India','UK','India']})

df2 = pd.DataFrame({'ID #': [1,2,3,4,5,6,7,8,9,10],
                    'Salary': [50000, 60000, 65000, 53000, 59000, 74000, 86000, 41000, 94000, 66000],
                    'Age': [24, 46, 51, 29, 33, 38, 70, 46, 49, 35]})
  • Join the two dataframes together using the merge function
  • How many people come from each country? (Hint: Don’t just count them. Which function allows you to see that easily?)
  • Reshape the data and create a pivot table view of people by country using the pivot_table function. Also include the name, age, and salary in the results
Example solution

import pandas as pd


df1 = pd.DataFrame({'ID #': [1,2,3,4,5,6,7,8,9,10],
                    'Name': ['John','Dave','Mary','Sarah','Mohammed','Rohan','Prisha','Vijay','Ananya','Raj'],
                    'Country': ['USA','USA','USA','UK','India','India','UK','India','UK','India']})

df2 = pd.DataFrame({'ID #': [1,2,3,4,5,6,7,8,9,10],
                    'Salary': [50000, 60000, 65000, 53000, 59000, 74000, 86000, 41000, 94000, 66000],
                    'Age': [24, 46, 51, 29, 33, 38, 70, 46, 49, 35]})


#merge
result = pd.merge(df1, df2, on="ID #")

#see counts of people by country
#print(result['Country'].value_counts())

#pivot table reshape
result2 = pd.pivot_table(result, index=['Country', 'Name', 'Age', 'Salary'])
print(result2)

Project 16

Download the file pandas_demo.ipynb and the data files eighthr.data and eightr.names. If you are using Windows, check that it does not append “.txt” to the data files. You may need to open File Explorer, go to View, and check “File name extensions.” Open the notebook in JupyterLab or Jupyter. Go through the exercises in each cell.

Example solution, zipped Jupyter notebook

pandas_demo.zip

Project 17

Download the file cigarette-smoking-behaviour-2018-census.csv, which is about cigarette smoking in New Zealand.

  • Read the file into a pandas dataframe
  • Make a bar plot in Matplotlib of types of cigarette smokers (‘Regular Smoker’, ‘Ex-Smoker’, etc.) and their count
  • Because we have a total number of respondents, let’s make a new column that is a ratio of # of each category / total number of respondents
Example solution

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

smokers=pd.read_csv("cigarette-smoking-behaviour-2018-census.csv") 

smokers.columns=["Code","Behavior","Count"]

labels=smokers.Behavior.iloc[:5].tolist()
counts=smokers.Count.iloc[:5].tolist()
countbc=pd.DataFrame(counts,index=labels)
countbc.plot.bar(legend=False,rot=70,title="Smoking in New Zealand as of 2018")

smokers["Ratio"]=smokers.Count/smokers.at[6,'Count']
ratios=smokers["Ratio"]

percents=[100.*x for x in ratios.tolist()[:5]]
countbc=pd.DataFrame(percents,index=labels)
countbc.plot.bar(legend=False,rot=70,title="Smoking in New Zealand as of 2018")
plt.ylabel("Percent of Total Responses")
plt.show()

Project 18

Redo Project 14 using Pandas.

Download cville_2017_april.xlsx, which contains April 2017 weather data for Charlottesville, VA.

  • Read the file into a pandas dataframe
  • Make a line plot of average wind speed for each day
  • Add main titles, and label axes to be more descriptive
  • Play with the bottom axis (x axis) to make sure all dates are visible
  • Make a bar and line plot showing average wind speed (in bars) and max wind gust (as a line). Add legend to distinguish between the two.
  • Make stacked bar chart of minimum and maximum temperatures for each day
  • Make grouped bar chart of minimum and maximum temperatures for each day
  • Plot the number of each weather ‘condition’. Plot sunny days, partly cloudy days, and rain days. There are several ways to do this.
Example solution

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

weather=pd.read_excel("cville_2017_april.xlsx")

xlabels=weather.Date.dt.strftime("%Y-%m-%d")

axs1=weather["Avg Wind (mph)"].plot(title="Average Wind Speed")
axs1.set_xticklabels(rotation=45,labels=xlabels)
plt.xlabel("Date")
plt.ylabel("Speed in MPH")
plt.show()

axs2=weather["Avg Wind (mph)"].plot.bar(title="Average Wind Speed and Gusts")
weather["Wind Gust (mph)"].plot(color="r")
axs2.xaxis.set_major_locator(plt.MaxNLocator(8))
axs2.set_xticklabels(rotation=45,labels=xlabels)
plt.xlabel("Date")
plt.ylabel("Speed in MPH")
axs2.legend()
plt.show()

#How to make a stacked chart if you want the values added
axs3=weather[["Lo Temp","Hi Temp"]].plot.bar(title="Stacked High and Low Temperatures",stacked=True)
axs3.xaxis.set_major_locator(plt.MaxNLocator(8))
axs3.set_xticklabels(rotation=45,labels=xlabels)
plt.xlabel("Date")
plt.ylabel("Temperature (F)")
plt.show()

#What we really want
fig,axs4=plt.subplots(1,1)
lows=weather["Lo Temp"]
plt.bar(weather["Date"],weather["Hi Temp"],color="orange")
plt.bar(weather["Date"],lows)
axs4.xaxis.set_major_locator(plt.MaxNLocator(8))
axs4.set_xticklabels(rotation=45,labels=xlabels)
plt.xlabel("Date")
plt.ylabel("Temperature (F)")
plt.title("Daily High and Low Temperatures")
plt.legend(labels=["High","Low"])
plt.show()

axs5=weather[["Hi Temp","Lo Temp"]].plot.bar(title="Daily High and Low Temperatures")
axs5.xaxis.set_major_locator(plt.MaxNLocator(8))
axs5.set_xticklabels(rotation=45,labels=xlabels)
plt.xlabel("Date")
plt.ylabel("Temperature (F)")
plt.show()

days=weather.groupby("Condition")["Date"].count()

days.plot.bar(title="Days Per Cloud Condition")
plt.ylabel("Count")

plt.show()


Previous
Next