raise
keyword, in passingraise Exception
is the simplest way to have your program stop when something goes wrongraise Exception
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
Exception
raise <something>
Exception
is the most general case (“something happened”)TypeError
: some variable is the wrong typeValueError
: some variable is the right type but the wrong valuex = -1
if not isinstance(x,str): ## check if x is a str
raise TypeError
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
TypeError
import math
x = -1
if x<0:
raise ValueError
print(math.sqrt(x))
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ValueError
x = -1
if not isinstance(x,str): ## check if x is a str
errstr = "x is of type "+type(x).__name__+", should be str"
raise TypeError(errstr)
TypeError: x is of type int, should be str
f-strings are a convenient way to construct error messages: anything inside curly brackets is interpreted as a Python expression. e.g.
x=1
print(f"x is of type {type(x).__name__}, should be str")
## x is of type int, should be str
So we could use
if not isinstance(x,str): ## check if x is a str
raise TypeError("x is of type {type(x).__name__}, should be str")
x = -1
if x<0:
raise ValueError(f"x should be non-negative, but it equals {x}")
ValueError: x should be non-negative, but it equals -1
An error means “it’s impossible to continue” or “you shouldn’t continue without fixing the problem”. You might want to issue a warning instead. This is not too different from just using print()
, but it allows advanced users to decide if they want to suppress warnings.
import warnings
warnings.warn("something bad happened")
## <string>:1: UserWarning: something bad happened
Now suppose you are getting an error and you don’t want your program to stop. “Wrapping” your code in a try:
clause will allow you to specify what to do in this case. pass
is a special Python statement called a “null operation” or a “no-op”; it does nothing except keep going.
try:
x= math.sqrt(-1)
except:
pass
## keep going (but x will not be set)
You can specify something you want to do with only a particular set of errors:
try:
x = math.sqrt(-1)
except ValueError:
print("a ValueError occurred")
except:
print("some other error occurred")
## keep going (but x will not be set)
## a ValueError occurred
If the error isn’t caught because it isn’t the right type, it will act like it normally does (without the try:
)
try:
z += 5 ## not defined yet
except ValueError:
print("a ValueError occurred")
NameError: name 'z' is not defined
We could catch this with a general-purpose except:
try:
z += 5 ## not defined yet
except ValueError:
print("a ValueError occurred")
except:
print("some other error occurred")
## some other error occurred
Or add another clause to catch it:
try:
z += 5 ## not defined yet
except ValueError:
print("a ValueError occurred")
except NameError:
print("a NameError occurred")
except:
print("some other error occurred")
## a NameError occurred
nan
…)try:
x = math.sqrt(-1)
except ValueError:
x = math.nan
print(x)
## nan
pandas
stands for panel data system. It’s a convenient and powerful system for handling large, complicated data sets. (The author pronounces it “pan-duss”.)NaN
)import pandas as pd ## standard abbreviation
# The initial set of baby names and birth rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
## initialize DataFrame with a *dictionary*
p = pd.DataFrame({'Name': names, 'Count': births})
print(p)
## Name Count
## 0 Bob 968
## 1 Jessica 155
## 2 Mary 77
## 3 John 578
## 4 Mel 973
What can we do with it?
.iloc
gives row/column indices (like an array)p["Count"] ## extract a column = Series (by *name*)
p[2:3] ## slice one row (3-2 = 1)
p[2:5] ## slice multiple rows
p[["Name","Count"]] ## extract multiple columns (data frame)
p.iloc[1,1] ## index with row/column integers like an array
p.iloc[0:5,:] ## can also slice
Indexing by name
p["Name"][4] ## 5th element of Name
p.Name ## attribute!
p.loc[1:2,"Name"] ## index by *label*, _inclusive_
Download US measles data from Project Tycho.
read_csv
reads a CSV file as a data frame; it automatically interprets the first row as headingsdf.iloc[]
indexes the result as though it were an arraydf.head()
shows just at the beginning; df.tail()
shows just the endLet’s look at the first few rows of a data set on measles in US states:
## "Weekly Measles Cases, 1909-2001"
## ...
## "Data provided by Project Tycho, Data Version 1.0.0, released 28 Novem...
## "YEAR","WEEK","ALABAMA","ALASKA","AMERICAN SAMOA","ARIZONA","ARKANSAS"...
## 1909,1,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,2,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,3,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
fn = "../data/MEASLES_Cases_1909-2001_20150322001618.csv"
p = pd.read_csv(fn,skiprows=2,na_values=["-"]) ## read in data
p.head() ## look at the first little bit
## YEAR WEEK ALABAMA ALASKA ... WEST VIRGINIA WISCONSIN WYOMING Unnamed: 61
## 0 1909 1 NaN NaN ... NaN NaN NaN NaN
## 1 1909 2 NaN NaN ... NaN NaN NaN NaN
## 2 1909 3 NaN NaN ... NaN NaN NaN NaN
## 3 1909 4 NaN NaN ... NaN NaN NaN NaN
## 4 1909 5 NaN NaN ... NaN NaN NaN NaN
##
## [5 rows x 62 columns]
Mostly NaN
values at the beginning! (NaN
= “not a number”: similar to nan
from math
or numpy
)
numpy
array indexing, but a little different …df.loc[:,"MASSACHUSETTS":"NEVADA"]
(index by label; includes endpoint)iloc
method, df.iloc[:,range]
(index by integer; doesn’t include endpoint)p.loc[:,"MASSACHUSETTS":"NEVADA"]
## MASSACHUSETTS MICHIGAN MINNESOTA ... MONTANA NEBRASKA NEVADA
## 0 NaN NaN NaN ... NaN NaN NaN
## 1 NaN NaN NaN ... NaN NaN NaN
## 2 NaN NaN NaN ... NaN NaN NaN
## 3 NaN NaN NaN ... NaN NaN NaN
## 4 NaN NaN NaN ... NaN NaN NaN
## ... ... ... ... ... ... ... ...
## 4856 NaN NaN NaN ... NaN NaN NaN
## 4857 NaN NaN NaN ... NaN NaN NaN
## 4858 NaN NaN NaN ... NaN NaN NaN
## 4859 NaN NaN NaN ... NaN NaN NaN
## 4860 NaN NaN NaN ... NaN NaN NaN
##
## [4861 rows x 8 columns]
This is the same:
pc = list(p.columns) ## list of colum names
print(pc[:5])
## find the locations of these two state names
## ['YEAR', 'WEEK', 'ALABAMA', 'ALASKA', 'AMERICAN SAMOA']
mass_ind = list(pc).index("MASSACHUSETTS")
neva_ind = list(pc).index("NEVADA")
## index using `.iloc` (with extended range)
p.iloc[:,mass_ind:neva_ind+1]
## MASSACHUSETTS MICHIGAN MINNESOTA ... MONTANA NEBRASKA NEVADA
## 0 NaN NaN NaN ... NaN NaN NaN
## 1 NaN NaN NaN ... NaN NaN NaN
## 2 NaN NaN NaN ... NaN NaN NaN
## 3 NaN NaN NaN ... NaN NaN NaN
## 4 NaN NaN NaN ... NaN NaN NaN
## ... ... ... ... ... ... ... ...
## 4856 NaN NaN NaN ... NaN NaN NaN
## 4857 NaN NaN NaN ... NaN NaN NaN
## 4858 NaN NaN NaN ... NaN NaN NaN
## 4859 NaN NaN NaN ... NaN NaN NaN
## 4860 NaN NaN NaN ... NaN NaN NaN
##
## [4861 rows x 8 columns]
You can also refer to individual columns as attributes (i.e. just p.<name>
)
p.ARIZONA[:5]
## 0 NaN
## 1 NaN
## 2 NaN
## 3 NaN
## 4 NaN
## Name: ARIZONA, dtype: float64
p.ARIZONA.head()
## 0 NaN
## 1 NaN
## 2 NaN
## 3 NaN
## 4 NaN
## Name: ARIZONA, dtype: float64
.drop()
gets rid of elements
pp = p.drop(["YEAR","WEEK"],axis=1)
## equivalent to
pp2 = p.iloc[2:,]
pp3 = p.loc[:,"ARIZONA"]
Always use name-indexing whenever you can!
.index
is a special attribute of data frames that governs searching, plotting, etc.. Here we’ll set it to a decimal date value:
pp.index = p.YEAR+(p.WEEK-1)/52
Choosing specific rows of a data frame; &
, |
,~
correspond to and
, or
, not
(individual elements must be in parentheses)
ariz = p.ARIZONA ## pull out a column (attribute)
ariz[(p.YEAR==1970) & (ariz>50)] ## *must* use parentheses!
## 3196 69.0
## 3197 57.0
## 3198 62.0
## 3200 56.0
## 3203 73.0
## 3205 54.0
## 3209 55.0
## Name: ARIZONA, dtype: float64
pandas
will automatically plot data frames in a (reasonably) sensible way
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
## pp.plot()
pp.plot(legend=False,logy=True) ## plot method (non-Pythonic)
plt.savefig("pix/measles1.png")
Or we can create our own (less complex) plots
import numpy as np
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(pp.index,np.log10(pp.ARIZONA))
ptot = pp.sum(axis=1)
df.min
, df.max
, df.mean
all work too …ptotweek = ptot.groupby(p.WEEK)
ptotweekmean = ptotweek.aggregate(np.mean)
ptotweekmean.plot()
%Y-%m-%d
; separators just match whatever’s in your data (usually “/” or “-”). Results need to be unambiguous, and ambiguity is dangerous (how is day of month specified? lower case, capital? etc.)pandas
tries to guess, but you shouldn’t let it.print(pd.to_datetime("05-01-2004"))
## 2004-05-01 00:00:00
print(pd.to_datetime("05-01-2004",format="%m-%d-%Y"))
## 2004-05-01 00:00:00
import pandas as pd
print(pd.to_datetime("1212004",format="%m%d%Y"))
## 2004-12-01 00:00:00
print(pd.to_datetime("12012004",format="%m%d%Y"))
## 2004-12-01 00:00:00
For our measles data we have week of year, so things get a little complicated
yearstr = p.YEAR.apply(format)
weekstr = p.WEEK.apply(format,args=["02"])
datestr = p.YEAR.astype(str)+"-"+weekstr+"-0"
dateindex = pd.to_datetime(datestr,format="%Y-%U-%w")
pd.cut(x,bins=...)
; decide on binspd.qcut(x,n)
; decide on number of bins (equal occupancy)## fancy stuff: automatically look for index and convert it to a date/time
p = pd.read_csv("../data/eng2.csv",skiprows=14,encoding="latin1",index_col="Date/Time",parse_dates=True)
## rename columns
p.columns = [
'Year', 'Month', 'Day', 'Time', 'Data Quality', 'Temp (C)',
'Temp Flag', 'Dew Point Temp (C)', 'Dew Point Temp Flag',
'Rel Hum (%)', 'Rel Hum Flag', 'Wind Dir (10s deg)', 'Wind Dir Flag',
'Wind Spd (km/h)', 'Wind Spd Flag', 'Visibility (km)', 'Visibility Flag',
'Stn Press (kPa)', 'Stn Press Flag', 'Hmdx', 'Hmdx Flag', 'Wind Chill',
'Wind Chill Flag', 'Weather']
## drop columns that are *all* NA
p = p.dropna(axis=1,how='all')
p["Temp (C)"].plot()
## get rid of columns (axis=1) we don't want
p = p.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)
Now pull out the temperature and take the median by hour:
temp = p[['Temp (C)']]
temp["Hour"] = temp.index.hour
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
temphr = temp.groupby('Hour')
medtmp = temphr.aggregate(np.median)
maxtmp = temphr.aggregate(np.max)
mintmp = temphr.aggregate(np.min)
Now plot these …