generating errors


raise Exception
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception


x = -1
if not isinstance(x,str): ## check if x is a str
    raise TypeError
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
TypeError

import math
x = -1
if x<0:
    raise ValueError
print(math.sqrt(x))
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
ValueError

error messages

x = -1
if not isinstance(x,str): ## check if x is a str
    errstr = "x is of type "+type(x).__name__+", should be str"
    raise TypeError(errstr)
TypeError: x is of type int, should be str

f-strings are a convenient way to construct error messages: anything inside curly brackets is interpreted as a Python expression. e.g. 

x=1
print(f"x is of type {type(x).__name__}, should be str")
## x is of type int, should be str

So we could use

if not isinstance(x,str): ## check if x is a str
    raise TypeError("x is of type {type(x).__name__}, should be str")
x = -1
if x<0:
    raise ValueError(f"x should be non-negative, but it equals {x}")
ValueError: x should be non-negative, but it equals -1

warnings

An error means “it’s impossible to continue” or “you shouldn’t continue without fixing the problem”. You might want to issue a warning instead. This is not too different from just using print(), but it allows advanced users to decide if they want to suppress warnings.

import warnings
warnings.warn("something bad happened")
## <string>:1: UserWarning: something bad happened

handling errors

Now suppose you are getting an error and you don’t want your program to stop. “Wrapping” your code in a try: clause will allow you to specify what to do in this case. pass is a special Python statement called a “null operation” or a “no-op”; it does nothing except keep going.

try:
    x= math.sqrt(-1)
except:
    pass
## keep going (but x will not be set)

You can specify something you want to do with only a particular set of errors:

try:
    x = math.sqrt(-1)
except ValueError: 
    print("a ValueError occurred")
except:
    print("some other error occurred")
## keep going (but x will not be set)
## a ValueError occurred

If the error isn’t caught because it isn’t the right type, it will act like it normally does (without the try:)

try:
    z += 5  ## not defined yet
except ValueError: 
    print("a ValueError occurred")
NameError: name 'z' is not defined

We could catch this with a general-purpose except:

try:
    z += 5  ## not defined yet
except ValueError: 
    print("a ValueError occurred")
except:
    print("some other error occurred")
## some other error occurred

Or add another clause to catch it:

try:
    z += 5  ## not defined yet
except ValueError: 
    print("a ValueError occurred")
except NameError:
    print("a NameError occurred")
except:
    print("some other error occurred")
## a NameError occurred

general rules

try:
    x = math.sqrt(-1)
except ValueError: 
    x = math.nan
print(x)
## nan

pandas

definition and reference

Data frames

  • rectangular data structure, looks a lot like an array.
  • each column is a Series; each column can be of a different type
  • rows and columns act differently
  • can index by (column) labels as well as positions
  • handles missing data (NaN)
  • convenient plotting
  • fast operations with keys
  • lots of facilities for input/output
import pandas as pd  ## standard abbreviation
# The initial set of baby names and birth rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
## initialize DataFrame with a *dictionary*
p = pd.DataFrame({'Name': names, 'Count': births})
print(p)
##       Name  Count
## 0      Bob    968
## 1  Jessica    155
## 2     Mary     77
## 3     John    578
## 4      Mel    973

What can we do with it?

  • “Simple” indexing
    • Indexing (a single value) selects a column by its key
    • key could be a number, if column names weren’t given when setting up the data frame
    • Slicing selects rows by number
    • indexing with a list gives multiple columns
    • .iloc gives row/column indices (like an array)
p["Count"]  ## extract a column = Series (by *name*)
p[2:3]      ## slice one row (3-2 = 1)
p[2:5]      ## slice multiple rows
p[["Name","Count"]]    ## extract multiple columns (data frame)
p.iloc[1,1]     ## index with row/column integers like an array
p.iloc[0:5,:]   ## can also slice

Indexing by name

p["Name"][4]  ## 5th element of Name
p.Name  ## attribute!
p.loc[1:2,"Name"]  ## index by *label*, _inclusive_

Measles data

Download US measles data from Project Tycho.

  • read_csv reads a CSV file as a data frame; it automatically interprets the first row as headings
  • df.iloc[] indexes the result as though it were an array
  • df.head() shows just at the beginning; df.tail() shows just the end

Let’s look at the first few rows of a data set on measles in US states:

## "Weekly Measles Cases, 1909-2001"
## ...
## "Data provided by Project Tycho, Data Version 1.0.0, released 28 Novem...
## "YEAR","WEEK","ALABAMA","ALASKA","AMERICAN SAMOA","ARIZONA","ARKANSAS"...
## 1909,1,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,2,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
## 1909,3,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-...
fn = "../data/MEASLES_Cases_1909-2001_20150322001618.csv"
p  = pd.read_csv(fn,skiprows=2,na_values=["-"])  ## read in data
p.head()                     ## look at the first little bit
##    YEAR  WEEK  ALABAMA  ALASKA  ...  WEST VIRGINIA  WISCONSIN  WYOMING  Unnamed: 61
## 0  1909     1      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 1  1909     2      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 2  1909     3      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 3  1909     4      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 4  1909     5      NaN     NaN  ...            NaN        NaN      NaN          NaN
## 
## [5 rows x 62 columns]

Mostly NaN values at the beginning! (NaN = “not a number”: similar to nan from math or numpy)

Selecting

  • Like numpy array indexing, but a little different …
  • Pandas doc, indexing and selecting
    • extract by name: df.loc[:,"MASSACHUSETTS":"NEVADA"] (index by label; includes endpoint)
    • extract by integer index: iloc method, df.iloc[:,range] (index by integer; doesn’t include endpoint)
p.loc[:,"MASSACHUSETTS":"NEVADA"]
##       MASSACHUSETTS  MICHIGAN  MINNESOTA  ...  MONTANA  NEBRASKA  NEVADA
## 0               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 1               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 2               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 3               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4               NaN       NaN        NaN  ...      NaN       NaN     NaN
## ...             ...       ...        ...  ...      ...       ...     ...
## 4856            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4857            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4858            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4859            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4860            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 
## [4861 rows x 8 columns]

This is the same:

pc = list(p.columns) ## list of colum names
print(pc[:5])
## find the locations of these two state names
## ['YEAR', 'WEEK', 'ALABAMA', 'ALASKA', 'AMERICAN SAMOA']
mass_ind = list(pc).index("MASSACHUSETTS")
neva_ind = list(pc).index("NEVADA")
## index using `.iloc` (with extended range)
p.iloc[:,mass_ind:neva_ind+1]
##       MASSACHUSETTS  MICHIGAN  MINNESOTA  ...  MONTANA  NEBRASKA  NEVADA
## 0               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 1               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 2               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 3               NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4               NaN       NaN        NaN  ...      NaN       NaN     NaN
## ...             ...       ...        ...  ...      ...       ...     ...
## 4856            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4857            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4858            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4859            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 4860            NaN       NaN        NaN  ...      NaN       NaN     NaN
## 
## [4861 rows x 8 columns]

More examples

You can also refer to individual columns as attributes (i.e. just p.<name>)

p.ARIZONA[:5]
## 0   NaN
## 1   NaN
## 2   NaN
## 3   NaN
## 4   NaN
## Name: ARIZONA, dtype: float64
p.ARIZONA.head()
## 0   NaN
## 1   NaN
## 2   NaN
## 3   NaN
## 4   NaN
## Name: ARIZONA, dtype: float64

.drop() gets rid of elements

pp = p.drop(["YEAR","WEEK"],axis=1)
## equivalent to
pp2 = p.iloc[2:,]
pp3 = p.loc[:,"ARIZONA"]

Always use name-indexing whenever you can!

.index is a special attribute of data frames that governs searching, plotting, etc.. Here we’ll set it to a decimal date value:

pp.index = p.YEAR+(p.WEEK-1)/52

Filtering

Choosing specific rows of a data frame; &, | ,~ correspond to and, or, not (individual elements must be in parentheses)

ariz = p.ARIZONA                                ## pull out a column (attribute)
ariz[(p.YEAR==1970) & (ariz>50)]                ## *must* use parentheses!
## 3196    69.0
## 3197    57.0
## 3198    62.0
## 3200    56.0
## 3203    73.0
## 3205    54.0
## 3209    55.0
## Name: ARIZONA, dtype: float64

Basic plotting

pandas will automatically plot data frames in a (reasonably) sensible way

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
## pp.plot()
pp.plot(legend=False,logy=True)                 ## plot method (non-Pythonic)
plt.savefig("pix/measles1.png")

Or we can create our own (less complex) plots

import numpy as np
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(pp.index,np.log10(pp.ARIZONA))

Column and row manipulations

  • totals by week
ptot = pp.sum(axis=1)
  • df.min, df.max, df.mean all work too …

Aggregation

ptotweek = ptot.groupby(p.WEEK)
ptotweekmean = ptotweek.aggregate(np.mean)
ptotweekmean.plot()

Dates and times

reference

  • (Another) complex subject.
  • Lots of possible date formats
  • Basic idea: something like %Y-%m-%d; separators just match whatever’s in your data (usually “/” or “-”). Results need to be unambiguous, and ambiguity is dangerous (how is day of month specified? lower case, capital? etc.)
  • pandas tries to guess, but you shouldn’t let it.
print(pd.to_datetime("05-01-2004"))
## 2004-05-01 00:00:00
print(pd.to_datetime("05-01-2004",format="%m-%d-%Y"))
## 2004-05-01 00:00:00
  • Time zones and daylight savings time can be a nightmare
  • May need to have the right number of digits, especially in the absence of separators:
import pandas as pd
print(pd.to_datetime("1212004",format="%m%d%Y"))
## 2004-12-01 00:00:00
print(pd.to_datetime("12012004",format="%m%d%Y"))
## 2004-12-01 00:00:00

For our measles data we have week of year, so things get a little complicated

yearstr = p.YEAR.apply(format)
weekstr = p.WEEK.apply(format,args=["02"])
datestr = p.YEAR.astype(str)+"-"+weekstr+"-0"
dateindex = pd.to_datetime(datestr,format="%Y-%U-%w")

Binning results

  • turn a quantitative variable into categories
  • pd.cut(x,bins=...); decide on bins
  • pd.qcut(x,n); decide on number of bins (equal occupancy)

Weather data

## fancy stuff: automatically look for index and convert it to a date/time
p = pd.read_csv("../data/eng2.csv",skiprows=14,encoding="latin1",index_col="Date/Time",parse_dates=True)
## rename columns
p.columns = [
    'Year', 'Month', 'Day', 'Time', 'Data Quality', 'Temp (C)', 
    'Temp Flag', 'Dew Point Temp (C)', 'Dew Point Temp Flag', 
    'Rel Hum (%)', 'Rel Hum Flag', 'Wind Dir (10s deg)', 'Wind Dir Flag', 
    'Wind Spd (km/h)', 'Wind Spd Flag', 'Visibility (km)', 'Visibility Flag',
    'Stn Press (kPa)', 'Stn Press Flag', 'Hmdx', 'Hmdx Flag', 'Wind Chill', 
    'Wind Chill Flag', 'Weather']
## drop columns that are *all* NA
p = p.dropna(axis=1,how='all')
p["Temp (C)"].plot()
## get rid of columns (axis=1) we don't want
p = p.drop(['Year', 'Month', 'Day', 'Time', 'Data Quality'], axis=1)

Now pull out the temperature and take the median by hour:

temp = p[['Temp (C)']]
temp["Hour"] = temp.index.hour
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
temphr = temp.groupby('Hour')
medtmp = temphr.aggregate(np.median)
maxtmp = temphr.aggregate(np.max)
mintmp = temphr.aggregate(np.min)

Now plot these …