Benford’s law

Benford’s law describes the (surprising) distribution of first (leading) digits of many different sets of numbers:

Benford’s law states that in listings, tables of statistics, etc., the digit 1 tends to occur with probability ~30%, much greater than the expected 11.1% (i.e., one digit out of 9). Benford’s law can be observed, for instance, by examining tables of logarithms and noting that the first pages are much more worn and smudged than later pages (Newcomb 1881).

Read it about it on Wikipedia or MathWorld

We’ll write a Python function benford_count that tabulates the occurrence of digits from a set of numbers.

file I/O

A bunch of ways to read a file

Reference

f = open("test.txt")
type(f)
print(f)
print(f.closed)
f.close()
print(f.closed)

.closed is a data attribute (e.g. see here): it’s a value rather than a function (unlike a method attribute like list.append()): no parentheses!


f = open('/Users/bolker/Documents/temp/temp.txt', 'r')
f.close()

Once a file has been opened, its entire contents can be read in as a str using the read() method on the associated file object.

f = open('test.txt', 'r')
contents = f.read()
print(contents)
print(repr(contents))
f.close()


f = open('test.txt', 'r')
print('first 10 characters of the file:')
print(f.read(10))
print('next 6 characters (including the end of line character `\\n`:')
print(f.read(6))
print("this is the rest of the file:")
contents = f.read()
print(contents)
f.close()

f = open('test.txt')
lines = f.readlines()
# print the list of lines.
# Each line ends with a new line character \n.
print(lines)
print(lines[2])
f.close()

f = open("test.txt")
for line in f:
    print(line)
    line_u = line.upper()
    print(line_u)
f.close()    

If you wanted instead to print the square of a single number occurring on every line, you could use print(int(line)**2) (int() ignores the \n at the end of the line).

More I/O details

f = open("test.txt")
L = f.readline() ## read one line
print(repr(L))
print(repr(L.strip()))
f = open("test.txt")
L = f.readline() ## read one line
LL = L.strip().split(" ")
print(LL)

And even more

next(), and more flow control

f = open("test.txt")
line = next(f) ## read one line
print(line)
f = open("test.txt")
finished = False
while not finished:
   try:
      line = next(f)
   except StopIteration:
      finished = True
while True:
   try:
      x = input("enter a number: ")
      print(int(x))
      break
   except ValueError:
      print("Try again!")

Benford’s Law

File format

To properly process a data file, we need to make some assumptions about its format. - We’ll assume that each line contains a number of words, for example, the name of a town, followed by its province/state and country, etc., and that the last word will be a number that represents the size of the quantities being counted in the file.

Steps:

  1. Initialize a digits_count list of length 10, filled with zeros (why a list??)
  2. Open the file
  3. For each line in the file,
    • Retrieve the last word from the line
    • If it is a string of digits that doesn’t start with 0, get the leading digit and update digits_count.
  4. return tuple(digits_count)

Other considerations

Sets


vowels = {'a', 'e', 'i', 'o', 'u'}
print(type(vowels))
print(vowels)
vowels = {'a', 'e', 'i', 'o', 'u', 'e', 'a'}
print(vowels)
print({1, 2, 3, 'a'} == {'a', 1, 1, 3, 2, 'a'})

print(set())
print(set([1, 2, 0, -1, 3, 1, 1, 2]))
print(set('hello world!'))
print(set(range(0, 10, 2)))

small = {0, 1, 2, 3}
mid = {3, 4, 5, 6, 7}
big = {7, 8, 9}
big.add(10)
small.remove(0)
print(small, big)
print(small.intersection(mid))
print(small.union(big))

sets can be compared using methods or set operators.

d = {0,1}
print(d.issubset({0, 1, 2, 3}))
print(d <= {0, 1, 2, 3})

Code for testing whether a string is hexadecimal:

hex_char = "0123456789abcdef"
word = "12aac"
for char in word:
    if not char in hex_char:
       return False
    return True

this can be replaced by:

set(word) <= set(hex_char)