This is a brief set of notes regarding some very rough Python scripts that I'm writing to help me grasp the material in R. Harald Baayen's "Word Frequency Distributions". To a certain extent, they replicate the tables and charts in the text.

A Python Companion to

R. Harald Baayen's "Word Frequency Distributions"

Sec 1.1, Chapter 1

$Date: 2006-06-09 21:14:52 +0800 (Fri, 09 Jun 2006)$

Below are links to the scripts. The re, math, and Gnuplot (which depends on Numeric), modules are used. You will have to modify the shebang lines to get them to run on your system. As in the text, "Alice's Adventures in Wonderland" was used. I worked with the Project Gutenberg etext (alice30.txt). I emailed Dr. Baayen to ask which edition/form of the book he'd used while preparing the text but I haven't yet heard back.

When the demo script is run, it will print the total number of word tokens and the number of unique words to the console:

N = 26694 word tokens
V(N) = V(26694) = 2635 distinct word types

Four text files are also generated:

Running the demo script will also generate 5 graphs. The two graphs below correspond to those in Figures 1.1 A and B in the text. Baayen's graphs included 20 points. The ones below, for no particular reason, include 30. You can specify more or fewer segments in the demo script.

These :

V(N) vs. N for 'Alice's Adventures in Wonderland'

V(N) vs. N for 'Alice's Adventures in Wonderland'.

N/V(N) vs. N for 'Alice's Adventures in Wonderland'

N/V(N) vs. N for 'Alice's Adventures in Wonderland'.

The next two graphs below correspond to those in Figures 1.2 A and B in the text. The third graph, of p('alice', N) vs. N, is one that I ran for fun:

p('a', N) vs. N for 'Alice's Adventures in Wonderland'

p('a', N) vs. N for 'Alice's Adventures in Wonderland'.

p('the', N) vs. N for 'Alice's Adventures in Wonderland'

p('the', N) vs. N for 'Alice's Adventures in Wonderland'.

p('alice', N) vs. N for 'Alice's Adventures in Wonderland'

p('alice', N) vs. N for 'Alice's Adventures in Wonderland'.