This is a brief set of notes regarding some very rough Python scripts that I'm writing to help me grasp the material in R. Harald Baayen's "Word Frequency Distributions". To a certain extent, they replicate the tables and charts in the text.
Problems? Suggestions for improvement? Email me (substitute 'heart' for the image of the heart):
A Python Companion to
R. Harald Baayen's "Word Frequency Distributions"
Sec 1.1, Chapter 1
$Date: 2006-06-09 21:14:52 +0800 (Fri, 09 Jun 2006)$
Below are links to the scripts. The re, math, and Gnuplot (which depends on Numeric), modules are used. You will have to modify the shebang lines to get them to run on your system. As in the text, "Alice's Adventures in Wonderland" was used. I worked with the Project Gutenberg etext (alice30.txt). I emailed Dr. Baayen to ask which edition/form of the book he'd used while preparing the text but I haven't yet heard back.
- module PgeIncrementalWordCountsAndCharting.py
- demo script demoPgeIncrementalWordCountsAndCharting.py
When the demo script is run, it will print the total number of word tokens and the number of unique words to the console:
N = 26694 word tokens
V(N) = V(26694) = 2635 distinct word types
Four text files are also generated:
- out-01-no_legalese.txt, alice30.txt minus the fine print at the top of the file (could manually trim this away and skip this step in the demo script)
- out-02-words_only.txt, a space-delimited list of all of the words (lower-cased) in the text in the order they appeared
- out-03-word-counts.txt, a list of all of the words in the text and the number of times they occurred, ranked in descending order by number of occurrences
- out-04-word-counts-per-segment.txt, a set of lists similar to the one in the previous file, but with the text segmented (30 segments in the run stored in this file) and processed incrementally.
Running the demo script will also generate 5 graphs. The two graphs below correspond to those in Figures 1.1 A and B in the text. Baayen's graphs included 20 points. The ones below, for no particular reason, include 30. You can specify more or fewer segments in the demo script.
These :
The next two graphs below correspond to those in Figures 1.2 A and B in the text. The third graph, of p('alice', N) vs. N, is one that I ran for fun:




