Re.Mark

My Life As A Blog

Simple Text Processing with Python

leave a comment »

On a couple of recent projects being able to calculate word frequency in text was a good idea.  There are a number of ways to do this, but I wondered how hard it would be to do in Python.  I saw a little bit of Michael Sparks’ talk at DevDays in London, and, while I don’t remember the detail, it did act as a reminder.  It turns out that a naive implementation is very easy indeed.  Here it is:

from collections import defaultdict

class Reader:
     def Read(self, strings):
         self.data = defaultdict(int)
         for string in strings:
             for word in string.split(" "):
                 self.data[word] += 1
         for key in self.data:
             print key, self.data[key]

Take a collection of strings, split them into words and then keep track of the count of each word using the word as a key.  The important class here is defaultdict without which I’d have to check to see if a word had already been inserted (not difficult but would have been more code.)  Here’s some code that uses the Reader class and its output:

>>> from TextProcessor import Reader
>>> reader = Reader()
>>> strings = ["Hello", "Hello", "Hello World", "Some other stuff with more spaces in it"]
>>> reader.Read(strings)
spaces 1
stuff 1
Some 1
it 1
other 1
in 1
World 1
with 1
Hello 3
more 1

To make it useful, there’s more to be done (punctuation being an obvious issue for this implementation) but it’s a useful start.

Advertisements

Written by remark

January 5, 2010 at 4:46 pm

Posted in Development, Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: