Simple Text Processing with Python
On a couple of recent projects being able to calculate word frequency in text was a good idea. There are a number of ways to do this, but I wondered how hard it would be to do in Python. I saw a little bit of Michael Sparks’ talk at DevDays in London, and, while I don’t remember the detail, it did act as a reminder. It turns out that a naive implementation is very easy indeed. Here it is:
from collections import defaultdict class Reader: def Read(self, strings): self.data = defaultdict(int) for string in strings: for word in string.split(" "): self.data[word] += 1 for key in self.data: print key, self.data[key]
Take a collection of strings, split them into words and then keep track of the count of each word using the word as a key. The important class here is defaultdict without which I’d have to check to see if a word had already been inserted (not difficult but would have been more code.) Here’s some code that uses the Reader class and its output:
>>> from TextProcessor import Reader >>> reader = Reader() >>> strings = ["Hello", "Hello", "Hello World", "Some other stuff with more spaces in it"] >>> reader.Read(strings) spaces 1 stuff 1 Some 1 it 1 other 1 in 1 World 1 with 1 Hello 3 more 1
To make it useful, there’s more to be done (punctuation being an obvious issue for this implementation) but it’s a useful start.