Re.Mark

My Life As A Blog

A Bit More Text Processing

leave a comment »

Yesterday I wrote a simple Python class counting word frequency.  The next step, I figured, was to strip out punctuation.  So, taking the class I wrote yesterday, I ended up with this:

from collections import defaultdict
import re

class Reader:
    
    def read(self, strings):
        self.data = defaultdict(int)
        for string in strings:
            clean_string = string.replace('\n', '')
            clean_string = self.__split_sentences(clean_string)
            clean_string = self.__remove_punctuation(clean_string)
            for sentence in clean_string.splitlines():
                clean_sentence = sentence.strip()                
                for word in clean_sentence.split(" "):
                    if not (word.isspace()):                    
                        self.data[word.lower()] += 1
        for key in self.data:
            print key, self.data[key]
            
    def __split_sentences(self, string):
        return re.sub('[.:?!]','\n', string)
        
    def __remove_punctuation(self, string):
        return re.sub('[;,\(\)\-"]', '', string)

Still seems fairly simple – strip out new lines, replace end of sentences with new line characters, strip out other punctuation, split into sentences, strip out extra whitespace and then split into words.  Also, this time around I’m doing all word comparison in lower case.  I’ve also changed the naming convention to be a bit more pythonic.  The double underscores, for those wondering, make the method private (by convention).  I’ve used the Python Regular Expression library (re) to do the (only slightly) more complex string replacements.  OK, so here’s some code to use in an interactive shell session to see the new class at work:

>>> from TextProcessor import Reader
>>> reader = Reader()
>>> text = ["This is a sentence.  And another one!", "This is"]
>>> reader.read(text)
a 1
and 1
sentence 1
this 2
is 2
one 1
another 1

It’s edging closer to a usable class and still pretty simple.  Probably needs to return something rather than printing the output.

Advertisements

Written by remark

January 6, 2010 at 3:19 pm

Posted in Development, Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: