A Bit More Text Processing
Yesterday I wrote a simple Python class counting word frequency. The next step, I figured, was to strip out punctuation. So, taking the class I wrote yesterday, I ended up with this:
from collections import defaultdict import re class Reader: def read(self, strings): self.data = defaultdict(int) for string in strings: clean_string = string.replace('\n', '') clean_string = self.__split_sentences(clean_string) clean_string = self.__remove_punctuation(clean_string) for sentence in clean_string.splitlines(): clean_sentence = sentence.strip() for word in clean_sentence.split(" "): if not (word.isspace()): self.data[word.lower()] += 1 for key in self.data: print key, self.data[key] def __split_sentences(self, string): return re.sub('[.:?!]','\n', string) def __remove_punctuation(self, string): return re.sub('[;,\(\)\-"]', '', string)
Still seems fairly simple – strip out new lines, replace end of sentences with new line characters, strip out other punctuation, split into sentences, strip out extra whitespace and then split into words. Also, this time around I’m doing all word comparison in lower case. I’ve also changed the naming convention to be a bit more pythonic. The double underscores, for those wondering, make the method private (by convention). I’ve used the Python Regular Expression library (re) to do the (only slightly) more complex string replacements. OK, so here’s some code to use in an interactive shell session to see the new class at work:
>>> from TextProcessor import Reader >>> reader = Reader() >>> text = ["This is a sentence. And another one!", "This is"] >>> reader.read(text) a 1 and 1 sentence 1 this 2 is 2 one 1 another 1
It’s edging closer to a usable class and still pretty simple. Probably needs to return something rather than printing the output.