My Life As A Blog

Archive for January 2010

Text Processing in WPF

with one comment

Having seen that I could run my Python code to calculate word frequency in IronPython, my thoughts turned to displaying the results.  Having previously built a simple WPF application in IronPython, I decided to start there.  First step was to create a simple XAML file:

<Window xmlns=""
Title="Text Processor App" Width="640" Height="480">
    <Label>Text Processor</Label>
    <ListBox x:Name="listbox1">
		<StackPanel Orientation="Horizontal" >
		    <TextBlock Text="{Binding Path=[0]}" Margin="0,0,10,0" />
		    <TextBlock Text="{Binding Path=[1]}" />

I called this file main.xaml.  The results from the Reader class in Python are a list of tuples – each tuple will have the word and the frequency with which it appears.  The CLR type is IronPython.Runtime.PythonTuple – which allows us to access the values via an indexer.  The next step was to create a file called (in the same folder as main.xaml):

import clr

from System.IO import File
from System.Windows.Markup import XamlReader
from System.Windows import Application
from TextProcessor import Reader

file = File.OpenRead('main.xaml')
window = XamlReader.Load(file)
reader = Reader()
list = ["This is a multiline string.\nWith many lines.", "This isn't.", "And nor is this"]
window.Content.Children[1].ItemsSource = reader.get_sorted_results()


The next step is to fire up a command prompt and type:


And it works. Here’s the output on my machine.


Written by remark

January 11, 2010 at 7:49 pm

Posted in .NET, Development, Python

Running Text Processor in IronPython

leave a comment »

I couldn’t resist trying to run the Python code I wrote earlier today in IronPython.  And the good news is it just works.  I span up an IronPython interactive shell and went through the same lines as I had done with Python earlier.  Not a surprise, of course, but may help when it comes to displaying the results.

Written by remark

January 7, 2010 at 6:58 pm

Posted in .NET, Development, Python

Sorting results in Python

leave a comment »

After yesterday’s exercise, I decided that today it’d be good to be able to retrieve the results.  And I’d like the results to be sorted.  A quick trawl across the internet and I found this post about sorting dictionaries in Python.  For an introduction to sorting in Python, this article is helpful.  So, I modified the Reader class by adding an import statement:

from operator import itemgetter

I took out the result printing loop and added a new function:

def get_sorted_results(self):
    return sorted(, key=itemgetter(1), reverse=True)

And that’s it.  To see the sorted results, here’s some code entered into the interactive shell:

>>> from TextProcessor import Reader
>>> reader = Reader()
>>> list = ["This is a multiline string.\nWith many lines.", "This isn't.", "And nor is this."]
>>> for key, value in reader.get_sorted_results():
...     print key, value
this 3
is 2
a 1
and 1
string 1
many 1
lines 1
multiline 1
nor 1
with 1
isn't 1

The next step would seem to be displaying the results in something other than a console.

Written by remark

January 7, 2010 at 6:18 pm

Posted in Development, Python

A Bit More Text Processing

leave a comment »

Yesterday I wrote a simple Python class counting word frequency.  The next step, I figured, was to strip out punctuation.  So, taking the class I wrote yesterday, I ended up with this:

from collections import defaultdict
import re

class Reader:
    def read(self, strings): = defaultdict(int)
        for string in strings:
            clean_string = string.replace('\n', '')
            clean_string = self.__split_sentences(clean_string)
            clean_string = self.__remove_punctuation(clean_string)
            for sentence in clean_string.splitlines():
                clean_sentence = sentence.strip()                
                for word in clean_sentence.split(" "):
                    if not (word.isspace()):                    
              [word.lower()] += 1
        for key in
            print key,[key]
    def __split_sentences(self, string):
        return re.sub('[.:?!]','\n', string)
    def __remove_punctuation(self, string):
        return re.sub('[;,\(\)\-"]', '', string)

Still seems fairly simple – strip out new lines, replace end of sentences with new line characters, strip out other punctuation, split into sentences, strip out extra whitespace and then split into words.  Also, this time around I’m doing all word comparison in lower case.  I’ve also changed the naming convention to be a bit more pythonic.  The double underscores, for those wondering, make the method private (by convention).  I’ve used the Python Regular Expression library (re) to do the (only slightly) more complex string replacements.  OK, so here’s some code to use in an interactive shell session to see the new class at work:

>>> from TextProcessor import Reader
>>> reader = Reader()
>>> text = ["This is a sentence.  And another one!", "This is"]
a 1
and 1
sentence 1
this 2
is 2
one 1
another 1

It’s edging closer to a usable class and still pretty simple.  Probably needs to return something rather than printing the output.

Written by remark

January 6, 2010 at 3:19 pm

Posted in Development, Python

Simple Text Processing with Python

leave a comment »

On a couple of recent projects being able to calculate word frequency in text was a good idea.  There are a number of ways to do this, but I wondered how hard it would be to do in Python.  I saw a little bit of Michael Sparks’ talk at DevDays in London, and, while I don’t remember the detail, it did act as a reminder.  It turns out that a naive implementation is very easy indeed.  Here it is:

from collections import defaultdict

class Reader:
     def Read(self, strings): = defaultdict(int)
         for string in strings:
             for word in string.split(" "):
       [word] += 1
         for key in
             print key,[key]

Take a collection of strings, split them into words and then keep track of the count of each word using the word as a key.  The important class here is defaultdict without which I’d have to check to see if a word had already been inserted (not difficult but would have been more code.)  Here’s some code that uses the Reader class and its output:

>>> from TextProcessor import Reader
>>> reader = Reader()
>>> strings = ["Hello", "Hello", "Hello World", "Some other stuff with more spaces in it"]
>>> reader.Read(strings)
spaces 1
stuff 1
Some 1
it 1
other 1
in 1
World 1
with 1
Hello 3
more 1

To make it useful, there’s more to be done (punctuation being an obvious issue for this implementation) but it’s a useful start.

Written by remark

January 5, 2010 at 4:46 pm

Posted in Development, Python