Could you post an example of your source data, as well as the kind of output you want.
Rather than write your own parser, why not load the text into a hidden instance of Word & use the Word object model to parse the text into words, sentences, etc.
Of course, it would depend on the output you want.