String splitting for inconsistent strings

Ron Rosenfeld · 06-21-2006, 11:00 PM

On Thu, 22 Jun 2006 00:04:37 +0100, Richard Buttrey
<chaos.theory.nospam.removethis@zen.co.uk> wrote:

>On Tue, 20 Jun 2006 13:29:40 -0400, Ron Rosenfeld
><ronrosenfeld@nospam.org> wrote:
>
>>
>>It seems as if this can be done with "Regular Expressions". I will make use of
>>the fact that the value which you want to pull out is the first floating point
>>number that ends with a "p " (thats "p" followed by <space>). If there are
>>any issues where the stock name includes a string of the form <number>p<space>
>>you might want to change the formula to take that into account.
>>
>>Download and install Longre's free morefunc.xll add-in from
>>http://xcell05.free.fr
>>
>>To extract the company name:
>>
>>=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s))")
>>
>>To extract the price:
>>
>>=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s)")
>>
>>To be a little more bullet-proof one can require that the price number be
>>identified as the first positive floating point number followed by the sequence
>>of p<space> then optionally a "-" and another digit. So the formula would
>>then be:
>>
>>Company Name:
>>=REGEX.MID(A1,"(\w+\s+)+(?=\d+(\.\d*)?(?=p\s-?\d))")
>>
>>Price:
>>=REGEX.MID(A1,"\d+(\.\d*)?(?=p\s-?\d)")
>>
>>
>>--ron
>
>Ron,
>
>Many thanks for the pointer to this Add In which I've just downloaded.
>The REGEX function looks very interesting - I just need to get my mind
>around the syntax and terminology.
>
>One supplementary if I may. I've just used your example on my data and
>it works a treat.
>However there's one other small variation I've just noticed, and
>that's where the integer part of the share price is four characters
>with a comma separating the thousands.
>
>e.g. the following string
>
>GlaxoSmithKline 1,480.00p 0.16p (1.09)
>
>returns nothing for the name in B1, and in C1 480.00, not 1480.00
>
>Can you suggest a modification that will overcome this please - it'll
>probably take me a couple of weeks to understand the syntax!
>
>Kind regards,
>
>Richard Buttrey
>
>
>
>__
>Richard Buttrey
>Grappenhall, Cheshire, UK
>__________________________

What you need to do, I believe, is optionally look for zero or more sequences,
within the number, of a comma followed by three digits. The syntax for that
would be: (,\d{3})*

I think this modification will do that:

\d+(,\d{3})*(\.\d*)(?=p\s)

So the modified "more bulletproof" formulas would be:

Company Name:

=REGEX.MID(A1,"(\w+\s?)+(?=\s\d+(,\d{3})*(\.\d*)?(?=p\s-?\d))")

Price:

=REGEX.MID(A1,"\d+(,\d{3})*(\.\d*)?(?=p\s-?\d)")

If you look at the Company Name formula, you will note I made a small change as
the previous formula would include a trailing <space> after each name. This
one does not.

The simpler formulas would be

Name:

=REGEX.MID(A1,"(\w+\s?)+(?=\s\d+(,\d{3})*(\.\d*)?(?=p\s))")

Price:

=REGEX.MID(A1,"\d+(,\d{3})*(\.\d*)(?=p\s)")

=========================================

http://www.regular-expressions.info/reference.html

is a place on the net to start learning a bit about syntax.

I would also recommend the book "Mastering Regular Expressions" by Jeffrey
Friedl. I got mine for just a few dollars, used, at amazon.com

Finally, be aware that a limitation of Longre's functions is that they cannot
process strings longer than 255 characters. I believe this limitation is due
to the data types used in the XLL, and also applies to the other functions in
the add-in.

If you need to process longer strings, it can be done in VBA by setting a
reference to Microsoft VBScript Regular Expressions 5.5 and writing some code.
I've got a little of that done, but nothing as flexible as what Longre
provides. And there's also some syntax that won't work in the VBScript
implementation, so I rarely bother with it.

Have fun
--ron

String splitting for inconsistent strings

LinkBack

Thread Tools

Rate This Thread

Display

Hybrid View

Re: String splitting for inconsistent strings

Thread Information

Users Browsing this Thread

Bookmarks

Bookmarks

Posting Permissions