+ Reply to Thread
Results 1 to 37 of 37

Text manipulation

Hybrid View

  1. #1
    Harlan Grove
    Guest

    Re: Text manipulation

    Ron Rosenfeld wrote...
    ....
    >For the "0 lost cuts"
    >
    >B1: =REGEX.MID(seq,"(\w+?([^FP]K|$)){"&COLUMN()-1&"}",ROW())
    >
    >ROW() resolves to a '1' which means take the 'first' sequence that matches the
    >pattern. As you copy/drag the formula down, ROW() will resolve to '2', '3',
    >etc. which means match the 2nd, 3rd, etc sequence that matches the pattern.
    >
    >The basic pattern is defined by "(\w+?([^FP]K|$)){" which means look for a
    >sequence of letters that ends with a K that is not preceded by an FP, or that
    >is at the end of the string.

    ....

    Wrong. [^FP] means any characters other than F or P, not the sequence
    FP. While your formula may have produced the OP's expected results, if
    there were an "FK" in the sequence, it wouldn't have. You need to use a
    negative look-ahead asertion.

    B2:
    =REGEX.MID($A$2,"([^K]?|.*?((?!FP)..))K",ROWS(B$2:B2))

    Note that if 0, 1, 2, etc. cuts are always wanted, the 1, 2, etc. cuts
    could all be generated from the 0 cuts by concatenating each group of
    2, 3, etc. 0-cut items, respectively.

    If there could be multiple prohibited sequences, e.g., never cut before
    FPK or DK (note the different lengths), it actually gets simpler if you
    use an inner REGEX.SUBSTITUTE call to mask the cut character (K) ending
    prohibited sequences, then unmask it after the REGEX.MID calls.

    B2:
    =SUBSTITUTE(REGEX.MID(REGEX.SUBSTITUTE($A$2,"(FP|D)K","[1]_"),
    ".*?K",ROWS(B$10:B10)),"_","K")

    If the cut character were named CC, the list of prohibited sequences -
    not including the ending cut character - were named PS, then this could
    be rewritten as

    B2:
    =SUBSTITUTE(REGEX.MID(REGEX.SUBSTITUTE($A$2,"("&MCONCAT(PS,"|")&")"&CC,
    "[1]#"),".*?"&CC,ROWS(B$10:B10)),"#",CC)

    making use if MOREFUNC.XLL's MCONCAT function as well.

    If you REALLY want to learn regular expressions, read
    comp.lang.perl.misc, comp.editors or comp.unix.shell. Then when you
    THINK you know them, reply to threads in those ngs. You'll find out
    VERY QUICKLY whether you know them or not.


  2. #2
    paulinoluciano
    Guest

    Re: Text manipulation

    Thank you very much Harlan Grove!
    In fact, I just realize that it was just a language problem. My excel
    has formula in portuguese; different from yours. I`m now trying put
    your formula to run as VBA codes but in this case I think it will be
    necessary to define each variable because some people will use thes
    worksheet will not have morefunc.xll.
    Luciano


  3. #3
    paulinoluciano
    Guest

    Re: Text manipulation


    paulinoluciano wrote:
    > Thank you very much Harlan Grove!
    > In fact, I just realize that it was just a language problem. My excel
    > has formula in portuguese; different from yours. I`m now trying put
    > your formula to run as VBA codes but in this case I think it will be
    > necessary to define each variable because some people will use thes
    > worksheet will not have morefunc.xll.
    > Luciano



  4. #4
    Ron Rosenfeld
    Guest

    Re: Text manipulation

    On 29 Dec 2005 14:08:41 -0800, "Harlan Grove" <hrlngrv@aol.com> wrote:

    >Ron Rosenfeld wrote...
    >...
    >>For the "0 lost cuts"
    >>
    >>B1: =REGEX.MID(seq,"(\w+?([^FP]K|$)){"&COLUMN()-1&"}",ROW())
    >>
    >>ROW() resolves to a '1' which means take the 'first' sequence that matches the
    >>pattern. As you copy/drag the formula down, ROW() will resolve to '2', '3',
    >>etc. which means match the 2nd, 3rd, etc sequence that matches the pattern.
    >>
    >>The basic pattern is defined by "(\w+?([^FP]K|$)){" which means look for a
    >>sequence of letters that ends with a K that is not preceded by an FP, or that
    >>is at the end of the string.

    >...
    >
    >Wrong. [^FP] means any characters other than F or P, not the sequence
    >FP. While your formula may have produced the OP's expected results, if
    >there were an "FK" in the sequence, it wouldn't have. You need to use a
    >negative look-ahead asertion.
    >
    >B2:
    >=REGEX.MID($A$2,"([^K]?|.*?((?!FP)..))K",ROWS(B$2:B2))
    >
    >Note that if 0, 1, 2, etc. cuts are always wanted, the 1, 2, etc. cuts
    >could all be generated from the 0 cuts by concatenating each group of
    >2, 3, etc. 0-cut items, respectively.
    >
    >If there could be multiple prohibited sequences, e.g., never cut before
    >FPK or DK (note the different lengths), it actually gets simpler if you
    >use an inner REGEX.SUBSTITUTE call to mask the cut character (K) ending
    >prohibited sequences, then unmask it after the REGEX.MID calls.
    >
    >B2:
    >=SUBSTITUTE(REGEX.MID(REGEX.SUBSTITUTE($A$2,"(FP|D)K","[1]_"),
    >".*?K",ROWS(B$10:B10)),"_","K")
    >
    >If the cut character were named CC, the list of prohibited sequences -
    >not including the ending cut character - were named PS, then this could
    >be rewritten as
    >
    >B2:
    >=SUBSTITUTE(REGEX.MID(REGEX.SUBSTITUTE($A$2,"("&MCONCAT(PS,"|")&")"&CC,
    >"[1]#"),".*?"&CC,ROWS(B$10:B10)),"#",CC)
    >
    >making use if MOREFUNC.XLL's MCONCAT function as well.
    >
    >If you REALLY want to learn regular expressions, read
    >comp.lang.perl.misc, comp.editors or comp.unix.shell. Then when you
    >THINK you know them, reply to threads in those ngs. You'll find out
    >VERY QUICKLY whether you know them or not.



    Thank you for pointing that out. And also for the references to those other
    NG's.


    --ron

  5. #5
    Ron Rosenfeld
    Guest

    Re: Text manipulation

    On 29 Dec 2005 14:08:41 -0800, "Harlan Grove" <hrlngrv@aol.com> wrote:

    >B2:
    >=REGEX.MID($A$2,"([^K]?|.*?((?!FP)..))K",ROWS(B$2:B2))


    By the way, I think in keeping with the OP's specifications, this should be:

    =REGEX.MID(seq,"([^K]?|.*?((?!FP)..))(K|$)",ROWS(B$2:B2))

    I think he always wants the end of the string, even if it doesn't end with a
    'K'.


    --ron

  6. #6
    Harlan Grove
    Guest

    Re: Text manipulation

    Ron Rosenfeld wrote...
    ....
    >By the way, I think in keeping with the OP's specifications, this should be:
    >
    > =REGEX.MID(seq,"([^K]?|.*?((?!FP)..))(K|$)",ROWS(B$2:B2))
    >
    >I think he always wants the end of the string, even if it doesn't end with a
    >'K'.


    Good point, but it begs the question whether any symbols after the last
    K would be considered valid data.


  7. #7
    Ron Rosenfeld
    Guest

    Re: Text manipulation

    On 30 Dec 2005 11:04:25 -0800, "Harlan Grove" <hrlngrv@aol.com> wrote:

    >Ron Rosenfeld wrote...
    >...
    >>By the way, I think in keeping with the OP's specifications, this should be:
    >>
    >> =REGEX.MID(seq,"([^K]?|.*?((?!FP)..))(K|$)",ROWS(B$2:B2))
    >>
    >>I think he always wants the end of the string, even if it doesn't end with a
    >>'K'.

    >
    >Good point, but it begs the question whether any symbols after the last
    >K would be considered valid data.


    It's been a long time since I was fragmenting amino acid sequences ...


    --ron

  8. #8
    Ron Rosenfeld
    Guest

    Re: Text manipulation

    On 29 Dec 2005 14:08:41 -0800, "Harlan Grove" <hrlngrv@aol.com> wrote:

    >Wrong. [^FP] means any characters other than F or P, not the sequence
    >FP. While your formula may have produced the OP's expected results, if
    >there were an "FK" in the sequence, it wouldn't have. You need to use a
    >negative look-ahead asertion.
    >
    >B2:
    >=REGEX.MID($A$2,"([^K]?|.*?((?!FP)..))K",ROWS(B$2:B2))


    Harlan,

    How about a negative look-behind assertion?

    =REGEX.MID($A$1,".*?(?<!FP)(K|$)",ROWS($B$2:B2))


    --ron

  9. #9
    Harlan Grove
    Guest

    Re: Text manipulation

    "Ron Rosenfeld" <ronrosenfeld@nospam.org> wrote...
    ....
    >How about a negative look-behind assertion?
    >
    >=REGEX.MID($A$1,".*?(?<!FP)(K|$)",ROWS($B$2:B2))


    I keep forgetting Longre uses PCRE rather than VBScript regexp syntax. Too
    many regexp varieties.

    That should work for one literal substring in the assertion, but multiple
    options, e.g., ".+?(?<!(FP|XYZ|D))K" [note: it's more efficient to append a
    K to A1 and use a simple K at the end of the regexp than to use (K|$) - to
    me at least that's a clearer indicator that trailing symbols after the last
    K should be included] could cause backtracking problems. Assertions with
    alternation subexpressions with closures (not the case here, but generally)
    can become a big PITA. If there were multiple prohibited sequences, it may
    be more efficient to mask immediately following cut characters and replace
    them in the results. This is the regexp analog to using ancillary cells for
    intermediate calculations rather than single huge formulas in spreadsheets.
    Just as 2 or 3 formulas may recalc more quickly than a single equivalent
    formula, 2 or 3 simpler regexp operations may recalc more quickly than a
    single, more complex one. Assertions may work well in this case, but I tend
    to avoid them when possible myself.



  10. #10
    Ron Rosenfeld
    Guest

    Re: Text manipulation

    On Mon, 2 Jan 2006 14:26:16 -0800, "Harlan Grove" <hrlngrv@aol.com> wrote:

    >"Ron Rosenfeld" <ronrosenfeld@nospam.org> wrote...
    >...
    >>How about a negative look-behind assertion?
    >>
    >>=REGEX.MID($A$1,".*?(?<!FP)(K|$)",ROWS($B$2:B2))

    >
    >I keep forgetting Longre uses PCRE rather than VBScript regexp syntax. Too
    >many regexp varieties.
    >
    >That should work for one literal substring in the assertion, but multiple
    >options, e.g., ".+?(?<!(FP|XYZ|D))K" [note: it's more efficient to append a
    >K to A1 and use a simple K at the end of the regexp than to use (K|$) - to
    >me at least that's a clearer indicator that trailing symbols after the last
    >K should be included] could cause backtracking problems. Assertions with
    >alternation subexpressions with closures (not the case here, but generally)
    >can become a big PITA. If there were multiple prohibited sequences, it may
    >be more efficient to mask immediately following cut characters and replace
    >them in the results. This is the regexp analog to using ancillary cells for
    >intermediate calculations rather than single huge formulas in spreadsheets.
    >Just as 2 or 3 formulas may recalc more quickly than a single equivalent
    >formula, 2 or 3 simpler regexp operations may recalc more quickly than a
    >single, more complex one. Assertions may work well in this case, but I tend
    >to avoid them when possible myself.
    >


    I will need to keep that in mind. Clearly I have insufficient experience with
    regular expressions to have run into some of the issues you have mentioned.

    I will need to think more about the backtracking issues in order to better
    understand what you write. As well as the pros and cons of trying to avoid
    assertions.

    I have been trying to work out some of the questions posed on the NG's you
    referred me to. Educational. Thanks.

    Best wishes and Happy New Year,

    --ron
    --ron

  11. #11
    Harlan Grove
    Guest

    Re: Text manipulation

    Ron Rosenfeld wrote...
    ....
    >I will need to keep that in mind. Clearly I have insufficient experience with
    >regular expressions to have run into some of the issues you have mentioned.
    >
    >I will need to think more about the backtracking issues in order to better
    >understand what you write. As well as the pros and cons of trying to avoid
    >assertions.

    ....

    If you want to get a real grounding in regular expressions, read
    Jeffrey Friedl's book 'Mastering Regular Expressions, 2nd Ed.', ISBN
    0-596-00289-0. Lots of discussion about backtracking and optimizing
    regular expessions.


  12. #12
    Ron Rosenfeld
    Guest

    Re: Text manipulation

    On 2 Jan 2006 19:05:16 -0800, "Harlan Grove" <hrlngrv@aol.com> wrote:

    >Ron Rosenfeld wrote...
    >...
    >>I will need to keep that in mind. Clearly I have insufficient experience with
    >>regular expressions to have run into some of the issues you have mentioned.
    >>
    >>I will need to think more about the backtracking issues in order to better
    >>understand what you write. As well as the pros and cons of trying to avoid
    >>assertions.

    >...
    >
    >If you want to get a real grounding in regular expressions, read
    >Jeffrey Friedl's book 'Mastering Regular Expressions, 2nd Ed.', ISBN
    >0-596-00289-0. Lots of discussion about backtracking and optimizing
    >regular expessions.


    Thanks. Just ordered a copy from Amazon.


    --ron

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1