+ Reply to Thread
Results 1 to 24 of 24

hash function for large strings

  1. #1
    jheby
    Guest

    hash function for large strings

    Excel vlookup function returns #VALUE when the lookup value exceeds 256
    characters. I need a hash function to transform large strings into a value
    that does not exceed the 256 character limit. Are there any quick solutions
    to this problem?

  2. #2
    Bernie Deitrick
    Guest

    Re: hash function for large strings

    It depends on your strings, and how similar they are. Anything from a simple =LEFT(A1,255) to a
    more colplicated sampling every third (or fourth or thenth) letter could work....

    HTH,
    Bernie
    MS Excel MVP


    "jheby" <jheby@discussions.microsoft.com> wrote in message
    news:FB83AD4A-0184-4717-A8E7-8EC893FC86AE@microsoft.com...
    > Excel vlookup function returns #VALUE when the lookup value exceeds 256
    > characters. I need a hash function to transform large strings into a value
    > that does not exceed the 256 character limit. Are there any quick solutions
    > to this problem?




  3. #3
    Niek Otten
    Guest

    Re: hash function for large strings

    It depends on what possible values of each character of the string you can
    exclude. If they can be each of the possible 256 characters, there are
    limited hashing possibilities; you can treat upper- and lowercase letters
    the same, since VLOOKUP will do that anyway. But that doesn't save you a
    lot. If there are just the alphabet and 0-9, you can exclude a lot more.
    Then you can use a number base conversion function to realize the gain in
    storage requirement. It may, however, be difficult to find one that uses all
    256 possibilities. I have seen one (from Ron Rosenfeld) that uses 0-9 and
    upper- and lowercase letters.

    I find it difficult to believe that there isn't any logic behind the
    population of your keys. Still that is the "key" to hashing.

    Can you tell a bit more about the nature of your keys?

    --
    Kind regards,

    Niek Otten


    "jheby" <jheby@discussions.microsoft.com> wrote in message
    news:FB83AD4A-0184-4717-A8E7-8EC893FC86AE@microsoft.com...
    > Excel vlookup function returns #VALUE when the lookup value exceeds 256
    > characters. I need a hash function to transform large strings into a
    > value
    > that does not exceed the 256 character limit. Are there any quick
    > solutions
    > to this problem?




  4. #4
    Niek Otten
    Guest

    Re: hash function for large strings

    <Still that is the "key" to hashing>

    That isn't true. But it is if you want to use it with VLOOKUP.
    Maybe you should tell us more about the nature of the lookup problem too.

    --
    Kind regards,

    Niek Otten

    "Niek Otten" <nicolaus@xs4all.nl> wrote in message
    news:ejDMvAhQGHA.3672@TK2MSFTNGP14.phx.gbl...
    > It depends on what possible values of each character of the string you can
    > exclude. If they can be each of the possible 256 characters, there are
    > limited hashing possibilities; you can treat upper- and lowercase letters
    > the same, since VLOOKUP will do that anyway. But that doesn't save you a
    > lot. If there are just the alphabet and 0-9, you can exclude a lot more.
    > Then you can use a number base conversion function to realize the gain in
    > storage requirement. It may, however, be difficult to find one that uses
    > all 256 possibilities. I have seen one (from Ron Rosenfeld) that uses 0-9
    > and upper- and lowercase letters.
    >
    > I find it difficult to believe that there isn't any logic behind the
    > population of your keys. Still that is the "key" to hashing.
    >
    > Can you tell a bit more about the nature of your keys?
    >
    > --
    > Kind regards,
    >
    > Niek Otten
    >
    >
    > "jheby" <jheby@discussions.microsoft.com> wrote in message
    > news:FB83AD4A-0184-4717-A8E7-8EC893FC86AE@microsoft.com...
    >> Excel vlookup function returns #VALUE when the lookup value exceeds 256
    >> characters. I need a hash function to transform large strings into a
    >> value
    >> that does not exceed the 256 character limit. Are there any quick
    >> solutions
    >> to this problem?

    >
    >




  5. #5
    jheby
    Guest

    Re: hash function for large strings

    I am pulling log entries into Excel from Essbase, a multi-dimensional
    database, and want to assign categories to each entry using a vlookup. The
    entries could use pretty much any character. Below is a sample of a large
    string entry:
    'Calculating [
    Accounts(AC_IS_Asset_Based_Fixed,AC_IS_Conditional_Sales_Contracts_Fixed,AC_IS_Construction_Fixed,AC_IS_Factoring_Fixed,AC_IS_Finance_Leases_Fixed,AC_IS_Floorplan_Fixed,AC_IS_Inventory_Fixed...]
    with fixed members [Business_Type(BT_Adds_Non_Recourse_NB_Recv_Amt,
    BT_Adds_Non_Recourse_NB_Equip_Cost_Amt, BT_Adds_Non_Recourse_NB_RE_Cost_Amt,
    BT_Adds_Non_Recourse_EB_Recv_Amt, BT_Adds_Non_Recourse_EB_Equip_Cost_Amt, ]

    Regards,
    Jon

    "Niek Otten" wrote:

    > It depends on what possible values of each character of the string you can
    > exclude. If they can be each of the possible 256 characters, there are
    > limited hashing possibilities; you can treat upper- and lowercase letters
    > the same, since VLOOKUP will do that anyway. But that doesn't save you a
    > lot. If there are just the alphabet and 0-9, you can exclude a lot more.
    > Then you can use a number base conversion function to realize the gain in
    > storage requirement. It may, however, be difficult to find one that uses all
    > 256 possibilities. I have seen one (from Ron Rosenfeld) that uses 0-9 and
    > upper- and lowercase letters.
    >
    > I find it difficult to believe that there isn't any logic behind the
    > population of your keys. Still that is the "key" to hashing.
    >
    > Can you tell a bit more about the nature of your keys?
    >
    > --
    > Kind regards,
    >
    > Niek Otten
    >
    >
    > "jheby" <jheby@discussions.microsoft.com> wrote in message
    > news:FB83AD4A-0184-4717-A8E7-8EC893FC86AE@microsoft.com...
    > > Excel vlookup function returns #VALUE when the lookup value exceeds 256
    > > characters. I need a hash function to transform large strings into a
    > > value
    > > that does not exceed the 256 character limit. Are there any quick
    > > solutions
    > > to this problem?

    >
    >
    >


  6. #6
    Pete_UK
    Guest

    Re: hash function for large strings

    Looking at your example string, you can see many repeating groups of
    characters, eg:

    AC_IS_ _Fixed BT_Adds_Non_Recourse _Cost_Amt _Recv_Amt

    to pick just five. You could envisage, then, a user-defined function
    which is applied to the original string and returns a reduced string by
    repeated application of SUBSTITUTE( ) to change these character groups
    to some other character which is not contained within any of the
    strings.

    If you apply the UDF to the lookup table as well, then you can used the
    modified string as the key field.

    Is this the kind of thing you were looking for?

    Pete


  7. #7
    Niek Otten
    Guest

    Re: hash function for large strings

    Also, it is almost only letters, commas, underscores and round and square
    brackets. That is less than 30 possibilities and should fit in just 5 bits
    instead of 8. Or, store 8 positions in 1 character.

    But probably Pete's suggestion for repeating strings is easier to code.

    --
    Kind regards,

    Niek Otten

    "Pete_UK" <pashurst@auditel.net> wrote in message
    news:1141832867.034904.79700@i40g2000cwc.googlegroups.com...
    > Looking at your example string, you can see many repeating groups of
    > characters, eg:
    >
    > AC_IS_ _Fixed BT_Adds_Non_Recourse _Cost_Amt _Recv_Amt
    >
    > to pick just five. You could envisage, then, a user-defined function
    > which is applied to the original string and returns a reduced string by
    > repeated application of SUBSTITUTE( ) to change these character groups
    > to some other character which is not contained within any of the
    > strings.
    >
    > If you apply the UDF to the lookup table as well, then you can used the
    > modified string as the key field.
    >
    > Is this the kind of thing you were looking for?
    >
    > Pete
    >




  8. #8
    jheby
    Guest

    RE: hash function for large strings

    Unfortunately, the repeating strings, while present in this example, don't
    occur in most cases. This example is more of an anomoly in that regard.

    "jheby" wrote:

    > Excel vlookup function returns #VALUE when the lookup value exceeds 256
    > characters. I need a hash function to transform large strings into a value
    > that does not exceed the 256 character limit. Are there any quick solutions
    > to this problem?


  9. #9
    Niek Otten
    Guest

    Re: hash function for large strings

    I can't see how you would use such a string as a key. As a search argument,
    OK, but not as an identifier that uniquely extracts one key.
    Can you tell us how you intend to use these items? How is you search key
    assembled/where does it come from? What data are you trying to look up using
    the key?

    --
    Kind regards,

    Niek Otten

    "jheby" <jheby@discussions.microsoft.com> wrote in message
    news:9D5A6507-411D-42EC-918A-6743BD98D35C@microsoft.com...
    > Unfortunately, the repeating strings, while present in this example, don't
    > occur in most cases. This example is more of an anomoly in that regard.
    >
    > "jheby" wrote:
    >
    >> Excel vlookup function returns #VALUE when the lookup value exceeds 256
    >> characters. I need a hash function to transform large strings into a
    >> value
    >> that does not exceed the 256 character limit. Are there any quick
    >> solutions
    >> to this problem?




  10. #10
    Harlan Grove
    Guest

    Re: hash function for large strings

    jheby wrote...
    >I am pulling log entries into Excel from Essbase, a multi-dimensional
    >database, and want to assign categories to each entry using a vlookup. The
    >entries could use pretty much any character. Below is a sample of a large
    >string entry:

    ....

    There are tasks for which Excel is most definitely not the most
    appropriate tool, and in this case, Excel may be one of the worst tools
    for the task. Most scripting languages (Perl, Pyton, Ruby, gawk)
    provide associative arrays that handle hashing arbitrarily long
    strings. But if you want a simple spreadsheet approach, get a capable
    spreadsheet like the one in OpenOffice. OpenOffice Calc doesn't choke
    on string operations with string lengths > 256 characters. This is one
    of the ugly little backwaters of Excel functionality that hasn't been
    changed in more than a decade.


  11. #11
    jheby
    Guest

    Re: hash function for large strings

    The log that I'm analyzing consist of perhaps 5K unique entries that get
    recorded perhaps 300K times over a period of several days. I've created a
    map that groups the 5K unique entries into about a hundred categories. I
    want to go back and tag each of the 300K instances with its category so that
    I can summarize & analyze the log entries by those category groupings -
    that's where the need for the vlookup comes in. The majority of the unique
    entries are well under 255 characters, but unfortunately, a significant
    number of them are not.

    "Niek Otten" wrote:

    > I can't see how you would use such a string as a key. As a search argument,
    > OK, but not as an identifier that uniquely extracts one key.
    > Can you tell us how you intend to use these items? How is you search key
    > assembled/where does it come from? What data are you trying to look up using
    > the key?
    >
    > --
    > Kind regards,
    >
    > Niek Otten
    >
    > "jheby" <jheby@discussions.microsoft.com> wrote in message
    > news:9D5A6507-411D-42EC-918A-6743BD98D35C@microsoft.com...
    > > Unfortunately, the repeating strings, while present in this example, don't
    > > occur in most cases. This example is more of an anomoly in that regard.
    > >
    > > "jheby" wrote:
    > >
    > >> Excel vlookup function returns #VALUE when the lookup value exceeds 256
    > >> characters. I need a hash function to transform large strings into a
    > >> value
    > >> that does not exceed the 256 character limit. Are there any quick
    > >> solutions
    > >> to this problem?

    >
    >
    >


  12. #12
    jheby
    Guest

    Re: hash function for large strings

    thanks - I hope that the Excel designers are planning to fix this along with
    a number of other items in the new product (including the eight level nesting
    limit and the 65K row limit)

    "Harlan Grove" wrote:

    > jheby wrote...
    > >I am pulling log entries into Excel from Essbase, a multi-dimensional
    > >database, and want to assign categories to each entry using a vlookup. The
    > >entries could use pretty much any character. Below is a sample of a large
    > >string entry:

    > ....
    >
    > There are tasks for which Excel is most definitely not the most
    > appropriate tool, and in this case, Excel may be one of the worst tools
    > for the task. Most scripting languages (Perl, Pyton, Ruby, gawk)
    > provide associative arrays that handle hashing arbitrarily long
    > strings. But if you want a simple spreadsheet approach, get a capable
    > spreadsheet like the one in OpenOffice. OpenOffice Calc doesn't choke
    > on string operations with string lengths > 256 characters. This is one
    > of the ugly little backwaters of Excel functionality that hasn't been
    > changed in more than a decade.
    >
    >


  13. #13
    Pete_UK
    Guest

    Re: hash function for large strings

    Well, you could wait for Excel 12 later in the year, but if you want to
    do this now, the following illustrates my point:

    Your example text string is 428 characters long. I put it in A1 and
    this formula in B1:

    =SUBSTITUTE(A1,"BT_Adds_Non_Recourse",CHAR(161))

    The length came down to 333 characters.

    Changing the formula to this:

    =SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE
    (A1,"BT_Adds_Non_Recourse",CHAR(161)),"_Cost_Amt",CHAR(162)),
    "_Recv_Amt",CHAR(163)),"AC_IS_",CHAR(164)),"_Fixed",CHAR(165))

    brought the length down to 233 characters - so you could use this as a
    key in your VLOOKUP formula. The formula merely changes the character
    groups identified earlier into a single character.

    Hopefully, you can see how this method could be extended by using a UDF
    to pick up other character groups.

    Pete


  14. #14
    jheby
    Guest

    Re: hash function for large strings

    Thanks - that approach will definately work in this instance. It still
    appears to me that I will have to create a VBA-based hashing function to
    address the general case.

    Regards,
    Jon

    "Pete_UK" wrote:

    > Well, you could wait for Excel 12 later in the year, but if you want to
    > do this now, the following illustrates my point:
    >
    > Your example text string is 428 characters long. I put it in A1 and
    > this formula in B1:
    >
    > =SUBSTITUTE(A1,"BT_Adds_Non_Recourse",CHAR(161))
    >
    > The length came down to 333 characters.
    >
    > Changing the formula to this:
    >
    > =SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE
    > (A1,"BT_Adds_Non_Recourse",CHAR(161)),"_Cost_Amt",CHAR(162)),
    > "_Recv_Amt",CHAR(163)),"AC_IS_",CHAR(164)),"_Fixed",CHAR(165))
    >
    > brought the length down to 233 characters - so you could use this as a
    > key in your VLOOKUP formula. The formula merely changes the character
    > groups identified earlier into a single character.
    >
    > Hopefully, you can see how this method could be extended by using a UDF
    > to pick up other character groups.
    >
    > Pete
    >
    >


  15. #15
    Pete_UK
    Guest

    Re: hash function for large strings

    Hi Jon,

    You won't be able to extend the above formula very much because of the
    limits on nesting. Here's an example UDF to do the same thing:

    Function reduce(test As String) As String
    reduce = test
    If Len(reduce) < 255 Then Exit Function
    reduce = Application.WorksheetFunction.Substitute(reduce,
    "BT_Adds_Non_Recourse", Chr(161))
    reduce = Application.WorksheetFunction.Substitute(reduce, "_Cost_Amt",
    Chr(162))
    reduce = Application.WorksheetFunction.Substitute(reduce, "_Recv_Amt",
    Chr(163))
    reduce = Application.WorksheetFunction.Substitute(reduce, "AC_IS_",
    Chr(164))
    reduce = Application.WorksheetFunction.Substitute(reduce, "_Fixed",
    Chr(165))
    reduce = Application.WorksheetFunction.Substitute(reduce, "_NB",
    Chr(166))
    reduce = Application.WorksheetFunction.Substitute(reduce, "_EB",
    Chr(167))
    reduce = Application.WorksheetFunction.Substitute(reduce,
    "Finance_Leases", Chr(168))
    reduce = Application.WorksheetFunction.Substitute(reduce,
    "Conditional_Sales_Contracts", Chr(169))
    End Function

    I've added a few extra functions to the end, which reduce the length of
    the original string to 184 characters when used as:

    =reduce(A1)

    The function does not change any original text which is already less
    than 255 characters in length. Obviously you can apply this to other
    strings by copying down, but you can also extend the UDF to cope with
    other patterns - just change the character number at the end of the
    line.

    Hope this helps.

    Pete


  16. #16
    Harlan Grove
    Guest

    Re: hash function for large strings

    Pete_UK wrote...
    >You won't be able to extend the above formula very much because of the
    >limits on nesting. Here's an example UDF to do the same thing:
    >
    >Function reduce(test As String) As String
    >reduce = test
    >If Len(reduce) < 255 Then Exit Function
    >reduce = Application.WorksheetFunction.Substitute(reduce,
    >"BT_Adds_Non_Recourse", Chr(161))
    >reduce = Application.WorksheetFunction.Substitute(reduce, "_Cost_Amt",
    >Chr(162))
    >reduce = Application.WorksheetFunction.Substitute(reduce, "_Recv_Amt",
    >Chr(163))
    >reduce = Application.WorksheetFunction.Substitute(reduce, "AC_IS_",
    >Chr(164))
    >reduce = Application.WorksheetFunction.Substitute(reduce, "_Fixed",
    >Chr(165))
    >reduce = Application.WorksheetFunction.Substitute(reduce, "_NB",
    >Chr(166))
    >reduce = Application.WorksheetFunction.Substitute(reduce, "_EB",
    >Chr(167))
    >reduce = Application.WorksheetFunction.Substitute(reduce,
    >"Finance_Leases", Chr(168))
    >reduce = Application.WorksheetFunction.Substitute(reduce,
    >"Conditional_Sales_Contracts", Chr(169))
    >End Function
    >
    >I've added a few extra functions to the end, which reduce the length of
    >the original string to 184 characters when used as:
    >
    >=reduce(A1)
    >
    >The function does not change any original text which is already less
    >than 255 characters in length. Obviously you can apply this to other
    >strings by copying down, but you can also extend the UDF to cope with
    >other patterns - just change the character number at the end of the
    >line.


    And thinking holistically, how does this udf help? While it may reduce
    the length of the 1st argument to VLOOKUP, it's not usable directly for
    the first column of the 2nd argument to VLOOKUP. It's not even usable
    directly in an INDEX/MATCH approach because it can't handle range or
    array arguments and return values. At best, the OP would need to use
    one additional cell per log file record with each of those cells
    calling this UDF with the corresponding record value. Calling *any* udf
    a few thousand times even on a superfast system will bring back fond
    memories of the responsiveness of 110 baud dial-up mainframe
    connections.

    MATCH is afflicted with the same 256 character limit on comparisons as
    VLOOKUP. However, the = operator isn't so afflicted. Use INDEX/MATCH.
    Given A1:A6 containing

    =REPT("0123456789ABCDEF",16)&" - "&CHAR(64+ROW())

    and B1:B6 containing

    =ROW()

    Then instead of VLOOKUP(x,A1:B6,2,0) use arrays formulas like

    =INDEX(B1:B6,MATCH(TRUE,(A1:A6=x),0))

    A kludge, but this will be faster than any udf.


  17. #17
    Niek Otten
    Guest

    Re: hash function for large strings

    Hi Harlan,

    An ingenious solution! I like the Out of the box approach.

    However, I'm not sure I agree with what you say about performance; in my
    experience array formulas are rather slow.
    I tested your solution (somewhat modified because of the limits of CHAR()).
    Because it requires some efforts to arrange for more records than there are
    rows, I tested with 30,000 (instead of 300,000) of your formulas; that took
    (on my machine) 284 seconds.
    I built a reducekey function similar to Pete's one; 30,000 reducements took
    15 seconds.
    If the 5K table is or can be sorted, Pete's solutions (which indeed requires
    some more work, like adding the reduced key to the lookup table) might be
    quite a lot faster than applying array formulas.

    I know UDF's have a bad reputation for performance. Indeed the overhead of
    switching between Worksheet and VBA is significant. But often the
    performance problems seem to be caused by inefficient programming,
    especially wrong data types.
    My actuarial function system manages 10,000 calls from a worksheet per
    second.

    It would be a lot of work, but it would be nice to test Jon's actual case
    with the different approaches suggested.
    --
    Kind regards,

    Niek Otten


    "Harlan Grove" <hrlngrv@aol.com> wrote in message
    news:1141928694.168585.58400@j33g2000cwa.googlegroups.com...
    > Pete_UK wrote...
    >>You won't be able to extend the above formula very much because of the
    >>limits on nesting. Here's an example UDF to do the same thing:
    >>
    >>Function reduce(test As String) As String
    >>reduce = test
    >>If Len(reduce) < 255 Then Exit Function
    >>reduce = Application.WorksheetFunction.Substitute(reduce,
    >>"BT_Adds_Non_Recourse", Chr(161))
    >>reduce = Application.WorksheetFunction.Substitute(reduce, "_Cost_Amt",
    >>Chr(162))
    >>reduce = Application.WorksheetFunction.Substitute(reduce, "_Recv_Amt",
    >>Chr(163))
    >>reduce = Application.WorksheetFunction.Substitute(reduce, "AC_IS_",
    >>Chr(164))
    >>reduce = Application.WorksheetFunction.Substitute(reduce, "_Fixed",
    >>Chr(165))
    >>reduce = Application.WorksheetFunction.Substitute(reduce, "_NB",
    >>Chr(166))
    >>reduce = Application.WorksheetFunction.Substitute(reduce, "_EB",
    >>Chr(167))
    >>reduce = Application.WorksheetFunction.Substitute(reduce,
    >>"Finance_Leases", Chr(168))
    >>reduce = Application.WorksheetFunction.Substitute(reduce,
    >>"Conditional_Sales_Contracts", Chr(169))
    >>End Function
    >>
    >>I've added a few extra functions to the end, which reduce the length of
    >>the original string to 184 characters when used as:
    >>
    >>=reduce(A1)
    >>
    >>The function does not change any original text which is already less
    >>than 255 characters in length. Obviously you can apply this to other
    >>strings by copying down, but you can also extend the UDF to cope with
    >>other patterns - just change the character number at the end of the
    >>line.

    >
    > And thinking holistically, how does this udf help? While it may reduce
    > the length of the 1st argument to VLOOKUP, it's not usable directly for
    > the first column of the 2nd argument to VLOOKUP. It's not even usable
    > directly in an INDEX/MATCH approach because it can't handle range or
    > array arguments and return values. At best, the OP would need to use
    > one additional cell per log file record with each of those cells
    > calling this UDF with the corresponding record value. Calling *any* udf
    > a few thousand times even on a superfast system will bring back fond
    > memories of the responsiveness of 110 baud dial-up mainframe
    > connections.
    >
    > MATCH is afflicted with the same 256 character limit on comparisons as
    > VLOOKUP. However, the = operator isn't so afflicted. Use INDEX/MATCH.
    > Given A1:A6 containing
    >
    > =REPT("0123456789ABCDEF",16)&" - "&CHAR(64+ROW())
    >
    > and B1:B6 containing
    >
    > =ROW()
    >
    > Then instead of VLOOKUP(x,A1:B6,2,0) use arrays formulas like
    >
    > =INDEX(B1:B6,MATCH(TRUE,(A1:A6=x),0))
    >
    > A kludge, but this will be faster than any udf.
    >




  18. #18
    Harlan Grove
    Guest

    Re: hash function for large strings

    Niek Otten wrote...
    ....
    >However, I'm not sure I agree with what you say about performance; in my
    >experience array formulas are rather slow.

    ....

    It's not the array formulas per se, it's the 30K LONG string
    comparisons. If the OP had relatively static data (apparently the
    case), a macro essentially performing repeated Edit > Replace
    operations on copies of the log file records would be better than UDFs.
    One VBA invocation beats 30K separate ones.

    The greater difficulty is whether the chaff that Pete_UK has
    *hardcoded* into his udf would apply to *all* the OP's records. Usually
    it's a bad idea to generalize from one sample record. But there's also
    the question whether given the *entire* population of Essbase fields,
    categories, dimensions, etc. could prove awkward. For instance, the
    OP's single sample record contained

    BT_Adds_Non_Recourse_NB_Recv_Amt,
    BT_Adds_Non_Recourse_NB_Equip_Cost_Amt,
    BT_Adds_Non_Recourse_NB_RE_Cost_Amt,
    BT_Adds_Non_Recourse_EB_Recv_Amt,
    BT_Adds_Non_Recourse_EB_Equip_Cost_Amt,

    Pete_UK's udf would reduce these to

    ,
    Equip,
    RE,
    ,
    Equip,

    Do you begin to see a problem?

    GIGO processing can be very runtime efficient. When you're after
    correct results, however, it may require more calculation time and
    perhaps a wee bit more time figuring out the full extent of the
    problem.

    Myself, I don't believe the OP can *RELIABLY* accomplish his/her task
    by removing any text other than space characters on either side of
    punctuation.


  19. #19
    Harlan Grove
    Guest

    Re: hash function for large strings

    Niek Otten wrote...
    ....
    >However, I'm not sure I agree with what you say about performance; in my
    >experience array formulas are rather slow.

    ....

    It's not the array formulas per se, it's the 30K LONG string
    comparisons. If the OP had relatively static data (apparently the
    case), a macro essentially performing repeated Edit > Replace
    operations on copies of the log file records would be better than UDFs.
    One VBA invocation beats 30K separate ones.

    The greater difficulty is whether the chaff that Pete_UK has
    *hardcoded* into his udf would apply to *all* the OP's records. Usually
    it's a bad idea to generalize from one sample record. But there's also
    the question whether given the *entire* population of Essbase fields,
    categories, dimensions, etc. could prove awkward. For instance, the
    OP's single sample record contained

    BT_Adds_Non_Recourse_NB_Recv_Amt,
    BT_Adds_Non_Recourse_NB_Equip_Cost_Amt,
    BT_Adds_Non_Recourse_NB_RE_Cost_Amt,
    BT_Adds_Non_Recourse_EB_Recv_Amt,
    BT_Adds_Non_Recourse_EB_Equip_Cost_Amt,

    Pete_UK's udf would reduce these to

    ,
    Equip,
    RE,
    ,
    Equip,

    Do you begin to see a problem?

    GIGO processing can be very runtime efficient. When you're after
    correct results, however, it may require more calculation time and
    perhaps a wee bit more time figuring out the full extent of the
    problem.

    Myself, I don't believe the OP can *RELIABLY* accomplish his/her task
    by removing any text other than space characters on either side of
    punctuation.


  20. #20
    Pete_UK
    Guest

    Re: hash function for large strings

    Unfortunately Jon has only posted one example of his long strings - we
    have no idea of what the other strings look like, but I had assumed
    that similar character groupings would appear in many of them. He says
    a significant number of his strings are very long, but that could mean
    a couple of thousand if he has a batch of 15k records.

    Applying the reduced key formula to the lookup table would only have to
    be done once - the values could be fixed and then copied over the
    original keys.

    Jon said he does 300k weekly, so performance is an issue, and like you,
    Niek, I have found that array formulae can be very slow. He wanted a
    way to reduce his long strings so that he could use his existing
    vlookup formulae, and my approach allows him to do that with not a lot
    of effort in setting it up nor in using it - I also tested it on 30k
    identical records following Harlan's posting, and it took less than 10
    seconds.

    Anyway, it's up to him - I think writing a generalised hashing
    algorithm as he was contemplating would take considerable time.

    Pete


  21. #21
    Pete_UK
    Guest

    Re: hash function for large strings

    Harlan,

    this is what the final part of the string gets converted into:

    [Business_Type(=A1=A6=A3,=A1=A6_Equip=A2, =A1=A6_RE=A2,=A1=A7=A3, =A1=A7_Eq=
    uip=A2, ]

    which represents the following:

    [Business_Type(
    [Business_Type(
    BT_Adds_Non_Recourse_NB_Recv_Amt, =A1=A6=A3,
    BT_Adds_Non_Recourse_NB_Equip_Cost_Amt, =A1=A6_Equip=A2,
    BT_Adds_Non_Recourse_NB_RE_Cost_Amt, =A1=A6_RE=A2,
    BT_Adds_Non_Recourse_EB_Recv_Amt, =A1=A7=A3,
    BT_Adds_Non_Recourse_EB_Equip_Cost_Amt, ] =A1=A7_Equip=A2, ]

    I am substituting the phrases with individual characters in the range
    161 onwards, not with blanks. The uniqueness is maintained.

    Pete


  22. #22
    Harlan Grove
    Guest

    Re: hash function for large strings

    "Pete_UK" <pashurst@auditel.net> wrote...
    >this is what the final part of the string gets converted into:
    >
    >[Business_Type(¡¦£,¡¦_Equip¢, ¡¦_RE¢,¡§£, ¡§_Equip¢, ]

    ....

    OK, I screwed up. Sorry.

    This might shorten the OP's records enough, or it might not. Also, if there
    just happen to be more than 128 such common phrases, you'd need to alter
    your approach. I still believe a macro approach would be better than a udf
    approach, and a macro approach could prompt for selection of a different
    range containing a table of common susbtrings rather than hard coding them
    into a udf.



  23. #23
    Harlan Grove
    Guest

    Re: hash function for large strings

    Pete_UK wrote...
    >Well, you could wait for Excel 12 later in the year, but if you want to

    ....

    Maybe you could wait for Excel 13 or 14, but this particular problem
    won't be fixed in Excel 12. David Gainer confirmed in the Excel 12 blog
    that VLOOKUP will remain limited to searching up to but no more than
    255 chars in Excel 12.

    FWIW, Lotus 123 97 Edition (almost 10 years old) handles lookup values
    > 255 characters, as does the Windows port of Gnumeric 1.6.0. I'd be surprised if Excel wasn't the only arguably high-end spreadsheet that can't handle such strings. Microsoft has a very bad tendency to leave matters of actual functionality unchanged version to version over decades but spend thousands of developers' hours on interface and other eyewash issues.


    Excel 12 is Microsoft's latest attempt to prove their assertion that
    they're years ahead of the competition. Then stuff like this comes
    along an proves that when low-level backwaters of functionality are at
    issue, it's Excel that's a decade or so behind the competition. And a
    long as the lemmings dutifully upgrade, there'll be no incentive for
    them to fix this.


  24. #24
    jheby
    Guest

    Re: hash function for large strings

    Regarding the array solution that you proposed - I'm assuming that the large
    string gets placed at the location of "x" in the following array formula:

    =INDEX(B1:B6,MATCH(TRUE,(A1:A6=x),0))

    Was that your intent? - when I tried that, I got a "formula too long" error.

    "Harlan Grove" wrote:

    > Pete_UK wrote...
    > >You won't be able to extend the above formula very much because of the
    > >limits on nesting. Here's an example UDF to do the same thing:
    > >
    > >Function reduce(test As String) As String
    > >reduce = test
    > >If Len(reduce) < 255 Then Exit Function
    > >reduce = Application.WorksheetFunction.Substitute(reduce,
    > >"BT_Adds_Non_Recourse", Chr(161))
    > >reduce = Application.WorksheetFunction.Substitute(reduce, "_Cost_Amt",
    > >Chr(162))
    > >reduce = Application.WorksheetFunction.Substitute(reduce, "_Recv_Amt",
    > >Chr(163))
    > >reduce = Application.WorksheetFunction.Substitute(reduce, "AC_IS_",
    > >Chr(164))
    > >reduce = Application.WorksheetFunction.Substitute(reduce, "_Fixed",
    > >Chr(165))
    > >reduce = Application.WorksheetFunction.Substitute(reduce, "_NB",
    > >Chr(166))
    > >reduce = Application.WorksheetFunction.Substitute(reduce, "_EB",
    > >Chr(167))
    > >reduce = Application.WorksheetFunction.Substitute(reduce,
    > >"Finance_Leases", Chr(168))
    > >reduce = Application.WorksheetFunction.Substitute(reduce,
    > >"Conditional_Sales_Contracts", Chr(169))
    > >End Function
    > >
    > >I've added a few extra functions to the end, which reduce the length of
    > >the original string to 184 characters when used as:
    > >
    > >=reduce(A1)
    > >
    > >The function does not change any original text which is already less
    > >than 255 characters in length. Obviously you can apply this to other
    > >strings by copying down, but you can also extend the UDF to cope with
    > >other patterns - just change the character number at the end of the
    > >line.

    >
    > And thinking holistically, how does this udf help? While it may reduce
    > the length of the 1st argument to VLOOKUP, it's not usable directly for
    > the first column of the 2nd argument to VLOOKUP. It's not even usable
    > directly in an INDEX/MATCH approach because it can't handle range or
    > array arguments and return values. At best, the OP would need to use
    > one additional cell per log file record with each of those cells
    > calling this UDF with the corresponding record value. Calling *any* udf
    > a few thousand times even on a superfast system will bring back fond
    > memories of the responsiveness of 110 baud dial-up mainframe
    > connections.
    >
    > MATCH is afflicted with the same 256 character limit on comparisons as
    > VLOOKUP. However, the = operator isn't so afflicted. Use INDEX/MATCH.
    > Given A1:A6 containing
    >
    > =REPT("0123456789ABCDEF",16)&" - "&CHAR(64+ROW())
    >
    > and B1:B6 containing
    >
    > =ROW()
    >
    > Then instead of VLOOKUP(x,A1:B6,2,0) use arrays formulas like
    >
    > =INDEX(B1:B6,MATCH(TRUE,(A1:A6=x),0))
    >
    > A kludge, but this will be faster than any udf.
    >
    >


+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1