+ Reply to Thread
Results 1 to 13 of 13

Detecting transitions between alpha/numeric characters

  1. #1
    Registered User
    Join Date
    01-17-2013
    Location
    San Francisco
    MS-Off Ver
    Excel 2010
    Posts
    6

    Detecting transitions between alpha/numeric characters

    I am trying to clean out an e-mail newsletter list w/ nearly a million rows compiled over 10+ years. I've gotten the list down to <100k entries and now the tricky ones are left (there's an unbelievable amount of "bot" entries). I'm having a tough time separating these from genuine addresses. I'd like to:
    1. find any instance where more than 5 numbers exist in the first half of an address
    2. find any instance where there are more than 2 transitions between numbers and letters in the first half of the address

    Can you help? Here's a sample of some addresses:
    506ecc94.8070804@gardenoffrancis.com
    20a12.2f2bceae.3d7fe7f5@aol.com
    500fd301.5090605@hazelst.com
    50d4fa18.8010200@gmail.com
    50b4090c.2060102@gmail.com
    50adc5a2.3010904@gmail.com
    54824.71a65b4a.3d415bc3@aol.com

  2. #2
    Forum Expert
    Join Date
    01-15-2007
    Location
    Brisbane, Australia
    MS-Off Ver
    2007
    Posts
    6,591

    Re: Detecting transitions between alpha/numeric characters

    Hi

    Define "first half of an address". Is it the first half of the part before the @? Or really the first half of the entire address? Also, for your examples, show which ones would be identified (and why) and which would be OK.

    rylo

  3. #3
    Registered User
    Join Date
    01-17-2013
    Location
    San Francisco
    MS-Off Ver
    Excel 2010
    Posts
    6

    Re: Detecting transitions between alpha/numeric characters

    Yes, first half of the address is everything before @.
    All of those addresses would be identified as bad. None of those are good address (which is why I shared them publicly). I could provide a much longer list if necessary.

  4. #4
    Forum Expert
    Join Date
    01-15-2007
    Location
    Brisbane, Australia
    MS-Off Ver
    2007
    Posts
    6,591

    Re: Detecting transitions between alpha/numeric characters

    Hi

    Assume your example data is in the range A1:A7, then
    B1: =IF(SUM(IF(ISNUMBER(MID(F1,ROW($1:$30),1)*1),1,0))>5,"Too Many Numbers","") This formula is array entered (ctrl, shift, enter)
    This should identify the items with too many numbers.

    For the second criteria, is it only a one direction transitions from numbers to letters or can it be letters to numbers as well? So 5a5a5a would be a problem but a5a5a5 would not?

    rylo
    rylo

  5. #5
    Registered User
    Join Date
    01-17-2013
    Location
    San Francisco
    MS-Off Ver
    Excel 2010
    Posts
    6

    Re: Detecting transitions between alpha/numeric characters

    transitions in either direction can be counted, so 5a5a5a and a5a5a5 would both be equally problematic. It's really the transition that's significant. For instance, there seem to be a lot of people who will do something like, "bill90210@domain.com" - where they use a zip code - some type of a year -- but those almost always come at the beginning or end of the name portion of an address.

    As for the solution you provide for telling how many numbers are in a string, I'm not sure that does what I need...I'm particularly stuck on the ROW() function...I don't see why that's necessary...I'll play with it a bit, though. Thanks!

  6. #6
    Registered User
    Join Date
    01-17-2013
    Location
    San Francisco
    MS-Off Ver
    Excel 2010
    Posts
    6

    Re: Detecting transitions between alpha/numeric characters

    I'll be darned. Clearly, your kung fu is better than my kung fu. That function works great! Thanks!

  7. #7
    Forum Expert JBeaucaire's Avatar
    Join Date
    03-21-2004
    Location
    Bakersfield, CA
    MS-Off Ver
    2010, 2016, Office 365
    Posts
    33,492

    Re: Detecting transitions between alpha/numeric characters

    Don't let the ROW() trick you in the middle like that. Notice it is in the middle of a MID() function? The second parameter of the MID is the "start" position. He is using ROW($1:$30) as a way of creating an array of the numbers 1-30 so that each letter in those positions will be individually checked to see if it's a number. The CTRL-SHIFT-ENTER effectively runs that formula once for each of the values 1-30, and the SUM is counting how many times the numeric test comes back as true.

    If the SUM is greater than 5, so more than 5 numbers were found in the first half of the string, you will get a "bad" result. My guess is this is going to be pretty accurate, will surely misconnect to a few, but not many. You can UP the >5 to >7 if you think that's a better minimum.

    ===============
    If that takes care of your original query, please select Thread Tools from the menu above and mark the thread as solved. Thanks.
    _________________
    Microsoft MVP 2010 - Excel
    Visit: Jerry Beaucaire's Excel Files & Macros

    If you've been given good help, use the icon below to give reputation feedback, it is appreciated.
    Always put your code between code tags. [CODE] your code here [/CODE]

    ?None of us is as good as all of us? - Ray Kroc
    ?Actually, I *am* a rocket scientist.? - JB (little ones count!)

  8. #8
    Forum Expert
    Join Date
    01-15-2007
    Location
    Brisbane, Australia
    MS-Off Ver
    2007
    Posts
    6,591

    Re: Detecting transitions between alpha/numeric characters

    Hi

    Actually, I forgot a step I'd done in the testing

    real formula should be

    =IF(SUM(IF(ISNUMBER(MID(LEFT(A1,SEARCH("@",A1)/2),ROW($1:$30),1)*1),1,0))>5,"Too Many Numbers","")

    The E1 and F1 in the original did the "first half of the string" bit,and I've now put it into the formula directly.

    Now to start thinking about the transition.....

    rylo
    Last edited by rylo; 01-17-2013 at 08:05 PM.

  9. #9
    Registered User
    Join Date
    01-17-2013
    Location
    San Francisco
    MS-Off Ver
    Excel 2010
    Posts
    6

    Re: Detecting transitions between alpha/numeric characters

    JBeaucaire, thanks fir the explanation! That's fantastic.

    The counting function works so good, that I don't think I even need the other one...and probably even better w/ the update, Rylo. Thanks!

    I did find that there are quite a few e-mail addresses w/ 6 numbers and even a few with up to 9. I do have name fields though, and I'm pretty happy w/ this result...
    If there are >5 numbers in a string and nothing in any of the name fields, then I can get rid of all of those and lose ~0.01% of valid e-mails (that's acceptable). If I want to protect the majority of that 0.01%, then I just manually sort through the few dozen whose first character comes after "f". It seems the bots like HEX.

  10. #10
    Registered User
    Join Date
    01-17-2013
    Location
    San Francisco
    MS-Off Ver
    Excel 2010
    Posts
    6

    Re: Detecting transitions between alpha/numeric characters

    Almost 100k seemingly valid addresses, or ~9% Thanks for the final push.

  11. #11
    Forum Expert JBeaucaire's Avatar
    Join Date
    03-21-2004
    Location
    Bakersfield, CA
    MS-Off Ver
    2010, 2016, Office 365
    Posts
    33,492

    Re: Detecting transitions between alpha/numeric characters

    Rylo, my suggestion at a tweak:

    =IF(SUM(IF(ISNUMBER(MID(LEFT(A1,SEARCH("@",A1)-1),ROW($1:$30),1)*1),1,0))>5,"Too Many Numbers","")


    Quietsnow: If that takes care of your original query, please select Thread Tools from the menu above and mark the thread as solved. Thanks.

  12. #12
    Forum Expert
    Join Date
    01-15-2007
    Location
    Brisbane, Australia
    MS-Off Ver
    2007
    Posts
    6,591

    Re: Detecting transitions between alpha/numeric characters

    Hi

    OK, try this for the transitions

    =IF(SUM(IF((ISNUMBER(MID(LEFT(A1,SEARCH("@",A1)/2),ROW($1:$30),1)*1))*(NOT(ISNUMBER(MID(LEFT(A1,SEARCH("@",A1)/2),ROW($2:$31),1)*1))),1,0))+SUM(IF((NOT(ISNUMBER(MID(LEFT(A1,SEARCH("@",A1)/2),ROW($1:$30),1)*1)))*(ISNUMBER(MID(LEFT(A1,SEARCH("@",A1)/2),ROW($2:$31),1)*1)),1,0))>3,"Too Many Transitions","")

    rylo

  13. #13
    Forum Expert
    Join Date
    01-15-2007
    Location
    Brisbane, Australia
    MS-Off Ver
    2007
    Posts
    6,591

    Re: Detecting transitions between alpha/numeric characters

    Jerry

    Good catch on the tweak. I was still in the mode of all before the @ was the "address", so had to take half of that.

    quietsnow

    you will have to update the transition formula for the tweak Jerry suggested.

    rylo

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1