We are working with Excel sheets that have over 100,000 rows trying to compare and remove duplicates. The list is basically a contact list that has company name, email, and some contact data. We have a "do-not-call" list that we compare it to then remove the contacts that show up in the "dnc" list. The problem is a company may be formatted in several ways. For example, the list may contain, "Cogswell's Cogs" or "Cogswell's Cogs, Inc" or "Coswell's Cog Co." or "Chicago Cogswell's Cogs" and multiple other inconsistent variations.
Searching through so many rows in Excel is becoming prohibitive, especially when trying to filter based on fuzzy matches which requires multiple passes, and the list is only growing. We have tried using VLookups with wildcards, parsing and concatenating the data, but it's hard to filter because the variations themselves vary quite a bit. Sometimes the variations actually are different companies! So we've tried using the MS Fuzzy Lookup COM Add-in, but it runs very slowly. Not to mention we're not 100% confident in the results, though they seem to be pretty close.
I'm looking for a better way. I'm sure there must be one. Maybe the answer is 'not use Excel', in which case I'd be open to using other tools. Here's hoping that someone else has a similar experience they can share.
Thanks.
Bookmarks