Aggregating data based on a search key, across a very large dataset possible?
Hey friends,
I have a database full of receipt IDs and need to filter out particular records (i.e. receipts belonging to buyers/sellers that no longer exist). I have spent almost 3-4 hours after work, every day, over the last 5 weeks, and have only finished around ~2200 records, one by one. I underestimated how long it would take, as there are around 330,000 records. So I’m 0.67% of the way there...
I’m thinking there must be a faster way to do this in Excel?
So the process is I’m performing:
1) A look-up of a specific string (in the ‘sub_seller’ column) that can appear in one to many (1:M) records in another table’s (‘Search Table’) column called ‘id’, and
2) Once found, across all the rows found, aggregates all the values in the column next to it (‘owner_id’), including the search/lookup string itself, and dumping this result set into the adjacent cell in the original table
3) I’d then like to repeat this process for the other column (obj_buyer) in that same original table.
4) Once the process (lookup, aggregation, data dump) has been performed for both tables, I’d like to join only the common values across the two data dumps only the common values (i.e. INTERSECT operation) into a separate column (‘INTERSECT’).
5) I then have a filter/exclusion table, where if any of the values (id’s) in ‘UNION’ (sorry I just realised I meant to use 'INTERSECT') are found in said table, they are to be excluded from the results (which will help to also exclude all the receipts associated with those excluded id’s
I’ve attached a clean file containing a complete example of this process above. I’ve also put screenshots below to show what each step above looks like / is doing:
(Also just a note, I realised '1636' (lookup value) should also be in the 'aggregate_seller_owners' column, apologies!
Step 1)
Step 2)
Step 3)
Step 4)
Step 5)
Any ideas? Any assistance would be greatly appreciated.
Thanks so much guys!
Last edited by Restrictedz; 08-15-2019 at 07:28 AM.
Reason: I meant 'intersection', not 'union'
Re: Aggregating data based on a search key, across a very large dataset possible?
Apologies, but I thought the pictures attached would fall below each step but doesn't look like to be the case! Hopefully all the pictures are in order for you (they are for me!) - Steps 1 to 5.
Link is here: dropbox dot com/s/hzu0a0nhjk18qv2/receipts_db_Cleaned.xlsx
Last edited by Restrictedz; 08-06-2019 at 10:13 AM.
Re: Aggregating data based on a search key, across a very large dataset possible?
Hello Restrictedz and Welcome to Excel Forum.
Please upload the file directly to this site by clicking on the GO ADVANCED button below the Quick Reply window and then scrolling down to Manage Attachments to open the upload window.
Also your profile states that you are using Excel 2001(?). Prior to Excel 2003 .xlsx was not supported, so I assume the profile to be incorrect. This is important because the proposals offered may be tailored to the version of Excel you have, for example if you are actually using 2016 or later you may be able to use Power Query (Get & Transform) to do some of what you are asking, so please update that information.
Let us know if you have any questions.
Consider taking the time to add to the reputation of everybody that has taken the time to respond to your query.
Re: Aggregating data based on a search key, across a very large dataset possible?
Thanks for your reply JeteMc. I have updated my profile accordingly to correctly reflect the Excel version I'm currently using.
I have tried to upload, but given the size of the original file (16MB), I had to delete a lot of rows to reduce it down to say, 6MB, which, when compressed into a .zip, falls under the 9.79MB file limit for a .zip however I cannot upload it as the website doesn't allow me to: "receipts_db_Cleaned.zip - Upload of file failed.", or I still get the error "file too large to upload". Please advise? Could you perhaps try downloading it via the Dropbox URL I pasted above and uploading it? I'm not sure how else to upload this file, but would really appreciate any assistance.
Last edited by AliGW; 08-11-2019 at 06:39 AM.
Reason: Please don't quote unnecessarily!
Re: Aggregating data based on a search key, across a very large dataset possible?
Deleted a majority of the records in the file to reduce file down as close to 1MB as possible. Attempt #2 at upload.
JeteMc, can I please get your help with this? You're the only one who seemed to notice my post. Steps 4 & 5 is what I'm trying to automate instead of doing this manually which I've completed 2500 rows over the last 7 days, putting in 4 hours each night.
Last edited by Restrictedz; 08-11-2019 at 06:41 AM.
This demonstration works only with numeric ID or its execution raises an error …
To paste to the Database worksheet module :
PHP Code:
Function FilterEval(ByVal S$, ByVal T$)
FilterEval = Filter(Evaluate("IF(ISNUMBER(MATCH({" & S & "},{" & T & "},0)),{" & S & "})"), False, False)
End Function
Sub Demo1()
Const S = ", "
Dim V, X$, R&, T$(3), W
V = ['Search Table'!A1].CurrentRegion.Columns("A:B").Value2
With ['Exclusion Table'!A1].CurrentRegion.Columns(1).Rows
X = Join(Application.Transpose(.Item("2:" & .Count)), S)
End With
Application.ScreenUpdating = False
With CreateObject("Scripting.Dictionary")
For R = 2 To UBound(V)
If .Exists(V(R, 1)) Then .Item(V(R, 1)) = .Item(V(R, 1)) & S & V(R, 2) Else .Add V(R, 1), V(R, 2)
Next
V = [A1].CurrentRegion.Columns("C:D").Value2
For R = 2 To UBound(V)
If .Exists(V(R, 1)) Then T(0) = .Item(V(R, 1))
If .Exists(V(R, 2)) Then
W = .Item(V(R, 2))
T(1) = V(R, 2) & S & W
If T(0) > "" Then
W = FilterEval(T(0), W)
If UBound(W) > -1 Then
T(2) = Join(W, S)
W = FilterEval(T(2), X)
If UBound(W) > -1 Then T(3) = "YES: '" & Join(W, S) & "'"
End If
End If
End If
If T(0) & T(1) > "" Then Cells(R, 5).Resize(, 4).Value2 = T: Erase T
Next
.RemoveAll
End With
Application.ScreenUpdating = True
End Sub
Do you like it ? So thanks to click on bottom left star icon « ★ Add Reputation » !
Last edited by Marc L; 08-11-2019 at 11:10 AM.
Reason: code typo …
This works well with the file attached - thanks so much.
The only issue is that the code breaks on row/line 52412, with a run-time error "Type mismatch" for the FilterEval() function.
Any ideas why it breaks? If you can please refer to the dropbox link (dropbox dot com/s/hzu0a0nhjk18qv2/receipts_db_Cleaned.xlsx)I put in one of my posts above for the full file which I am running your code on.
Unfortunately I cannot attach the file directly here because the file size is too big, even if I compress it...
Mods, I have even uploaded the same file as a .xlsb, which is under file size limit, but I get a "Upload of file failed." error; same happens if I upload a .zip very well under the size limit. I think it's a host issue...
Thanks Marc
Last edited by Restrictedz; 08-12-2019 at 10:39 AM.
As your dropbox is not accessible, the better is to attach a new workbook (no needs much rows)
but this time with all the possible data configurations in order to have all your initial post steps result …
Re: Aggregating data based on a search key, across a very large dataset possible?
Originally Posted by Marc L
No idea, as I warned maybe a non numeric …
As your dropbox is not accessible, the better is to attach a new workbook (no needs much rows)
but this time with all the possible data configurations in order to have all your initial post steps result …
That's what I thought as well, but I can confirm they are all numeric (there is no non-numeric values) - I checked using TYPE(), and all the values in the two (2) columns are '1' (numeric).
Also, the Dropbox link works for me - what did you mean by not accessible? If you can replace the spaces and the 'dot' with an actual ".", it should work fine.
I have attached a snippet of one part which broke (hopefully the root cause of where it breaks is the reason for the rest of the other breaks).
Last edited by Restrictedz; 08-12-2019 at 09:43 PM.
Hi Marc the only reason for that is because I need to reduce the file size to below 1MB. What I intended is if you can please insert the other two tabs from the first workbook into this recently-uploaded workbook and proceed as usual?
Alternatively, copy the contents of the 'Database' tab from the recently-uploaded workbook, into the same tab in the initial workbook uploaded.
Hope that makes sense - really appreciate your help in looking into this Marc.
Last edited by Restrictedz; 08-13-2019 at 07:34 PM.
I finished to succeed to download from dropbox (first time I had such issue, a mess) and I got any issue running Demo1 !
But as my test computer has an old Excel version limited to 65 536 rows max
I will grab later (~ half a day) another computer with a more recent one …
Re: Aggregating data based on a search key, across a very large dataset possible?
Not sure what you meant by: '
Originally Posted by Marc L
So wait for another one may reach your dropbox, I hope you shared the file without displaying and restriction …
Originally Posted by Marc L
I finished to succeed to download from dropbox (first time I had such issue, a mess) and I got any issue running Demo1 !
But as my test computer has an old Excel version limited to 65 536 rows max
I will grab later (~ half a day) another computer with a more recent one …
Thanks for looking into this Marc; I'm interested to see whether we can find what made the code break and how we can resolve it.
Re: Aggregating data based on a search key, across a very large dataset possible?
Originally Posted by jindon
See if this works.
This doesn't cause any breaks (versus Marc's code), and runs through the entire 330,000 rows, but, the aggregation of the strings is buggy for the 'buyer' data. For instance, in the first row, this is the output in column F:
7249884, 72498837249884, 51485837249884, 5115537
However, expected output should be: (Marc's code has no issue with this aspect)
7249884, 7249883, 5148583, 5115537
Looks like it's aggregating the search string (from column D) to some of the results.
Also, there's nothing listed in the 'IN_EXCLUSION' column (column H) - so not exactly sure what the code is doing for records that fall in the exclusion table.
interested to see whether we can find what made the code break
and how we can resolve it.
With all your data I got the same issue, occuring on row #5413 in the FilterEval formula.
I tested with small arrays with same elements # (9 & 33 like 1, 2, ) : no error !
So I suppose it's some formula string size limitation …
This new demonstration where you can follow its execution within the Excel status bar (at screen bottom left)
works as text even with numeric ID, just paste it this time to an empty standard module (or to ThisWorkbook) Edit v3 :
PHP Code:
Sub Demo2()
Const D = "Database", S = ", "
Dim R&, oDic(4) As Object, V, K$, W, T$(), L#, N#, U, X, P%
Sheets(D).UsedRange.Offset(1).Columns("E:H").ClearContents
Application.StatusBar = "Initializing …"
For R = 0 To 4: Set oDic(R) = CreateObject("Scripting.Dictionary"): Next
V = ['Search Table'!A1].CurrentRegion.Columns("A:B").Value2
For R = 2 To UBound(V)
K = V(R, 1)
If oDic(2).Exists(K) Then
W = oDic(2)(K)
ReDim Preserve W(UBound(W) + 1)
W(UBound(W)) = CStr(V(R, 2))
oDic(2)(K) = W
Else
oDic(2).Add K, Array(CStr(V(R, 2)))
End If
Next
V = ['Exclusion Table'!A1].CurrentRegion.Columns(1).Value2
For R = 2 To UBound(V): oDic(3)(CStr(V(R, 1))) = "": Next
V = Sheets(D).[A1].CurrentRegion.Columns("C:D").Value2
ReDim T(2 To UBound(V), 3)
Application.StatusBar = "1%"
L = UBound(V) / 20
N = L
For R = 2 To UBound(V)
K = V(R, 1)
If oDic(2).Exists(K) Then U = oDic(2)(K): T(R, 0) = Join$(U, S)
K = V(R, 2)
If oDic(2).Exists(K) Then
W = oDic(2)(K)
T(R, 1) = K & S & Join$(W, S)
If T(R, 0) > "" Then
For Each X In W: oDic(0)(X) = "": Next
For Each X In U
If oDic(0).Exists(X) Then
oDic(1)(X) = ""
If oDic(3).Exists(X) Then oDic(4)(X) = ""
End If
Next
oDic(0).RemoveAll
If oDic(1).Count Then
T(R, 2) = Join$(oDic(1).Keys, S)
oDic(1).RemoveAll
If oDic(4).Count Then T(R, 3) = "YES: '" & Join$(oDic(4).Keys, S) & "'": oDic(4).RemoveAll
End If
End If
End If
If R >= N Then P = P + 5: Application.StatusBar = P & "%": N = N + L
Next
oDic(2).RemoveAll: oDic(3).RemoveAll
Erase oDic
Application.StatusBar = "Writing data …"
DoEvents
Sheets(D).Rows("2:" & UBound(T)).Columns("E:H").Value2 = T
Application.StatusBar = False
End Sub
Do you like it ? So thanks to click on bottom left star icon « ★ Add Reputation » !
Last edited by Marc L; 08-16-2019 at 06:06 PM.
Reason: optimization …
Re: Aggregating data based on a search key, across a very large dataset possible?
Originally Posted by jindon
If this doesn't work, just ignore.
Thanks for your attempt to help with this jindon; this worked really well as well but has errors in the final column (not outputting correct results; false positives etc); but have also given you reputation for looking into this and giving it ago as I would say this was a challenging task.
Originally Posted by Marc L
With all your data I got the same issue, occuring on row #5413 in the FilterEval formula.
I tested with small arrays with same elements # (9 & 33 like 1, 2, ) : no error !
So I suppose it's some formula string size limitation …
This new demonstration where you can follow its execution within the Excel status bar (at screen bottom left)
works as text even with numeric ID, just paste it this time to an empty standard module (or to ThisWorkbook) Edit v2 :
Thanks so much Marc, this is great. I also suspect it must have something to do with the 255 string limit that caused the break.
Consider this closed/solved. Marc, you've saved me with countless months of work! I estimated this task would have taken me another 6-7 months every weekend... bless y
Bookmarks