I'm pretty sure there's no way to do this in excel, but hoping someone can prove me wrong...
I have a large-ish data set of ~1200 columns of data, each about ~1200 rows tall. The columns represent profiles, and the rows represent attributes of each profile. The attributes are scored 0 - 4, and we can assume the scale is consistent and each level of it equal to another.
What I want to do is create a matrix that shows the names of each profile across the top X axis (1200 columns wide) and those same names along the left side Y axis (1200 rows tall) and where each column and row intersect I want to show a % that expresses how closely the two profiles match each other by comparing their values for each row of attributes.
I have the formulas I want to use for this, meaning I can calculate the "match %" number, but my problem is I can't figure out how cram the math into a single cell so that I can do this in each intersection of row and column. I've got it down to using one "helper" column, but trying to put in like ~600 helper columns in order to compare all 1200 profiles with each other sounds like a nightmare. The data I'm using for this is just one set of many that I need to provide this analysis on, so I need this to work efficiently-ish.
I have started looking at creating macros to attempt to automate the process, but I'm getting a bit lost with how this should work. And deep down I can't help but wonder if I was just better with array formulas if I could fit all this into the one cell and have a simple and easy solution after all. Can any Obi-Wan's out there help??
Here is some example data, very simplified... actual data has 1200 rows of attributes, however typically only 20 attributes have values (median count).
Column A . Column B . Column C . Column D .
Profile 1 . Profile 2 . Profile 3
Attribute 1 . 0 1 0
Attribute 2 . 0 2 3
Attribute 3 . 1 0 1
Attribute 4 . 0 1 1
Attribute 5 . 4 4 0
Attribute 6 . 0 1 0
Attribute 7. 3 0 0
So on and so forth........
The above sample is a bit misleading, as the real data has 1200 rows of attributes but on average only 20 of them actually have values. This is why, when both profiles have values in the same attribute, comparing them and considering how close those values are to each other in the equation that calculates 'match %' is important.
There are so many zeros that it's not really useful to include them in any of my calculations, so they are basically ignored except when one job has a zero, and the other has a value... in that case I do include the row in the COUNT used to determine "total number of potential matching attributes" that is the divisor in step #2 below.
So, I have two formulas that I'm using.
Step #1
The first one relies on a "helper column" that I insert between two profile columns and copy/paste down all 1200 rows so that it shows a number anytime both profiles on either side of it have a value in the same attribute. It expresses a ratio meant to articulate how close the two numbers are to each other by adding both together then dividing the sum by 2 * the larger of the two numbers. In other words if both are 1's it will show a 1.0 and if both are 4's it will show a 1.0, but if one is a 1 and the other is a 4 it will show .625 and if one is a 2 and the other a 3 it will show .833 because the 2 & 3 are closer than the 1 and 4 but not as close as two 4's or two 1's. Make sense?
Step #2
Once I've calculated the above for all 1200 rows, I sum that up and divide it by the total number of potential matching attributes, which I've decided to define as the number of attributes of the profile which has more attributes between the two. So, if one profile has 17 attributes with values above zero and the other has 39 attributes with values above zero, I do the sum discussed in Step #1, then I divide that by 39 and format the output as a % to express my "match %" for the two profiles.
Here are the two formulas in question, the 'Step 1' formula assumes it is between two profiles in A5:A1198 and C5:C1198;
Step 1a (this is in a "helper" column B for all ~1200 rows):
=IFERROR(IF(AND(A5>0, C5>0), (A5+C5)/(MAX(A5, C5)*2), ""), "")
Step 1b (this just sums up the above in cell B1199):
=SUM($B$5:$B$1198)
Step 2a (this creates the divisor):
=((SUMPRODUCT((A$5:A$1198>=1)+(C$5:C$1198>=1)))-(COUNTIFS(A$5:A$1198, ">=1", C$5:C$1198, ">=1")))
Step 2b (this divides Step 1b over Step 2a):
=B1199/B1200
I don't know if anyone can help with this... this is a hail mary here. I have been beating my head against this for over a week, and am about to give up. If someone can provide a single cell formula that does all this, I'd be willing to pay for the consulting if that's a motivation. Or post praises all over the interwebs, or put your picture on a t-shirt and wear it around.. whatever. Thanks for reading!
Bookmarks