RANDBETWEEN - maybe not so random

**Imbizile** · 02-28-2018, 02:05 PM

I have not gone into details with the statistic calculation here, but im quite puzzled by the following.

I need to generate usernames with 2 letters (from A-Z), followed by 2 numbers (from 0-9) and then again 2 letters, for example AB45DE.

I made a formula to do that:

=CHAR(RANDBETWEEN(65,90))&CHAR(RANDBETWEEN(65,90))&RANDBETWEEN(0,9)&RANDBETWEEN(0,9)&CHAR(RANDBETWEEN(65,90))&CHAR(RANDBETWEEN(65,90))

Im not much into the calculations of likelihood, but this should have more than 45 million combinations, ie (26*26*10*10*26*26)=45.697.600

I then do this for 10.000 rows, hoping for 10.000 unique usernames. But what happens more often than not is that I end up with 1, 2, 3 or more duplicates in the 10.000 rows. This seems very strange and unlikely to me.

I made a sheet to illustrate this - its generates the names in column A and shows if its duplicated in column B (2 or more is a duplicate), and the sum of duplicates in D1 - it runs by clicking the Randomize button (takes a little while to run) and it converts to values at the end so it will stay the same until next run. Try running it a few times and you will get duplicates.

Can it really be statistically correct that you get duplicates so often in 10.000 rows, outta 45 million possible combinations or is the Excel RANDBETWEEN really not that random after all?

**6StringJazzer** · 02-28-2018, 02:30 PM

Two things to note. First, software does not generate random numbers. It generates pseudo-random numbers. This means, roughly, that a sequence of numbers should be intractably hard to predict and should approximate a random distribution, but is not the same as true randomness. This is a pretty big topic and I am not a mathematician so am limiting my explanation here.

Second, even with true random numbers, if you generate 10,000 numbers at random with replacement from a population of 45,697,600, the chances are, by rough calculation, at least 1/9 of having at least two duplicates in that list. (This is a bit back-of-the-envelope but I'll show my work if you're really interested.)

If you need to generate 10,000 strings that are guaranteed to be unique then I would use VBA code.

**Glenn Kennedy** · 02-28-2018, 02:33 PM

Originally Posted by 6StringJazzer

This is a bit back-of-the-envelope but I'll show my work if you're really interested.

I am... and would be interested to see... Since I'm not the OP, don't bother if it's going to be a lengthy discourse!!

**6StringJazzer** · 02-28-2018, 04:00 PM

Originally Posted by Glenn Kennedy

I am... and would be interested to see... Since I'm not the OP, don't bother if it's going to be a lengthy discourse!!

In writing out my answer I found an error so let's wait on that

But ben_hensel said it well--sometimes math clashes with intuition. If you have 23 randomly chosen people, chances are 50% that you'll get at least one duplicate birthday.

**Glenn Kennedy** · 02-28-2018, 03:11 PM

Forget it. I was asleep. The first letter and digit don't count. So it's 26*26*26*10

**ben_hensel** · 02-28-2018, 03:21 PM

RAND(), and by extension RANDBETWEEN(), kind of have a reputation for being not all that great. You can read the gossip.

Specifically to your case though,
This is in the same family as the Birthday problem of the probability that random events/numbers will duplicate. (The short answer is that duplicates are more likely than naive intuition would assume.)

If you just want to generate a bunch of mostly-random usernames as a one-off, then I would tell you to generate extras, and then just cull the duplicates out.

**MarvinP** · 02-28-2018, 05:45 PM

Hi Imbizile,

Random numbers don't mean they won't repeat. If you think of a 6 side die and you role it twice, will you get the same number again (at random)? Of course you would. Just because you have 45 million possible sides on the die, doesn't mean it won't repeat.

Now if you want to insure you get unique names, simply filter your B column, in your example, by 1 so any duplicate counts won't show. Click your button and sort column B from Large to Small. Then drop down the B column filter and unselect the 2. You have a good answer then, unique random names.

**shg** · 02-28-2018, 06:06 PM

The probability of picking 10,000 distinct items when multi-selecting from 45,697,600 is ...

{=PRODUCT(1 - (ROW(INDIRECT("1:10000")) - 1)/45697600)} ~ 33%

You should see the same result if you enter =randbetween(1, 45697600) down 10,000 rows.

If you just generate a few extra values they way you are, you could copy, paste as values, and then do Remove Duplicates.

**leelnich** · 02-28-2018, 07:02 PM

Odds of 1 or more duplicates in 10000 rows : 66.52 %
(The loop calcs the probability of 10000 uniques, which is then subtracted from 1.)

Sub CalcOdds()
    Dim M As Double, i As Double, k As Double
    M = 26 ^ 4 * 10 ^ 2
    k = 1
    For i = 0 To 9999
    k = k * (1 - i / M)
    Next
    MsgBox "Odds of 1 or more duplicates in 10000 rows: " & Format(100 * (1 - k), "0.00") & "%"
End Sub

**Imbizile** · 03-01-2018, 07:45 AM

Thanks everyone for the input. And yes, I do realize now that the probability of duplicates happening are actually much higher than first anticipated. I dont actually use the exact setup I have in the demo file here, it was just in order to be able to illustrate what I meant. But I made some adjustments to my actual setup, to make sure duplicates are not appearing (or rather they are generated again if they occour).

I just thought it was not likely to have that many duplicates, but after some researching, from the inputs here, specifically on the Birthday problem, I can see why it is in fact very likely. This site explains it very well, btw: https://betterexplained.com/articles...thday-paradox/

I believe the 66,52% probability is the correct one.

Thanks again everyone.

**MarvinP** · 03-01-2018, 01:47 PM

Hi Imbizele,

I'm smiling when you wrote:

I believe the 66,52% probability is the correct one.

shg (who is rarely wrong) said the probability of picking 10000 distinct items while leelnich said "1 or more duplicates". I think they may both be right.

I used to teach statitistics and there were two types of problems where reading the problem was imperitive. It was putting black and white marbles in a bag and drawing them out. One problem was the probability of drawing marbles out "with replacements" and the other was with NO replacements. Many time students would confuse the two types of games. Reread both answers above and see if they might both be correct.

**Imbizile** · 03-01-2018, 06:05 PM

Ahh yes I see that now - misread the first one

RANDBETWEEN - maybe not so random

LinkBack

Thread Tools

Rate This Thread

Display

Hybrid View

RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Re: RANDBETWEEN - maybe not so random

Thread Information

Users Browsing this Thread

Similar Threads

Assign random number (RANDBETWEEN) equal number of times

Generate Random Time Series using RANDBETWEEN, but NOT with RANDBETWEEN in Each Cell

HELP: How to get random data from a 'table' using 'randbetween'

Randbetween or Random Number (1,3)

How can I generate non-iterative random numbers with RANDBETWEEN()?

Re: Non-random numbers generated by excel's data analysis random gener

Random numbers not using randbetween()

Bookmarks

Bookmarks

Posting Permissions