One of the projects I worked on in 2013 gave me access to a large list of South African ID numbers. Since the South African ID number encodes the person’s birthdate into the first 6 digits, I realised it would be possible to make a South African version of this map of common birthdays in the US. I grabbed digits 3 – 6 (month and day) of each ID and discarded the rest to be safe. From there, its was simple enough calculate the frequency distribution. The US map only shows the ranking, not the actual distribution and so I have made an interactive version that does both:
The picture above shows the ranking from 1 to 366 and it is kind of interesting, but random things often look like patterns. However, this gets more interesting when you look at the frequency distribution. The 1st of January is a massive outlier and this was very unexpected given the how things look on the US version of the map. The 1st of January is at about 400,000 and is more than twice the next highest, which is the 10th of October at about 189,000.
The spread across days of the week and the months look about right. To confirm that there was not a problem with the source data, I did that same analysis using a 10% sample set from the Census data, I saw the same trend.
Either a disproportionately large number or South African’s are New Years babies, or (and this is my personal hypothesis), pre 1994 a large number of South African’s would not have a had official ID numbers or birth certificates. A valid ID number was required to vote in the first official democratic election and the process of allocating ID numbers to those who did not have must have started prior. If the person applying for the ID number did not have a valid birth certificate, or their date of birth was not known, then they were probably given the date of 1 January and a guess at the year. The other dates that stand out are 2 Feb, 3 Mar, 4 April, etc. Similarly, if you only know the month, it would be simplest to just match the day and the month numerically i.e. 2-2, 3-3, 4-4 etc. The other days that stand out are the 16th of June and the 25th of December. The anomaly dates become less as you narrow the date range to exclude older people. Given that my source data was the voters roll, I could not do this for people born in the last 20 years. If I can get access to that, I will do an updated version.