ewams

Analyzing SSA Name Data - Letters


Continuing on with the analysis of Social Security Administration's baby names, this time we are going to look at letters in the names. Part 1 of the analysis looked at total names which was fun, but lets go a little deeper. All results are for first names only. All data is publicly provided by the SSA and no I do not have a text file with your social security number.
Chart 1 - Our first chart in this post looks at the length of names, as in how many letters are in each name. This chart is cumulative so even duplicate names are included with the results. There are more female than male names, so not surprisingly the pink overshadows the blue. The actual results for both genders is very similar. For females the most common name length is 6, then 7, then 5. While for males it is 6, then 5, then 7. The shortest names for both genders had 2 letters (Jo, Al, Wm), while the longest for both was also 15 (Johnchristopher and Mariadelrosario). FYI, the SSA states on their website they remove non-alpha characters such as "-" and spaces from names, so "John-Anthony", "John Anthony" and "Johnanthony" all become the same result. I think they also must truncate names longer than 15 characters because I can see in the results names like "Christophermich" which I assume is supposed to be "Christopher-Michael".




Chart 2 - This next chart is actually pretty neat and I stumbled on it by accident. It compares the first and last letters used by every name (including duplicate names). The letter "A" is used the most for both the first and last letter (thanks Anna!). The letters used at the start of names is more evenly distributed than the last letter of names, for example pretty much every letter is used as the first letter of names fairly regularly except for Q, U, and X. The most used letters for the start of names are A, then M, then J, and a very close battle for 4th with C, D, K, L, and S.

For the last letters in names the most popular is A then E, then N, with those 3 letters being used a vast majority of the time. The fourth most commonly used last letter is Y, but it is used less than half as much as N. You can click on the legend above the chart to show or hide metrics if you want to hide First letter to see better details. After those A, E, N, and Y the remaining letters are used fairly infrequently. So Americans are picky about what letters comes at the end of their names, but not the start?




Chart 3 - Looking deeper in to the data from the previous chart, now we just look at the first letter used in each name (including duplicates) and broken up by gender. It is kind of interesting to see letters like A, M and S to be more heavily skewed towards females than males. Though the data overall is skewed towards females since there are more total females than males in the data. Apparently, more males have their names start with Q and U than females.




Chart - Now we look at all letters of the alphabet and how many times they are used in every single name through the years. Names like "Anna" would increase the count for A and N by 2, while Eric only adds the count by 1 for the letters E, R, I, and C. etc. A takes a massive lead, followed by E, then N, and I. The least used letter is X, then Q, then W, F, P, and Z. Both F and P surprised me initially but makes sense because both are really only used in the start of names.




Written by Eric Wamsley
Posted: May 9th, 2020 8:12am
Topic: Chartjs
Tags: chartjs,


 ©Eric Wamsley - ewams.net