# How do baby names come and go?

**Quick caveat**: In this analysis I used government data that assumes binary biological sex.

#### I’m in my mid-thirties and as many of my friends are starting to make their own families, I'm having to learn lots of baby names. I’ve heard lots of people say that “older” names are becoming popular and in hearing these baby names I feel like there is something to this.

#### One of my good friends has a sweet, one-year old baby named Vera. It’s a beautiful, old-fashioned name for sure, but is it becoming popular again?

#### After plotting some data from the Social Security Administration (see plot to the right), it does look to be making a comeback. Funny aside: It turns out the name Ariel spiked in popularity after Disney's 1989 release of The Little Mermaid.

#### This made me wonder more generally about names and their trends. Are there complicated dynamics at play with the popularity of names, or does their popularity come and go in waves, or do most names peak in popularity for a bit and fade into history? I recently attended SciPy 2019 and attended a great session on time series clustering. This question seemed like a great problem to try out some of the methods I had learned there, such as the concept of dynamic time warping (DTW) .

#### I ran an analysis and found the following:
- While there are some definite, clear clusters in name popularity over time, there is tons of heterogeneity here. Bottom line: you won't be able to neatly categorize the rise and fall of names with a few simple rules.
- Although I pulled out more clusters for boys, it seems like there is more complexity in girl naming trends. See the final girl name cluster, for example, which the algorithm couldn't disentangle.

#### Here are the name trend clusters I was able to pull out. Click on the links below to see the full plots for each (each line in a plot represents a unique name's popularity over time). I also shared a few exemplars for each (names with the closest DTW distance to the center of their cluster). Note: to simplify things I excluded names if there were less than 10,000 instances of them in the last 100 years. See this notebook for more details on the analysis.

### Boy name clusters:

**Cluster #1:** *Not popular before but exploding in popularity over the last decade * (e.g. Owen, Cruz, Carter)

**Cluster #2:** *A sharp peak in the mid-20th century but that’s it* (e.g. Dale, Roger, Tony)

**Cluster #3:** *Peaked in the late 90s / early aughts but dying out* (e.g. Jacob, Trenton, Brennan)

**Cluster #4:** *Very old names that have died out* (e.g. Archie, Walter, Louis)

**Cluster #5:** *Popular towards the end of the 20th century but dying out * (e.g. Timothy, Brian, Eric)

### Girl name clusters:

**Cluster #1:** *Super popular the last two decades but mostly dropping off * (e.g. Arianna, Sophia, Makenzie)

**Cluster #2:** *Old-timey names that have died out (with some making a comeback) * (e.g. Flora, Maxine, Lillie)

**Cluster #3:** *Wildcards / difficult to cluster!* (e.g. Melissa, Amy, Erin)

# An Automated Elbow Method

#### For some types of unsupervised learning analyses, machine learning practitioners have typically needed to examine a plot and make a somewhat subjective judgement call to tune the model (the so-called "elbow method"). I can think of two examples of this but others certainly exist:

#### 1) In any sort of clustering analysis: finding the appropriate number of clusters by plotting the within-cluster sum of squares against the number of clusters.

#### 2) When reducing feature space via PCA or a Factor Analysis: using a Scree plot to determine the number of components/factors to extract.

#### For one-off analyses, using your eyeballs and some subjectivity might be fine, but what if you are using these methods as part of a pipeline in an automated process? I came across a very simple and elegant solution to this, which is described by Mu Zhu in this paper. Lots of heuristics exist to solve this but I've found this method to be particularly robust.

#### Zhu's idea is to generate the data you would typically generate to identify the elbow/kink. Then, he treats this data as a composite of two different samples, separated by the cutoff he is trying to identify. He loops through all possible cutoffs, in an attempt to find the cutoff that maximizes the profile log-likelihood (using sample means and a pooled SD in the calculations). Below is code I created to implement Zhu's method:

import numpy as np from scipy.stats import norm def calc_logl(x,mu,sd): """ Helper function to calculate log-likelihood """ logl = 0 for i in x: logl += np.log(norm.pdf(i, mu, sd)) return logl def find_optimal_k(data): """ Provide a numpy array, returns index to serve as cut-off """ profile_logl = [] for q in range(1,len(data)): n = len(data) s1 = data[0:q] s2 = data[q:] mu1 = s1.mean() mu2 = s2.mean() sd1 = s1.std() sd2 = s2.std() sd_pooled = np.sqrt((((q-1)*(sd1**2)+(n-q-1)*(sd2**2)) / (n-2))) profile_logl.append(calc_logl(s1,mu1,sd_pooled) + calc_logl(s2,mu2,sd_pooled)) return np.argmax(profile_logl) + 1