How do baby names come and go?

Quick caveat: In this analysis I used government data that assumes binary biological sex.

I’m in my mid-thirties and as many of my friends are starting to make their own families, I'm having to learn lots of baby names. I’ve heard lots of people say that “older” names are becoming popular and in hearing these baby names I feel like there is something to this.

One of my good friends has a sweet, one-year old baby named Vera. It’s a beautiful, old-fashioned name for sure, but is it becoming popular again?

After plotting some data from the Social Security Administration (see plot to the right), it does look to be making a comeback. Funny aside: It turns out the name Ariel spiked in popularity after Disney's 1989 release of The Little Mermaid.

This made me wonder more generally about names and their trends. Are there complicated dynamics at play with the popularity of names, or does their popularity come and go in waves, or do most names peak in popularity for a bit and fade into history? I recently attended SciPy 2019 and attended a great session on time series clustering. This question seemed like a great problem to try out some of the methods I had learned there, such as the concept of dynamic time warping (DTW) .

I ran an analysis and found the following:
  • While there are some definite, clear clusters in name popularity over time, there is tons of heterogeneity here. Bottom line: you won't be able to neatly categorize the rise and fall of names with a few simple rules.
  • Although I pulled out more clusters for boys, it seems like there is more complexity in girl naming trends. See the final girl name cluster, for example, which the algorithm couldn't disentangle.

Here are the name trend clusters I was able to pull out. Click on the links below to see the full plots for each (each line in a plot represents a unique name's popularity over time). I also shared a few exemplars for each (names with the closest DTW distance to the center of their cluster). Note: to simplify things I excluded names if there were less than 10,000 instances of them in the last 100 years. See this notebook for more details on the analysis.


Boy name clusters:

Cluster #1: Not popular before but exploding in popularity over the last decade (e.g. Owen, Cruz, Carter)

Cluster #2: A sharp peak in the mid-20th century but that’s it (e.g. Dale, Roger, Tony)

Cluster #3: Peaked in the late 90s / early aughts but dying out (e.g. Jacob, Trenton, Brennan)

Cluster #4: Very old names that have died out (e.g. Archie, Walter, Louis)

Cluster #5: Popular towards the end of the 20th century but dying out (e.g. Timothy, Brian, Eric)


Girl name clusters:

Cluster #1: Super popular the last two decades but mostly dropping off (e.g. Arianna, Sophia, Makenzie)

Cluster #2: Old-timey names that have died out (with some making a comeback) (e.g. Flora, Maxine, Lillie)

Cluster #3: Wildcards / difficult to cluster! (e.g. Melissa, Amy, Erin)

POPULARITY OF THE NAME "VERA"

pics to combine

THE 1989 RELEASE OF "THE LITTLE MERMAID" AND THE POPULARITY OF THE NAME "ARIEL"

pics to combine

A BOY NAME CLUSTER

pics to combine

A GIRL NAME CLUSTER

pics to combine


An Automated Elbow Method

For some types of unsupervised learning analyses, machine learning practitioners have typically needed to examine a plot and make a somewhat subjective judgement call to tune the model (the so-called "elbow method"). I can think of two examples of this but others certainly exist:

1) In any sort of clustering analysis: finding the appropriate number of clusters by plotting the within-cluster sum of squares against the number of clusters.

2) When reducing feature space via PCA or a Factor Analysis: using a Scree plot to determine the number of components/factors to extract.

For one-off analyses, using your eyeballs and some subjectivity might be fine, but what if you are using these methods as part of a pipeline in an automated process? I came across a very simple and elegant solution to this, which is described by Mu Zhu in this paper. Lots of heuristics exist to solve this but I've found this method to be particularly robust.

Zhu's idea is to generate the data you would typically generate to identify the elbow/kink. Then, he treats this data as a composite of two different samples, separated by the cutoff he is trying to identify. He loops through all possible cutoffs, in an attempt to find the cutoff that maximizes the profile log-likelihood (using sample means and a pooled SD in the calculations). Below is code I created to implement Zhu's method:


import numpy as np
from scipy.stats import norm

def calc_logl(x,mu,sd):
  """
  Helper function to calculate log-likelihood
  """
  logl = 0
  for i in x:
    logl += np.log(norm.pdf(i, mu, sd))
  return logl

def find_optimal_k(data):
  """
  Provide a numpy array, returns index to serve as cut-off
  """
  profile_logl = []
  for q in range(1,len(data)):
    n = len(data)
    s1 = data[0:q]
    s2 = data[q:]
    mu1 = s1.mean()
    mu2 = s2.mean()
    sd1 = s1.std()
    sd2 = s2.std()
    sd_pooled = np.sqrt((((q-1)*(sd1**2)+(n-q-1)*(sd2**2)) / (n-2)))
    profile_logl.append(calc_logl(s1,mu1,sd_pooled) + calc_logl(s2,mu2,sd_pooled))
  return np.argmax(profile_logl) + 1

Neural Networks and DNA

Biology is super complicated. If you don't believe me, check out the Roche Biochemical Pathways chart. I wonder if you can buy a laminated poster of that chart...

One of the more complicated biochemical processes is the one by which cells convert DNA to functional proteins. As a first step in this process, DNA is converted to mRNA in a process named transcription. Transcription is controlled with transcription factors (TFs) , which are proteins that bind to specific parts of the DNA. Understanding what specific parts of DNA that TFs bind to might lead to new insights into how transcription occurs and why certain genes are expressed or not expressed.

If you formulate this problem statistically, it is essentially sequence classification. That is, taking a n-length sequence of unique symbols (e.g. the four DNA bases T, A, C, and G) and learning to categorize them (say, as bindable to a protein or not).

There are a number of statistical models one can use for sequence classification. Long Short-Term Memory (LSTM) neural networks are an exciting approach that could be useful for this problem. These types of neural networks have been increasing used by Google, Apple, and others. Unlike standard feedforward networks, LSTMs have loops that allows them to excel at retaining information and learning patterns within sequences. These networks are composed of LSTM blocks, which have "gates" that determine the flow of information (see image on the right).

Using data from the publicly available UniProbe dataset (more details here), I implemented a LSTM using python's Keras module. The dataset consisted of about 20,000 DNA sequences that bind to a protein of interest, and about 20,000 sequences that didn't. The sequences themselves were each 60 bases long. The LSTM model I created included an embedding layer (to transform the discrete symbols into continuous vector space). I then included two hidden LSTM layers with 20% dropout to prevent overfitting. I ran this on batches of 64 observations and 10 epochs, using the stochastic ADAM optimizer and a tanh activation function. Bottom line, I was able to obtain almost 80% accuracy in the test set (see line plot on right).

A LSTM BLOCK

pics to combine




DNA MODEL ACCURACY

pics to combine


Machine Learning and Art

In 2015 Leon Gatys wrote a paper describing an algorithm that could "separate and recombine content and style of arbitrary images." If you ever wondered what Mona Lisa would look like if done in the style of van Gogh's Starry Night, it turns out this is something a machine can do quite well! This is done with a type of algorithm known as a Convolutional neural network, a model that is inspired by our understanding of the human brain and the connections of its neurons. The CNN learns abstract, high-level features of a style (like the texture of Starry Night) and is able to apply it to the content of a given image.

To try this out I spun up an GPU enabled EC2 Linux server on AWS, and I used Justin Johnson's "neural-style", a torch implementation of this above-mentioned algorithm. I combined a selfie with a picture of the rings in a cross-section of a log. The result was really cool!

Here is an animation of the progressive iterations of the model.

ORIGINAL IMAGES

pics to combine

BLENDED IMAGE

pics to combine


Preparing for the Transition to Data Science

As a Data Scientist and Program Director at Insight, we've helped hundreds of academics with PhDs transition into the data science industry. A colleague and I recently wrote this blog post on what skills one should cultivate to become a data scientist. It includes some wonderful resources that everyone should know about. Check it out!


SciClarify

SciClarify is a data product I created during the 2015 Insight Data Science Fellowship. Using the PUBMED API and a machine learning algorithm, SciClarify compares your text against the recent top papers in your field. I used natural language processing to engineer text features related to structure, syntax, and semantics.


Visualizing Socioeconomic Disadvantage Across US Counties

When we create maps to view the spatial variation of socioeconomic status, we are typically only viewing the variation of one factor at a time (e.g. just income or just unemployment rate). I thought it would be useful to create and visualize a summary score of overall "socioeconomic disadvantage" from many socioeconomic indicators. Using publicly available county-level US Census data from 2005 I created the following map. I conducted a factor analysis to combine the following indicators into one disadvantage measurement:


* Net 5-year population change

* % residents with less than a bachelor's degree

* % households with below $75,000 annual income

* % residents living at or below the poverty line

* Infant deaths per 1,000 live births

* Medicare recipients per 100,000 residents

* % residents that own their dwelling

* Unemployment rate


The three most disadvantaged counties were:

1) McDowell County, West Virginia

2) Owsley County, Kentucky

3) Buffalo County, South Dakota


The three least disadvantaged counties were:

1) Douglas County, Colorado

2) Fairfax County, Virginia

3) Loudoun County, Virigina

SOCIOEOCNOMIC DISADVANTAGE ACROSS US COUNTIES

small US map


Professional Basketball Simulation

The terms "Moneyball" and "sabermetrics" are increasingly being used in pop culture. In fact, there was a 2011 movie on the topic. These terms refer to the relatively new, evidence-based, statistical approach used in baseball management. Can this approach be applied to the game of basketball? The short answer is: it is much trickier. Baseball involves clear, discrete intervals of play surrounding one interaction (the pitcher interacting with the batter). Basketball consists of many players interacting simultaneously with possessions of variable length! In my spare time, a friend and I are attempting to create a Monte Carlo simulation of a professional basketball game. The procedure will involve pulling the most recent player statistics off of various websites and simulating a match between two teams 1000 times. We are excited to try out some recent machine learning algorithms in the program, and hopefully they will contribute something unique and helpful. The output will be distributions of 1000 final scores for each team, and will look something like this...

EXAMPLE OF THE SIMULATION OUTPUT

NBA plot

SIMULATION OVERVIEW

sim overview