An Automated Elbow Method

For some types of unsupervised learning analyses, machine learning practitioners have typically needed to examine a plot and make a somewhat subjective judgement call to tune the model (the so-called "elbow method"). I can think of two examples of this but others certainly exist:

1) In any sort of clustering analysis: finding the appropriate number of clusters by plotting the within-cluster sum of squares against the number of clusters.

2) When reducing feature space via PCA or a Factor Analysis: using a Scree plot to determine the number of components/factors to extract.

For one-off analyses, using your eyeballs and some subjectivity might be fine, but what if you are using these methods as part of a pipeline in an automated process? I came across a very simple and elegant solution to this, which is described by Mu Zhu in this paper. Lots of heuristics exist to solve this but I've found this method to be particularly robust.

Zhu's idea is to generate the data you would typically generate to identify the elbow/kink. Then, he treats this data as a composite of two different samples, separated by the cutoff he is trying to identify. He loops through all possible cutoffs, in an attempt to find the cutoff that maximizes the profile log-likelihood (using sample means and a pooled SD in the calculations). Below is code I created to implement Zhu's method:

import numpy as np 
from scipy.stats import norm

def calc_logl(x,mu,sd):
  Helper function to calculate log-likelihood
  logl = 0
  for i in x:
    logl += np.log(norm.pdf(i, mu, sd))
  return logl

def find_optimal_k(data):
  Provide a numpy array, returns index to serve as cut-off
  profile_logl = []
  for q in range(1,len(data)):
    n = len(data)
    s1 = data[0:q]
    s2 = data[q:]
    mu1 = s1.mean()
    mu2 = s2.mean()
    sd1 = s1.std()
    sd2 = s2.std()
    sd_pooled = np.sqrt((((q-1)*(sd1**2)+(n-q-1)*(sd2**2)) / (n-2)))
    profile_logl.append(calc_logl(s1,mu1,sd_pooled) + calc_logl(s2,mu2,sd_pooled))
  return np.argmax(profile_logl)

Neural Networks and DNA

Biology is super complicated. If you don't believe me, check out the Roche Biochemical Pathways chart. I wonder if you can buy a laminated poster of that chart...

One of the more complicated biochemical processes is the one by which cells convert DNA to functional proteins. As a first step in this process, DNA is converted to mRNA in a process named transcription. Transcription is controlled with transcription factors (TFs) , which are proteins that bind to specific parts of the DNA. Understanding what specific parts of DNA that TFs bind to might lead to new insights into how transcription occurs and why certain genes are expressed or not expressed.

If you formulate this problem statistically, it is essentially sequence classification. That is, taking a n-length sequence of unique symbols (e.g. the four DNA bases T, A, C, and G) and learning to categorize them (say, as bindable to a protein or not).

There are a number of statistical models one can use for sequence classification. Long Short-Term Memory (LSTM) neural networks are an exciting approach that could be useful for this problem. These types of neural networks have been increasing used by Google, Apple, and others. Unlike standard feedforward networks, LSTMs have loops that allows them to excel at retaining information and learning patterns within sequences. These networks are composed of LSTM blocks, which have "gates" that determine the flow of information (see image on the right).

Using data from the publicly available UniProbe dataset (more details here), I implemented a LSTM using python's Keras module. The dataset consisted of about 20,000 DNA sequences that bind to a protein of interest, and about 20,000 sequences that didn't. The sequences themselves were each 60 bases long. The LSTM model I created included an embedding layer (to transform the discrete symbols into continuous vector space). I then included two hidden LSTM layers with 20% dropout to prevent overfitting. I ran this on batches of 64 observations and 10 epochs, using the stochastic ADAM optimizer and a tanh activation function. Bottom line, I was able to obtain almost 80% accuracy in the test set (see line plot on right).


pics to combine


pics to combine

Machine Learning and Art

In 2015 Leon Gatys wrote a paper describing an algorithm that could "separate and recombine content and style of arbitrary images." If you ever wondered what Mona Lisa would look like if done in the style of van Gogh's Starry Night, it turns out this is something a machine can do quite well! This is done with a type of algorithm known as a Convolutional neural network, a model that is inspired by our understanding of the human brain and the connections of its neurons. The CNN learns abstract, high-level features of a style (like the texture of Starry Night) and is able to apply it to the content of a given image.

To try this out I spun up an GPU enabled EC2 Linux server on AWS, and I used Justin Johnson's "neural-style", a torch implementation of this above-mentioned algorithm. I combined a selfie with a picture of the rings in a cross-section of a log. The result was really cool!

Here is an animation of the progressive iterations of the model.


pics to combine


pics to combine

Preparing for the Transition to Data Science

As a Data Scientist and Program Director at Insight, we've helped hundreds of academics with PhDs transition into the data science industry. A colleague and I recently wrote this blog post on what skills one should cultivate to become a data scientist. It includes some wonderful resources that everyone should know about. Check it out!


SciClarify is a data product I created during the 2015 Insight Data Science Fellowship. Using the PUBMED API and a machine learning algorithm, SciClarify compares your text against the recent top papers in your field. I used natural language processing to engineer text features related to structure, syntax, and semantics.

Visualizing Socioeconomic Disadvantage Across US Counties

When we create maps to view the spatial variation of socioeconomic status, we are typically only viewing the variation of one factor at a time (e.g. just income or just unemployment rate). I thought it would be useful to create and visualize a summary score of overall "socioeconomic disadvantage" from many socioeconomic indicators. Using publicly available county-level US Census data from 2005 I created the following map. I conducted a factor analysis to combine the following indicators into one disadvantage measurement:

* Net 5-year population change

* % residents with less than a bachelor's degree

* % households with below $75,000 annual income

* % residents living at or below the poverty line

* Infant deaths per 1,000 live births

* Medicare recipients per 100,000 residents

* % residents that own their dwelling

* Unemployment rate

The three most disadvantaged counties were:

1) McDowell County, West Virginia

2) Owsley County, Kentucky

3) Buffalo County, South Dakota

The three least disadvantaged counties were:

1) Douglas County, Colorado

2) Fairfax County, Virginia

3) Loudoun County, Virigina


small US map

Childhood Academic Achievement and Socioeconomic Status: An Application of Predictive Modeling

The pursuit of knowledge is one of our most important social values and we all want our children to succeed academically. Because of this, it is important to understand the correlates of early academic success. It is also important to understand what factors explain the achievement gap that has persisted in the US for decades. These figures from the National Center for Education Statistics illustrate the problem.

Educational researchers have clearly established that socioeconomic factors have a profound influence on academic success. However, it is still unclear which specific socioeconomic factors are the most important predictors of academic achievement. Knowing this might allow policymakers to focus their efforts for maximum impact. I am currently collaborating with the New York City Department of Health and Mental Hygiene to analyze data from the Longitudinal Study of Early Development (LSED). This rich dataset contains 3rd grade math and language achievement scores and socioeconomic information on 100,000's of children born in New York City from 1994 to 2004. I will examine a of suite of predictive models/machine learning techniques to find an optimally predictive model and then calculate variable importance scores to isolate the top 10 or 20 predictive socioeconomic factors.


math scores preview


reading scores preview


small predictive output

Professional Basketball Simulation

The terms "Moneyball" and "sabermetrics" are increasingly being used in pop culture. In fact, there was a 2011 movie on the topic. These terms refer to the relatively new, evidence-based, statistical approach used in baseball management. Can this approach be applied to the game of basketball? The short answer is: it is much trickier. Baseball involves clear, discrete intervals of play surrounding one interaction (the pitcher interacting with the batter). Basketball consists of many players interacting simultaneously with possessions of variable length! In my spare time, a friend and I are attempting to create a Monte Carlo simulation of a professional basketball game. The procedure will involve pulling the most recent player statistics off of various websites and simulating a match between two teams 1000 times. We are excited to try out some recent machine learning algorithms in the program, and hopefully they will contribute something unique and helpful. The output will be distributions of 1000 final scores for each team, and will look something like this...


NBA plot


sim overview