An Automated Elbow Method

Zhu's idea is to generate the data you would typically generate to identify the elbow/kink. Then, he treats this data as a composite of two different samples, separated by the cutoff he is trying to identify. He loops through all possible cutoffs, in an attempt to find the cutoff that maximizes the profile log-likelihood (using sample means and a pooled SD in the calculations). Below is code I created to implement Zhu's method:

```import numpy as np
from scipy.stats import norm

def calc_logl(x,mu,sd):
"""
Helper function to calculate log-likelihood
"""
logl = 0
for i in x:
logl += np.log(norm.pdf(i, mu, sd))
return logl

def find_optimal_k(data):
"""
Provide a numpy array, returns index to serve as cut-off
"""
profile_logl = []
for q in range(1,len(data)):
n = len(data)
s1 = data[0:q]
s2 = data[q:]
mu1 = s1.mean()
mu2 = s2.mean()
sd1 = s1.std()
sd2 = s2.std()
sd_pooled = np.sqrt((((q-1)*(sd1**2)+(n-q-1)*(sd2**2)) / (n-2)))
profile_logl.append(calc_logl(s1,mu1,sd_pooled) + calc_logl(s2,mu2,sd_pooled))
return np.argmax(profile_logl)
```

Childhood Academic Achievement and Socioeconomic Status: An Application of Predictive Modeling

Educational researchers have clearly established that socioeconomic factors have a profound influence on academic success. However, it is still unclear which specific socioeconomic factors are the most important predictors of academic achievement. Knowing this might allow policymakers to focus their efforts for maximum impact. I am currently collaborating with the New York City Department of Health and Mental Hygiene to analyze data from the Longitudinal Study of Early Development (LSED). This rich dataset contains 3rd grade math and language achievement scores and socioeconomic information on 100,000's of children born in New York City from 1994 to 2004. I will examine a of suite of predictive models/machine learning techniques to find an optimally predictive model and then calculate variable importance scores to isolate the top 10 or 20 predictive socioeconomic factors.  