Biology is super complicated. If you don't believe me, check out the Roche Biochemical Pathways chart. I wonder if you can buy a laminated poster of that chart...
One of the more complicated biochemical processes is the one by which cells convert DNA to functional proteins. As a first step in this process, DNA is converted to mRNA in a process named transcription. Transcription is controlled with transcription factors (TFs) , which are proteins that bind to specific parts of the DNA. Understanding what specific parts of DNA that TFs bind to might lead to new insights into how transcription occurs and why certain genes are expressed or not expressed.
If you formulate this problem statistically, it is essentially sequence classification. That is, taking a n-length sequence of unique symbols (e.g. the four DNA bases T, A, C, and G) and learning to categorize them (say, as bindable to a protein or not).
There are a number of statistical models one can use for sequence classification. Long Short-Term Memory (LSTM) neural networks are an exciting deep learning approach that could be useful for this problem. These types of neural networks have been increasing used by Google, Apple, and others. Unlike standard feedforward networks, LSTMs have loops that allows them to excel at retaining information and learning patterns within sequences. These networks are composed of LSTM blocks, which have "gates" that determine the flow of information (see image on the right).
Using data from the publicly available UniProbe dataset (more details here), I implemented a LSTM using python's Keras module. The dataset consisted of about 20,000 DNA sequences that bind to a protein of interest, and about 20,000 sequences that didn't. The sequences themselves were each 60 bases long. The LSTM model I created included an embedding layer (to transform the discrete symbols into continuous vector space). I then included two hidden LSTM layers with 20% dropout to prevent overfitting. I ran this on batches of 64 observations and 10 epochs, using the stochastic ADAM optimizer and a tanh activation function. Bottom line, I was able to obtain almost 80% accuracy in the test set (see line plot on right).
Machine Learning and Art
In 2015 Leon Gatys wrote a paper describing an algorithm that could "separate and recombine content and style of arbitrary images." If you ever wondered what Mona Lisa would look like if done in the style of van Gogh's Starry Night, it turns out this is something a machine can do quite well! This is done with a type of algorithm known as a Convolutional neural network, a model that is inspired by our understanding of the human brain and the connections of its neurons. The CNN learns abstract, high-level features of a style (like the texture of Starry Night) and is able to apply it to the content of a given image.
To try this out I spun up an GPU enabled EC2 Linux server on AWS, and I used Justin Johnson's "neural-style", a torch implementation of this above-mentioned algorithm. I combined a selfie with a picture of the rings in a cross-section of a log. The result was really cool!
As a Data Scientist and Program Director at Insight, we've helped hundreds of academics with PhDs transition into the data science industry. A colleague and I recently wrote this blog post on what skills one should cultivate to become a data scientist. It includes some wonderful resources that everyone should know about. Check it out!
SciClarify.com is a data product I created during the 2015 Insight Data Science Fellowship.
Using the PUBMED API and a machine learning algorithm, SciClarify
compares your text against the recent top papers in your field. I used
natural language processing to engineer text featuresrelating to structure, syntax, and semantics.
Visualizing Socioeconomic Disadvantage Across US Counties
When we create maps to view the spatial variation of
socioeconomic status, we are typically only viewing the variation of
one factor at a time (e.g. just income or just unemployment rate). I thought it would be useful to create and visualize a summary score of overall "socioeconomic disadvantage" from many socioeconomic
indicators. Using publicly
available county-level US Census data from 2005 I created the
following map. I conducted a factor analysis
to combine the following indicators into one disadvantage
Childhood Academic Achievement and Socioeconomic Status: An
Application of Predictive Modeling
The pursuit of knowledge is one of our most important social
values and we all want our children to succeed academically. Because of
this, it is important to understand the correlates of early academic
success. It is also important to understand what factors explain the achievement
gap that has persisted in the US for decades. These figures from
Center for Education Statistics illustrate the problem.
Educational researchers have clearly established that
socioeconomic factors have a profound influence on academic success.
However, it is still unclear which specific socioeconomic factors are
the most important predictors of academic achievement. Knowing this
might allow policymakers to focus their efforts for maximum impact.
I am currently collaborating with the New York City Department of
Health and Mental Hygiene to analyze data from the Longitudinal
Study of Early Development (LSED).
This rich dataset contains 3rd grade math and language achievement
scores and socioeconomic information on 100,000's of children
born in New York City from 1994 to 2004. I will examine a of suite of
predictive models/machine learning techniques to find an optimally
predictive model and then calculate variable importance scores to
isolate the top 10 or 20 predictive socioeconomic factors.
Professional Basketball Simulation
The terms "Moneyball" and "sabermetrics" are increasingly
being used in pop culture. In fact, there was a 2011 movie on the topic. These terms refer to the relatively new, evidence-based,
statistical approach used in baseball management. Can this approach be
applied to the game of basketball? The short answer is: it is much
trickier. Baseball involves clear, discrete intervals of play
surrounding one interaction (the pitcher interacting with the batter).
Basketball consists of many players interacting simultaneously with
possessions of variable length! In my spare time, a friend and I are
attempting to create a Monte Carlo
simulation of a professional
basketball game. The procedure will involve pulling the most recent
player statistics off of various websites
and simulating a match between two teams 1000 times. We are excited to
try out some recent machine learning algorithms in the program, and
hopefully they will contribute something unique and helpful. The output
will be distributions of 1000 final scores for each team, and will look
something like this...
Here is an overview of the process for a single simulated game: