Recently, I started to delve into the universe of machine learning and data science to further my knowledge and incorporate it into my work. Making use of algorithms to iteratively improve their ability to make predictions or decisions, machine learning applications have abounded in recent years. With a mix of formal university classes, Coursera classes and Kaggle competitions I built up a working knowledge to use machine learning in Python. To start my first real project, I took this recent paper (https://www.pnas.org/content/115/46/E10988) and tried to improve their classification algorithm (you can read the full account in my bioRxiv preprint: https://www.biorxiv.org/content/early/2018/12/19/499780 and if you like to play with the code you can find it here: https://github.com/Bribak/SURFY2).
Using a Random Forest classifier and a highly curated database, the authors predicted whether human transmembrane proteins reside in the surface-exposed plasma membrane or in intracellular membranes and thereby delivered a set of human surface proteins predicted by their algorithm SURFY. To decrease the weaknesses of a single Random Forest ensemble, I built a multiple layer meta-ensemble classifier which relied on engineered features. On the lowest level, 12 optimized classifiers predicted the protein localization as probabilities. These predictions were then added to the original data as new features. The next layer, three optimized classifiers, were then trained on the expanded dataset containing the newly engineered features. Their predicted class probabilities were then fed into the uppermost layer, a voting classifier which weighted each of the middle layer classifiers and determined the final prediction according to their votes.
This meta-ensemble approach SURFY2 resulted in a final prediction accuracy of 95.5% on an unseen test set, an improvement of over 3% with the formerly best approach SURFY in a controlled comparison. Thus, the human surfaceome predicted by SURFY2 now is the most accurate version to date and will hopefully assist research in that area. Additionally, I dug into the mechanism of how SURFY2 makes predictions and compared it to the decision making procedure of SURFY, especially with regard to discrepant predictions. This investigation also resulted in some interesting observations and hypotheses regarding human transmembrane proteins.