Reproducible Analysis of Post-Translational Modifications in Proteomes-Application to Human Mutations
PLoS One 10, e0144692. (2015)
Holehouse, A.S., and Naegle, K.M.
There are two main takeaways from this paper:
- Our understanding of biology changes with time, and we should repeatedly compare conclusions dawn previously with the conclusions that can be drawn today to ask 'If we know more now, do our conclusions change at all'. As an example, if you'd never seen a car before (perhaps you're a time traveller?) and saw one red car you might conclude that all cars are red. After seeing 100 cars you would know better.
- We found that disease-causing mutations are significantly more likely to occur close to some (but not all) sites of post-translational modifications. Sites of post translational modifications are special places on proteins often linked to cell signaling.
In more detail...
Our understanding of biological sciences changes with time (...). This is not exactly a revolutionary idea, but while our 'big picture' understanding changes (evolution, DNA, epigenetics etc.) we are also constantly just collecting more data. We discover new things every week, such that the outcome of large-scale analyses which consider 'all human knowledge' on some specific subject may change as we gain more and more knowledge.
In this work, we made the point that we should use scientific tools that allow for the reproducible analysis of databases using specific 'versions' of data-sets, such that we can systematically compare how our conclusions change over time as more data are collected. This would let us (as a scientific community) ask if previously drawn conclusions are truly robust, or if they are the consequence of limited data availability (e.g. our 'all cars are red' scenario from above).
To make this more tangible, we were interested in post-translational modifications (PTMs). PTMs are basically just chemical groups that are added to or removed from proteins to change their behavior. There is a massive amount of data on the location of these PTMs - i.e. where on an amino acid sequence these changes occur. The ProteomeScout database is an easy to use datastore of this information, which as well as being an interactive database allows access to the complete dataset through a downloadable file. To facilitate this reproducible analysis, we developed a general toolkit for analyzing data from ProteomeScout called ProteomeScoutAPI. ProteomeScoutAPI lets you take an arbitrary ProteomeScout data file and run a wide range of analysis. Importantly, you can build your own analysis pipelines around the data, and the plug different versions of the ProteomeScout dataset into the ProteomeScoutAPI to ask how the ever changing dataset of PTMs leads to changes in conclusions drawn.
To demonstrate the ProteomeScoutAPIs power we analyzed the relationship between disease causing mutations and PTMs. To cut a long story short, after correcting for study bias, we found that mutations within 8 residues of ubiquitination, phosphoserine and phosphotyrosine were significantly more likely to be pathogenic. This may indicate a link between these types of PTMs and disease, but we should be aware that the discrete, binary classification of 'disease causing' mutations when such mutations are likely to exist on a continuous scale where many will be un-reported is fraught with its own biases. However, fundamentally, this does at least suggests that mutations near PTMs might be expected to have a greater impact on function in a pathogenic manner when compared to other mutations.