Friday, November 19, 2010

Protein and gene model inference based on statistical modeling in k-partite graphs


Today, Natalie presented the paper “Protein and gene model inference based on statistical modeling in k-partite graphs” by Gerster et al. in PNAS.

The paper proposed that their MIPGEM (Markovian Inference of Proteins and Gene Models) is comparable in previous approaches of peptide scoring such as N-peptideruel, ProteinProphet, Nested mixture model, Hierarchical statistical model ,and MSBayesPro.

The authors claim the method is novel in several ways:
1. They allow dependencies between peptide scores.
2. They allow shared peptides
3. The model peptide scores as random values to allow for low-quality peptide scoring.
4. They can infer the probability of a gene model being present.

There are two main points of this proposal such that ;
1. Dependencies between peptides and proteins based on Markovian assumptions
- Their model computes the probability of a protein being present given the probabilities or scores of the observed peptides.
2. Considering ‘Shared and unshared’
- Shared peptides contribute to increase or decrease the probability for presence of a protein, depending on whether the peptide scores are above or below the median of all peptides scores.

According to datasets (Mixture of 18 purified proteins and sigma49 dataset), the paper compare the performance of MIPGEM with other methods in the graph of number of true positives and false positives. In the Mixture of 18 purified proteins, their model has similar between others but it performs slightly worse. But in sigma46, for sigma49 dataset, the number of true positives MIPGEM only goes up straight in 0 of false positives and flattens out and it can be used to achieve zero false positives.

To sum up, this paper insisted that their MIPGEM is reliable for protein and gen model inference mode, however, through our discussions, this proposal is not much better than other methods. 

Specific criticism:
1.  The tradeoff of FP and FN does not warrant the added complexity of the method.  Zero FP rate was obtained with a very high FN rate.
2.  The authors choose a mixture model for conditional probabilities of peptide scores given the presence/absence of proteins.  However, this function is given no justification.  Nor is any advice given about choosing this function based on the nature of the peptide scores.
3.  They discard useful information about spectral counts for peptides, instead adopting a more conservative approach of only accepting the best spectrum per peptide.
4.  The method does present much novelty.

Speaker: Natalie
Slides: Click Me
Scribe: Yoona





Monday, November 15, 2010

Feasibility of large scale Phosphoproteomics with Higher Energy Collisional Dissociation Fragmentation

Today, Sangtae presented the paper "Feasibility of large scale Phosphoproteomics with Higher Energy Collisional Dissociation Fragmentation" by Matthias Mann group.

The main question discussed in this paper is, whether or not HCD technology can be used for large-scale phosphoproteomics, and the authors answer is 'yes'.

Different mass spectrometers can be divided to three groups according to their mass accuracy:
1. Low Resolution (0.1-1Da): Ion Trap, Quadrupoles, Triple Quadrupoles
2. Medium Resolution: (0.01-0.1Da): Time of Flights, Hybrids with
3. Quadrupoles High Resolution (0.001-0.01Da): FTICR MS, FT Orbitraps, Hybrids with ion traps

But why do we need higher resolution data? High-res masses in MS1 level help in unambiguous charge determination and decreasing of the search state. In MS2 level, they help in fragment ion charge determination and signal/noise separation.

High-high strategy for mass collection means collection of both MS1 and MS2 spectra with high precision. High-low strategy means collection of high precision MS1 and Low precision MS2. Up to know, High/Low strategy is the preferred choice, because its cheaper, and many of the database search softwares can not handle the high precision data anyway. Moreover, many of high-res MS2 methods yield poor fragmentations. Ion trap is still the most popular data collection method, because its cheap, fast and sensitive (rich fragmentation).

ETD fragmentation is the most popular fragmentation method for longer and higher charged peptides. It is particularly interesting for PTM discovery, because modifications such as phosphorylation are not lost during the fragmentation.

For charge +1 and +2, CID is the fragmentation method of choice. For charge +3 spectra there is a disagreement among the researchers. Kyowon believes ETD outperforms CID for triply charged spectra, while Sangtae believes CID is still the better fragmentation.

HCD fragmentation can only be used with LTQ-orbitrap (high-high strategy), with lower sensitivity. However, the advantage of HCD over ETD and CID is the fact that low mass ions are not lost. This means one can observe ions such as immonium ions in the spectra and use that information for better identification.

For a dataset of enriched phosphopeptides from HeLa S3 cells, HCD high-high marginally outperformed CID high-low. However, the main issue was that HCD requires more ions for generating good quality spectra (in compare to CID), and as a result, the target value was not reached for many of the spectra as a result of limited injection time.  However, the authors claim that a large portion of those spectra with insufficient ions can still be identified.

The main criticism of the paper is the fact that Mascot, the database search used for identification of peptides, does not fully benefit from high accuracy data (tolerance is set to 0.25Da). This means the HCD results are under-represented.

Speaker: Sangtae
Slides: Click Me
Scribe: Hosein