Friday, December 17, 2010

MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies

MaxQuant enables high peptide identification rates, individualized 
p.p.b.-range mass accuracies and proteome-wide protein quantification
Jurgen Cox & Matthias Mann, Nature Biotechnology


* What is MaxQuant?
===================
high-res quantitative MS data SILAC => MS/MS ID and prot quantification
SILAC = stable amino acid isotope-labeled 
   labeled with light vs. heavy isotope => pairs of heavy and light isotope 
    patterns  => ratio heavy/light

pipeline:
1. feature detection and peptide quantitation
  1a. peak detection: 
    + 2D (centroid) and 3D (peak alignment)
    + bootstrap estimation
  1b. SILAC pair detection
    + for all pairs of isotope patterns: 
      - correlation test > 0.5, equal charge, close enough mass
      - equal charge, close enough mass
      comments: 0.5 too low? lots of false positive? rely on mass too much?
    + convolute 2 isotope patterns
      - with possible KR combinations (K, R, KK, KR, RR, KKK, and more)
      - find same atomic composition
    comments: how to treat those that did not fully incorporate heavy isotope?
  1c. quantitation (discussed)
    used SILAC intensity ratio: heavy/light. one peak, total? 
                                 monoisotopic? most intense peak?

2. MS/MS ion search -- Mascot

3. ID and validation
  3a. posterior error probability for calculating FDR
    + probability that ID is false knowing Mascot score and peptide length
      comments: 
        normalize Mascot score by peptide length?
        what is the training data? don't need one, use decoy
        why didn't they do FDR like everyone else? by normalizing by length, 
          get more IDs
          how do they justify it?
    + peptide score distributions: decoy vs target hit
      after length normalization: score distributions target vs decoy 
       more separated
      comments:
        proportions of target hits vs decoy hits are off
        why does length normalization make such a difference? 
          Mascot gives higher score to longer peptides.
  
4. visualization


* What are the benefits of MaxQuant?
====================================
1. improving peptide mass accuracy
  after re-calibration: p.p.m. much lower
2. high rate of identified MS/MS spectra
  not many ID for low mass but many IDs at high mass
  comments: doesn't look like 70% ID
3. proteome-wide protein quantification
  + protein ratio = median(all SILAC peptide ratio)
    comments:
      1 peptide or both peptides in pair IDed?
        both peptides in pair must ID as same peptide
      what if MS/MS for only one of the pair?
      if 2 peptides IDed but not paired earlier, can pair after (10% extra)
      how helpful is it to use tight tolerance? did not say
  + P-value for detection of significant outlier ratio (significance A)
    A. as above
    B. bin proteins by intensities
    significance A better than significance B


* Conclusion
============
MaxQuant improves:
 peptide ID rate
 mass accuracy
 protein-wide quantitation


* Criticism
===========
All experimental results are based on Mascot search 
  Mascot does not fully benefit high-accuracy data (0.25 Da)
comments: LTQ so it's fine
Speaker: Yoona Kim
Scribe: Jocelyne Bruand
Slides: here

Friday, November 19, 2010

Protein and gene model inference based on statistical modeling in k-partite graphs


Today, Natalie presented the paper “Protein and gene model inference based on statistical modeling in k-partite graphs” by Gerster et al. in PNAS.

The paper proposed that their MIPGEM (Markovian Inference of Proteins and Gene Models) is comparable in previous approaches of peptide scoring such as N-peptideruel, ProteinProphet, Nested mixture model, Hierarchical statistical model ,and MSBayesPro.

The authors claim the method is novel in several ways:
1. They allow dependencies between peptide scores.
2. They allow shared peptides
3. The model peptide scores as random values to allow for low-quality peptide scoring.
4. They can infer the probability of a gene model being present.

There are two main points of this proposal such that ;
1. Dependencies between peptides and proteins based on Markovian assumptions
- Their model computes the probability of a protein being present given the probabilities or scores of the observed peptides.
2. Considering ‘Shared and unshared’
- Shared peptides contribute to increase or decrease the probability for presence of a protein, depending on whether the peptide scores are above or below the median of all peptides scores.

According to datasets (Mixture of 18 purified proteins and sigma49 dataset), the paper compare the performance of MIPGEM with other methods in the graph of number of true positives and false positives. In the Mixture of 18 purified proteins, their model has similar between others but it performs slightly worse. But in sigma46, for sigma49 dataset, the number of true positives MIPGEM only goes up straight in 0 of false positives and flattens out and it can be used to achieve zero false positives.

To sum up, this paper insisted that their MIPGEM is reliable for protein and gen model inference mode, however, through our discussions, this proposal is not much better than other methods. 

Specific criticism:
1.  The tradeoff of FP and FN does not warrant the added complexity of the method.  Zero FP rate was obtained with a very high FN rate.
2.  The authors choose a mixture model for conditional probabilities of peptide scores given the presence/absence of proteins.  However, this function is given no justification.  Nor is any advice given about choosing this function based on the nature of the peptide scores.
3.  They discard useful information about spectral counts for peptides, instead adopting a more conservative approach of only accepting the best spectrum per peptide.
4.  The method does present much novelty.

Speaker: Natalie
Slides: Click Me
Scribe: Yoona





Monday, November 15, 2010

Feasibility of large scale Phosphoproteomics with Higher Energy Collisional Dissociation Fragmentation

Today, Sangtae presented the paper "Feasibility of large scale Phosphoproteomics with Higher Energy Collisional Dissociation Fragmentation" by Matthias Mann group.

The main question discussed in this paper is, whether or not HCD technology can be used for large-scale phosphoproteomics, and the authors answer is 'yes'.

Different mass spectrometers can be divided to three groups according to their mass accuracy:
1. Low Resolution (0.1-1Da): Ion Trap, Quadrupoles, Triple Quadrupoles
2. Medium Resolution: (0.01-0.1Da): Time of Flights, Hybrids with
3. Quadrupoles High Resolution (0.001-0.01Da): FTICR MS, FT Orbitraps, Hybrids with ion traps

But why do we need higher resolution data? High-res masses in MS1 level help in unambiguous charge determination and decreasing of the search state. In MS2 level, they help in fragment ion charge determination and signal/noise separation.

High-high strategy for mass collection means collection of both MS1 and MS2 spectra with high precision. High-low strategy means collection of high precision MS1 and Low precision MS2. Up to know, High/Low strategy is the preferred choice, because its cheaper, and many of the database search softwares can not handle the high precision data anyway. Moreover, many of high-res MS2 methods yield poor fragmentations. Ion trap is still the most popular data collection method, because its cheap, fast and sensitive (rich fragmentation).

ETD fragmentation is the most popular fragmentation method for longer and higher charged peptides. It is particularly interesting for PTM discovery, because modifications such as phosphorylation are not lost during the fragmentation.

For charge +1 and +2, CID is the fragmentation method of choice. For charge +3 spectra there is a disagreement among the researchers. Kyowon believes ETD outperforms CID for triply charged spectra, while Sangtae believes CID is still the better fragmentation.

HCD fragmentation can only be used with LTQ-orbitrap (high-high strategy), with lower sensitivity. However, the advantage of HCD over ETD and CID is the fact that low mass ions are not lost. This means one can observe ions such as immonium ions in the spectra and use that information for better identification.

For a dataset of enriched phosphopeptides from HeLa S3 cells, HCD high-high marginally outperformed CID high-low. However, the main issue was that HCD requires more ions for generating good quality spectra (in compare to CID), and as a result, the target value was not reached for many of the spectra as a result of limited injection time.  However, the authors claim that a large portion of those spectra with insufficient ions can still be identified.

The main criticism of the paper is the fact that Mascot, the database search used for identification of peptides, does not fully benefit from high accuracy data (tolerance is set to 0.25Da). This means the HCD results are under-represented.

Speaker: Sangtae
Slides: Click Me
Scribe: Hosein