Friday, January 28, 2011

1-28-2011 Computing Fragmentation Trees from Tandem Mass Spectrometry Data

Rasche, F, Svatoš, A, Maddula, RK, Böttcher, C, Böcker, S (2010). Computing Fragmentation Trees from Tandem Mass Spectrometry Data. Anal Chem, :np page given.

Goal: Annotate MS2 spectra of small molecules with chemical compositions.

Data: Orbi, QstAR, QTOF spectra of ~200 known compounds (<1KDa) with different CID energies
Build consensus spectra from different energies

Strategy
  1. Calculate chemical formula of parent mass from MS1 spectrum (given)
  2. Guild graph (DAG) of possible formulas to explain peaks
    • Node for each possible chem formula for a single peak (usually 3-4 nodes per peak)
    • Color nodes from the same peak with the same color
    • Edge between nodes if they could be generated by a plausible neutral loss
  3. Build tree from graph giving the most likely pattern of fragmentation to explain spectrum
    • Score nodes based on likelihood of that formula
    • Score edges according to known probabilities of various neutral losses
    • Want a 'colorful' graph, ie explains most peaks ("colors") in the spectrum, and few peaks have multiple formula associated
1-2 are solved heuristically.
Solve 3 formally:
Input: DAG + edge weights + set of colors
Output: Directed tree with maximal edge weight containing a unique set of colors
Solved by dynamic programming.

Conclusions
  • Fragment tree is alternate representation of spectra
  • Helps experts with manual peak annotation (somehow).
  • Better than existing greedy solution and other tools for predicting spectra from structure (Mass Frontier)
  • Could be used for comparing fragmentation trees to a library, but this isn't well developed.
  • No clear progress towards goal of determining structure from MS2

Critiques:
  •  Not very systematic in the analysis
  •  Poor paper organization
  •  Are fragment trees useful?
Speaker: Anand
Scribe: Spencer
Slides: here

1-21-2011 Identifying complex patterns of PTMs and Un-targeted database search

Guan, S, Burlingame, AL (2010). Data processing algorithms for analysis of high resolution MSMS spectra of peptides with complex patterns of posttranslational modifications. Mol. Cell Proteomics, 9, 5:804-10.

Aim: For a given Spectrum, Peptide, and set of PTMs, find a subset of plausible PTM modified peptide configurations and their relative abundances.
Achievements: 
- Greedy algorithm, repeatedly selects maximum scoring plausible PTM configurations from set of all possible configurations. 
- Relative Abundance is computed by minimizing the least square error of theoretical peak intensity based on configuration abundance and empirical intensity.

Responses:
- May not consider very similar configurations
- Will not handle overlapping peaks.
- A first attempt solution, where improvements can be made.
- Validation was lacking/non-existent.


Baliban, RC, DiMaggio, PA, Plazas-Mayorca, MD, Young, NL, Garcia, BA, Floudas, CA (2010). A novel approach for untargeted post-translational modification identification using integer linear optimization and tandem mass spectrometry. Mol. Cell Proteomics, 9, 5:764-79.

Aim: For a given Spectrum and x peptides, identify the PTM and localization. (With the implicit assumption that unrestricted search means a large set of plausible PTMs)
Achievements:
- ILP formulation of the problem to output candidate PTM-peptide matches with post-finagling to acquire best modified peptide matches
- Results appear excellent, with comparison to other tools.
- 5 distinct mass spec datasets.

Critiques:
- Good preprocessing policy.
- Reduce spectra to "b" ions only. This policy varies upon instrument.
- Strategy is slow, but the accuracy of results compensates.
- Unclear what "manual validation" means.
 
Speaker: Xiaowen
Scribe: Anand
Slides: here

Wednesday, January 19, 2011

1-14-2011 Spectrum denoising

A novel approach to denoising ion trap tandem mass
spectra" by Jiarui et al, 2009


spectral pre-processing
goal for pre-processing
1. remove the noise
2. decrease the number of non-identified spectra
3. increase the number of identified peptides

procedure
1. denoising of spectrum
-signal peaks: peaks from y or b
-noisy peaks: other peaks
2. intensity normalization
using 5 interrelation features
1. F1: # of peaks p’ such that p-p’ = an a.a. mass
2. F2: # of peaks p’ such that p+p’ = precursor mass
3. F3: # of peaks p’ such that p-p’ = H2O for NH3
4. F4: # of peaks p’ such that p-p’ = CO or NH
5. F5: # of peaks p’ such that p-p’ = isotope mass
6. score: w0 + w1F1 + w2F2 + w3F3 + w4F4 + w5F5
7. if score is minus they exclude the peak(noise)
peak selection
-after intensity normalization it is likely that signal peaks are local maxima
to select the local maxima, morphological reconstruction filter is adopted
dataset
-ISB: ESI ion trap 37044 spectra
-TOV: LCQ DECA XP ion trap 22576 spectra
-database: ipi.Human protein database
-Mascot is used to evaluate denoising
Number of identified spectra
-spectrum is identified if its Mascot ion score is larger than the identity threshold
results
-Denoised spectrum increased the # of identification of Mascot search
Features of spectrum that other people use in preprocessing
-Number of peaks
-total ion current
-Good Diff fraction
-Total normalized intensity of peaks with associated isotope peaks
-complements
-water losses
-signal to noise ratio
Conclusion
-intensity normalization is too heuristic
-among used features, neutral losses are often observed in noisy peaks
-features were manually selected, and no new feature was introduced
the benefit of morphological filter is not clear
-standard target-decoy analysis was not shown
-it is about denoising, but the result of denoising is not directly shown
-proposed scheme may not suitable for other tools
-the running time of their algorithm is not shown
discussion
-They increased the # of identifications of MASCOT
-Spectrum preprocessing might be good on De Novo but no significant improvements on Database search
-Preprocessing is highly dependent on scoring function itself


Speaker: Kyowon
Scribe: Sunghee
Slides: here

Monday, January 3, 2011

12/10/2010 Spatial segmentation of imaging mass spectrometry data with edge-preserving image denoising and clustering


Title : Spatial segmentation of imaging mass spectrometry data with edge-preserving image denoising and clustering
by Theodore Alexandrov, Michael Becker, Sören Deininger, Günther Ernst, Liane Wehder, Markus Grasmair, Ferdinand von Eggeling, Herbert Thiele, and Peter Maass
  1. Background: MS Imaging enables visualization of the spatial distribution of e.g. compounds, biomarker, metabolites, peptides or proteins by their molecular masses.
  1. What they did: the authors proposed a new procedure for spatial segmentation of MALDI-imaging dataset.
  1. How they did: they built the pipeline that consists of
    • spectra preprocessing
    • peak picking
    • edge-preserving denoising of mz images
    • finally, clustering.
  2. More in detail...
    • Spectra preprocessing uses baseline correction, which reduces the intensity errors.
    • Peak picking picks only 10 peaks in each 10th spectra, and keeps peaks at least 1% across entire sample. Orthogonal Matching Pursuit (OMP) is used since it is simple and fast.
    • Denoising uses Grasmair modification of Total Variation minimizing Chambolle algorithm. The parameter theta controls the smoothness.
    • Clustering uses High Dimensional Discriminant Clustering (HDDC), where each cluster is modeled by a Gaussian distribution.
  1. Results
    • Dataset : Rat brain coronal section and Section of neuroendocrine tumor (NET) invading the small intestine
    • Peak picking using OMP detects major peaks successfully.
    • Denoising with Grasmair method removed noise efficiently not smoothing out edges. The result illustrates the selection of parameter theta is important, though.
    • The clustered image by proposed pipeline and the segmentation map of rat brain were shown to be similar each other. The edge preserving denoising affects the clustering result.
    • 3 parameters for peak picking and 2 parameters for denoising and clustering should be tuned for good result.
  2. conclusion
    • HDDC clustering is better than k-means but slow.
    • It is important for cancer study.
  3. Criticism
    • What they are optimizing is unclear.
    • Too many parameters can influence the result yet no optimal values are given for various applications.
    • Slow running time makes it hard to run multiple trials.
Speaker: Jocelyne
Scribe: Kyowon
Slides: here

Friday, December 17, 2010

MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies

MaxQuant enables high peptide identification rates, individualized 
p.p.b.-range mass accuracies and proteome-wide protein quantification
Jurgen Cox & Matthias Mann, Nature Biotechnology


* What is MaxQuant?
===================
high-res quantitative MS data SILAC => MS/MS ID and prot quantification
SILAC = stable amino acid isotope-labeled 
   labeled with light vs. heavy isotope => pairs of heavy and light isotope 
    patterns  => ratio heavy/light

pipeline:
1. feature detection and peptide quantitation
  1a. peak detection: 
    + 2D (centroid) and 3D (peak alignment)
    + bootstrap estimation
  1b. SILAC pair detection
    + for all pairs of isotope patterns: 
      - correlation test > 0.5, equal charge, close enough mass
      - equal charge, close enough mass
      comments: 0.5 too low? lots of false positive? rely on mass too much?
    + convolute 2 isotope patterns
      - with possible KR combinations (K, R, KK, KR, RR, KKK, and more)
      - find same atomic composition
    comments: how to treat those that did not fully incorporate heavy isotope?
  1c. quantitation (discussed)
    used SILAC intensity ratio: heavy/light. one peak, total? 
                                 monoisotopic? most intense peak?

2. MS/MS ion search -- Mascot

3. ID and validation
  3a. posterior error probability for calculating FDR
    + probability that ID is false knowing Mascot score and peptide length
      comments: 
        normalize Mascot score by peptide length?
        what is the training data? don't need one, use decoy
        why didn't they do FDR like everyone else? by normalizing by length, 
          get more IDs
          how do they justify it?
    + peptide score distributions: decoy vs target hit
      after length normalization: score distributions target vs decoy 
       more separated
      comments:
        proportions of target hits vs decoy hits are off
        why does length normalization make such a difference? 
          Mascot gives higher score to longer peptides.
  
4. visualization


* What are the benefits of MaxQuant?
====================================
1. improving peptide mass accuracy
  after re-calibration: p.p.m. much lower
2. high rate of identified MS/MS spectra
  not many ID for low mass but many IDs at high mass
  comments: doesn't look like 70% ID
3. proteome-wide protein quantification
  + protein ratio = median(all SILAC peptide ratio)
    comments:
      1 peptide or both peptides in pair IDed?
        both peptides in pair must ID as same peptide
      what if MS/MS for only one of the pair?
      if 2 peptides IDed but not paired earlier, can pair after (10% extra)
      how helpful is it to use tight tolerance? did not say
  + P-value for detection of significant outlier ratio (significance A)
    A. as above
    B. bin proteins by intensities
    significance A better than significance B


* Conclusion
============
MaxQuant improves:
 peptide ID rate
 mass accuracy
 protein-wide quantitation


* Criticism
===========
All experimental results are based on Mascot search 
  Mascot does not fully benefit high-accuracy data (0.25 Da)
comments: LTQ so it's fine
Speaker: Yoona Kim
Scribe: Jocelyne Bruand
Slides: here

Friday, November 19, 2010

Protein and gene model inference based on statistical modeling in k-partite graphs


Today, Natalie presented the paper “Protein and gene model inference based on statistical modeling in k-partite graphs” by Gerster et al. in PNAS.

The paper proposed that their MIPGEM (Markovian Inference of Proteins and Gene Models) is comparable in previous approaches of peptide scoring such as N-peptideruel, ProteinProphet, Nested mixture model, Hierarchical statistical model ,and MSBayesPro.

The authors claim the method is novel in several ways:
1. They allow dependencies between peptide scores.
2. They allow shared peptides
3. The model peptide scores as random values to allow for low-quality peptide scoring.
4. They can infer the probability of a gene model being present.

There are two main points of this proposal such that ;
1. Dependencies between peptides and proteins based on Markovian assumptions
- Their model computes the probability of a protein being present given the probabilities or scores of the observed peptides.
2. Considering ‘Shared and unshared’
- Shared peptides contribute to increase or decrease the probability for presence of a protein, depending on whether the peptide scores are above or below the median of all peptides scores.

According to datasets (Mixture of 18 purified proteins and sigma49 dataset), the paper compare the performance of MIPGEM with other methods in the graph of number of true positives and false positives. In the Mixture of 18 purified proteins, their model has similar between others but it performs slightly worse. But in sigma46, for sigma49 dataset, the number of true positives MIPGEM only goes up straight in 0 of false positives and flattens out and it can be used to achieve zero false positives.

To sum up, this paper insisted that their MIPGEM is reliable for protein and gen model inference mode, however, through our discussions, this proposal is not much better than other methods. 

Specific criticism:
1.  The tradeoff of FP and FN does not warrant the added complexity of the method.  Zero FP rate was obtained with a very high FN rate.
2.  The authors choose a mixture model for conditional probabilities of peptide scores given the presence/absence of proteins.  However, this function is given no justification.  Nor is any advice given about choosing this function based on the nature of the peptide scores.
3.  They discard useful information about spectral counts for peptides, instead adopting a more conservative approach of only accepting the best spectrum per peptide.
4.  The method does present much novelty.

Speaker: Natalie
Slides: Click Me
Scribe: Yoona





Monday, November 15, 2010

Feasibility of large scale Phosphoproteomics with Higher Energy Collisional Dissociation Fragmentation

Today, Sangtae presented the paper "Feasibility of large scale Phosphoproteomics with Higher Energy Collisional Dissociation Fragmentation" by Matthias Mann group.

The main question discussed in this paper is, whether or not HCD technology can be used for large-scale phosphoproteomics, and the authors answer is 'yes'.

Different mass spectrometers can be divided to three groups according to their mass accuracy:
1. Low Resolution (0.1-1Da): Ion Trap, Quadrupoles, Triple Quadrupoles
2. Medium Resolution: (0.01-0.1Da): Time of Flights, Hybrids with
3. Quadrupoles High Resolution (0.001-0.01Da): FTICR MS, FT Orbitraps, Hybrids with ion traps

But why do we need higher resolution data? High-res masses in MS1 level help in unambiguous charge determination and decreasing of the search state. In MS2 level, they help in fragment ion charge determination and signal/noise separation.

High-high strategy for mass collection means collection of both MS1 and MS2 spectra with high precision. High-low strategy means collection of high precision MS1 and Low precision MS2. Up to know, High/Low strategy is the preferred choice, because its cheaper, and many of the database search softwares can not handle the high precision data anyway. Moreover, many of high-res MS2 methods yield poor fragmentations. Ion trap is still the most popular data collection method, because its cheap, fast and sensitive (rich fragmentation).

ETD fragmentation is the most popular fragmentation method for longer and higher charged peptides. It is particularly interesting for PTM discovery, because modifications such as phosphorylation are not lost during the fragmentation.

For charge +1 and +2, CID is the fragmentation method of choice. For charge +3 spectra there is a disagreement among the researchers. Kyowon believes ETD outperforms CID for triply charged spectra, while Sangtae believes CID is still the better fragmentation.

HCD fragmentation can only be used with LTQ-orbitrap (high-high strategy), with lower sensitivity. However, the advantage of HCD over ETD and CID is the fact that low mass ions are not lost. This means one can observe ions such as immonium ions in the spectra and use that information for better identification.

For a dataset of enriched phosphopeptides from HeLa S3 cells, HCD high-high marginally outperformed CID high-low. However, the main issue was that HCD requires more ions for generating good quality spectra (in compare to CID), and as a result, the target value was not reached for many of the spectra as a result of limited injection time.  However, the authors claim that a large portion of those spectra with insufficient ions can still be identified.

The main criticism of the paper is the fact that Mascot, the database search used for identification of peptides, does not fully benefit from high accuracy data (tolerance is set to 0.25Da). This means the HCD results are under-represented.

Speaker: Sangtae
Slides: Click Me
Scribe: Hosein