User Tools

Site Tools


prolineconcepts:lcmslabelfreequantitationworkflow

Label-free LC-MS quantitation workflow

Analyzing Label-free LC-MS data requires a series of algorithms presented below.

Figure 1 : overview of the differents stages of label-free LC-MS data processing

1. Generation of the LC-MS maps

LC-MS maps can be imported from files generated by others peak picking LC-MS tools, or directly created through Proline with its own feature extraction algorithms.

2. Feature clustering

Maps generated with peak picking algorithms cannot be 100% reliable and often contain redundant signals, corresponding to the same compound. Furthermore, modified peptides having the same sequence can have different PTM polymorphisms that can give different MS signals with the same m/z ratio but having slightly different retention times. Comparing LC-MS maps with such cases is a problem as it may lead to an inversion of feature matches between maps. Creating feature clusters is a way to avoid this issue. This operation is called “Clustering” (cf. figure 2).

Figure 2 : grouping features into cluster. All features with the same charge state, close m/z ratio and retention times are grouped in a single cluster. The other features are stored without clustering.

The processing consists of grouping, in a given LC-MS map, the features with the same charge state, close in retention time and m/z ratio(Default tolerance is respectively 15 seconds and 10 ppm). Some metrics are calculated for each cluster (equivalent as those used for the features) :

  • Cluster m/z is the median of the m/z of all features in the cluster
  • Cluster RT is (2 calculation options):
    • Median: median of all the retention times of the features in the cluster
    • Most intense: retention time of the most intense feature
  • Cluster intensity is:
    • Sum: sum of the intentisties of all the features in the cluster
    • Most intense: intensity of the most intense feature
  • Cluster charge state is the charge state of every feature in the cluster
  • Number of MS1 in cluster is the sum of the MS1 signal of all features in the cluster
  • Number of MS2 in cluster is the sum of the MS2 signal of all features in the cluster
  • Cluster first scan is the first scan of all the features in the cluster
  • Cluster last scan is the last scan of all the features in the cluster

The resulting Maps are “cleaner” at the end of the algorithm, thus reducing ambiguities for map alignment and comparison. Quantitative data extracted from these maps will be processes in the following steps. It is necessary to eliminate the ambiguities found by the clustering step. To do so, it is possible to rely on the information given by the search engine on each identified peptide. If some ambiguities remain, the end user must be aware of them and be able to either manually handle them or either exclude them from the analysis.

NB: do not mix up clustering and deconvolution which consists in grouping all the charge states detected for a single molecule.

3. LC-MS map alignment

Feature matching

Because chromatographic separation is not completely reproducible, LC-MS maps must be aligned before being compared. The first step of the alignment algorithm is to randomly pick a reference map and then compare every other map to it. On each comparison the algorithm will determine all possible matches between detected features, considering time and mass windows (the default values are respectively 600 seconds and 10 ppm). Only landmarks involving unambiguous links between the maps (only one feature on each map) are kept (cf. figure 3).

Figure 3 : Matching features with the reference map respecting a mass (10ppm) and time tolerance (600s)

The result of this alignment algorithm can be represented with a scatter plot (cf. figure 5).

Selection of the reference map

The algorithm completes this alignment process several times with randomly chosen reference maps. Then it sums the absolute values of the distance between each map to an average map (cf. figure 4). The map with the lowest sum is the closest to the other maps and will be considered as the final reference map from this point.

Figure 4 : Selection of the reference map. The chart on the left shows the time distances between each map and the average map obtained by multiple alignments. The chart on the right summarizes the integration of each curve in the chart on the left. The map closest to the average map is selected as the reference map.

Two algorithms have been implemented to make this selection.

Exhaustive algorithm

This algorithm considers every possible couple between maps:

  1. For each map, compute the distance in time to all the other maps (Sum of the distances in seconds)
  2. The reference map is the one with the lowest distance

Iterative algorithm

  1. Randomly select a reference map
  2. Align this map with all the other maps
  3. Compute the distance in time to all the other maps
  4. The new reference map is the one with the lowest distance
  5. Steps 2 to 4 are repeated unless
    1. the reference map remains the same for two consecutive iteration
    2. the maximum number of iteration is reached (default value is 3)

Alignment smoothing

The last thing to do is to find the path going through the regions with the highest density of points in the scatter plot. This step was implemented using a moving median smoothing (cf. figure 5).

Figure 5 : Alignment smoothing of two maps using a moving median calculation. The scatter plot represents the time variation (in seconds) of multiple landmarks (between the compared map and the reference map) against the observed time (in seconds) in the reference map. A user-defined window is moved along the plot, computing on each step a median time difference (left plot). The smoothed alignment curve is constituted of all the median values (right plot).

4. Creation of the master map

Once the maps have been corrected and aligned, the final step consists of creating a consensus map or master map. It is produced by searching the best match for each feature detected on different maps. The master map can be seen as a representation of all the features detected on the maps, without redundancy. (cf. figure 6).

Figure 6 : Creation of the master map by matching the features detected on two LC-MS maps. The elution times used here are the ones corrected by the alignment step. The intensity of a feature can vary from one map to another, it can also happen that a feature only appears in one map.

During the creation of the master map, the algorithm will first consider matches for the most intense features (higher than a given threshold), and then consider the other features only if they match a feature with a high intensity in another map. This is done in order to avoid to include background noise to the master map (cf. figure 7).

Figure 7 : Distribution of the intensities of the maps considered to build the master map. The construction is done in 3 steps : 1) removing features with a normalized intensity lower than a given threshold 2) matching the most intense features 3) features without matches in at least one map are compared again with the low intensity features, put aside in first step.

5. Solving conflicts

It has been seen that ambiguous features with close m/z and retention times can be grouped into clusters. Other conflicts are also generated during the creation of the master map due to wrong matches. Adding the peptide sequence is the key to solve these conflicts by identifying without ambiguity a feature. Proline has access to the list of all identified and validated PSMs as well as the identifier (id) of each MS/MS spectrum related to an identification. This means that the link between the scan id and the peptide id is known. On the other hand the list of MS/MS events simultaneous to the elution window of each feature is known. For each of these events the corresponding peptide sequences can be retrieved. If only one peptide sequence is found for the master feature, it will be kept as it is. Otherwise the master feature will be cloned in order to have one feature per peptide sequence. During this duplication step the daughter features will be distributed on the new master features according to the identified peptide sequences.

6. Cross assignment

When the master map is created some intensity values could be missing. Proline will read the mzDB files to reduce the number of missing values, using the expected coordinates (m/z – RT) for each missing feature to extract new features. These new extractions are added to copies of the daughters and the master maps. This gives a new master map with a limited number of missing values.

7. Normalizing LC-MS maps

The comparison of LC-MS maps is confronted to another problem which is the variability of the MS signals measured by the instrument. This variability can be technical or biological. Technical variations between MS signals in two analyses can depend on the injected quantity of material, the reproducibility of the instrument configuration and also the software used for the signal processing. The observed systematic biases on the intensity measurements between two successive and similar analysis are mainly due to errors in the total amount of injected material in each case, or the nanoLC-MS system instabilities that can cause variable performances during a series of analysis and thus a different response in MS signal for peptides having the same abundance. Data may not be used if the difference is too important. It is always recommended to do a quality control of the acquisition before considering any computational analysis. However, there are always biases in any analytic measurement but they can usually be fixed by normalizing the signals. Numerous normalization methods have been developed, each of them using a different mathematical approach (Christin, Bischoff et al. 2011). Methods are usually split in two categories, linear and non-linear calculation methods, and it has been demonstrated that linear methods can fix most of the biases (Callister, Barry et al. 2006). Three different linear methods have been implemented in Proline by calculating normalization factors as the ratio of the sum of the intensities, as the ratio of the median of the intensities, or as the ratio of the median of the intensities.

Sum of the intensities

How this factor is calculated:

  1. For each map, sum the intensities of the features
  2. The reference map is the median map
  3. The normalization factor of a map = sum of the intensities of the reference map / sum of the intensities of the map

Median of the intensities

How this factor is calculated:

  1. For each map, calculate the median of the intensities in the map
  2. The reference map is the median map
  3. The normalization factor of a map = median of the intensities of the reference map / median of the intensities of the map

Median of ratios

This last strategy has been published in 2006 (Dieterle, Ross et al. 2006) and gives the best results. It consists of calculating the intensity ratios between two maps to be compared then set the normalization factor as the inverse value of the median of these ratios (cf. figure 8). The procedure is the following:

  1. For each map in a “map set”, sum the intensities of the features
  2. The reference map is the median map
  3. For each feature of the master map, ratio = intensity of the feature in the reference map / intensity of the feature for this map
  4. Normalization factor = median of these ratios
Figure 8 : Distribution of the ratios transformed in log2 and calculated with the intensities of features observed in two LC-MS maps. The red line representing the median is slightly off-centered. The value of the normalization factor is equal to the inverse of this median value. The normalization process will refocus the ratio distribution on 0 which is represented by the black arrow.

Proline makes this normalization process for each match with the reference map and has a normalization factor for each map, independently of the choice of the algorithm. The normalization factor for the reference map is equal to 1.

8. Building a "QuantResultSummary"

Once the master map is normalized, it is stored in the Proline LCMS database and used to create a “QuantResultSummary”. This object links the quantitative data to the identification data validated in Proline. This “QuantResultSummary” is then stored in the Proline MSI database (cf. figure below).

Figure 9 : From raw files to the « QuantResultSummary » object.
prolineconcepts/lcmslabelfreequantitationworkflow.txt · Last modified: 2014/10/14 13:25 by 130.79.67.121