Analyzing Label-free LC-MS data requires a series of algorithms presented below.
LC-MS maps can be imported from files generated by others peak picking LC-MS tools, or directly created through Proline with its own feature extraction algorithms.
Maps generated with peak picking algorithms cannot be 100% reliable and often contain redundant signals, corresponding to the same compound. Furthermore, modified peptides having the same sequence can have different PTM polymorphisms that can give different MS signals with the same m/z ratio but having slightly different retention times. Comparing LC-MS maps with such cases is a problem as it may lead to an inversion of feature matches between maps. Creating feature clusters is a way to avoid this issue. This operation is called “Clustering” (cf. figure 2).
The processing consists of grouping, in a given LC-MS map, the features with the same charge state, close in retention time and m/z ratio(Default tolerance is respectively 15 seconds and 10 ppm). Some metrics are calculated for each cluster (equivalent as those used for the features) :
The resulting Maps are “cleaner” at the end of the algorithm, thus reducing ambiguities for map alignment and comparison. Quantitative data extracted from these maps will be processes in the following steps. It is necessary to eliminate the ambiguities found by the clustering step. To do so, it is possible to rely on the information given by the search engine on each identified peptide. If some ambiguities remain, the end user must be aware of them and be able to either manually handle them or either exclude them from the analysis.
NB: do not mix up clustering and deconvolution which consists in grouping all the charge states detected for a single molecule.
Because chromatographic separation is not completely reproducible, LC-MS maps must be aligned before being compared. The first step of the alignment algorithm is to randomly pick a reference map and then compare every other map to it. On each comparison the algorithm will determine all possible matches between detected features, considering time and mass windows (the default values are respectively 600 seconds and 10 ppm). Only landmarks involving unambiguous links between the maps (only one feature on each map) are kept (cf. figure 3).
Figure 3 : Matching features with the reference map respecting a mass (10ppm) and time tolerance (600s) |
---|
The result of this alignment algorithm can be represented with a scatter plot (cf. figure 5).
The algorithm completes this alignment process several times with randomly chosen reference maps. Then it sums the absolute values of the distance between each map to an average map (cf. figure 4). The map with the lowest sum is the closest to the other maps and will be considered as the final reference map from this point.
Two algorithms have been implemented to make this selection.
This algorithm considers every possible couple between maps:
The last thing to do is to find the path going through the regions with the highest density of points in the scatter plot. This step was implemented using a moving median smoothing (cf. figure 5).
Once the maps have been corrected and aligned, the final step consists of creating a consensus map or master map. It is produced by searching the best match for each feature detected on different maps. The master map can be seen as a representation of all the features detected on the maps, without redundancy. (cf. figure 6).
During the creation of the master map, the algorithm will first consider matches for the most intense features (higher than a given threshold), and then consider the other features only if they match a feature with a high intensity in another map. This is done in order to avoid to include background noise to the master map (cf. figure 7).
It has been seen that ambiguous features with close m/z and retention times can be grouped into clusters. Other conflicts are also generated during the creation of the master map due to wrong matches. Adding the peptide sequence is the key to solve these conflicts by identifying without ambiguity a feature. Proline has access to the list of all identified and validated PSMs as well as the identifier (id) of each MS/MS spectrum related to an identification. This means that the link between the scan id and the peptide id is known. On the other hand the list of MS/MS events simultaneous to the elution window of each feature is known. For each of these events the corresponding peptide sequences can be retrieved. If only one peptide sequence is found for the master feature, it will be kept as it is. Otherwise the master feature will be cloned in order to have one feature per peptide sequence. During this duplication step the daughter features will be distributed on the new master features according to the identified peptide sequences.
When the master map is created some intensity values could be missing. Proline will read the mzDB files to reduce the number of missing values, using the expected coordinates (m/z – RT) for each missing feature to extract new features. These new extractions are added to copies of the daughters and the master maps. This gives a new master map with a limited number of missing values.
The comparison of LC-MS maps is confronted to another problem which is the variability of the MS signals measured by the instrument. This variability can be technical or biological. Technical variations between MS signals in two analyses can depend on the injected quantity of material, the reproducibility of the instrument configuration and also the software used for the signal processing. The observed systematic biases on the intensity measurements between two successive and similar analysis are mainly due to errors in the total amount of injected material in each case, or the nanoLC-MS system instabilities that can cause variable performances during a series of analysis and thus a different response in MS signal for peptides having the same abundance. Data may not be used if the difference is too important. It is always recommended to do a quality control of the acquisition before considering any computational analysis. However, there are always biases in any analytic measurement but they can usually be fixed by normalizing the signals. Numerous normalization methods have been developed, each of them using a different mathematical approach (Christin, Bischoff et al. 2011). Methods are usually split in two categories, linear and non-linear calculation methods, and it has been demonstrated that linear methods can fix most of the biases (Callister, Barry et al. 2006). Three different linear methods have been implemented in Proline by calculating normalization factors as the ratio of the sum of the intensities, as the ratio of the median of the intensities, or as the ratio of the median of the intensities.
How this factor is calculated:
How this factor is calculated:
This last strategy has been published in 2006 (Dieterle, Ross et al. 2006) and gives the best results. It consists of calculating the intensity ratios between two maps to be compared then set the normalization factor as the inverse value of the median of these ratios (cf. figure 8). The procedure is the following:
Proline makes this normalization process for each match with the reference map and has a normalization factor for each map, independently of the choice of the algorithm. The normalization factor for the reference map is equal to 1.
Once the master map is normalized, it is stored in the Proline LCMS database and used to create a “QuantResultSummary”. This object links the quantitative data to the identification data validated in Proline. This “QuantResultSummary” is then stored in the Proline MSI database (cf. figure below).