This is an old revision of the document!

Validation Algorithm

Once a result file have been imported and a search result created, the validation is performed in 4 mains steps :

Peptide Matches filtering and validation
Protein Inference (peptides and proteins grouping)
Protein and Proteins Sets scoring
Protein sets filtering and validation
Protein Sets Filtering and Validation

Finally, the Identification Result issued from these steps is stored in the identification database. Different validation of a Search Result can be performed and a new Identification Summary of this Search Result is created for each validation.

Peptide Matches Filtering

Peptide Matches identified in search result can be filtered using one or multiple predefined filters (describes here after). Only validated peptide matches will be considered for further steps.

Basic Score Filter

All PSMs which score is lower than a given threshold are invalidated.

Pretty Rank Filter

This filtering is performed after having temporarily joined target and decoy PSMs corresponding to the same query (only really needed for separated forward/reverse database searches). Then for each query, PSMs from target and decoy are sorted by their score. A rank (Mascot pretty rank) is computed for each PSM depending on their score position: PSM with almost equal score (difference < 0.1) are assigned the same rank. All PSMs with rank greater than specified one are invalidated.

Minimum Sequence length Filter

PSMs corresponding to short peptide sequences (length lower than the provided one) can be invalidated using this parameter.

Mascot eValue Filter

Allows to filter PSMs by using the Mascot expectation value (e-value) which reflects the difference between the PSM score and the Mascot identity threshold (p=0.05). PSMs having an e-value greater than the specified one are invalidated.

Mascot adjusted eValue Filter

Proline is able to compute an adjusted e-value. It first selects the lowest threshold between the identity and homology ones (p=0.05). Then it computes the e-value using this selected threshold. PSMs having an adjusted e-value greater than the specified one are invalidated.

Mascot p-value on Identity Filter

Given a specific p-value, the Mascot identity threshold is calculated for each query and all peptide matches associated to the query with a score lower than calculated identity threshold are invalidated.
When parsing Mascot result file, the number of PSM candidate for a spectra is saved and could be used to recalculate identity threshold for any p-value.

Mascot p-value on homology Filter

Given a specific p-value, the Mascot homology threshold is inferred for each query and all peptide matches associated to the query with a score lower than calculated homology threshold are invalidated.

Peptide Matches Validation

Specify an expected FDR and tune a specified filter in order to obtain this FDR. See how FDR is calculated

Once previously described pre-filters have been applied, a validation algorithm can be run to control the FDR: given a criteria, the system will estimate the better threshold value in order to reach a specific FDR.

Protein Sets Filtering

Specific peptides Filter

Invalid Protein Set that don't have at least x peptides identifying only that protein set. The specificity is considered at the DataSet level.

This filtering go through all Protein Sets from worth score to best score. For each, if the protein set is invalidated, associated peptides properties are updated before going to next protein set. Peptide property is the number of identified protein sets.

Protein Sets Validation

Once pre-filters (see above) have been applied, a validation algorithm can be run to control the FDR. See how FDR is calculated

At the moment, it is only possible to control the FDR by changing the Protein Set Score threshold. Three different protein set scoring functions are available.

Given an expected FDR, the system will try to estimate the best score threshold to reach this FDR. Two validation rules (R1 and R2) corresponding to two different groups of protein sets (see below the detailed procedure) are optimized by the algorithm. Each rule defines the optimum score threshold allowing to obtain the closest FDR to the expected one for the corresponding group of protein sets.

Here is the procedure used for FDR optimization:

protein sets are segregated in two groups, the ones identified by a single validated peptide (G1) and the ones identified by multiple validated peptides (G2), with potentially multiple identified PSMs per peptide.

for each of the validation rules, the FDR computation is performed by merging target and decoy protein sets and by sorting them by descending score. The score threshold is then modulated by using successively the score of each protein set of this sorted list. For each new threshold, a new FDR is computed by counting the number of target/decoy protein sets having a score above or equivalent to this value. The procedure stops when there are no more protein sets in the list or when a maximum FDR of 50% is reached. It is has to be noted that the two validation rules are optimized separately:
- G2 FDR is first optimized leading to the R2 score threshold. The validation status of G2 protein sets is then fixed.
- final FDR (G1+G2) is then optimized leading to the R1 score threshold. Only the G1 protein sets are here used for the score threshold modulation procedure. However the FDR is computed by taking into account the G2 validated target/decoy protein sets.

The separation of proteins sets in two groups allows to increase the power of discrimination between target and decoy hits. Indeed, the score threshold of the G1 group is often much higher than the G2 one. If we were using a single average threshold, this will reduce the number of G2 validated proteins, leading to a decrease in sensitivity for a same value of FDR. In the future, we will try to implement such a strategy in order to allow the user to make its own comparison.

Proline

User Tools

Site Tools

Table of Contents