User Tools

Site Tools


prolineconcepts:protscoring

Proteins and Proteins sets scoring

There are multiple algorithms than could be used to calculate the Proteins and Protein Sets score. Proteins score are computed during the importation phase while Protein Sets score are computed during the validation phase.

Protein

Each individual protein match is scored according to all peptide matches associated with this protein, independently of any validation of these peptide matches. Currently, when

  • importing Mascot result file : the Mascot standard scoring is used (sum of peptide matches scores)
  • importing OMSSA result file : FIXME
  • importing X! tandem result file : the X! Tandem standard hyperscore is used

Protein Set

Each individual protein set is scored according to the validated peptide matches belonging to this protein set (see inference).

Scoring schemes

Mascot Standard Scoring

The score associated to each identified protein (or protein set) is the sum of the score of all peptide matches identifying this protein (or protein set). In case of duplicate peptide matches (peptide matched by multiple queries) only the match with the best score is considered.

Mascot MudPIT Scoring

This scoring scheme is also based on the sum of all non-duplicate peptide matches score. However the score for each peptide match is not its absolute value, but the amount that it is above the threshold: the score offset. Therefore, peptide matches with a score below the threshold do not contribute to the protein score. Finally, the average of the thresholds used is added to the score. For each peptide match, the “threshold” is the homology threshold if it exists, otherwise it is the identity threshold. The algorithm below illustrates the MudPIT score computation procedure:

Protein score = 0
For each peptide match {
  If there is a homology threshold and ions score > homology threshold {
    Protein score += peptide score - homology threshold
  } else if ions score > identity threshold {
    Protein score += peptide score - identity threshold
  }
}
Protein score += 1 * average of all the subtracted thresholds
  • if there are no significant peptide matches, the protein score will be 0.
  • homology and identity threshold values depend on a given p-value. By default Mascot and Proline compute these thresholds with a p-value of 5%.
  • In the case of separated target-decoy searches we obtain two values for each threshold : one for the target search and another one for the decoy search. In order to obtain a single value we apply the following procedure:
    • the homology threshold is the decoy value if it exists else the target value
    • the identity threshold is the mean of target and decoy values.

The benefit of the MudPIT score over the standard score is that it removes many of the junk protein sets, which have a high standard score but no high scoring peptide matches. Indeed, protein sets with a large number of weak peptide matches do not have a good MudPIT score.

Mascot Modified MudPIT Scoring

This scoring scheme, introduced by Proline, is a modified version of the Mascot MudPIT one. The difference with the latter is that it does not take into account the average of the substracted thresholds. This leads to the following scoring procedure:

Protein score = 0
For each peptide match {
  If there is a homology threshold and ions score > homology threshold {
    Protein score += peptide score - homology threshold
  } else if ions score > identity threshold {
    Protein score += peptide score - identity threshold
  }
}

This score has the same benefits than the MudPIT one. The main difference is that the minimum value of this modified version will be always close to zero while the genuine MudPIT score defines a minimum value which is not constant between the datasets and the proteins (i.e. the average of all the subtracted thresholds).

prolineconcepts/protscoring.txt · Last modified: 2015/07/10 15:21 by 132.168.72.225