Differences

This shows you the differences between two versions of the page.

--- prolineconcepts:protscoring [2013/03/25 16:30]
132.168.72.131 créée
+++ prolineconcepts:protscoring [2015/07/10 15:21] (current)
132.168.72.225 [Proteins and Proteins sets scoring]
@@ Line 1: / Line 1: @@
 ====== Proteins and Proteins sets scoring ======
-They are multiple algorithm than could be use to calculate the Proteins and Proteins Sets score.
+There are multiple algorithms than could be used to calculate the Proteins and Protein Sets score.
-Actually, when
+Proteins score are computed during the importation phase while Protein Sets score are computed during the validation phase.
- * importing Mascot result file : the standard scoring is used
- * importing OMSSA result file : ...
- * validating an identification : depending on the installation (see your administrator), any of the following algorithm could been set
+===== Protein =====
-===== Standard Scoring =====
+Each individual protein match is scored according to all peptide matches associated with this protein, independently of any validation of these peptide matches.
+Currently, when
+  * importing Mascot result file : the Mascot standard scoring is used (sum of peptide matches scores)
+  * importing OMSSA result file : FIXME
+  * importing X! tandem result file : the X! Tandem standard hyperscore is used
-The score associated to each identified protein (or protein set) is the sum **of the score of all peptide matches** identifying the protein (same set protein). In case of duplicate peptide matches (peptide matched by multiple queries) only the match with the best score is considered.
+===== Protein Set =====
+Each individual protein set is scored according to the validated peptide matches belonging to this protein set (see [[prolineconcepts:proteininferer|inference]]).
-===== Mascot Protein Set Scoring =====
-The score associated to each identified protein (or protein sets) is the sum of the offset between peptide matches score and corresponding threshold.
-Peptide matches considered for a protein (or protein set) is all peptide matches identifying that protein (or a same set protein). In case of duplicate peptide matches (peptide matched by multiple queries) only the best score match is considered.
-===== Mascot Mudpit Scoring =====
+===== Scoring schemes =====
+==== Mascot Standard Scoring ====
+The score associated to each identified protein (or protein set) is the sum **of the score of all peptide matches** identifying this protein (or protein set). In case of duplicate peptide matches (peptide matched by multiple queries) only the match with the best score is considered.
+==== Mascot MudPIT Scoring ====
+This scoring scheme is also based on the sum of all non-duplicate peptide matches score. However the score for each peptide match is not its absolute value, but the amount that it is above the threshold: the score offset. Therefore, peptide matches with a score below the threshold do not contribute to the protein score. Finally, the average of the thresholds used is added to the score. For each peptide match, the "threshold" is the homology threshold if it exists, otherwise it is the identity threshold.
+The algorithm below illustrates the MudPIT score computation procedure:
+<code>
+Protein score = 0
+For each peptide match {
+  If there is a homology threshold and ions score > homology threshold {
+    Protein score += peptide score - homology threshold
+  } else if ions score > identity threshold {
+    Protein score += peptide score - identity threshold
+  }
+}
+Protein score += 1 * average of all the subtracted thresholds
+</code>
+  * if there are no significant peptide matches, the protein score will be 0.
+  * homology and identity threshold values depend on a given p-value. By default Mascot and Proline compute these thresholds with a p-value of 5%.
+  * In the case of separated target-decoy searches we obtain two values for each threshold : one for the target search and another one for the decoy search. In order to obtain a single value we apply the following procedure:
+    * the homology threshold is the decoy value if it exists else the target value
+    * the identity threshold is the mean of target and decoy values.
+The benefit of the MudPIT score over the standard score is that it removes many of the junk protein sets, which have a high standard score but no high scoring peptide matches. Indeed, protein sets with a large number of weak peptide matches do not have a good MudPIT score.
+==== Mascot Modified MudPIT Scoring ====
+This scoring scheme, introduced by Proline, is a modified version of the Mascot MudPIT one.
+The difference with the latter is that it does not take into account the average of the substracted thresholds.
+This leads to the following scoring procedure:
+<code>
+Protein score = 0
+For each peptide match {
+  If there is a homology threshold and ions score > homology threshold {
+    Protein score += peptide score - homology threshold
+  } else if ions score > identity threshold {
+    Protein score += peptide score - identity threshold
+  }
+}
+</code>
+This score has the same benefits than the MudPIT one. The main difference is that the minimum value of this modified version will be always close to zero while the genuine MudPIT score defines a minimum value which is not constant between the datasets and the proteins (i.e. the average of all the subtracted thresholds).
-Not yet implemented

Proline

User Tools

Site Tools

Differences

Page Tools