Installing and configuring the Sequence Repository

Even if this is an optional module it is recommended to install it, mostly if you want to view the proteins sequences in the user interfaces!

It can be installed on the same machine running the Proline Server. However as this module will parse the mascot fasta files to extract sequence and description from it, it will be more efficient if installed on the computer executing your Mascot Server. In any case, you should also be able to access to the PostgreSQL server from the computer where Sequence Repository is installed.

Sequence Repository installation

This module comes with the Proline Server installation (using installer or manual installation)

Configuration

Configuration files are located under the “<seqrepo_folder>/config”.

Server and Datastore description

application.cong file define datastore and server description to access to the UDS database (for postgresql database). Properties specified here should be the same as the one you specify while configuring the Proline Server.

proline-config {
  driver-type = "postgresql" // valid values are: h2, postgresql 
  max-pool-connection=3 //Beta properties : specify maximum number of pool connected to DB Server. default to 3 
}

//User and Password to connect to databases server.
auth-config {
  user="<user-proline>"
  password="<passwoed-proline>"
}

//Databases server Host
host-config {
  host="<host>"
  port="5432"
}

uds-db { 
 connection-properties {
    dbName = "uds_db"   }
}

h2-config {
  script-directory = "/h2"
  connection-properties {
    connectionMode = "FILE"
    driver = "org.h2.Driver"
  }
}

postgresql-config {
  script-directory = "/postgresql"
  connection-properties {
    connectionMode = "HOST"
    driver = "org.postgresql.Driver"
  }
}

note:

If you didn't change the default naming scheme of databases the 'uds_db' this value should be kept for dbName
<user-proline> and <password-proline> are the same as specified in application.conf for Proline Server

Protein description parsing rule

As this module is used to extract Protein sequence and description from a fasta file for a specific protein accession, it is necessary to configure the rule used to parse the protein ACC, from fasta description line. This is similar to the rules specified in Mascot Server. To do this, parsing-rules.conf file should be edited. In this file it is necessary to escape (this means prefix with '\') some characters: '\' , ':' and '='

//Specify path to fasta files for SeqRepository daemon. Multiple path separated by ',' between []
//On linux system : local-fasta-directories =["/local/mascot/sequence"] 
local-fasta-directories =["D:\\mascot\\sequence"] 

// Rules used for parsing fasta entries. Multiple rules could be specified.
// name : identifying rule definition
// fasta-name : FASTA file name must match specified Java Regex CASE_INSENSITIVE. multiple Regex separated by ',' between [] could be specified
// fasta-version : Java Regex with capturing group for fasta release version extraction (CASE_INSENSITIVE)
// protein-accession : Java Regex with capturing group for protein accession extraction

parsing-rules = [{
   name="label1",
   fasta-name=["uniprot"],
   fasta-version="[.]*_([^_]*).fasta",
   protein-accession =">\\w{2}\\|([^\\|]+)\\|"    
},
{
  name="label2",
   fasta-name=["myDB"],
   fasta-version="[.]*_([^_]*).fasta",
   protein-accession =">\\w{2}\\|[^\\|]*\\|(\\S+)"    
}
]


//Default Java Regex with capturing group for protein accession if fasta file name doesn't match parsing_rules RegEx
// >(\\S+) :  String after '>' and before first space
default-protein-accession =">(\\S+)"

Testing rules

In order to verify the specified configuration, once previous files have been configured and saved, run the following tool under sequence repository installation directory : On Windows system

run-TestConfiguration.bat

On Linux system

run-TestConfiguration.sh

An output will be displayed with all fasta files found and for each which rule will be applied. The first 3 entries of each will also be displayed with the extracted protein accession.

Output should look like :

 Scanning [D:\temp\fasta]
 [D:\temp\fasta] scan terminated
 Number of traversed dirs: 1
 Found FASTA file names: 7
  ---- Scanning Fasta local path ---- 
  Using default rule ">(\S+)" for fasta "Nouvelle_base_données_sara.fasta" 
 	 Accession "tr|F8WIX8|H2A.l_MOUSE" will be used for entry ">tr|F8WIX8|H2A.l_MOUSE Original_Name=F8WIX8_MOUSE Histone H2A OS=Mus musculus GN=Hist1h2al PE=3 SV=1".
 	 Accession "tr|Q5M8Q2|H2A.L.1.3_MOUSE" will be used for entry ">tr|Q5M8Q2|H2A.L.1.3_MOUSE Original_Name= Q5M8Q2_MOUSE Histone H2A OS=Mus musculus GN=OTTMUSG00000016789 PE=2 SV=1".
 	 Accession "tr|J3QP08|H2A.L.1.4_MOUSE" will be used for entry ">tr|J3QP08|H2A.L.1.4_MOUSE Original_Name= J3QP08_MOUSE Histone H2A OS=Mus musculus GN=Gm14501 PE=3 SV=1".
[UPS1UPS2_D_20121108.fasta] matches Fasta Name Regex "UPS1UPS2_"
  Using rule ">[^\|]*\|(\S+)" for "UPS1UPS2_D_20121108.fasta" 
    Release (using rule "_(?:D|(?:Decoy))_(.*)\.fasta") = "20121108" 
 	 Accession "ALBU_HUMAN_UPS" will be used for entry ">P02768ups|ALBU_HUMAN_UPS Serum albumin (Chain 26-609) - Homo sapiens (Human)".
 	 Accession "NEDD8_HUMAN_UPS" will be used for entry ">Q15843ups|NEDD8_HUMAN_UPS NEDD8 (Chain 1-81) - Homo sapiens (Human)".
 	 Accession "RASH_HUMAN_UPS" will be used for entry ">P01112ups|RASH_HUMAN_UPS GTPase HRas (Chain 1-189) - Homo sapiens (Human)".
  Using default rule ">(\S+)" for fasta "uniprot-taxonomy%3A-Mus+musculus+%28Mouse%29+%5B10090%5D-.fasta" 
 	 Accession "sp|Q9CQV8|1433B_MOUSE" will be used for entry ">sp|Q9CQV8|1433B_MOUSE 14-3-3 protein beta/alpha OS=Mus musculus GN=Ywhab PE=1 SV=3".
 	 Accession "sp|P62259|1433E_MOUSE" will be used for entry ">sp|P62259|1433E_MOUSE 14-3-3 protein epsilon OS=Mus musculus GN=Ywhae PE=1 SV=1".
 	 Accession "sp|P68510|1433F_MOUSE" will be used for entry ">sp|P68510|1433F_MOUSE 14-3-3 protein eta OS=Mus musculus GN=Ywhah PE=1 SV=2".
[iSa_D_20130403.fasta] matches Fasta Name Regex "ISA_"
  Using rule ">\w{2}\|([^\|]+)\|" for "iSa_D_20130403.fasta" 
    Release (using rule "_(?:D|(?:Decoy))_(.*)\.fasta") = "20130403" 
 	 Accession "Q99M51tag" will be used for entry ">sp|Q99M51tag|NCK1_strep-tag Cytoplasmic protein NCK1 OS=Mus musculus GN=Nck1 PE=1 SV=1".
 	 Accession "Q9ES52tag" will be used for entry ">sp|Q9ES52tag|SHIP1_strep-tag Phosphatidylinositol 3,4,5-trisphosphate 5-phosphatase 1 OS=Mus musculus GN=Inpp5d PE=1 SV=2".
 	 Accession "###REV###Q99M51tag" will be used for entry ">sp|###REV###Q99M51tag|NCK1_strep-tag Reverse sequence, was Cytoplasmic protein NCK1 OS=Mus musculus GN=Nck1 PE=1 SV=1".
[UP_MouseEDyP_D_20150629.fasta] matches Fasta Name Regex "UP_"
  Using rule ">\w{2}\|([^\|]*)\|\S+" for "UP_MouseEDyP_D_20150629.fasta" 
    Release (using rule "_([^_])*\.fasta") = "9" 
 	 Accession "Q9CQV8" will be used for entry ">sp|Q9CQV8|1433B_MOUSE 14-3-3 protein beta/alpha OS=Mus musculus GN=Ywhab PE=1 SV=3".
 	 Accession "Q9CQV8-2" will be used for entry ">sp|Q9CQV8-2|1433B_MOUSE Isoform Short of 14-3-3 protein beta/alpha OS=Mus musculus GN=Ywhab".
 	 Accession "P62259" will be used for entry ">sp|P62259|1433E_MOUSE 14-3-3 protein epsilon OS=Mus musculus GN=Ywhae PE=1 SV=1".
[S_cerevisiae_Decoy_20121108.fasta] matches Fasta Name Regex "S_cerevisiae_"
  Using rule ">\w{2}\|([^\|]*)\|\S+" for "S_cerevisiae_Decoy_20121108.fasta" 
    Release (using rule "_([^_])*\.fasta") = "8" 
 	 Accession "P38903" will be used for entry ">sp|P38903|2A5D_YEAST Serine/threonine-protein phosphatase 2A 56 kDa regulatory subunit delta isoform OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=RTS1 PE=1 SV=2".
 	 Accession "P31383" will be used for entry ">sp|P31383|2AAA_YEAST Protein phosphatase PP2A regulatory subunit A OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=TPD3 PE=1 SV=3".
 	 Accession "Q00362" will be used for entry ">sp|Q00362|2ABA_YEAST Protein phosphatase PP2A regulatory subunit B OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=CDC55 PE=1 SV=2".
[UPS1UPS2_Decoy_20121108.fasta] matches Fasta Name Regex "UPS1UPS2_"
  Using rule ">[^\|]*\|(\S+)" for "UPS1UPS2_Decoy_20121108.fasta" 
    Release (using rule "_(?:D|(?:Decoy))_(.*)\.fasta") = "20121108" 
 	 Accession "ALBU_HUMAN_UPS" will be used for entry ">P02768ups|ALBU_HUMAN_UPS Serum albumin (Chain 26-609) - Homo sapiens (Human)".
 	 Accession "NEDD8_HUMAN_UPS" will be used for entry ">Q15843ups|NEDD8_HUMAN_UPS NEDD8 (Chain 1-81) - Homo sapiens (Human)".
 	 Accession "RASH_HUMAN_UPS" will be used for entry ">P01112ups|RASH_HUMAN_UPS GTPase HRas (Chain 1-189) - Homo sapiens (Human)".

Running Sequence Repository

To run Sequence Repository, you should modify and execute

On Windows system

run-RetrieveService.bat

On Linux system

run-RetrieveService.sh

By default, RetrieveService will be run with option “-t 2” : daemon will be executed every 2 hours. To modify the option, edit the run-RetrieveService script file and change option after fr.proline.module.seq.service.RetrieveService string.

Available option are:

Usage: <main class> [options]

Options:
  -f, --forceUpdate
     force update of MSIdb result summaries and biosequences (even if already
     updated)
     Default: false
  -p, --project
     the ID of the single project to process
     Default: 0
  -t, --time
     the daemon periodicity (in hours)
     Default: -1
  -debug
     set logger level to DEBUG (verbose mode)
     Default: false

Proline

User Tools

Site Tools

Table of Contents

Installing and configuring the Sequence Repository

Sequence Repository installation

Configuration

Testing rules

Running Sequence Repository

Page Tools