RINERATOR VERSION 0.5.1 10 October 2014 UPDATES V0.5.1 - Computation of conservation scores - Retrieval of external data - Simpler format for creating RINs UPDATES V0.4 - Ligand names are in the same format as residues. - A function for calling Probe without Reduce is available. - The chain identifier may be an empty string. - Small overlaps detected by Probe are ignored. - The format of edge attribute files is compatible with Cy 3. - A list of residues is included in the output. - Simple selection of ligands is enabled. - An example job for ligands and for multiple pdbs is available. 1.REQUIREMENTS You need to have python and Biopython installed. Both should be included in a standard linux distribution like Debian. You also need Reduce in order to get the hydrogen atoms in the PDB coordinate files, and Probe to identify the non covalent interactions. They are very easy to install and you can get them from the Richardson Lab Web Site (http://kinemage.biochem.duke.edu/subindex.php). python http://www.python.org/ tested on version 2.6 Biopython http://biopython.org tested on version 1.54 Reduce http://kinemage.biochem.duke.edu/software/reduce.php tested on version 3.23 Probe http://kinemage.biochem.duke.edu/software/probe.php tested on version 2.16 2. INSTALLATION 2.1. Set a directory for installation (INST_DIR). 2.2. Move the package RINerator_V0.5.1.tar.gz to INST_DIR and extract its content, e.g. on a linux machine, type shell command: tar xvzf RINerator_V0.5.1.tar.gz This creates RINerator_V0.5.1 directory with: README.TXT This instruction file Source Python scripts Test Example job files and results for testing installation 2.3. Create symbolic links to the Reduce and Probe executables, e.g., in your user's $USER_HOME/bin directory, using the following shell commands: cd $USER_HOME/bin ln -s PROGRAMS_DIR/reduce.3.23.130521.linuxi386 reduce ln -s PROGRAMS_DIR/probe.2.16.130520.linuxi386 probe 3. TEST INSTALLATION Go to the test directory: INST_DIR/RINerator_V0.5.1/Test Run the job file to compute the network .sif files by typing the command: INST_DIR/RINerator_V0.5.1/Source/get_chains.py PDB/pdb1hiv.ent Results/ INPUT/chains_1hiv_A.txt The following files should be generated in the directory Results: pdb1hiv_h.ent PDB with hydrogen atoms pdb1hiv_h.probe probe result file pdb1hiv_h.sif network file of residue interactions, can be loaded into Cytoscape pdb1hiv_h_nrint.ea attribute file with number of interactions between residues pdb1hiv_h_intsc.ea attribute file with score of interaction between residues These files should be identical to precomputed files that are included in directory INST_DIR/RINerator_V0.5.1/Test/OUTPUT/ (you can compare them with diff). 4. RUNNING RINERATOR V0.5.1 4.1. To create a RIN from all chains (and ligands) of a protein in a PDB file, execute the shell command: INST_DIR/RINerator_V0.5.1/Source/get_chains.py path_pdb path_output path_chains [path_ligands] Parameters: path_pdb PDB file or directory containing PDB files of the same protein structure path_output directory to save generated files path_chains file with chain identifiers separated by commas chain identifier might be any letter or an empty string examples: "A,B,I" or "" or "A," path_ligands [optional] file with ligand identifiers separated by new lines each ligand is listed with a name, a chain identifier and a residue number separated by commas examples: "NOA,I,1" or "NOA,,1" Test command from within the RINerator_V0.5.1/Test directory: ../Source/get_chains.py PDB/pdb1hiv.ent Results/ INPUT/chains_1hiv_all.txt INPUT/ligands_1hiv_all.txt 4.2. To create a RIN from a specific selection of residues in a PDB file, execute the shell command: INST_DIR/RINerator_V0.5.1/Source/get_segments.py path_pdb path_output path_segments [path_ligands] Parameters: path_pdb PDB file or directory containing PDB files of the same protein structure path_output directory to save generated files path_segments file with segment identifiers separated by new lines; each segment consists of a model number, a chain identifier, a starting and ending residue number separated by commas; starting and ending residue numbers should be one of these: "number" residue number "_" any residue, e.g., if set as starting and ending residue number, all residues in the chain are considered "None" no residue, if set as ending residue, only the starting residue is considered examples: "0,A,_,_" or "0,B,25,None" or "0,B,50,70" path_ligands [optional] file with ligand identifiers separated by new lines; each ligand is listed with a name, a chain identifier and a residue number separated by commas examples: "NOA,I,1" or "NOA,,1" Test command from within the RINerator_V0.5.1/Test directory: ../Source/get_segments.py PDB/pdb1hiv.ent Results/ INPUT/segments_1hiv.txt INPUT/ligands_1hiv_all.txt 4.3. To calculate conservation scores from a user-specified MSA file, execute the shell command: INST_DIR/RINerator_V0.5.1/Source/get_conservation.py gap_format output_format path_alignment path_out path_log [path_id] Parameters: gap_format any symbol will be considered as a gap, e.g. "-" or "." output_format "name+score" identifier conservation_score "resid+score" index conservation_score "score" conservation_score path_alignment file with multiple sequence alignment in FASTA format path_out file with 1 or 2 columns (according to output_format) separated by space path_log log file path_id [optional] file with nodes identifiers (according to the NONgaps positions in the first sequence in alignment) in TXT format (should not contain empty lines) Test command from within the RINerator_V0.5.1/Test directory: ../Source/get_conservation.py - name+score INPUT/pdb1hiv_ConsrufDB_align.fasta Results/pdb1hiv_h_cons.txt Results/pdb1hiv_h_cons.log OUTPUT/pdb1hiv_h_res.txt Note that we use the OUTPUT/pdb1hiv_h_res.txt identifier file since it contains *only* the residue nodes from chain A in contrast to Results/pdb1hiv_h_res.txt, which currently contains all residue nodes assuming the previous test commands were executed. 4.4. To retrieve data from external resources, such as AAindex and ConSurfDB, execute the shell command: INST_DIR/RINerator_V0.5.1/Source/get_data.py path_id path_output pdb_id [path_input] Parameters: path_id file with nodes identifiers (according to the RIN specifications) in TXT format path_output file to save retrieved data pdb_id PDB identifier (The script will try to retrieve the conservation scores from the ConSurfDB website, if no consurf.grades files are found in path_input.) path_input: [optional] path with additional input data, any of the following files: consurf.grades conservation scores from ConSurfDB. If more than one file, e.g., for each chain, the chain identifier should be included in the file name: consurf_A.grades *.sif network file generated by RINerator. This file will be used to calculate the number of unweighted residue interactions of each residue node. *_nrint.ea edge attributes generated by RINerator. This file will be used to calculate the number of atomic interactions of each residue node. *_intsc.ea edge attributes generated by RINerator. This file will be used to calculate the sum of atomic interaction scores for each residue node. Test command from within the RINerator_V0.5.1 directory: Source/get_data.py Test/Results/pdb1hiv_h_res.txt Test/Results/pdb1hiv_h_data.na 1hiv 5. RUNNING JOBS FILES (EARLIER VERSIONS) RINerator_V04 and below worked with user-editable job_files. There are still available in the INST_DIR/RINerator_V0.5.1/Test/JOBS directory and their usage is described in the following section. For generating RINs for single chains in a PDB file without models, go to 5.1; for including ligands, go to 5.2; for defining residue segments or specifying the model number, go to 5.3; for full selection control go to 5.4; for multiple runs with the same PDB file, go to 5.5; for multiple files of the same PDB structure, go to 5.6. If the symbolic links to Reduce and Probe do not work or could not be created for some reason, edit the test job file with text editor and set the paths of the reduce and probe programs, e.g.: reduce_cmd = 'REDUCE_INSTALLATION_DIRECTORY/reduce.3.23.130521.linuxi386' probe_cmd = 'PROBE_INSTALLATION_DIRECTORY/probe.2.16.130520.linuxi386' 5.1 SELECTING CHAINS In order to compute a network for all or for a single chain in a given PDB structure, say xxx.pdb, you need to copy the test script test_job_chains.py and edit the lines: reduce_cmd: reduce command probe_cmd: probe command pdb_path: path of the input pdb pdb_h_path: path for the output pdb file with hydrogen atoms probe_path: path for the probe output file pdb_filename: file name of the input pdb sif_file: name of sif file sel_id: string identifier for the selection, can be any string chains: identifiers of the chains to be included in the network If you want to select only chain A, write: chains = ['A'] For chain A and B, write: chains = ['A', 'B'] Run the modified script test_job_chains.py with the command: INST_DIR/RINerator_V0.5.1/Source/get_ncint.py test_job_chains.py The following files are generated: pdb_h_path/xxx_h.ent pdb file with hydrogen atoms probe_path/xxx_h.probe a probe result file sif_file the network sif file sif_file_name_nrint.ea edge attribute file with number of interactions between residues sif_file_name_intsc.ea edge attribute file with interaction score between residues sif_file_name_res.txt list of all residues The interaction score is defined as the sum of the interaction scores between all atoms of two residues. See probe publication: Word, et al. (1999) J. Mol. Biol. 285, 1709-1731. 5.2 SELECTING CHAINS AND LIGANDS In order to compute a network for all or for a single chain as well as ligands in a given PDB structure, say xxx.pdb, you need to copy the test script test_job_chains_ligands.py and edit the lines: reduce_cmd: reduce command probe_cmd: probe command pdb_path: path of the input pdb pdb_h_path: path for the output pdb file with hydrogen atoms probe_path: path for the probe output file pdb_filename: file name of the input pdb sif_file: name of sif file sel_id: string identifier for the selection, can be any string chains: identifiers of the chains to be included in the network ligand*: a single ligand ligands: a list of ligands to be included in the network If you want to select only chain A, write: chains = ['A'] For chain A and B, write: chains = ['A', 'B'] For example: ligand1 = ['NOA', 'I', 1] all_ligands = [ligand1] will select ligand NOA in chain I with residue number 1. The syntax for a ligand is: ligand = [1,2,3] 1: string ligand name 2: string chain identifier 3: start of ligand (could be integer, string or None) The start of a ligand could be any of these three values: '_' if the ligand contains several residues and all are to be selected None if the residue number is not specified residue_number the residue number, e.g. 1 Run the modified script test_job_chains.py with the command: INST_DIR/RINerator_V0.5.1/Source/get_ncint.py test_job_chains_ligands.py The following files are generated: pdb_h_path/xxx_h.ent pdb file with hydrogen atoms probe_path/xxx_h.probe a probe result file sif_file the network sif file sif_file_name_nrint.ea edge attribute file with number of interactions between residues sif_file_name_intsc.ea edge attribute file with interaction score between residues sif_file_name_res.txt list of all residues The interaction score is defined as the sum of the interaction scores between all atoms of two residues. See probe publication: Word, et al. (1999) J. Mol. Biol. 285, 1709-1731. 5.3. SELECTING RESIDUE SEGMENTS AND SPECIFYING MODEL NUMBER In order to compute a network for a residue segment or specify a model for a given PDB structure, say xxx.pdb, you need to copy the test script test_job_segments.py and edit it the lines: reduce_cmd: reduce command probe_cmd: probe command pdb_path: path of the input pdb pdb_h_path: path for the output pdb file with hydrogen atoms probe_path: path for the probe output file pdb_filename: file name of the input pdb sif_file: name of sif file sel_id: string identifier for the selection, can be any string segment*: a single segment selection all_segments: a list of segments to be included in the network For example: segment1 = [0, 'A', 20, 30] segments = [segment1] will select residues 20-30 in chain A. The syntax for a segment is: segment = [1, 2, 3, 4] 1: model number, always 0 for first model (in NMR PDB files) or when there are no models in the PDB file 2: string chain identifier 3: start of segment (could be integer, string or None) 4: end of segment The start and end of segments are strings and could be any of these four values: '_' if the whole chain is selected None if the residue is not specified residue_number the residue number, e.g. 20 'residue_number and insertion_code' the residue number and an insertion code, e.g. '20A' Select all residues in chain A: segment1 = [0, 'A', '_', '_'] Select residue 25 in chain B: segment2 = [0, 'B', 25, None] Select residues 50-70 in chain B: segment3 = [0, 'B', 50, 70] Include all three segments in the network: all_segments = [segment1, segment2, segment3] After defining the selection, run the modified script test_job_segments.py with the command: INST_DIR/RINerator_V0.5.1/Source/get_ncint.py test_job_segments.py The following files are generated: pdb_h_path/xxx_h.ent pdb file with hydrogen atoms probe_path/xxx_h.probe a probe result file sif_file the network sif file sif_file_name_nrint.ea edge attribute file with number of interactions between residues sif_file_name_intsc.ea edge attribute file with interaction score between residues sif_file_name_res.txt list of all residues The interaction score is defined as the sum of the interaction scores between all atoms of two residues. See probe publication: Word, et al. (1999) J. Mol. Biol. 285, 1709-1731. 5.4. ADVANCED SELECTION In order to compute a network for an arbitrary selection of residues and ligands for a given PDB structure, say xxx.pdb, you need to copy the test script test_job_ligand.py and edit it the lines: reduce_cmd: reduce command probe_cmd: probe command pdb_path: path of the input pdb pdb_h_path: path for the output pdb file with hydrogen atoms probe_path: path for the probe output file pdb_filename: file name of the input pdb sif_file: name of sif file sel_id: string identifier for the selection, can be any string component*: a single selection components: a list of selections to be included in the network For example: component1 = ['1hiv', 'protein', [[0,'A',[' ','_',' '],[' ','_',' ']], [0,'B',[' ','_',' '],[' ','_',' ']], [0,'I',[' ','_',' '],[' ','_',' ']]]] will select all residues in chain A and chain B. component2 = ['HOH', 'water', [[0,'A',['W',302,' '],['W',390,' ']], [0,'B',['W',304,' '],['W',389,' ']], [0,'I',['W',301,' '],['W',339,' ']]]] will select all water molecules in chain A and chain B. component3 = ['NOA', 'ligand', [[0,'I',['H_NOA',1,' '],[None,None,None]]]] will select the ligand NOA in chain I with residue number 1. components = [component1, component2, component3] will include all of the above selections. For more examples, see next section. The syntax for each component is: component = [1, 2, [[3,4,[5,6,7],[8,9,10]], ...]] 1: string label 2: should be either 'protein' or 'water' or 'ligand' 3: model number, always 0 for first model (in NMR PDB files) or when there are no models in the PDB file 4: string with chain identifier 5,6,7: definition of start of a segment 5: if start residue is a standard amino acid residue then is space string(' ') This is the hetero-flag defined in Biopython, only if residue is hetero atom (HETATM in PDB), then this is 'H_' plus the name of the hetero-residue or 'W' for water. See Bio.PDB FAQ documentation in Biopython web site for more information. 6: residue number of start residue in segment 7: insertion code, in most case there is no insertion code and you should use space string (' ') 8,9,10: definition of end of segment, same syntax as start of a segment if present, otherwise 'None' After defining the selection, run the modified script test_job_ligand.py with the command: INST_DIR/RINerator_V0.5.1/Source/get_ncint.py test_job_ligand.py The following files are generated: pdb_h_path/xxx_h.ent pdb file with hydrogen atoms probe_path/xxx_h.probe a probe result file sif_file the network sif file sif_file_name_nrint.ea edge attribute file with number of interactions between residues sif_file_name_intsc.ea edge attribute file with interaction score between residues sif_file_name_res.txt list of all residues The interaction score is defined as the sum of the interaction scores between all atoms of two residues. See probe publication: Word, et al. (1999) J. Mol. Biol. 285, 1709-1731. 5.5. ADVANCED SELECTION AND MULTIPLE RUNS For advanced selection and multiple runs, you need to use these two files: run_reduce_probe_job.py script to generate pdb file with hydrogen bonds and probe file with interactions test_job_all.py script to generate network file of residue interactions (.sif file) 5.5.1. RUN REDUCE AND PROBE In run_reduce_probe_job.py you need to set the names of several files and directories: reduce_cmd: the reduce command probe_cmd: the probe command pdb_filename: the pdb file name in pdb_path: the path of the input pdb pdb_h_path: the path for the output pdb file with hydrogen atoms probe_path: the path for the probe output file You can run the script with the command: INST_DIR/RINerator_V0.5.1/Source/get_ncint.py run_reduce_probe_job.py Then, you get new PDB file xxx_h.ent stored in path defined in pdb_h_path, and a probe result file xxx_h.probe stored in path defined in probe_path 5.5.2. GENERATE NETWORK SIF FILE In test_job_all.py, edit the lines: pdb_path: path where to find the pdb file with hydrogen atom coordinates pdb_filename: name of pdb file with hydrogen atoms probe_path: the path for the probe output file probe_filename: name of probe file sif_file: name of sif file sel_id: string identifier for the selection, can be any string component*: a single selection components: list of selections to be included in the network You need to select the residues that are going to be included in the network of interactions. For example: component1 = ['1hiv', 'protein', [[0,'A',[' ',2,' '],[' ',57,' ']]]] selects residues 2-57 in chain A, while component2 = ['NOA', 'ligand', [[0,'I',['H_NOA',1,' '],[None,None,None]]]] selects the ligand NOA in chain I. All components to be included in the RIN are added to the list: components = [component1, component2] The syntax for a component is: component = [1, 2, [[3,4,[5,6,7],[8,9,10]], ...]] 1: string label 2: should be either 'protein' or 'water' or 'ligand' 3: model number, always 0 for first model (in NMR PDB files) or when there are no models in the PDB file 4: string with chain identifier 5,6,7: definition of start of a segment 5: if start residue is a standard amino acid residue then is space string(' ') This is the hetero-flag defined in Biopython, only if residue is hetero atom (HETATM in PDB), then this is 'H_' plus the name of the hetero-residue or 'W' for water. See Bio.PDB FAQ documentation in Biopython web site for more information. 6: residue number of start residue in segment 7: insertion code, in most case there is no insertion code and you should use space string (' ') 8,9,10: definition of end of segment, same syntax as start of a segment if present, otherwise 'None' If the start and end of the segment are the first and last residue in the chain than use '_' as residue number. For example, select all residues in chain A: component = ['1hiv', 'protein', [[0,'A',[' ','_',' '],[' ','_',' ']]]] Additional segments can be defined, for segments 2-57 and 65-90 in chain A: component = ['1hiv', 'protein', [[0,'A',[' ',2,' '],[' ',57,' ']], [0,'A',[' ',65,' '],[' ',90,' ']]]] If the segment is only one residue, then the end of segment is defined with None. For example, residues 25 and 27 in chain B: component = ['1hiv', 'protein', [[0,'B',[' ',25,' '],[None,None,None]], [0,'B',[' ',27,' '],[None,None,None]]]] Different chains can be identified, for example, to define segments 2-57 and 65-90 in chain A and 65-90 in chain B: component = ['1hiv', 'protein', [[0,'A',[' ',2,' '],[' ', 57,' ']], [0,'A',[' ',65,' '],[' ',90,' ']], [0,'B',[' ',65,' '],[' ',90,' ']]]] Another example, residues 20-30 in chain A, plus 25,26,65-68 in chain B: component = ['1hiv', 'protein', [[0,'A',[' ',20,' '],[' ',30,' ']], [0,'B',[' ',25,' '],[None,None,None]], [0,'B',[' ',26,' '],[None,None,None]], [0,'B',[' ',65,' '],[' ',68,' ']]]] 5.6. MULTIPLE FILES FOR THE SAME STRUCTURE In order to compute a network for all chains and ligands for multiple files of the same PDB structure, you need to copy the test script test_job_directory.py and edit the lines: reduce_cmd: reduce command probe_cmd: probe command pdb_path: path of the input pdb rin_path: path for the output chains: identifiers of the chains to be included in the network ligand*: a single ligand ligands: a list of ligands to be included in the network For the chains and ligands syntax, see section 4.2. Run the modified script test_job_directory.py with the command: INST_DIR/RINerator_V0.5.1/Source/get_ncint.py test_job_directory.py The following files are generated for each pdb file xxx.pdb in pdb_path: rin_path/xxx_h.ent pdb file with hydrogen atoms rin_path/xxx_h.probe a probe result file rin_path/sif_file the network sif file rin_path/sif_file_name_nrint.ea edge attribute file with number of interactions between residues rin_path/sif_file_name_intsc.ea edge attribute file with interaction score between residues rin_path/sif_file_name_res.txt list of all residues The interaction score is defined as the sum of the interaction scores between all atoms of two residues. See probe publication: Word, et al. (1999) J. Mol. Biol. 285, 1709-1731. Send comments and questions to: Nadezhda T. Doncheva Max-Planck-Institut Informatik Campus E1 4 66123 Saarbruecken, Germany email: doncheva _AT_ mpi-inf.mpg.de