Perl Scripts for bioinformatics
This page contains Perl scripts i wrote during my experience with bioinformatics work. It is recommended to download fresh script from here, as these will be continuously debugged and updated.
(If you find the scripts useful, please give a visit to the sponsors in the advertisement banners.)
Genscan2GFF.pl (older version. only wrks with Genscan)
Genscan_Fgenesh2GFF3.pl (last updated on 09/06/2013)
This scripts take Genscan or Fgenesh output file as input and converts predicted genes into a GFF format.
Usage: perl Script -options
-i Genscan/fgenesh OutFile [Required]
-o Outputfile_name [infile.gff]
-f genscan or fgenesh [genscan]
-p yes|no. Parse partial genes[yes]
-e 1|0 1:print extended features. e.g PolyA and TF binding Sites etc[0].
Options:
-p Print partial genes in the GFF. If set to yes, this will allow partial predictions to be included into GFF. Default is Yes. Change it to no if you dont want the partial predictions to be included in the output.
Genscan_Fgenesh2GFF3.pl (last updated on 09/06/2013)
This scripts take Genscan or Fgenesh output file as input and converts predicted genes into a GFF format.
Usage: perl Script -options
-i Genscan/fgenesh OutFile [Required]
-o Outputfile_name [infile.gff]
-f genscan or fgenesh [genscan]
-p yes|no. Parse partial genes[yes]
-e 1|0 1:print extended features. e.g PolyA and TF binding Sites etc[0].
Options:
-p Print partial genes in the GFF. If set to yes, this will allow partial predictions to be included into GFF. Default is Yes. Change it to no if you dont want the partial predictions to be included in the output.
Translate_DNA_6frames
This script will translate multiple sequences in a fasta file to in all the 6 frames. Script will be able to produce translation in any requested frame.
Usage:
#################### DNA 2 PROTEIN ####################
This script will convert your DNA sequence to PROTEIN Sequence in 6 frames
Usage: perl script -options
-s sequence_file. Multifasta sequence file containing DNA sequence.
-f frame [1]. Which frame you want to translate in. choose values from 1 to 6.
-r [o]nly || [r]ange [o]. e.g. if you chose frame 3 (-f 3), selecting 'o' will only translate frame 3 while selecting 'r' will
translate the DNA in all the frames from 1-3. Default id Only one frame.
-l smallest_protein-allowed [5]. Smallest size of the translated protein allowed in the output file.
-c yes|no Clean sequence of stop codon in frame 1.[no]. If you want to remove stop codons from sequences.
Usage:
#################### DNA 2 PROTEIN ####################
This script will convert your DNA sequence to PROTEIN Sequence in 6 frames
Usage: perl script -options
-s sequence_file. Multifasta sequence file containing DNA sequence.
-f frame [1]. Which frame you want to translate in. choose values from 1 to 6.
-r [o]nly || [r]ange [o]. e.g. if you chose frame 3 (-f 3), selecting 'o' will only translate frame 3 while selecting 'r' will
translate the DNA in all the frames from 1-3. Default id Only one frame.
-l smallest_protein-allowed [5]. Smallest size of the translated protein allowed in the output file.
-c yes|no Clean sequence of stop codon in frame 1.[no]. If you want to remove stop codons from sequences.
AddOrReplace_inFastaTitle_V2.0
This script will modify the fasta headers in a fasta file based on the following options selected. If you want to change, add remove, replace anything from the headers of fasta file then this script is what you need. It can also take a table of two columns which contains the pattern to search and new name to be used for modification. Tis way you can change multiple files using multiple criteria.
usage : perl script_name [....options]
options:
-s Sequence file containing sequences in fasta format
-t table_file_name|length|manual. [length]
table_file_name: table of two columns, 1:pattern to look for, 2: New name to be used for modification.
The pattern and new name should be seperated by tab, Pattern in first row.
e.g. pattern1 New_name1
pattern2 New_name2
length: Use length of the sequence as New_name.
manual: provide pattern with '-p' and replacement string with '-r' flags. If not, will be asked for pattern and
replacement string manually. Only append/prepend/substitutes the pattern with the replacement string in
the header.
-d delimiter to use to append or prepend the New_name ['_']
-c a|r|p|s. [default is append]
[a]ppend:add to the end,
[r]eplace: replace the whole header with New_Name,
[p]repend: add to the start.
[s]ubstitute: only substitute the patern with with New_Name.
-o Output file to store resulting sequences.[Mod_inputFileName]
-n Yes|no. Print list of patterns not found in sequence.
-m Search mode to be used. Use following options:[exact]
'match' to use pattern matching mode.
'exact' to use exact match mode.
-p Manual pattern to look for. provide only one string. works with manual mode only.
-r Manual replacement string to replace/append/prepend when manual pattern is found. works with manual
mode only.
-i yes|no. ignore case while pattern match [no].
-v yes|no. Print the list of all the replacements done.[no]
usage : perl script_name [....options]
options:
-s Sequence file containing sequences in fasta format
-t table_file_name|length|manual. [length]
table_file_name: table of two columns, 1:pattern to look for, 2: New name to be used for modification.
The pattern and new name should be seperated by tab, Pattern in first row.
e.g. pattern1 New_name1
pattern2 New_name2
length: Use length of the sequence as New_name.
manual: provide pattern with '-p' and replacement string with '-r' flags. If not, will be asked for pattern and
replacement string manually. Only append/prepend/substitutes the pattern with the replacement string in
the header.
-d delimiter to use to append or prepend the New_name ['_']
-c a|r|p|s. [default is append]
[a]ppend:add to the end,
[r]eplace: replace the whole header with New_Name,
[p]repend: add to the start.
[s]ubstitute: only substitute the patern with with New_Name.
-o Output file to store resulting sequences.[Mod_inputFileName]
-n Yes|no. Print list of patterns not found in sequence.
-m Search mode to be used. Use following options:[exact]
'match' to use pattern matching mode.
'exact' to use exact match mode.
-p Manual pattern to look for. provide only one string. works with manual mode only.
-r Manual replacement string to replace/append/prepend when manual pattern is found. works with manual
mode only.
-i yes|no. ignore case while pattern match [no].
-v yes|no. Print the list of all the replacements done.[no]
RandomDNA_Generator
This script will generate specified number of Random DNA sequence of user specified length and GC content. Random sequences are required when doing some statistical analysis on actual sequences.
Usage:
perl script options...
-l length of random sequence[1000]
-g GC content of random sequence[0.5]
-n number of random sequence requested[1]
-v verbose. Show seq on screen also
Usage:
perl script options...
-l length of random sequence[1000]
-g GC content of random sequence[0.5]
-n number of random sequence requested[1]
-v verbose. Show seq on screen also
Search and get selected sequences from multifasta file (last updated 09/26/2013)
#This script Version 2.0 can be used to pull sequences from a fasta based on key words written in pattern file seperated by new line. Any white sapces in the pattern will be removed so dont use pattern having white spaces in between. white spaces at end wont be a problem. It reads whole sequences in memory so not advised for very long sequence files e.g >100 MB on computers having small memory. vesion is modified to remove other non recognizable signs from pattern and sequence header . version 6 modified for adding X in the last of pattern and header to avoid similar wrong sequence because of similarity.
USAGE INSTRUCTIONS:
This script will read sequence names and pick out sequence related sequence from given fasta file.
pattern can be provided as list of names stored in a file or typed manually when asked. Also a multi-line table containing output file names and sequences to be saved in those files, seperated by semicolon, in each line, can be provided. Script will create a new file for each line and save all the sequences in that line in one file.
usage : perl script_name [options]
options:
-s Sequence file containing sequences in fasta format
-l List file containing pattern seperated by new line.
-t Table of output file name and sequences to put in those files.
The name of out file and sequences should be seperated by semicolon, File name first in row.
e.g. filename1;seq1;seq2;seq3......
filename2;seq11;seq12;seq13......
-o Output file to store resulting sequences.
-m Search mode to be used. Use following options:[exact]
'match' to use pattern matching mode. Whole word match required.
'pmatch' to use partial pattern matching. get sequences with matching pattern.
'exact' to use exact match mode.
-d delimiter to split the sequence header and use -c column for search
-c use this column number for search purpose.
-h Print help and exit.
USAGE INSTRUCTIONS:
This script will read sequence names and pick out sequence related sequence from given fasta file.
pattern can be provided as list of names stored in a file or typed manually when asked. Also a multi-line table containing output file names and sequences to be saved in those files, seperated by semicolon, in each line, can be provided. Script will create a new file for each line and save all the sequences in that line in one file.
usage : perl script_name [options]
options:
-s Sequence file containing sequences in fasta format
-l List file containing pattern seperated by new line.
-t Table of output file name and sequences to put in those files.
The name of out file and sequences should be seperated by semicolon, File name first in row.
e.g. filename1;seq1;seq2;seq3......
filename2;seq11;seq12;seq13......
-o Output file to store resulting sequences.
-m Search mode to be used. Use following options:[exact]
'match' to use pattern matching mode. Whole word match required.
'pmatch' to use partial pattern matching. get sequences with matching pattern.
'exact' to use exact match mode.
-d delimiter to split the sequence header and use -c column for search
-c use this column number for search purpose.
-h Print help and exit.
Extract portion of sequence from large sequences
This script is made to extract sequences from sequence file provided by the user, based on the
cordinates given in the blast file (tabular output). user can provide number of nuceotide to add
on both sides This option is optional
This script will read sequence name and cordinates from
blast file and extract sequences from sequence file provided
usage: use following options\n -m mode of input(auto or manual) [Defaulst is auto]
-s sequence file \n -b Blast out put file in table format
-h (optional) number of extra nt to add at head [Default 0]
-t (optional) number of extra nt to add at tail [Default 0]
-o output file [Default is: output_seq_extract]
-g use names till first space while searching. [use full length names while]
\n\n\n usage for manual input mode:\n\n-s sequence file name containing all sequences
-b sequence/contig name to extract from
-h (optional) number of extra nt to add at head [Default 0]
-t (optional) number of extra nt to add at tail [Default 0]
-o output file [Default is: output_seq_extract]
-q query name for manual input[Default is: manual_query ]
cordinates given in the blast file (tabular output). user can provide number of nuceotide to add
on both sides This option is optional
This script will read sequence name and cordinates from
blast file and extract sequences from sequence file provided
usage: use following options\n -m mode of input(auto or manual) [Defaulst is auto]
-s sequence file \n -b Blast out put file in table format
-h (optional) number of extra nt to add at head [Default 0]
-t (optional) number of extra nt to add at tail [Default 0]
-o output file [Default is: output_seq_extract]
-g use names till first space while searching. [use full length names while]
\n\n\n usage for manual input mode:\n\n-s sequence file name containing all sequences
-b sequence/contig name to extract from
-h (optional) number of extra nt to add at head [Default 0]
-t (optional) number of extra nt to add at tail [Default 0]
-o output file [Default is: output_seq_extract]
-q query name for manual input[Default is: manual_query ]