Generate Sequences

source code: getvalidsequences.pl

The ~1000bps are scanned for ATG position having 100 bp upstream and 100 bp downstream. and dump the 203 length ATG window sequences additionally the is dumped on STDOUT. The position 500 (array index starts with index 0) in each full length sequence is positive TIS
Usage: perl getvalidsequences.pl inputfile outputfile > mapfile
inputfile: ~1000bps nucleotide sequences
outputfile: 203 window nucleotide sequences

Generate Features:

source code: genfeatures.pl

This code generates features (arff format) from the sequences(203 ATG window: upstream 100, ATG, downstream 100). Total number of features
upstream: monomers, dimers,trimers, tetramers, pentamers, codons
downstream: monomers, dimers,trimers, tetramers, pentamers, codons
Counter shows the processing sequence number

Defaultly all sequence classtype is set as positive so if you generate training file care should be taken to make only 500 position and rest all negative for each full length sequence. This can be obtained from map file mentioned in getvalidsequences.pl since array index starts with 0 the position is 500, in the original full length sequence its 501 position.

Usage: perl genfeatures.pl inputfile outputfile
inputfile: ATG sequences file
outputfile: arff file

Score calculation

source code: mySMO.java


compilation: javac mySMO.java
Usage: java mySMO > outputfile
options given inside the java code
complexity constant = 1.0
cachesize = 250007
epsilon = 1.0E-12
tolerance parameter = 0.001
t = training file
T = Test file
outputfile consists the classification results and score for each 203 nucleotide window sequence i.e., ATG score.