make su make install
***************************************************************************** * BioCRF - Grammatical-Restrained HCRF tool for Bioinformatics applications * * Copyright (C) 2009 Castrense Savojardo * * Bologna Biocomputing Group * * University of Bologna, Italy * * savojard@biocomp.unibo.it * ***************************************************************************** Usage: biocrf <ACTION> <OPTIONS> <FILE> ACTION The type of process you want to perform. OPTIONS The arguments for the ACTION (action-specific). FILE The file on witch the ACTION will be executed (action-specific). ACTION can be one of the following: -all run both training and testing. -train run training. -test run testing. -predict run predicting unlabelled data. All ACTIONs share the following OPTIONS: -m model_file The model file. For -all and -train ACTIONs this file is WRITTEN and at the end of the procedure model_file contains the trained model. For -test and -predict ACTIONs a trained model_file is needed to compute the test or to predict. The following OPTIONS are valid only for -all and -train ACTIONs: -g topology_file Specify the model topology file. -j value Specify the number of iterations of the training algorithm to be performed. -i value Specify the initial parameters value. Optional, default: 0.0. -w value Specify the half size of the sliding window used to generate state features. Optional, default: 0. -s value Specify the sigma square value used by the training algorithm to govern overfitting. Optional, default: 0.5. -e value Specify the epsilon value for convergence. Optional, default: 0.0001. The following OPTIONS are valid only for -all, -test an -predict ACTIONs: -d decoding_algorithm Decoding algorithm to be used. The string decoding_algorithm can be viterbi or posterior-viterbi. -o output_file Specify the file where predictions will be stored. -q posterior_out_file Store the posterior matrix in a file. This option is used only in case of posterior-viterbi decoding. The following OPTIONS are specific of the -all ACTION: -t testing_file Specify the testing sequences file. -p unlabelled_file Specify the unlabelled sequences file. Depends on the ACTION performed, the FILE argument will be considered as follows: -all, -train FILE is the training sequences file. -test FILE is the testing sequences file. -predict FILE is the unlabelled sequences file. Examples: biocrf -all \ -g topology.model \ -m model.txt \ -s 0.005 \ -j 60 \ -w 7 \ -d viterbi \ -p unlabelled.set \ -o out.predictions \ train.set biocrf -train -g topology.model \ -m model.txt \ -s 0.005 \ -j 60 \ -e 0.0001 \ -w 7 \ train.set biocrf -test -m model.txt \ -d posterior-viterbi \ -o out.predictions \ -q posteriors.matrix \ test.set biocrf -predict -m model.txt \ -d viterbi \ -o out.predictions \ unlabelled.set
BioCRF is designed to be easily applied on Bioinformatics applications. For this reason, two different type of input are allowed:
Single sequence input is the sequence of the residues of a given protein. A sequence profile is a matrix whose rows are sequence positions and columns are 20 possible amino acids.Each element X[i][a] of a sequence profiles represents the frequency of the amino acid a in the
aligned position i. A single sequence can be encoded with a sequence profile matrix whose elements
X[i][a] are equal to 1.00 only for the amino acid at position i in the original sequence, 0.0 otherwise.
Thus, the same format can be used for both single sequences and sequence profiles:
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.56 0.33 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 i
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.45 0.00 0.00 0.00 0.00 0.17 0.24 0.00 0.10 i
0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.05 0.01 0.00 0.66 0.01 0.00 0.11 0.02 0.00 0.01 0.00 0.13 0.00 i
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.09 0.00 0.10 0.79 0.00 0.00 0.01 0.00 0.00 0.00 0.02 0.00 T
0.25 0.49 0.11 0.00 0.14 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 T
0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.01 0.87 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 T
0.12 0.09 0.12 0.00 0.04 0.00 0.00 0.13 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 T
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.01 0.00 0.00 T
0.00 0.00 0.00 0.00 0.01 0.00 0.97 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 T
0.00 0.13 0.00 0.00 0.01 0.00 0.00 0.00 0.84 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 T
0.00 0.01 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.16 0.00 0.00 0.09 0.15 0.00 0.55 0.00 0.00 0.01 T
0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.01 0.00 0.41 0.48 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 T
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.01 0.00 0.08 0.01 0.00 0.17 0.07 0.11 0.01 0.00 0.28 0.22 T
0.43 0.08 0.02 0.04 0.01 0.00 0.07 0.01 0.21 0.03 0.00 0.08 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 T
0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.03 0.00 0.36 0.07 0.01 0.00 0.00 0.02 0.11 0.25 0.05 0.05 0.05 o
0.03 0.00 0.01 0.01 0.01 0.00 0.00 0.62 0.01 0.05 0.03 0.02 0.00 0.00 0.00 0.00 0.00 0.07 0.01 0.10 o
0.12 0.03 0.01 0.00 0.09 0.03 0.00 0.05 0.01 0.01 0.37 0.00 0.00 0.05 0.00 0.04 0.04 0.01 0.03 0.12 o
0.02 0.03 0.00 0.04 0.01 0.00 0.00 0.14 0.18 0.00 0.04 0.00 0.00 0.00 0.00 0.12 0.01 0.00 0.01 0.06 o
The last column is the sequence annotation.
The following is a valid grammar file:
labels 2
states 4
label_mapping
B
F
end_label_mapping
state_mapping
B1 B
B2 B
F1 F
F2 F
end_state_mapping
transitions
BEGIN B1 F1
B1 B2 F2
B2 B1 F1 END
F1 B1 F1 END
F2 B2 F2
end_transitions
In the first two lines specify the number of labels and states respectively. The label_mapping
section is used to list all different labels for the problem at hand. The state_mapping
section contains the list of states (first column), and for each state, the associated label (second column). In the transitions
section, for each state (first column), is given the list of allowed transitions (from the second to the last column in each line). Two special states, BEGIN
and END
, are used to define initial and final states.