BioCRF

Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) implementation for Bioinformatics applications.

Introduction

BioCRF is an implementation of Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) [1], an extension of linear HCRF [2,3] to include prior knowledge about the problem at hand by means of a regular grammar. This results very useful in several Bioinformatics problems where only solution that agree with a regular grammar rules are biologically meaningful. BioCRF is implemented using C++ programming language.

Download

Instructions

Installation

make
su
make install

Usage

*****************************************************************************
* BioCRF - Grammatical-Restrained HCRF tool for Bioinformatics applications *
* Copyright (C) 2009 Castrense Savojardo                                    *
*   Bologna Biocomputing Group                                              *
*   University of Bologna, Italy                                            *
*   savojard@biocomp.unibo.it                                               *
*****************************************************************************
Usage: biocrf <ACTION> <OPTIONS> <FILE>

	ACTION  The type of process you want to perform.
	OPTIONS The arguments for the ACTION (action-specific).
	FILE    The file on witch the ACTION will be executed 
		(action-specific).

ACTION can be one of the following:

-all      run both training and testing.
-train    run training.
-test     run testing.
-predict  run predicting unlabelled data.

All ACTIONs share the following OPTIONS:

-m model_file           The model file. For -all and -train ACTIONs this file
 			is WRITTEN and at the end of the procedure model_file
			contains the trained model. For -test and -predict 
			ACTIONs a trained model_file is needed to compute the
			test or to predict.

The following OPTIONS are valid only for -all and -train ACTIONs:

-g topology_file        Specify the model topology file.
-j value                Specify the number of iterations of the training 
			algorithm to be performed.
-i value                Specify the initial parameters value. 
			Optional, default: 0.0.
-w value                Specify the half size of the sliding window used to 
			generate state features. 
			Optional, default: 0.
-s value                Specify the sigma square value used by the training 
			algorithm to govern overfitting. 
			Optional, default: 0.5.
-e value                Specify the epsilon value for convergence. 
			Optional, default: 0.0001.

The following OPTIONS are valid only for -all, -test an -predict ACTIONs:

-d decoding_algorithm   Decoding algorithm to be used. The string 
			decoding_algorithm can be viterbi or 
			posterior-viterbi.
-o output_file          Specify the file where predictions will be stored.
-q posterior_out_file   Store the posterior matrix in a file. This option is 
			used only in case of posterior-viterbi decoding.

The following OPTIONS are specific of the -all ACTION:

-t testing_file         Specify the testing sequences file.
-p unlabelled_file      Specify the unlabelled sequences file.

Depends on the ACTION performed, the FILE argument will be considered as 
follows:

-all, -train      FILE is the training sequences file.
-test             FILE is the testing sequences file.
-predict          FILE is the unlabelled sequences file.

Examples:

biocrf -all 			\
	-g topology.model 	\
	-m model.txt 		\
	-s 0.005 		\
	-j 60 			\
	-w 7 			\
	-d viterbi 		\
	-p unlabelled.set 	\
	-o out.predictions 	\
	train.set

biocrf -train 
	-g topology.model 	\
	-m model.txt 		\
	-s 0.005 		\
	-j 60 			\
	-e 0.0001 		\
	-w 7 			\
	train.set

biocrf -test 
	-m model.txt 		\
	-d posterior-viterbi 	\
	-o out.predictions 	\
	-q posteriors.matrix 	\
	test.set

biocrf -predict 
	-m model.txt 		\
	-d viterbi 		\
	-o out.predictions 	\
	unlabelled.set

Imput file formats

Sequence files

BioCRF is designed to be easily applied on Bioinformatics applications. For this reason, two different type of input are allowed:

Single sequence input is the sequence of the residues of a given protein. A sequence profile is a matrix whose rows are sequence positions and columns are 20 possible amino acids.Each element X[i][a] of a sequence profiles represents the frequency of the amino acid a in the aligned position i. A single sequence can be encoded with a sequence profile matrix whose elements X[i][a] are equal to 1.00 only for the amino acid at position i in the original sequence, 0.0 otherwise.
Thus, the same format can be used for both single sequences and sequence profiles:

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.56 0.33 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 i
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.45 0.00 0.00 0.00 0.00 0.17 0.24 0.00 0.10 i
0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.05 0.01 0.00 0.66 0.01 0.00 0.11 0.02 0.00 0.01 0.00 0.13 0.00 i
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.09 0.00 0.10 0.79 0.00 0.00 0.01 0.00 0.00 0.00 0.02 0.00 T
0.25 0.49 0.11 0.00 0.14 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 T
0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.01 0.87 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 T
0.12 0.09 0.12 0.00 0.04 0.00 0.00 0.13 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 T
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.01 0.00 0.00 T
0.00 0.00 0.00 0.00 0.01 0.00 0.97 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 T
0.00 0.13 0.00 0.00 0.01 0.00 0.00 0.00 0.84 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 T
0.00 0.01 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.16 0.00 0.00 0.09 0.15 0.00 0.55 0.00 0.00 0.01 T
0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.01 0.00 0.41 0.48 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 T
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.01 0.00 0.08 0.01 0.00 0.17 0.07 0.11 0.01 0.00 0.28 0.22 T
0.43 0.08 0.02 0.04 0.01 0.00 0.07 0.01 0.21 0.03 0.00 0.08 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 T
0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.03 0.00 0.36 0.07 0.01 0.00 0.00 0.02 0.11 0.25 0.05 0.05 0.05 o
0.03 0.00 0.01 0.01 0.01 0.00 0.00 0.62 0.01 0.05 0.03 0.02 0.00 0.00 0.00 0.00 0.00 0.07 0.01 0.10 o
0.12 0.03 0.01 0.00 0.09 0.03 0.00 0.05 0.01 0.01 0.37 0.00 0.00 0.05 0.00 0.04 0.04 0.01 0.03 0.12 o
0.02 0.03 0.00 0.04 0.01 0.00 0.00 0.14 0.18 0.00 0.04 0.00 0.00 0.00 0.00 0.12 0.01 0.00 0.01 0.06 o

The last column is the sequence annotation.

Grammar file

The following is a valid grammar file:

labels 2
states 4
label_mapping
B
F
end_label_mapping
state_mapping
B1 B
B2 B
F1 F
F2 F
end_state_mapping
transitions
BEGIN B1 F1
B1 B2 F2
B2 B1 F1 END
F1 B1 F1 END
F2 B2 F2
end_transitions

In the first two lines specify the number of labels and states respectively. The label_mapping section is used to list all different labels for the problem at hand. The state_mapping section contains the list of states (first column), and for each state, the associated label (second column). In the transitions section, for each state (first column), is given the list of allowed transitions (from the second to the last column in each line). Two special states, BEGIN and END, are used to define initial and final states.

References

[1] Fariselli P., Savojardo C., Martelli P.L. and Casadio R., Grammatical-Restrained Hidden Conditional Random Fields for Bioinformatics applications. Algorithms for Molecular Biology, 2009, 4:13.
[2] Wang S. et al., Hidden Conditional Random Fields for Gesture Recognition. In CVPR 2006:II: 1521-1527.
[3] Lafferty et al., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML01 2001:282-289.