Parsing IgBLAST output¶
We have hosted a small example data set resulting from the Roche 454 example workflow described in the pRESTO documentation. In addition to the example FASTA files, we have included the standalone IgBLAST results. The files can be downloded from here:
Building the IgBLAST database¶
To ensure that the receptor sequences can be manipulated to follow
the IMGT numbering scheme, enabling identification of the junction region as
well as framework and complementarity-determining regions, it is essential to
build the standalone IgBLAST
reference database from an originally IMGT numbered reference.
The first step is to download the requisite reference database from any source
with IMGT numbered germlines in FASTA format
(IMGT reference directory).
These FASTA files must then be converted into IgBLAST databases. The
following example demonstrates how to make the IgBLAST database for the
included reference fasta files (Human B cell receptor heavy chains)
that should be placed in the IgBLAST directory
1 2 3 4 5 6 7 8 9
# V-segment database edit_imgt_file.pl IMGT_Human_IGHV.fasta > database/human_igh_v makeblastdb -parse_seqids -dbtype nucl -in database/human_igh_v # D-segment database edit_imgt_file.pl IMGT_Human_IGHD.fasta > database/human_igh_d makeblastdb -parse_seqids -dbtype nucl -in database/human_igh_d # J-segment database edit_imgt_file.pl IMGT_Human_IGHJ.fasta > database/human_igh_j makeblastdb -parse_seqids -dbtype nucl -in database/human_igh_j
Once these databases are built for each segment of the receptor, they can be referenced when running IgBLAST.
To run standalone IgBLAST
with the example code below, place the example input FASTA
file into the IgBLAST directory (eg,
The example data is human B cell receptor heavy chains, but the parameters
can be changed according to analyze other types of receptors.
To ensure the requisite information is generated you must specify the special
output format shown below using the
-outfmt '7 std qseq sseq btop' argument:
1 2 3 4 5 6 7 8 9
igblastn \ -germline_db_V database/human_igh_v \ -germline_db_D database/human_igh_d \ -germline_db_J database/human_igh_j \ -auxiliary_data optional_file/human_gl.aux \ -domain_system imgt -ig_seqtype Ig -organism human \ -outfmt '7 std qseq sseq btop' \ -query S43_atleast-2.fasta \ -out S43_atleast-2.fmt7
For this example, we have suggested moving the FASTA file into the IgBLAST
directory for the sake of simplicity. See the
regarding use of the
IGDATA environment variable to control where IgBLAST
looks for its internal database files. This will allow you run IgBLAST from any location.
For a more complex example see this script.
Processing the output of IgBLAST¶
output is parsed by the igblast subcommand of
MakeDb to generate the standardized tab-delimited database file on which all
subsequent Change-O modules operate. In addition to the IgBLAST output
-i S43_atleast-2.fmt7), both the FASTA files input to
-s S43_atleast-2.fasta) and the IMGT-gapped reference
-r Human_IGH[VDJ].fasta) must be provided to MakeDb:
MakeDb.py igblast -i S43_atleast-2.fmt7 -s S43_atleast-2.fasta -r Human_IGH[VDJ].fasta \ --regions --scores
The references sequences you provide to MakeDb must contain IMGT-gapped V-segment references, and these reference must be the same sequences used to build the IgBLAST reference database. If your IgBLAST germlines are not IMGT-gapped and/or they are not identical to those provided to MakeDb, sequences assigned missing germlines will fail the parsing operation and the CDR3/junction sequences will not be correct.