Parsing IgBLAST output

Example Data

We have hosted a small example data set resulting from the Roche 454 example workflow described in the pRESTO documentation. In addition to the example FASTA files, we have included the standalone IgBLAST results. The files can be downloded from here:

Change-O Example Files

Building the IgBLAST database

To ensure that the receptor sequences can be manipulated to follow the IMGT numbering scheme, enabling identification of the junction region as well as framework and complementarity-determining regions, it is essential to build the standalone IgBLAST reference database from an originally IMGT numbered reference. The first step is to download the requisite reference database from any source with IMGT numbered germlines in FASTA format (IMGT reference directory). These FASTA files must then be converted into IgBLAST databases. The following example demonstrates how to make the IgBLAST database for the included reference fasta files (Human B cell receptor heavy chains) that should be placed in the IgBLAST directory (eg, /home/user/apps/igblast-1.4.0/bin).

1
2
3
4
5
6
7
8
9
# V-segment database
edit_imgt_file.pl IMGT_Human_IGHV.fasta > database/human_igh_v
makeblastdb -parse_seqids -dbtype nucl -in database/human_igh_v
# D-segment database
edit_imgt_file.pl IMGT_Human_IGHD.fasta > database/human_igh_d
makeblastdb -parse_seqids -dbtype nucl -in database/human_igh_d
# J-segment database
edit_imgt_file.pl IMGT_Human_IGHJ.fasta > database/human_igh_j
makeblastdb -parse_seqids -dbtype nucl -in database/human_igh_j

Once these databases are built for each segment of the receptor, they can be referenced when running IgBLAST.

Running IgBLAST

To run standalone IgBLAST with the example code below, place the example input FASTA file into the IgBLAST directory (eg, /home/user/apps/igblast-1.4.0/bin). The example data is human B cell receptor heavy chains, but the parameters can be changed according to analyze other types of receptors. To ensure the requisite information is generated you must specify the special output format shown below using the -outfmt '7 std qseq sseq btop' argument:

1
2
3
4
5
6
7
8
9
igblastn \
    -germline_db_V database/human_igh_v \
    -germline_db_D database/human_igh_d \
    -germline_db_J database/human_igh_j \
    -auxiliary_data optional_file/human_gl.aux \
    -domain_system imgt -ig_seqtype Ig -organism human \
    -outfmt '7 std qseq sseq btop' \
    -query S43_atleast-2.fasta \
    -out S43_atleast-2.fmt7

See also

For this example, we have suggested moving the FASTA file into the IgBLAST directory for the sake of simplicity. See the IgBLAST documentation regarding use of the IGDATA environment variable to control where IgBLAST looks for its internal database files. This will allow you run IgBLAST from any location. For a more complex example see this script.

Processing the output of IgBLAST

Standalone IgBLAST output is parsed by the igblast subcommand of MakeDb to generate the standardized tab-delimited database file on which all subsequent Change-O modules operate. In addition to the IgBLAST output (-i S43_atleast-2.fmt7), both the FASTA files input to IgBLAST (-s S43_atleast-2.fasta) and the IMGT-gapped reference sequences (-r Human_IGH[VDJ].fasta) must be provided to MakeDb:

MakeDb.py igblast -i S43_atleast-2.fmt7 -s S43_atleast-2.fasta -r Human_IGH[VDJ].fasta \
    --regions --scores

The optional (--regions) and (--scores) arguments add extra columns to the output database containing IMGT-gapped CDR/FWR regions and alignment metrics, respectively.

Warning

The references sequences you provide to MakeDb must contain IMGT-gapped V-segment references, and these reference must be the same sequences used to build the IgBLAST reference database. If your IgBLAST germlines are not IMGT-gapped and/or they are not identical to those provided to MakeDb, sequences assigned missing germlines will fail the parsing operation and the CDR3/junction sequences will not be correct.