Using IgBLAST

Example data

We have hosted a small example data set resulting from the Roche 454 example workflow described in the pRESTO documentation. In addition to the example FASTA files, we have included the standalone IgBLAST results. The files can be downloded from here:

Change-O Example Files

Configuring IgBLAST

A collection of scripts for setting up the standalone IgBLAST database from the IMGT reference sequences are available on the Immcantation repository. To use these scripts, copy all the tools in the /scripts folder to a location in your PATH. At a minimum, you’ll need the following scripts:

  1. fetch_igblastdb.sh
  2. fetch_imgtdb.sh
  3. clean_imgtdb.py
  4. imgt2igblast.sh

Download and configure the IgBLAST and IMGT reference databases as follows:

1
2
3
4
5
# Download reference databases
fetch_igblastdb.sh -o ~/share/igblast
fetch_imgtdb.sh -o ~/share/germlines/imgt
# Build IgBLAST database from IMGT reference sequences
imgt2igblast.sh -i ~/share/germlines/imgt -o ~/share/igblast

Note

Several Immcantation tools require the observed V(D)J sequence (SEQUENCE_IMGT) and associated germline fields (GERMLINE_IMGT or GERMLINE_IMGT_D_MASK) to have gaps inserted to conform to the IMGT numbering scheme. Thus, when a tool such as MakeDb or CreateGermlines requires a reference sequence set as input, it will required the IMGT-gapped reference set. Meaning, the reference sequences that were downloaded using the fetch_imgtdb.sh script, or downloaded manually from the IMGT reference directory, rather than the final upgapped reference set required by IgBLAST.

See also

The provided scripts download only the mouse and human IMGT reference databases. See the IgBLAST documentation for instructions on how to build the database in a more general case. Shown below is an example of how to performed the same steps as the Immcantation scripts using a separately downloaded IMGT reference set and the scripts provided by IgBLAST. You must have all of the associated commands in your PATH and the appropriate directories created:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# V segment database
edit_imgt_file.pl IMGT_Human_IGHV.fasta > ~/share/igblast/fasta/imgt_human_ig_v.fasta
makeblastdb -parse_seqids -dbtype nucl -in ~/share/igblast/fasta/imgt_human_ig_v.fasta \
    -out ~/share/igblast/database/imgt_human_ig_v
# D segment database
edit_imgt_file.pl IMGT_Human_IGHD.fasta > ~/share/igblast/fasta/imgt_human_ig_d.fasta
makeblastdb -parse_seqids -dbtype nucl -in ~/share/igblast/fasta/imgt_human_ig_d.fasta \
    -out ~/share/igblast/database/imgt_human_ig_d
# J segment database
edit_imgt_file.pl IMGT_Human_IGHJ.fasta > ~/share/igblast/fasta/imgt_human_ig_j.fasta
makeblastdb -parse_seqids -dbtype nucl -in ~/share/igblast/fasta/imgt_human_ig_j.fasta \
    -out ~/share/igblast/database/imgt_human_ig_j

Once these databases are built for each segment they can be referenced when running IgBLAST.

Running IgBLAST

Change-O provides a simple wrapper script to run IgBLAST with the required options as the igblast subcommand of AssignGenes. This wrapper can be run as follows using the database built using the Immcantation scripts:

AssignGenes.py igblast -s S43_atleast-2.fasta -b ~/share/igblast \
    --organism human --loci ig --format blast

The optional --format blast argument defines the output format of IgBLAST. The default, blast, is the blocked tabular output provided by specifying the -outfmt '7 std qseq sseq btop' argument to IgBLAST. Specifying --format airr will output a tab-delimited file compliant with the AIRR Rearrangement schema defined by the AIRR Community. AIRR format support requires IgBLAST v1.9.0 or higher.

The -b ~/share/igblast argument specifies the path containing the database, internal_data, and optional_file directories required by IgBLAST. This option sets the IGDATA environment variable that controls where IgBLAST looks for internal database files. See the IgBLAST documentation for more details regarding the IGDATA environment variable.

See also

The AssignGenes IgBLAST wrapper provides limited functionality. For more control, IgBLAST should be run directly. The only strict requirement for compatibility with Changeo-O is that the output must either be an AIRR tab-delimited file (--outfmt 19) or a blast-style tabular output with the optional query sequence, subject sequence and BTOP fields (-outfmt '7 std qseq sseq btop'). An example of how to run IgBLAST directly is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
exprt IGDATA=~/share/igblast
igblastn \
    -germline_db_V ~/share/igblast/database/imgt_human_ig_v\
    -germline_db_D ~/share/igblast/database/imgt_human_ig_d \
    -germline_db_J ~/share/igblast/database/imgt_human_ig_v \
    -auxiliary_data ~/share/igblast/optional_file/human_gl.aux \
    -domain_system imgt -ig_seqtype Ig -organism human \
    -outfmt '7 std qseq sseq btop' \
    -query S43_atleast-2.fasta \
    -out S43_atleast-2.fmt7

Processing the output of IgBLAST

Standalone IgBLAST blast-style tabular output is parsed by the igblast subcommand of MakeDb to generate the standardized tab-delimited database file on which all subsequent Change-O modules operate. In addition to the IgBLAST output (-i S43_atleast-2.fmt7), both the FASTA files input to IgBLAST (-s S43_atleast-2.fasta) and the IMGT-gapped reference sequences (-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta) must be provided to MakeDb:

MakeDb.py igblast -i S43_atleast-2.fmt7 -s S43_atleast-2.fasta \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta \
    --regions --scores

The optional --regions and --scores arguments add extra columns to the output database containing IMGT-gapped CDR/FWR regions and alignment metrics, respectively.

Warning

The references sequences you provide to MakeDb must contain IMGT-gapped V segment references, and these reference must be the same sequences used to build the IgBLAST reference database. If your IgBLAST germlines are not IMGT-gapped and/or they are not identical to those provided to MakeDb, then sequences which were assigned missing germlines will fail the parsing operation and the junction (CDR3) sequences will not be correct.