Generating MiAIRR compliant GenBank/TLS submissions¶
The MiAIRR standard (minimal information about adaptive immune receptor repertoires) is a minimal reporting standard for experiments using sequencing-based technologies to study adaptive immune receptors (T and B cell receptors). The current version (1.0) of the standard was published in Rubelt et al, 2017 and accepted by the general assembly at the annual AIRR Community meeting in December 2017.
This example will cover generation of files for submission to TLS starting from Change-O formatted data. For complete details of the required and optional elements of the TLS submission see the AIRR Standards documentation site.
Special attention should be paid to the
Note that GenBank expects there to be a
CDS element that corresponds to the
JUNCTION. If submitting
single-cell heavy:light paired BCR data, GenBank expects separate files for the heavy, the kappa, and the
lambda chains. Note that even though the kappa and the lambda chain sequences should be in separate files,
misc_feature comments should both read
immunoglobulin light chain variable region, per AIRR
standard requirements. In addition, every effort should be made to make sure that the values of the attributes
for GenBank submission match those of the BioSample attributes. In particular, if the BioSample specifies
strain value (e.g. for mouse data), then a
strain attribute MUST be included when preparing GenBank
submission, and that value MUST match the BioSample value.
The following examples use the
HD13M_db-pass.tsv database file and
HD13M_template.sbt file provided in the example bundle, which has already undergone
the IgBLAST annotation, parsing, and filtering
Generating files for submission¶
tbl2asn installed and in your
A GenBank submission template file (
.sbt), generating using the NCBI Template Generator.
C region annotations must use official gene symbols (IGHM, IGHG, etc) so that they are properly
recognized by remote databases. If your annotations are not of this form, then they must be updated
prior to generating the GenBank/TLS submission files. The following example shows how to use the
update subcommand of ParseDb.py to rename the values in the
The files provided for this example already have correctly annotated
c_call information, so
the following is hypothetical example (
db.tsv) with existing annotation of the for IgM, IgG, etc:
ParseDb.py update -d db.tsv -f c_call \
-u IgA IgD IgE IgG IgM \
-t IGHA IGHD IGHE IGHG IGHM
Creating ASN files¶
ASN submission files are generated using the genbank subcommand of ConvertDb.py as follows:
ConvertDb.py genbank -d HD13M_db-pass.tsv \
--product "immunoglobulin heavy chain" \
--db "IMGT/GENE-DB" \
--inf "IgBLAST:1.14.0" \
--organism "Homo sapiens" \
--tissue "Peripheral blood" \
--cell-type "B cell" \
--isolate HD13M \
--cf c_call \
--nf duplicate_count \
--sbt HD13M_template.sbt \
The resulting output in the
HD13M_TLS folder will include a number of files.
The Sequin file
HD13M_db-pass_genbank.sqn is the file that will be used for submission
and the GenBank record file
HD13M_db-pass_genbank.gbf is similar to what the submission
will look like once it has been accepted by GenBank.
The command above manually specifies several required and optional annotations.
Alternatively, sample information (
cell_type) can be specified in a separate yaml file and provided via the
-y argument. Additional harmonized
which are not convered by the existing commandline arguments, may be provided
in the yaml file. Note, the yaml file adds only sample features, so it cannot be used
to specify source features (
--db arguments), parsing
arguments, or run parameters (–label`,
--exec, etc). Features specified in the yaml
file will override equivalent features specified through the corresponding commandline arguments.
The example shown above automatically runs tbl2asn, because the
--asn argument was specified. ConvertDb.py
can be run without running tbl2asn, which will generate only the
feature table (
S43_update_genbank.tbl) and fasta (
required to run tbl2asn manually via the command:
tbl2asn -p . -a s -V vb -t S43_template.sbt
When running tbl2asn using the
argument to ConvertDb.py there is no internal validation that the records
passing the filters in ConvertDb.py also pass the filters in tbl2asn.
As such, it is recommended that the number of sequences in the output
file be verified against the number of sequences in the
output files. From the command line, this can be achieved via:
grep -c iupacna *.sqn
There is a known issue with the
In some environments, for reasons that are presently unknown, tbl2asn
may fail to recongizing the input fasta file and report an error stating
Unable to read any FASTA records. Running tbl2asn manually should
resolve the issue.
Submitting to GenBank/TLS using SequinMacroSend¶
After generating the
.sqn files, you can submit them as MiAIRR compliant
GenBank/TLS records using GenBank’s
When submitting, simply add the keyword
AIRR to the subject line in the
submission system and it will be routed accordingly.
Currently, the SequinMacroSend system cannot accept files over 512MB in size.
For submissions over the size limit, you must split them into smaller files
and note in the submission comments that they are a part of a split submission.
.sqn files used for submission are usually about 30 times the size
of the original tab-delimited Change-O file. See the split subcommand
of ParseDb.py for one approach to logically dividing large submissions.