Filtering records

The ParseDb.py tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.

Removing non-productive sequences

After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:

1ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T
2ParseDb.py split -d HD13M_db-pass.tsv -f productive

The first line above uses the select subcommand to output a single file labeled parse-select containing only records with the value of T (-u T) in the productive column (-f productive).

Alternatively, the second line above uses the split subcommand to output multiple files with each file containing records with one of the values found in the productive column (-f productive). This will generate two files labeled productive-T and productive-F.

Removing disagreements between the C-region primers and the reference alignment

If you have data that includes both heavy and light chains in the same library, the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not always agree with the isotype assignments from the C-region primers. In these cases, you can filter out such reads with the select subcommand of ParseDb.py. An example function call using an imaginary file db.tsv is provided below:

1ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \
2    --logic all --regex --outname heavy
3ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \
4    --logic all --regex --outname light

These commands will require that all of the v_call, j_call and c_call fields (-f v_call j_call c_call and --logic all) contain the string IGH (lines 1-2) or one of IGK or IGL (lines 3-4). The --regex argument allows for partial matching and interpretation of regular expressions. The output from these two commands are two files, one containing only heavy chains (heavy_parse-select.tsv) and one containg only light chains (light_parse-select.tsv).

Exporting records to FASTA files

You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.

Standard FASTA

The fasta subcommand allows you to export sequences and annotations to FASTA formatted files in the pRESTO annototation scheme:

ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \
    --sf sequence_alignment --mf v_call duplicate_count

Where the column containing the sequence identifier is specified by --if sequence_id, the nucleotide sequence column is specified by --sf sequence_id, and additional annotations to be added to the sequence header are specified by --mf v_call duplicate_count.

BASELINe FASTA

The baseline subcommand generates a FASTA derivative format required by the BASELINe web tool. Generating these files is similar to building standard FASTA files, but requires a few more options. An example function call using an imaginary file db.tsv is provided below:

ConvertDb.py baseline -d db.tsv --if sequence_id \
    --sf sequence_alignment --mf v_call duplicate_count \
    --cf clone_id --gf germline_alignment_d_mask

The additional arguments required by the baseline subcommand include the clonal grouping (--cf clone_id) and germline sequence (--gf germline_alignment_d_mask) columns added by the DefineClones and CreateGermlines tasks, respectively.

Note

The baseline subcommand requires the CLONE column to be sorted. DefineClones.py generates a sorted CLONE column by default. However, you needed to alter the order of the CLONE column at some point, then you can re-sort the clonal assignments using the sort subcommand of ParseDb.py. An example function call using an imaginary file db.tsv is provided below:

ParseDb.py sort -d db.tsv -f clone_id

Which will sort records by the value in the clone_id column (-f clone_id).