De-multiplexing

Checkout-batch

Description

A script to perform de-multiplexing and UMI tag extraction for a set of FASTQ files that were previously split using Illumina sample indices.

Usage

General:

java -jar migec.jar CheckoutBatch [options] barcodes_file output_dir

The barcodes file specifies sample multiplexing and UMI (NNN.. region) extraction rules. It has the same structure as for “manual” Checkout (see section below), with additional two columns that specify input FASTQ file names.

Sample ID Master barcode sequence Slave barcode sequence Read#1 FASTQ Read#2 FASTQ
S0 acgtacgtAGGTTAcadkgag      
S1 acgtacgtGGTTAAcadkgag ctgkGTTCaat ILM1_R1_L001.fastq.gz ILM1_R2_L001.fastq.gz
S1 acgtacgtAAGGTTcadkgagNNNNNN   ILM2_R1_L001.fastq.gz ILM2_R2_L001.fastq.gz
S3 acgtacgtTAAGGTcadkgagNNNNNN NNNNNNctgkGTTCaat ILM1_R1_L001.fastq.gz ILM1_R2_L001.fastq.gz

The following rules apply:

  • All specified FASTQ files are sequentially processed using Checkout
  • If no FASTQ file is specified for a given barcode, it will be searched in all FASTQ files
  • CheckoutBatch will properly aggregate reads from multiple FASTQ files that have the same sample id
  • Still there should not be the case when a FASTQ file has the same barcode specified more than once

Parameters

Same as in manual version of Checkout, see below.

Output format

The Checkout routine produces files in the FASTQ format that have a specific UMI field added to the header. Each read successfully matched by Checkout will be output as follows:

@ILLUMINA_HEADER UMI:NNNN:QQQQ
ATAGATTATGAGTATG
+
##II#IIIIIIIIIII

The original read header (ILLUMINA_HEADER here) is preserved, the appended UMI:NNNN:QQQQ contains the sequence of the UMI tag (NNNN bases) and its quality string (QQQQ).

Checkout-manual

Description

A script to perform de-multiplexing and UMI tag extraction

Usage

General:

java -jar migec.jar Checkout [options] barcodes_file R1.fastq[.gz] [. or R2.fastq[.gz]] output_dir

For paired-end data:

java -jar migec.jar Checkout -cute barcodes.txt R1.fastq.gz R2.fastq.gz ./checkout/

For unpaired library:

java -jar migec.jar Checkout -cute barcodes.txt R.fastq.gz . ./checkout/

For overlapping paired reads:

java -jar migec.jar Checkout -cute --overlap barcodes.txt R1.fastq.gz R2.fastq.gz . checkout/

accepted barcodes.txt format is a tab-delimited table with the following structure:

Sample ID Master barcode sequence Slave barcode sequence
S0 acgtacgtAGGTTAcadkgag  
S1 acgtacgtGGTTAAcadkgag ctgkGTTCaat
S2 acgtacgtAAGGTTcadkgagNNNNNN  
S3 acgtacgtTAAGGTcadkgagNNNNNN NNNNNNctgkGTTCaat

A sequencing read is scanned for master adapter and then, if found, its mate is reverse-complemented to get on the same strand as master read and scanned for slave adapter.

  • Slave adapter sequence could be omitted.
  • Adaptor sequence could contain any IUPAC DNA letters.
  • Upper and lower case letters mark seed and fuzzy-search region parts respectively.
  • N characters mark UMI region to be extracted.
  • Multiple rows could correspond to the same sample
  • In order to be able to run batch pipeline operations, all samples should contain UMI region of the same size

For example, in case S2 Checkout will search for AAGGTT seed exact match, then for the remaining adapter sequence with two mismatches allowed and output the NNNNNN region to header. In case S3 in addition the slave read is scanned for GTTC seed, fuzzy match to the rest of barcode is performed and NNNNNN region is extracted and concatenated with UMI region of master read.

Parameters

General:

-c compressed output (gzip compression).

-u perform UMI region extraction and output it to the header of de-multiplexed FASTQ files

-t trim adapter sequence from output.

-e also remove trails of template-switching (poly-G) for the case when UMI-containing adapter is added using reverse-transcription (cDNA libraries).

--overlap will try to overlap reads (paired-end data only), non-overlapping and overlapping reads will be placed to *_R1/_R2* and *_R12* FASTQ files respectively. While overlapping the nucleotide with higher quality will be taken thus improving overall data quality.

--overlap-max-offset X controls to which extent overlapping region is searched. IMPORTANT If the read-through extent is high (reads are embedded) should be set to ~40.

Barcode search:

-o speed up by assuming that reads are oriented, i.e. master adapter should be in R1

-r will apply a custom RC mask. By default it assumes Illumina reads with mates on different strands, so it reverse-complements read with slave adapter so that output reads will be on master strand.

--rc-barcodes also searches for both adapter sequences in reverse complement. Use it if unsure of your library structure.

--skip-undef will not store reads that miss adapter sequence to save drive space.

Note

When there is a huge number of unassigned/unused reads --skip-undef option greatly speeds up de-multiplexing. However, take care to carefully investigate the reasons behind low barcode extraction rate if it is a case.

Important

The --overlap option may not perform well for poor quality reads, which is a typical situation for 300+300bp MiSEQ sequencing. In this case, merging reads using external software after Assemble stage is recommended.