nxCode - DNA Barcode Designer and Decoder

for Next-Gen Sequencing

Cold Spring Harbor Labs
Main Download barcode sets Download software Usage Algorithmic details FAQ

Algorithmic details

page contents
  1. Channel model
  2. nxDecoder
  3. nxCodeBuilder
  4. Input Parameters & Quality scores
Channel model
The sequencing process is prone to certain errors. A good barcode set should be relatively immune to such errors. To design such a set, it is necessary to take into account the different types of errors and their frequencies of occurrence. We  addressed this by modeling the sequencing process as a noisy communications channel, and the barcodes as messages that are sent over that channel. Our channel model is a table containing probabilities of different errors occurring at the various positions in the read. 

The types of errors considered in our model are:
  1. Substitutions (AAA => AAC), which occur mainly in the sequencing phase,
  2. Deletions (AACA => ACA), which occur in the oligonucleotide synthesis phase, and
  3. Insertions (AAC => AACC), which also occur during oligonucleotide synthesis.
The probabilities of these events are then used for finding the "neighbors" or "neighborhood" of a given barcode.  Such a neighborhood consists of the sequences which the barcode is likely to get distorted to during synthesis and sequencing. These neighborhoods are taken into consideration during barcode set construction and during decoding.
nxDecoder
About the method:
Decoding is the process of associating each sequence read with its most likely origin (barcode).  This is accomplished in two phases: a preprocessing phase and a decoding phase.

Preprocessing phase:
The preprocessing phase calculates the most likely origin of most possible sequences.
Given a barcode ci (i=1,...,n) in C, let Bi be the set of sequences (b1, ..., bk) that ci is likely to be distorted into.  ci is said to be likely to be distorted into a sequence bi if there exists a series of events (substitutions, insertions and/or deletions) in which ci is distorted to bi, that has a probability >= min_prob_to_consider.
(Note: if min_prob_to_consider were 0, than calculating Bi would involve iterating over all possible series of events. Thus, min_prob_to_consider serves as a threshold to restrict event-series length, as such series can theoretically be infinitely long). Higher values of min_prob_to_consider produce faster but less accurate decoding.

Given a barcode set C = (c1,...,cn), preprocessing is performed as follows:
For each barcode ci in C, Bi is calculated (using the probabilities given by the channel model).
For each sequence b in Bi (i=1,...,n), if the likelihood of ci being distorted into b is greater than the likelihood of any previously considered barcode (c1,...,ci-1) being distorted into b, ci is remembered as the most likely origin of b.
At the end of the preprocessing phase, we have a function mapping each sequence b to its most likely origin barcode. This function will be used to decode sequences during the decoding phase (Note: not all possible sequences will have mappings, but rather only sequences which some barcode is likely to be distorted into. If other sequences are read, they will be decoded as *unknown*).

Decoding phase:
For each given sequence read S (a line in the output of your sequencer), the barcode region b is extracted from the sequence S according to its predetermined position in the oligo (an input parameter).

The barcode ci which is the most likely origin of b (as calculated in the preprocessing phase) is then chosen and written to the output file as the barcode associated with S.
Event_likelihood and Decode_quality are then calculated and written to the output file header along with ci's description (see explanation above).

nxCodeBuilder
About the method:
Barcode set design is the process of selecting a particular set of sequences of a particular length l, in a fashion that will ensure a maximal rate of correct decoding after transmission across a noisy channel.  Thus, the chosen set should be as resistant as possible to channel noise (sequencing and synthesis errors).  The sequences should also meet certain biochemical constraints, such as GC content and maximal homo polymer length.  In addition, the set should contain as many barcode sequences as possible.
Iterating over all possible sets of sequences that meet the above criteria to find an optimal set is computationally intractable (NP-hard).
nxCodeBuilder is an approximation algorithm, that searches the barcode space using a greedy approach and produces near-optimal-size barcode sets that ensure a given rate of correct decoding - exp_acc.

Let W=(w1,...,wN) be the set of all possible permutations of length l over the alphabet [A,C,G,T] (Note that N = 4l). Let W'=(wi,...,wj) be the subset of W containing all the sequences that meet the given biochemical constraints.

The algorithms begins by initializing the barcode set C to be the empty set.
Next, the algorithm considers, in lexicographic order, each sequence w in W' as a candidate barcode for addition to the growing set C.

Let wi be the candidate currently being considered, and let Bi be the set of sequences (b1, ..., bk) that wi is likely to be distorted into.  wi is said to be likely to be distorted into a sequence bj if there exists a series of events (substitutions, insertions and/or deletions) in which wi is distorted to bj, that has a probability >= min_prob_to_consider.

Let Pj be the probability of wi being distorted into bj.  Notice that there are multiple series of events in which wi is distorted to bj.  pj is the sum of probabilities over all these series that occur with probability >= min_prob_to_consider. Therefor, pj is an approximation of Pj, and its accuracy is determined by min_prob_to_consider.
(Note: if min_prob_to_consider were 0, than calculating Bi would involve iterating over all possible series of events. Thus, min_prob_to_consider serves as a threshold to restrict event-series length, as such series can theoretically be infinitely long). Higher values of min_prob_to_consider produce faster but less accurate decoding.

In order for wi to be added to C, wi must satisfy the following conditions:
  1. There must exist a neighborhood Ni = (n1,...,nm), which is a subset of Bi, such that for each sequence nk in Ni, pk (the probability of wi being distorted into nk) is greater than the probability of any other codeword already in C to be distorted into nk.  In addition, the sum of probabilities p1 + ... + pm must be >= exp_acc (the parameter specifying the expected accuracy of correct decoding, see below).
  2. There must not exist a codeword c already in the code and a neighbor n of c such that the probability of wi being distorted into n is greater than the probability of c being distorted into n (in other words, wi must not interfere with the neighborhood of any previously selected barcode).

At the end of the run, C contains a set of barcodes that mutually satisfy the above conditions.

Correctness and exp_acc:
At the end of the run, each barcode c in C has a neighborhood Nc of sequences.  When c is transmitted, the probability of it being read as one of the words in Nc is >= exp_acc (because the total probability of c being distorted into one of the words in Nc is >= exp_acc).  
For each word n in Nc, c is the codeword with highest probability of being distorted into n. Therefor, n will always be decoded to c (see explanation of decode algorithm).  Overall, a transmission of c will be decoded as c at a rate >= exp_acc.
Input parameters & Quality scores
About exp_acc:
NxCodeBuilder designs codes in such a way that when they are used in experiments, the expected rate of correct decoding will be the value of the parameter --exp_acc.  A reasonable value might be --exp_acc=0.99, which will yield a barcode set designed to be correctly decoded 99% of the time.  Actual correct decoding rates may differ, due to distortions other than base substitutions, insertions and deletions. Legal values are in the range (0,1).
If you aren't getting barcode sets that are large enough for your experiment, consider alotting more nucleotides to your barcode region.  If this is unfeasable, try lowering the value of exp_acc.

About min_prob_to_consider:
Both nxCodeBuilder and nxDecoder work by calculating how likely particular series of distortions are.  For example, they calculate the different ways the barcode AACCG could be mutated to the sequence AAAGG. Note that there are theoretically an infinite number of ways for this distortion to occur, through varying numbers of base deletions, insertions and substitutions.  Fortunately, most such ways are highly unlikely and can safely be ignored.
The parameter --min_prob_to_consider determines the threshold minimum probability of a series of distortions that should be taken into account.  For example, a series of distortions in which AACCG is mutated into AAAGG by 9 substitutions, 4 insertions and 4 deletions is very unlikely, and does not have to be taken into account during code construction and decoding.  The less such mutations that are taken into account, the faster nxCodeBuilder and nxDecoder can work.

Raising the value of min_prob_to_consider (by using the optional input parameter --min_prob_to_consider) yields faster running times. However, this will produce smaller barcode sets and less accurate decoding. If your running times with the default values (usually 1e-08 to 1e-10) are too long, try higher values of min_prob_to_consider, such as 1e-07 or 1e-06.  If you can afford longer running times, consider lower values (such as 1e-11).

Both nxCodeBuilder and nxDecoder print out the default values of --min_prob_to_consider during startup, in this format:

Minimum probability of mutations to consider: 1e-09

About Event_likelihood and Decode_quality:

nxDecoder assignes each decoded barcode with two scores:
  1. Event_likelihood is a number in the range [0,10], which provides a measurement of how much
    a particular barcode is likely to be distorted into the sequence at the barcode region of the read.
    Likely events will produce values close to 10, and unlikely ones - values close to 0.
    For example, the barcode AACGT is quite likely to be read as AACTT, so the Event likelihood for decoding AACTT as AACGT will be high. The same barcode (AACGT) is less likely to be read as GGCTA, so the Event_likelihood for decoding GGCTA as AACGT will be low. Naturally, the Event_likelihood of decoding a barcode as itself will be near 10.
  2. Decode_quality is a number in the range [0,1], which provides a relative measurement of how much a particular barcode is likely to be distorted into the sequence at the barcode region of the read,
    in relation to all the other barcodes. For example, consider a barcode set comprised of just three barcodes: AACCT, TTAAA and GGTTT.  The sequence AACGG is much more similar to AACCT than to TTAAA or to GGTTT,
    so decoding AACGG to AACCT will be assigned a relatively high Decode_quality.  Note that the Event_likelihood of AACCT being distorted to AACGG is quite low, and that this has no bearing on the Decode_quality.