Algorithmic details

nxCode - DNA Barcode Designer and Decoder for Next-Gen Sequencing

Main	Download barcode sets	Download software	Usage	Algorithmic details	FAQ
Algorithmic details
page contents
Channel model nxDecoder nxCodeBuilder Input Parameters & Quality scores
Channel model
The sequencing process is prone to certain errors. A good barcode set should be relatively immune to such errors. To design such a set, it is necessary to take into account the different types of errors and their frequencies of occurrence. We addressed this by modeling the sequencing process as a noisy communications channel, and the barcodes as messages that are sent over that channel. Our channel model is a table containing probabilities of different errors occurring at the various positions in the read. The types of errors considered in our model are: Substitutions (AAA => AAC), which occur mainly in the sequencing phase, Deletions (AACA => ACA), which occur in the oligonucleotide synthesis phase, and Insertions (AAC => AACC), which also occur during oligonucleotide synthesis. The probabilities of these events are then used for finding the "neighbors" or "neighborhood" of a given barcode. Such a neighborhood consists of the sequences which the barcode is likely to get distorted to during synthesis and sequencing. These neighborhoods are taken into consideration during barcode set construction and during decoding.
nxDecoder
About the method: Decoding is the process of associating each sequence read with its most likely origin (barcode). This is accomplished in two phases: a preprocessing phase and a decoding phase. Preprocessing phase: The preprocessing phase calculates the most likely origin of most possible sequences. Given a barcode c_i (i=1,...,n) in C, let B_i be the set of sequences (b₁, ..., b_k) that c_i is likely to be distorted into. c_i is said to be likely to be distorted into a sequence b_i if there exists a series of events (substitutions, insertions and/or deletions) in which c_i is distorted to b_i, that has a probability >= min_prob_to_consider. (Note: if min_prob_to_consider were 0, than calculating B_i would involve iterating over all possible series of events. Thus, min_prob_to_consider serves as a threshold to restrict event-series length, as such series can theoretically be infinitely long). Higher values of min_prob_to_consider produce faster but less accurate decoding. Given a barcode set C = (c₁,...,c_n), preprocessing is performed as follows: For each barcode c_i in C, B_i is calculated (using the probabilities given by the channel model). For each sequence b in B_i (i=1,...,n), if the likelihood of c_i being distorted into b is greater than the likelihood of any previously considered barcode (c₁,...,c_i-1) being distorted into b, c_i is remembered as the most likely origin of b. At the end of the preprocessing phase, we have a function mapping each sequence b to its most likely origin barcode. This function will be used to decode sequences during the decoding phase (Note: not all possible sequences will have mappings, but rather only sequences which some barcode is likely to be distorted into. If other sequences are read, they will be decoded as unknown). Decoding phase: For each given sequence read S (a line in the output of your sequencer), the barcode region b is extracted from the sequence S according to its predetermined position in the oligo (an input parameter). The barcode c_i which is the most likely origin of b (as calculated in the preprocessing phase) is then chosen and written to the output file as the barcode associated with S. Event_likelihood and Decode_quality are then calculated and written to the output file header along with c_i's description (see explanation above).
nxCodeBuilder
About the method: Barcode set design is the process of selecting a particular set of sequences of a particular length l, in a fashion that will ensure a maximal rate of correct decoding after transmission across a noisy channel. Thus, the chosen set should be as resistant as possible to channel noise (sequencing and synthesis errors). The sequences should also meet certain biochemical constraints, such as GC content and maximal homo polymer length. In addition, the set should contain as many barcode sequences as possible. Iterating over all possible sets of sequences that meet the above criteria to find an optimal set is computationally intractable (NP-hard). nxCodeBuilder is an approximation algorithm, that searches the barcode space using a greedy approach and produces near-optimal-size barcode sets that ensure a given rate of correct decoding - exp_acc. Let W=(w₁,...,w_N) be the set of all possible permutations of length l over the alphabet [A,C,G,T] (Note that N = 4^l). Let W'=(w_i,...,w_j) be the subset of W containing all the sequences that meet the given biochemical constraints. The algorithms begins by initializing the barcode set C to be the empty set. Next, the algorithm considers, in lexicographic order, each sequence w in W' as a candidate barcode for addition to the growing set C. Let w_i be the candidate currently being considered, and let B_i be the set of sequences (b₁, ..., b_k) that w_i is likely to be distorted into. w_i is said to be likely to be distorted into a sequence b_j if there exists a series of events (substitutions, insertions and/or deletions) in which w_i is distorted to b_j, that has a probability >= min_prob_to_consider. Let P_j be the probability of w_i being distorted into b_j. Notice that there are multiple series of events in which w_i is distorted to b_j.p_j is the sum of probabilities over all these series that occur with probability >= min_prob_to_consider. Therefor, p_j is an approximation of P_j, and its accuracy is determined by min_prob_to_consider. (Note: if min_prob_to_consider were 0, than calculating B_i would involve iterating over all possible series of events. Thus, min_prob_to_consider serves as a threshold to restrict event-series length, as such series can theoretically be infinitely long). Higher values of min_prob_to_consider produce faster but less accurate decoding. In order for w_i to be added to C, w_i must satisfy the following conditions: There must exist a neighborhood N_i = (n₁,...,n_m), which is a subset of B_i,such that for each sequence n_k in N_i, p_k (the probability of w_i being distorted into n_k) is greater than the probability of any other codeword already in C to be distorted into n_k. In addition, the sum of probabilities p₁ + ... + p_m must be >= exp_acc (the parameter specifying the expected accuracy of correct decoding, see below). There must not exist a codeword c already in the code and a neighbor n of c such that the probability of w_i being distorted into n is greater than the probability of c being distorted into n (in other words, w_i must not interfere with the neighborhood of any previously selected barcode). At the end of the run, C contains a set of barcodes that mutually satisfy the above conditions. Correctness and exp_acc: At the end of the run, each barcode c in C has a neighborhood N_c of sequences. When c is transmitted, the probability of it being read as one of the words in N_c is >= exp_acc (because the total probability of c being distorted into one of the words in N_c is >= exp_acc). For each word n in N_c, c is the codeword with highest probability of being distorted into n. Therefor, n will always be decoded to c (see explanation of decode algorithm). Overall, a transmission of c will be decoded as c at a rate >= exp_acc.
Input parameters & Quality scores
About exp_acc: NxCodeBuilder designs codes in such a way that when they are used in experiments, the expected rate of correct decoding will be the value of the parameter --exp_acc. A reasonable value might be --exp_acc=0.99, which will yield a barcode set designed to be correctly decoded 99% of the time. Actual correct decoding rates may differ, due to distortions other than base substitutions, insertions and deletions. Legal values are in the range (0,1). If you aren't getting barcode sets that are large enough for your experiment, consider alotting more nucleotides to your barcode region. If this is unfeasable, try lowering the value of exp_acc. About min_prob_to_consider: Both nxCodeBuilder and nxDecoder work by calculating how likely particular series of distortions are. For example, they calculate the different ways the barcode AACCG could be mutated to the sequence AAAGG. Note that there are theoretically an infinite number of ways for this distortion to occur, through varying numbers of base deletions, insertions and substitutions. Fortunately, most such ways are highly unlikely and can safely be ignored. The parameter --min_prob_to_consider determines the threshold minimum probability of a series of distortions that should be taken into account. For example, a series of distortions in which AACCG is mutated into AAAGG by 9 substitutions, 4 insertions and 4 deletions is very unlikely, and does not have to be taken into account during code construction and decoding. The less such mutations that are taken into account, the faster nxCodeBuilder and nxDecoder can work. Raising the value of min_prob_to_consider (by using the optional input parameter --min_prob_to_consider) yields faster running times. However, this will produce smaller barcode sets and less accurate decoding. If your running times with the default values (usually 1e-08 to 1e-10) are too long, try higher values of min_prob_to_consider, such as 1e-07 or 1e-06. If you can afford longer running times, consider lower values (such as 1e-11). Both nxCodeBuilder and nxDecoder print out the default values of --min_prob_to_consider during startup, in this format: Minimum probability of mutations to consider: 1e-09 About Event_likelihood and Decode_quality: nxDecoder assignes each decoded barcode with two scores: Event_likelihood is a number in the range [0,10], which provides a measurement of how much a particular barcode is likely to be distorted into the sequence at the barcode region of the read. Likely events will produce values close to 10, and unlikely ones - values close to 0. For example, the barcode AACGT is quite likely to be read as AACTT, so the Event likelihood for decoding AACTT as AACGT will be high. The same barcode (AACGT) is less likely to be read as GGCTA, so the Event_likelihood for decoding GGCTA as AACGT will be low. Naturally, the Event_likelihood of decoding a barcode as itself will be near 10. Decode_quality is a number in the range [0,1], which provides a relative measurement of how much a particular barcode is likely to be distorted into the sequence at the barcode region of the read, in relation to all the other barcodes. For example, consider a barcode set comprised of just three barcodes: AACCT, TTAAA and GGTTT. The sequence AACGG is much more similar to AACCT than to TTAAA or to GGTTT, so decoding AACGG to AACCT will be assigned a relatively high Decode_quality. Note that the Event_likelihood of AACCT being distorted to AACGG is quite low, and that this has no bearing on the Decode_quality.

nxCode - DNA Barcode Designer and Decoder

for Next-Gen Sequencing

Algorithmic details