nxCode - DNA Barcode Designer and Decoderfor Next-Gen Sequencing |
|||||
Main | Download barcode sets | Download software | Usage | Algorithmic details | FAQ |
Algorithmic details | |||||
page contents | |||||
Channel model | |||||
The sequencing process is prone to certain
errors. A good barcode set should be relatively immune to such
errors. To design such a set, it is necessary to take into account the
different types of errors and their frequencies of occurrence. We
addressed this by modeling the sequencing process as a noisy
communications channel, and the barcodes as messages that are sent over
that channel. Our channel model is a table containing
probabilities of different errors occurring at the various positions in
the read. The types of errors considered in our model are:
|
|||||
nxDecoder | |||||
About the method: Decoding is the process of associating each sequence read with its most likely origin (barcode). This is accomplished in two phases: a preprocessing phase and a decoding phase. Preprocessing phase: The preprocessing phase calculates the most likely origin of most possible sequences. Given a barcode ci (i=1,...,n) in C, let Bi be the set of sequences (b1, ..., bk) that ci is likely to be distorted into. ci is said to be likely to be distorted into a sequence bi if there exists a series of events (substitutions, insertions and/or deletions) in which ci is distorted to bi, that has a probability >= min_prob_to_consider. (Note: if min_prob_to_consider were 0, than calculating Bi would involve iterating over all possible series of events. Thus, min_prob_to_consider serves as a threshold to restrict event-series length, as such series can theoretically be infinitely long). Higher values of min_prob_to_consider produce faster but less accurate decoding. Given a barcode set C = (c1,...,cn), preprocessing is performed as follows: For each barcode ci in C, Bi is calculated (using the probabilities given by the channel model). For each sequence b in Bi (i=1,...,n), if the likelihood of ci being distorted into b is greater than the likelihood of any previously considered barcode (c1,...,ci-1) being distorted into b, ci is remembered as the most likely origin of b. At the end of the preprocessing phase, we have a function mapping each sequence b to its most likely origin barcode. This function will be used to decode sequences during the decoding phase (Note: not all possible sequences will have mappings, but rather only sequences which some barcode is likely to be distorted into. If other sequences are read, they will be decoded as *unknown*). Decoding phase: For each given sequence read S (a line in the output of your sequencer), the barcode region b is extracted from the sequence S according to its predetermined position in the oligo (an input parameter). The barcode ci which is the most likely origin of b (as calculated in the preprocessing phase) is then chosen and written to the output file as the barcode associated with S. Event_likelihood and Decode_quality are then calculated and written to the output file header along with ci's description (see explanation above). |
|||||
nxCodeBuilder | |||||
About the method: Barcode set design is the process of selecting a particular set of sequences of a particular length l, in a fashion that will ensure a maximal rate of correct decoding after transmission across a noisy channel. Thus, the chosen set should be as resistant as possible to channel noise (sequencing and synthesis errors). The sequences should also meet certain biochemical constraints, such as GC content and maximal homo polymer length. In addition, the set should contain as many barcode sequences as possible. Iterating over all possible sets of sequences that meet the above criteria to find an optimal set is computationally intractable (NP-hard). nxCodeBuilder is an approximation algorithm, that searches the barcode space using a greedy approach and produces near-optimal-size barcode sets that ensure a given rate of correct decoding - exp_acc. Let W=(w1,...,wN) be the set of all possible permutations of length l over the alphabet [A,C,G,T] (Note that N = 4l). Let W'=(wi,...,wj) be the subset of W containing all the sequences that meet the given biochemical constraints. The algorithms begins by initializing the barcode set C to be the empty set. Next, the algorithm considers, in lexicographic order, each sequence w in W' as a candidate barcode for addition to the growing set C. Let wi be the candidate currently being considered, and let Bi be the set of sequences (b1, ..., bk) that wi is likely to be distorted into. wi is said to be likely to be distorted into a sequence bj if there exists a series of events (substitutions, insertions and/or deletions) in which wi is distorted to bj, that has a probability >= min_prob_to_consider. Let Pj be the probability of wi being distorted into bj. Notice that there are multiple series of events in which wi is distorted to bj. pj is the sum of probabilities over all these series that occur with probability >= min_prob_to_consider. Therefor, pj is an approximation of Pj, and its accuracy is determined by min_prob_to_consider. (Note: if min_prob_to_consider were 0, than calculating Bi would involve iterating over all possible series of events. Thus, min_prob_to_consider serves as a threshold to restrict event-series length, as such series can theoretically be infinitely long). Higher values of min_prob_to_consider produce faster but less accurate decoding. In order for wi to be added to C, wi must satisfy the following conditions:
At the end of the run, C contains a set of barcodes that mutually satisfy the above conditions. Correctness and exp_acc: At the end of the run, each barcode c in C has a neighborhood Nc of sequences. When c is transmitted, the probability of it being read as one of the words in Nc is >= exp_acc (because the total probability of c being distorted into one of the words in Nc is >= exp_acc). For each word n in Nc, c is the codeword with highest probability of being distorted into n. Therefor, n will always be decoded to c (see explanation of decode algorithm). Overall, a transmission of c will be decoded as c at a rate >= exp_acc. |
|||||
Input parameters & Quality scores | |||||
About exp_acc: NxCodeBuilder designs codes in such a way that when they are used in experiments, the expected rate of correct decoding will be the value of the parameter --exp_acc. A reasonable value might be --exp_acc=0.99, which will yield a barcode set designed to be correctly decoded 99% of the time. Actual correct decoding rates may differ, due to distortions other than base substitutions, insertions and deletions. Legal values are in the range (0,1). If you aren't getting barcode sets that are large enough for your experiment, consider alotting more nucleotides to your barcode region. If this is unfeasable, try lowering the value of exp_acc. About min_prob_to_consider: Both nxCodeBuilder and nxDecoder work by calculating how likely particular series of distortions are. For example, they calculate the different ways the barcode AACCG could be mutated to the sequence AAAGG. Note that there are theoretically an infinite number of ways for this distortion to occur, through varying numbers of base deletions, insertions and substitutions. Fortunately, most such ways are highly unlikely and can safely be ignored. The parameter --min_prob_to_consider determines the threshold minimum probability of a series of distortions that should be taken into account. For example, a series of distortions in which AACCG is mutated into AAAGG by 9 substitutions, 4 insertions and 4 deletions is very unlikely, and does not have to be taken into account during code construction and decoding. The less such mutations that are taken into account, the faster nxCodeBuilder and nxDecoder can work. Raising the value of min_prob_to_consider (by using the optional input parameter --min_prob_to_consider) yields faster running times. However, this will produce smaller barcode sets and less accurate decoding. If your running times with the default values (usually 1e-08 to 1e-10) are too long, try higher values of min_prob_to_consider, such as 1e-07 or 1e-06. If you can afford longer running times, consider lower values (such as 1e-11). Both nxCodeBuilder and nxDecoder print out the default values of --min_prob_to_consider during startup, in this format: Minimum probability of mutations to consider: 1e-09 About Event_likelihood and Decode_quality: nxDecoder assignes each decoded barcode with two scores:
|