# 09-26【Han Mao Kiah】腾讯会议-吴文俊数学重点实验室组合图论系列讲座之157

The imbalance of a binary word refers to the absolute difference between the number of ones and the number of zeros in it. A word is balanced if its imbalance is at most one and a code is balanced if all its codewords are balanced. In 1986, motivated by applications in optical disks, Knuth [1] proposed a simple and efficient scheme that transforms binary messages into balanced codes.
In recent years, advances in synthesis and sequencing technologies have made DNA macromolecules an attractive medium for digital information storage. Besides being biochemically robust, DNA strands offer ultrahigh storage densities of 10^15 - 10^20 bytes per gram of DNA, as demonstrated in recent experiments (see [2, Table 1]). However, this non-traditional media presents new challenges to the coding community (see [3] for a survey). One particular challenge has rekindled interest in balanced codes. Specifically, a DNA string comprises four bases or letters: A, C, T and G, and a string is GC-rich (or GC-poor) if a high (or low) proportion of the bases corresponds to either G or C. Since GC-rich or GC-poor DNA strings are prone to both synthesis and sequencing errors, we aim to reduce the difference with the number of G and C and the number of A and T on every DNA codeword. This requirement is equivalent to reducing the imbalance of a related binary word.
In this talk, we revisit Knuth's balancing method and look at techniques that adapt the method to address coding problems in DNA-based data storage.
In the first part, we look at constructions of address sequences that are critical in DNA-based data storage with random-access capabilities. Specifically, Yazdi et al. [2] developed an architecture that allows selective access to encoded DNA strands through the process of hybridization. The technique involves prepending infor
mation-carrying DNA strands with specially chosen address sequences called primers. By combining Knuth's balancing method with cyclic codes, we provide efficient and explicit constructions of such primer sets [4].

In the second part, in addition to the GC-balanced constraints, we look at codes that correct a single insertion or deletion or substitution and whose codewords obey the homopolymer runlength constraints. Besides the code constructions, we also propose linear-time encoders for our codebooks [5].

In the third part, we apply Knuth's balancing method to a beautiful and important class of codes, the polar codes. Invented by Arıkan [6], polar codes achieve capacity for many channels with low encoding and decoding complexities. We study a generalization of Knuth's balancing method, specifically, a technique of Mazumdar, Roth, and Vontobel [7], and provide means of transforming messages into balanced polar codewords while retaining the low complexities of the polar encoding and decoding algorithms.