Bioinformatics, Computational and Systems Biology
Walker A. Rickord
Undergraduate Student
University of Illinois Urbana-Champaign
Wheaton, Illinois, United States
Paul Jensen
Assistant Professor
University of Michigan
Ann Arbor, Michigan, United States
Benjamin David
Ph.D. Candidate
University of Michigan, United States
Synthetic biologists use randomized DNA barcodes, known as randomers, to label individual genetic parts. For workflows utilizing restriction enzymes, randomers may contain recognition sites that result in barcode digestion and improper DNA assembly. Therefore, a method of designing sequences that prevent the presence of these sites is valuable.
The CutFree algorithm constructs randomers free of user-specified restriction enzyme recognition sites using mixed-integer linear programming (MILP) [1]. Its primary objective is to maximize the log of the degeneracy of the randomer, i.e., the number of unique barcodes present in the pool of oligonucleotides. However, CutFree’s MILP formulation gives rise to unpredictable and excessively long runtimes for complex inputs. CutFreeRL, a reinforcement learning-based alternative to CutFree, mitigates these limitations using heuristics and stochastic search [2]. These techniques improve scalability and result in predictable runtimes at the cost of exhibiting reduced randomer degeneracy.
Ideally, users would apply the original CutFree algorithm to small-scale problems that solve quickly while reserving CutFreeRL for more extensive problems not tractable by CutFree. With no obvious way to predict CutFree's runtime, we introduce a unique application of the deep bidirectional encoder representations from transformers (BERT) model to optimize the algorithm selection process for a given input. BERT, designed for natural language processing, is an unsupervised learning method that creates downstream representations of words through attention mechanisms [3]. Despite being outside BERT’s domain, the model has demonstrated utility in sequence-based tasks [4, 5]. For the following experiment, we deployed DistilBERT, a smaller, faster, and cheaper version of BERT [6].
Materials and Methods::
The study is based on a 27,334-sample dataset. Each sample contains a starting sequence, which sets the output randomer’s length and the pool of bases allowed at each position during the optimization, alongside varying quantities of restriction enzyme recognition sites, which identify the sequences to block. Additionally, we include their corresponding solutions, degeneracies, and runtimes from CutFree and CutFreeRL. A target column specifies the algorithm that produced the faster execution after ensuring the result’s degeneracy score was within 10% of the optimal value. We craft each input for the model as an array containing the starting sequence length and a tokenized string concatenation of the blocking sites padded for uniformity. They are labeled as 0 (CutFree) or 1 (CutFreeRL).
The sequence lengths were inputted to a 3-layer perceptron (768, 128, and 64 neurons), each with L2 regularization (1e-3), batch normalization, dropout (0.4), and rectified linear unit (ReLU) activation. Simultaneously, we delivered the tokenized strings to a pre-trained DistilBERT, took the last hidden state (embedding size = 768), applied dropout (0.8), and connected it to a second perceptron sharing the same structure as above. The two perceptron’s outputs were concatenated and linked to a 2-neuron layer activated by the SoftMax function, which performs the classification. We trained the model for 30 epochs using the Adam weight decaying optimizer (alpha = 5e-5, decay = 0.01) and the sparse categorical cross-entropy loss function. Additionally, weights of 0.7 (CutFree) and 1.7 (CutFreeRL) were assigned to alleviate class imbalance effects on the model.
CutFree’s use is undesirable in high runtime situations, as it may cause server timeouts, increase computational costs, or get stuck in infinitely long optimization cycles. Nonetheless, we stick with CutFree in these cases if CutFreeRL is predicted to produce an invalid solution, as determined by a degeneracy score that falls below the 10% flexibility threshold. Based on these parameters, we visualize the ideal output by sorting the proper algorithm selections by their runtime discrepancies (CutFree runtime – CutFreeRL runtime) in an ascending manner as seen in Figure 1A. The cutoff near 60 seconds is present due to a runtime limit placed on the CutFree algorithm. Using this visualization, we can identify problematic and unproblematic classifications of our model. Problematic classifications are characterized as input cases that satisfy the following criteria: 1.) they are predicted to belong to the CutFree model, 2.) they produce a high runtime discrepancy, and 3.) they are solvable by CutFreeRL.
The fine-tuned DistilBERT classifier predicts the more efficient algorithm with an accuracy of 89.7% ± 0.296 (n=5). Importantly, the sensitivity is 99.8% ± 0.149 (n=5) when assessing the problematic and unproblematic classifications as seen in Figure 1B. For comparison, we created a control model which followed the same perceptron structure as previously described. We extracted 3 key features from each of the inputs: the starting sequence length, the number of restriction sites, and the average length of the restriction sites. Using these features, the control model achieved an accuracy of 72.4% ± 0.458 (n=5).
Overall, our approach demonstrates an effective application of large language modeling to synthetic biology. Specifically, we show how the context provided within restriction enzyme recognition sites can be used to optimize the efficiency of designing randomized DNA sequences compatible with those sites. A future direction includes the investigation of our model embeddings to evaluate the inputs that result in the highest and lowest confidence.
By altering a transformer model to suit a high-performance computing classification task, we successfully automate a challenging decision for end-users unfamiliar with the implementation and trade-offs of CutFree (deterministic) and CutFreeRL (heuristic). All algorithms mentioned are available at https://github.com/wrickord/CutFree.
This work was supported by the NIH (award EB027396).
[1] A. J. Storm and P. A. Jensen, “Designing randomized DNA sequences free of restriction enzyme recognition sites,” Biotechnol J, vol. 13, no. 1, p. 10.1002/biot.201700326, Jan. 2018, doi: 10.1002/biot.201700326.
[2] B. M. David, R. M. Wyllie, R. Harouaka, and P. A. Jensen, “A reinforcement learning framework for pooled oligonucleotide design,” Bioinformatics, vol. 38, no. 8, pp. 2219–2225, Apr. 2022, doi: 10.1093/bioinformatics/btac073.
[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019. doi: 10.48550/arXiv.1810.04805.
[4] N. Q. K. Le, Q.-T. Ho, T.-T.-D. Nguyen, and Y.-Y. Ou, “A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information,” Briefings in Bioinformatics, vol. 22, no. 5, p. bbab005, Feb. 2021, doi: 10.1093/bib/bbab005.
[5] Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri, “DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome,” Bioinformatics, vol. 37, no. 15, pp. 2112–2120, Aug. 2021, doi: 10.1093/bioinformatics/btab083.
[6] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” arXiv, Feb. 29, 2020. doi: 10.48550/arXiv.1910.01108.