Bioinformatics, Computational and Systems Biology
Sophia Vincoff (she/her/hers)
PhD Student
Duke University
Durham, North Carolina, United States
Tianlai Chen
Research Student
Duke University, United States
Vivian Yudistyra (she/her/hers)
Graduate Student Researcher
Duke University
Durham, North Carolina, United States
Lauren Hong (she/her/hers)
PhD Student
Duke University
Durham, North Carolina, United States
Pranam Chatterjee, PhD
Assistant Professor of Biomedical Engineering
Duke University
Durham, North Carolina, United States
Many proteins are considered “undruggable” by standard small molecule-based approaches, largely due to their disordered nature, instability, and lack of binding site accessibility. As a result, protein-based therapeutics are emerging as potent treatment options. The design of binding proteins, however, is quite challenging, requiring either laborious screening, via yeast or phage display, or structure-based computational design. While structure-based deep learning tools such as RDiffusion have produced de novo high affinity binders, these tools are inapplicable to the large portion of disease-causing proteins with unstable structures.
Large autoregressive language models, most notably the GPT-X series from OpenAI, have become the leading method in image and natural language generation tasks, demonstrating state-of-the-art performance in text generation provided contextual prompts. Recently, protein language models (pLMs), such as ESM-2, Ankh, and ProtT5, have trained attention-based Transformer architectures on millions of amino acid sequences, demonstrating accurate representation of physicochemical, structural, and functional properties of input proteins. Similarly, autoregressive pLMs, such as ProtGPT2 and ProGen2, have achieved both unconditional and class-specific de novo protein generation. However, no current work leverages autoregressive pLMs to generate novel binders to specific targets, which would enable the design of therapeutics to disordered and undruggable proteins. To address this gap, we develop PPI-GPT by fine-tuning ProtGPT2 on PPI sequences. PPI-GPT generates proteins de novo when prompted with a target protein sequence. We show that PPI-GPT-generated peptides exhibit low perplexity, high amino acid sequence diversity, secondary structure distributions similar to those of natural proteins, and can functionally bind and modulate target proteins, thus motivating model usage for proteome-wide editing applications.
Materials and Methods:
To train an autoregressive GPT model that can generate target-binding proteins provided an input target sequence, we first curated a unified, gold-standard PPI dataset by mining IMEX and BioGRID for experimentally verified PPIs. Target and binder protein sequences were acquired from UniProtKB, and only PPIs with total lengths under 1024 amino acids were retained. The dataset was clustered at 80% sequence identity using MMSeqs2 to minimize redundancy. Finally, the dataset was augmented and effectively doubled by swapping the targets and binders, leading to 245,520 PPIs. As input to the GPT-2 tokenizer, we concatenated the target and binder sequences with no separator token. The model was trained and hyperparameter tuned over five training epochs. The final model, when prompted with a target protein sequence, autoregressively generates a binding protein of any desired length via a next-token prediction procedure. The top results are considered to be those for which perplexity of the concatenated target-binder sequence is lowest.
Results, Conclusions, and Discussions: After training and hyperparameter tuning, we were able to produce a model that exhibited low perplexity on a held-out validation set of PPIs, as compared to PPIs with randomly generated binding proteins . Next, we sought to confirm the validity of PPI-GPT-generated proteins in silico via IUPred3 secondary structure prediction. A dataset of 1000 PPI-GPT-generated proteins was compared to the natural and random datasets from ProtGPT2. To validate the binding ability of PPIGPT proteins to their intended targets in silico, we chose six short (< 350 amino acids), stable (pLDDT > 89) proteins (PD-L1, APEX1, UBC9, PCNA, CHIP, and GAPDH) and ran a comparison between PPIGPT and RFDiffusion. For each target, six binding proteins of equal length to the target were generated using RFDiffusion and PPI-GPT. AlphaFold-Multimer was run for each resulting complex. Comparing the complexes' iPTMs revealed similar performance between the two methods on multiple targets.
Finally, we experimentally validated PPI-GPT-generated binder proteins by fusing them to E3 ubiquitin ligases, enabling intracellular degradation of target proteins. We show that these “ubiquibodies” enable robust degradation in various cellular contexts via Western blotting, motivating application to undruggable target proteins for which small molecules do not exist.
In total, we present a novel language model, PPI-GPT, that generates binding protein sequences conditioned on an input target protein sequence. By fusing these generated proteins to E3 ubiquitin ligase domains, we create a programmable protein-targeting architecture, enabling the degradation of various target proteins without the requirement of tertiary structure information. To the best of our knowledge, our degraders represent the first de novo-designed therapeutic proteins ever generated by a purely sequence-based deep learning model, unlocking a host of biotechnology applications.
Acknowledgements (Optional): :
References (Optional): :