(B-44) PepGPT: De novo Design of Target-Specific Peptides for Programmable Proteome Editing

Friday, October 13, 2023

3:30 PM - 4:30 PM PDT

Location: Exhibit Hall - Row B - Poster # 44

Presenting Author(s)

Sophia Vincoff (she/her/hers)

PhD Student
Duke University
Durham, North Carolina, United States

Co-Author(s)

TC

Tianlai Chen

Research Student
Duke University, United States
VY

Vivian Yudistyra (she/her/hers)

Graduate Student Researcher
Duke University
Durham, North Carolina, United States
Lauren Hong (she/her/hers)

PhD Student
Duke University
Durham, North Carolina, United States

Primary Investigator(s)

Pranam Chatterjee, PhD

Assistant Professor of Biomedical Engineering
Duke University
Durham, North Carolina, United States

Abstract: Large pre-trained language models have revolutionized natural language processing (NLP), and have served as the foundation for mainstream artificial intelligence tools. Analogous pre-trained protein language models built upon the Transformer architecture have been trained on millions of amino acid sequences to capture relevant structural, functional, and physicochemical properties of proteins. Autoregressive models, most notably the GPT-X series of models from OpenAI, have demonstrated remarkable capabilities to generate new, expressive text provided a contextual input. The first GPT model to “speak” the protein language, ProtGPT2, was recently trained on an instance of GPT-2, to generate functional protein sequences. In this work, we extend the capability of ProtGPT2 to design target-binding proteins by conditionally fine-tuning ProtGPT2 on protein-protein interaction (PPI) sequences. We demonstrate that our PPI-GPT model can generate functional protein sequences with low perplexity, provided only the target amino acid sequence. Experimentally, we validate target-binding capability of our generated proteins via surface plasmon resonance and via fusion to E3 ubiquitin ligases for target-specific protein degradation. We envision that our purely sequence-based generative model will thus enable the programmable modulation of pathogenic proteins with minimal structural information, such as dysregulated transcription factors and fusion oncoproteins.

Introduction:

Many proteins are considered “undruggable” by standard small molecule-based approaches, largely due to their disordered nature, instability, and lack of binding site accessibility. As a result, protein-based therapeutics are emerging as potent treatment options. The design of binding proteins, however, is quite challenging, requiring either laborious screening, via yeast or phage display, or structure-based computational design. While structure-based deep learning tools such as RDiffusion have produced de novo high affinity binders, these tools are inapplicable to the large portion of disease-causing proteins with unstable structures.

Large autoregressive language models, most notably the GPT-X series from OpenAI, have become the leading method in image and natural language generation tasks, demonstrating state-of-the-art performance in text generation provided contextual prompts. Recently, protein language models (pLMs), such as ESM-2, Ankh, and ProtT5, have trained attention-based Transformer architectures on millions of amino acid sequences, demonstrating accurate representation of physicochemical, structural, and functional properties of input proteins. Similarly, autoregressive pLMs, such as ProtGPT2 and ProGen2, have achieved both unconditional and class-specific de novo protein generation. However, no current work leverages autoregressive pLMs to generate novel binders to specific targets, which would enable the design of therapeutics to disordered and undruggable proteins. To address this gap, we develop PPI-GPT by fine-tuning ProtGPT2 on PPI sequences. PPI-GPT generates proteins de novo when prompted with a target protein sequence. We show that PPI-GPT-generated peptides exhibit low perplexity, high amino acid sequence diversity, secondary structure distributions similar to those of natural proteins, and can functionally bind and modulate target proteins, thus motivating model usage for proteome-wide editing applications.

Materials and Methods:
To train an autoregressive GPT model that can generate target-binding proteins provided an input target sequence, we first curated a unified, gold-standard PPI dataset by mining IMEX and BioGRID for experimentally verified PPIs. Target and binder protein sequences were acquired from UniProtKB, and only PPIs with total lengths under 1024 amino acids were retained. The dataset was clustered at 80% sequence identity using MMSeqs2 to minimize redundancy. Finally, the dataset was augmented and effectively doubled by swapping the targets and binders, leading to 245,520 PPIs. As input to the GPT-2 tokenizer, we concatenated the target and binder sequences with no separator token. The model was trained and hyperparameter tuned over five training epochs. The final model, when prompted with a target protein sequence, autoregressively generates a binding protein of any desired length via a next-token prediction procedure. The top results are considered to be those for which perplexity of the concatenated target-binder sequence is lowest.

Results, Conclusions, and Discussions: After training and hyperparameter tuning, we were able to produce a model that exhibited low perplexity on a held-out validation set of PPIs, as compared to PPIs with randomly generated binding proteins . Next, we sought to confirm the validity of PPI-GPT-generated proteins in silico via IUPred3 secondary structure prediction. A dataset of 1000 PPI-GPT-generated proteins was compared to the natural and random datasets from ProtGPT2. To validate the binding ability of PPIGPT proteins to their intended targets in silico, we chose six short (< 350 amino acids), stable (pLDDT > 89) proteins (PD-L1, APEX1, UBC9, PCNA, CHIP, and GAPDH) and ran a comparison between PPIGPT and RFDiffusion. For each target, six binding proteins of equal length to the target were generated using RFDiffusion and PPI-GPT. AlphaFold-Multimer was run for each resulting complex. Comparing the complexes' iPTMs revealed similar performance between the two methods on multiple targets.

Finally, we experimentally validated PPI-GPT-generated binder proteins by fusing them to E3 ubiquitin ligases, enabling intracellular degradation of target proteins. We show that these “ubiquibodies” enable robust degradation in various cellular contexts via Western blotting, motivating application to undruggable target proteins for which small molecules do not exist.

In total, we present a novel language model, PPI-GPT, that generates binding protein sequences conditioned on an input target protein sequence. By fusing these generated proteins to E3 ubiquitin ligase domains, we create a programmable protein-targeting architecture, enabling the degradation of various target proteins without the requirement of tertiary structure information. To the best of our knowledge, our degraders represent the first de novo-designed therapeutic proteins ever generated by a purely sequence-based deep learning model, unlocking a host of biotechnology applications.

Acknowledgements (Optional): :

References (Optional): :