For Partners | Y-Data Advanced Data Science Program

Project in details

Finding Genes With Similar
functional Homology

Authors

Eli Birkan

Tal Brender

Y-DATA Mentor

Shani Kotler

Industry Partner

Roy Granit

Compugen is a company focused on predictive drug discovery and development, particularly in cancer treatments.

This project aimed to develop a proof of concept for using Protein Language Models to discover and classify proteins by functionality. The focus was on the 4-Helical Cytokine Clan, which includes proteins characterized by their shape and immune system activity.

Finding unknown genes could lead to new cancer treatments. The challenge lies in identifying remote homologues, as proteins with dissimilar sequences can still belong to the same family. This project explores both traditional methods and advanced machine learning techniques to address this challenge.

Goals

Basic goal: Reproduce Compugen's experiment using the Exon Hypothesis

Advanced goal: Utilize Protein Language Models to classify and discover functionally similar proteins

Methods

Repurposed Smith-Waterman algorithm to work with exon lengths and phases.
Used ESM2, a transformer-based model trained on 60M proteins, and developed a classification pipeline using ESM embeddings.
Created a curated challenging dataset using similaritybased clustering to ensure robust evaluation of model generalization and remote homology detection capabilities.
Trained MLP classifiers for multi-class and binary classification.
Explainability: Implemented Attention Rollout technique for model interpretation.

Results

The Exon structure similarity method achieved 100% recall and 51,8% precision on the reviewed dataset, identifying 53 potential false positives out of 14,000 genes. It also found 28 candidate genes in the unreviewed set.

The Language Model approach demonstrated promising results, with 77% accuracy on a curated challenging dataset for multi-class classification.

The binary classifier for the Cytokine clan achieved perfect F1 scores and identified additional potential clan members in unreviewed proteins. Attention maps provided insights into the model’s focus on signal peptide (SP) regions which was observed in almost all of the identified candidates.

Overall, the project successfully developed a proof of concept for using Language Models in protein classification, enabling future work on more advanced families and categorizations.

Benefit from fresh perspectives. Join as an industry partner.

Meet the companies we’ve already collaborated with

Why become our partner?

What are industry projects?

What do you need to get started?

Project in details

About Y-DATA