The Exon structure similarity method achieved 100% recall and 51,8% precision on the reviewed dataset, identifying 53 potential false positives out of 14,000 genes. It also found 28 candidate genes in the unreviewed set.
The Language Model approach demonstrated promising results, with 77% accuracy on a curated challenging dataset for multi-class classification.
The binary classifier for the Cytokine clan achieved perfect F1 scores and identified additional potential clan members in unreviewed proteins. Attention maps provided insights into the model’s focus on signal peptide (SP) regions which was observed in almost all of the identified candidates.
Overall, the project successfully developed a proof of concept for using Language Models in protein classification, enabling future work on more advanced families and categorizations.