Volume 0 - Multidisciplinary Cancer Investigation                   Multidiscip Cancer Investig 2025, 0 - Multidisciplinary Cancer Investigation: 20-30 | Back to browse issues page

XML Print


Download citation:
BibTeX | RIS | EndNote | Medlars | ProCite | Reference Manager | RefWorks
Send citation to:

Ahmadi M, Jahed-Motlagh M R, Asgari E, Rahmani A T. Language Model–Based Representation Learning for Venom Protein Identification and Therapeutic Target Discovery in Cancer. Multidiscip Cancer Investig 2025;
URL: http://mcijournal.com/article-1-416-en.html
1- Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
2- Department of Computer Engineering, Iran University of Science and Technology,Tehran, Iran , jahedmr@iust.ac.ir
3- Qatar Computing Research Institute
Abstract:   (39 Views)
Venom is a complex mixture of bioactive molecules produced by venomous organisms for predation, defense, or intraspecific competition, often leading to specific physiological responses in target organisms. Venom-derived peptides and proteins have recently attracted attention in biomedical research for their potential therapeutic applications, including anticancer drug discovery. However, venom sequences constitute a highly divergent class of proteins, making their machine learning and homology-based identification particularly challenging. To address this, we propose ToxVec, a transfer learning based framework for automatic representation learning of protein sequences aimed at improving venom identification. Our approach leverages pre-trained protein language models to capture sequence-level information without manual feature engineering. ToxVec outperforms existing feature-based models, achieving amacro-F1 score of 0.89. Furthermore, an ensemble model trained on multiple balanced subsets enhances
performance to a macro-F1 of 0.93, representing a 7% improvement over the state of the art. Beyond benchmark performance, screening of experimentally validated anticancer peptides from the CancerPPD2 dataset revealed that many exhibit high venom-like signatures according to ToxVec, supporting the notion that toxin-inspired molecular architectures may underlie anticancer bioactivity. We further discuss how language model–based representation learning embodies a Cognitive Mind–Body–Inspired interpretation, linking abstract sequence semantics (the “mind”) to biological function (the “body”). By enabling more accurate large-scale identification of venom proteins, ToxVec provides a foundation for systematically exploring venom-derived bioactive peptides as potential therapeutic candidates, including those targeting pathways implicated in breast cancer progression and metastasis. This automated approach thus bridges computational protein informatics with translational oncology, supporting future efforts in bioactive peptide based anticancer research.
Full-Text [PDF 1062 kb]   (42 Downloads)    


Select article type: Original/Research Article | Subject: Prevention, Early Detection and Screening
Received: 2025/10/6 | Accepted: 2025/10/28 | ePublished: 2025/12/1

References
1. Bengio Y. Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning. 2012;27:17-36. [Available from: https://proceedings.mlr.press/v27/bengio12a.html]
2. Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M, Akhtar N, Barnes N, Mian A. A comprehensive overview of large language models. ACM Trans Intell Syst Technol. 2025;16(5):1-72. doi:10.1145/3652427 [DOI:10.1145/3744746]
3. Howard J, Ruder S. Universal Language Model Fine-Tuning for Text Classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics; 2018. p. 328-339. doi:10.18653/v1/P18-1031 [DOI:10.18653/v1/P18-1031] [PMID]
4. Goldberg Y, Levy O. Word2vec explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint. 2014;arXiv:1402.3722. Available from: https://arxiv.org/abs/1402.3722
5. Janin J, Miller S, Chothia C. Surface, subunit interfaces and interior of oligomeric proteins. J Mol Biol. 1988;204(1):155-164. doi:10.1016/0022-2836(88)90606-3 [DOI:10.1016/0022-2836(88)90606-7] [PMID]
6. Hassabis D, Kumaran D, Summerfield C, Botvinick M. Neuroscience-inspired artificial intelligence. Neuron. 2017;95(2):245-258. doi:10.1016/j.neuron.2017.06.011 [DOI:10.1016/j.neuron.2017.06.011] [PMID]
7. Vihinen M, Torkkila E, Riikonen P. Accuracy of protein flexibility predictions. Proteins. 1994;19(2):141-149. doi:10.1002/prot.340190207 [DOI:10.1002/prot.340190207] [PMID]
8. Ahmadi S, Knerr JM, Argemí L, Bordon KCF, Pucca M, Cerni F, et al. Scorpion venom: Detriments and benefits. Biomedicines. 2020;8(5):118. doi:10.3390/biomedicines8050118 [DOI:10.3390/biomedicines8050118] [PMID] []
9. Jenner RA, von Reumont BM, Campbell LI, Undheim EAB. Parallel evolution of complex centipede venoms revealed by comparative proteotranscriptomic analyses. Mol Biol Evol. 2019;36(12):2748-2763. doi:10.1093/molbev/msz176 [DOI:10.1093/molbev/msz176] [PMID] []
10. Maaten Lvd, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579-2605. Available from: https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
11. Lewis RJ, Garcia ML. Therapeutic potential of venom peptides. Nat Rev Drug Discov. 2003;2(10):790-802. doi:10.1038/nrd1197 [DOI:10.1038/nrd1197] [PMID]
12. Gallese V. Embodied simulation: From neurons to phenomenal experience. Phenom Cogn Sci. 2005;4(1):23-48. doi:10.1007/s11097-005-4737-z [DOI:10.1007/s11097-005-4737-z]
13. Wong ES, Hardy MC, Wood D, Bailey T, King GF. SVM-based prediction of propeptide cleavage sites in spider toxins identifies toxin innovation in an Australian tarantula. PLoS One. 2013;8(7):e66279. doi:10.1371/journal.pone.0066279 [DOI:10.1371/journal.pone.0066279] [PMID] []
14. Chauhan M, Gupta A, Tomer R, Raghava GP. CancerPPD2: an updated repository of anticancer peptides and proteins. Database (Oxford). 2025;2025:baaf030. doi:10.1093/database/baaf030 [DOI:10.1093/database/baaf030] [PMID] []
15. Chen JY, Wang JF, Hu Y, Li XH, Qian YR, Song CL. Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review. Front Bioeng Biotechnol. 2025;13:1506508. doi:10.3389/fbioe.2025.1506508 [DOI:10.3389/fbioe.2025.1506508] [PMID] []
16. Pan X, Zuallaert J, Wang X, Shen HB, Campos EP, Marushchak DO, et al. ToxDL: deep learning using primary structure and domain embeddings for assessing protein toxicity. Bioinformatics. 2020;36(13):4222-4231. doi:10.1093/bioinformatics/btaa518 [DOI:10.1093/bioinformatics/btaa518] [PMID] []
17. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics; 2014. p. 1724-1734. doi:10.3115/v1/D14-1179 [DOI:10.3115/v1/D14-1179]
18. Xiao Y, Zhao W, Zhang J, Jin Y, Zhang H, Ren Z, et al. Protein large language models: A comprehensive survey. arXiv preprint. 2025;arXiv:2502.17504. Available from: https://arxiv.org/abs/2502.17504
19. Naamati G, Askenazi M, Linial M. ClanTox: a classifier of short animal toxins. Nucleic Acids Res. 2009;37(suppl_2):W363-W368. doi:10.1093/nar/gkp240 [DOI:10.1093/nar/gkp240] [PMID] []
20. Casewell NR, Wüster W, Vonk FJ, Harrison RA, Fry BG. Complex cocktails: the evolutionary novelty of venoms. Trends Ecol Evol. 2013;28(4):219-229. doi:10.1016/j.tree.2012.10.020 [DOI:10.1016/j.tree.2012.10.020] [PMID]
21. Wan F, Zeng JM. Deep learning with feature embedding for compound-protein interaction prediction. bioRxiv. 2016;086033. doi:10.1101/086033 [DOI:10.1101/086033]
22. Linial M, Rappoport N, Ofer D. Overlooked short toxin-like proteins: a shortcut to drug design. Toxins (Basel). 2017;9(11):350. doi:10.3390/toxins9110350 [DOI:10.3390/toxins9110350] [PMID] []
23. Starcevic A, Moura-da Silva AM, Cullum J, Hranueli D, Long PF. Combinations of long peptide sequence blocks can be used to describe toxin diversification in venomous animals. Toxicon. 2015;95:84-92. doi:10.1016/j.toxicon.2014.12.005 [DOI:10.1016/j.toxicon.2014.12.005] [PMID]
24. Asgari E, McHardy AC, Mofrad MR. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (dimotif) and sequence embedding (protvecx). Sci Rep. 2019;9(1):3577. doi:10.1038/s41598-019-39813-0 [DOI:10.1038/s41598-019-38746-w] [PMID] []
25. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. HuggingFace's Transformers: State-of-the-art natural language processing. arXiv preprint. 2019;arXiv:1910.03771. Available from: https://arxiv.org/abs/1910.03771
26. Butlin P, Long R, Elmoznino E, Bengio Y, Birch J, Constant A, et al. Consciousness in artificial intelligence: insights from the science of consciousness. arXiv preprint. 2023;arXiv:2308.08708. [Available from: https://arxiv.org/abs/2308.08708]
27. Prashanth JR, Hasaballah N, Vetter I. Pharmacological screening technologies for venom peptide discovery. Neuropharmacology. 2017;127:4-19. doi:10.1016/j.neuropharm.2017.03.008 [DOI:10.1016/j.neuropharm.2017.03.008] [PMID]
28. Asgari E, Mofrad MR. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One. 2015;10(11):e0141287. doi:10.1371/journal.pone.0141287 [DOI:10.1371/journal.pone.0141287] [PMID] []
29. Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, et al. A survey on the evaluation of large language models. ACM Trans Intell Syst Technol. 2024;15(3):1-45. doi:10.1145/3639376 [DOI:10.1145/3641289]
30. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135-146. doi:10.1162/tacl_a_00051 [DOI:10.1162/tacl_a_00051]
31. Ojeda PG, Ramírez D, Alzate-Morales J, Caballero J, Kaas Q, González W. Computational studies of snake venom toxins. Toxins (Basel). 2018;10(1):8. doi:10.3390/toxins10010008 [DOI:10.3390/toxins10010008] [PMID] []
32. Hargreaves AD, Swain MT, Hegarty MJ, Logan DW, Mulley JF. Restriction and recruitment-gene duplication and the origin and evolution of snake venom toxins. Genome Biol Evol. 2014;6(8):2088-2095. doi:10.1093/gbe/evu166 [DOI:10.1093/gbe/evu166] [PMID] []
33. Clark A. Supersizing the Mind: Embodiment, Action, and Cognitive Extension. Oxford: Oxford University Press; 2010. doi:10.1093/acprof:oso/9780195333213.001.0001 [DOI:10.1093/acprof:oso/9780195333213.001.0001]
34. Johnson M. Philosophy in the Flesh: The Embodied Mind and Its Challenge to Western Thought. New York: Basic Books; 1999.
35. Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep transfer learning. In: International Conference on Artificial Neural Networks. Cham: Springer; 2018. p. 270-279. doi:10.1007/978-3-030-01424-7_27 [DOI:10.1007/978-3-030-01424-7_27]
36. Cole TJ, Brewer MS. Toxify: a deep learning approach to classify animal venom proteins. PeerJ. 2019;7:e7200. doi:10.7717/peerj.7200 [DOI:10.7717/peerj.7200] [PMID] []
37. Dao FY, Yang H, Su ZD, Yang W, Wu Y, Hui D, et al. Recent advances in conotoxin classification by using machine learning methods. Molecules. 2017;22(7):1057. doi:10.3390/molecules22071057 [DOI:10.3390/molecules22071057] [PMID] []
38. Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. Proc Natl Acad Sci U S A. 2005;102(18):6395-6400. doi:10.1073/pnas.0408677102 [DOI:10.1073/pnas.0408677102] [PMID] []
39. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (NIPS). 2013;26:3111-3119.
40. Gacesa R, Barlow DJ, Long PF. Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions. PeerJ Comput Sci. 2016;2:e90. doi:10.7717/peerj-cs.90 [DOI:10.7717/peerj-cs.90]
41. Jungo F, Bougueleret L, Xenarios I, Poux S. The UniProtKB/Swiss-Prot Tox-Prot program: a central hub of integrated venom protein data. Toxicon. 2012;60(4):551-557. doi:10.1016/j.toxicon.2012.03.010 [DOI:10.1016/j.toxicon.2012.03.010] [PMID] []
42. Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, et al. Evaluating protein transfer learning with TAPE. In: Advances in Neural Information Processing Systems (NeurIPS). 2019;32:9689-9701. [DOI:10.1101/676825]
43. Nawarak J, Sinchaikul S, Wu CY, Liau MY, Phutrakul S, Chen ST. Proteomics of snake venoms from elapidae and viperidae families by multidimensional chromatographic methods. Electrophoresis. 2003;24(16):2838-2854. doi:10.1002/elps.200305517 [DOI:10.1002/elps.200305517] [PMID]
44. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157(1):105-132. doi:10.1016/0022-2836(82)90515-0 [DOI:10.1016/0022-2836(82)90515-0] [PMID]
45. Emini EA, Hughes JV, Perlow D, Boger J. Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide. J Virol. 1985;55(3):836-839. doi:10.1128/jvi.55.3.836-839.1985 [DOI:10.1128/jvi.55.3.836-839.1985] [PMID] []
46. Ma R, Mahadevappa R, Kwok H. Venom-based peptide therapy: insights into anticancer mechanism. Oncotarget. 2017;8:100908-100930. doi:10.18632/oncotarget.21757 [DOI:10.18632/oncotarget.21757] [PMID] []

Add your comments about this article : Your username or Email:
CAPTCHA

Send email to the article author


Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© 2025 CC BY-NC 4.0 | Multidisciplinary Cancer Investigation

Designed & Developed by : Yektaweb