Predictor for Condensate Proteins: MPI-CBG

Machine learning algorithm, PICNIC, can predict which proteins are involved in biomolecular condensates, regardless of their structure.

Membrane-less organelles called biomolecular condensates are capable of concentrating hundreds of distinct proteins to carry out vital biological processes. Like oil droplets forming in water, these dynamic, liquid-like droplets form quickly, for example, by phase separation, creating temporary structures protected from the interior of the watery cell. Researchers have demonstrated in recent years they are involved in many physiological functions, such as DNA control, cell division, cellular signaling, and the nested structure of nucleoli in the cell nucleus. Therefore, researchers are increasingly using biomolecular condensates as a novel class of therapeutic targets. Accurately identifying each of their components is still difficult and biased toward proteins with a significant degree of structural disorder. Proteins with structurally disordered regions tend to accumulate many sequence changes (mutations) over evolutionary time.

Researchers in the group of Agnes Toth-Petroczy at the Max-Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) and at the Center for Systems Biology Dresden (CSBD) have now developed a machine learning classifier (a type of algorithm) that is less biased towards the proteins with high level of disorder. The classifier PICNIC—Proteins Involved in CoNdensates In Cells—accurately predicts proteins that form condensates by learning the amino acid patterns in protein sequences and structures, along with their intrinsic disorder features. Anna Hadarovich, one of the two lead authors of the publication in Nature Communications and a postdoctoral researcher in the group of Agnes, explains, “We trained the classifier with proteins from human. However, I was positively surprised to see how well the predictions of PICNIC worked on other species that it wasn't trained on. We proved this with previously published experimental data.” Hari Raj Singh, the second lead author and postdoctoral researcher in the group of Anthony Hyman, who is a director at the MPI-CBG, performed the experimental validation of the classifier PICNIC. He says, “We tested 24 proteins predicted to be part of condensates in cells and found the tool to be about 82% accurate, regardless of how much structural disorder the proteins had.”

“We developed a machine-learning tool that can analyze condensate proteins across entire proteomes, the complete set of proteins produced by a cell, in different organisms. PICNIC shows that it can identify general patterns using only protein sequence information and structures derived from it across many different species,” says Agnes Toth-Petroczy, who oversaw the study, and continues, “These results can help us understand how biomolecular condensates have evolved and predict more proteins involved in condensates. This could also help identify protein targets for modifying diseased condensates and aid drug development.” The classifier PICNIC is open-source Python package that is easy to use, so everyone can use it for any protein, synthetic or real from different species.

Original Publication

Anna Hadarovich, Hari Raj Singh, Soumyadeep Ghosh, Maxim Scheremetjew, Nadia Rostam, Anthony A. Hyman & Agnes Toth-Petroczy: PICNIC accurately predicts condensate-forming proteins regardless of their structural disorder across organisms. Nat Commun 15, 10668 (2024). https://doi.org/10.1038/s41467-024-55089-x