Metrics: Rho vs R²

Why pLM benchmarks keeping using spearman's ρ as a performance metric?

There is something that has been bothering me for quite a while. Some papers keep reporting Spearman's rank correlation coefficient, denoted as ρ (rho), as a metric for pLM performance when fine-tuned to predict protein properties. This is concerning because it can sometimes make results appear inflated. So I wonder, how does this metric compare with others, such as the coefficient of determination (R²), and how accurately does Spearman's ρ reflect the true values? Let’s dive into this metrics.

Spearman's rank correlation coefficient or Spearman's ρ

  • Measures monotonicity, in other words, how well your predictions preserve the order of the targets.

  • Ignores scale and offset, if your predictions are all scaled or shifted, but keep the same trend, ρ will still be 1.0

Coefficient of determination (R²)

  • Measures how close predictions are to the actual values in terms of variance explained.

  • Sensitive to both scale and offset, if you get the trend right but the scale wrong, R² can be low even when ρ is high.

What “scale and offset” mean?

  • Scale: how stretched or shrunk your predictions are compared to the true values. Example: predicting everything 2× too large.

  • Offset: a constant shift up or down. Example: always predicting +1 higher than the truth.

Which metric should I use for my protein property prediction?

Spearman’s ρ measures the rank correlation between predicted and actual values. It captures whether the order of predictions matches the order of the true values, regardless of scale or offset. This means that even if predictions are systematically higher or lower (offset) or proportionally different (scale), Spearman’s ρ can still be perfect as long as the relative ranking is preserved (Figure 1). As such, ρ is particularly useful when the task depends more on correctly ranking samples than on predicting their exact values, such as assessing the effect of mutations on protein expression, where the goal is simply to predict whether expression will increase, decrease, or remain unchanged.

In other hand, R² measures how close the predicted values are to the actual values in terms of squared error. If your model predicted the right shape/trend but your line is not in the right position (offset) (Figure 1. Case B) or is tilted (scale) (Figure 1. Case C), the squared errors grow and R² drops. Even if you perfectly capture the correlation, the numbers themselves need to match for R² to be high. Therefore, in situations when absolute prediction accuracy matters (e.g., predicting protein stability in ΔΔG, protein binding affinity in Kd, and/or toxicity of a drug express in percentage of hemolysis), R² becomes a key metric because it reflects how close your predictions are to the true measurements.

Figure 1. Comparison of true versus predicted values across three different cases: A) Perfect correlation between true and predicted values. B) Correct trend and scale, but wrong offset (+1). C) Correct trend but wrong scale (2× larger). These three cases demonstrate how offset shifts and scale changes in the predicted variable can mislead your metrics. Link to GitHub on Colab to reproduce the figure.

So, when dealing with protein property prediction if your downstream task uses the predicted order of variants (like selecting top candidates for lab testing), Spearman can be more relevant. If you need accurate numerical estimates for simulation or quantitative analysis, R² is a better metric. The choice between ρ and R² often comes down to whether you care more about getting the rank order right or the exact values right. Consequently, reporting both is important, since a model can rank variants well (high ρ) but still be numerically off (low R²), or vice versa. In addition other metrics can be value, such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE).

Coming back to what bothers me. My point is that some papers, when evaluating the performance of multiple models across different protein benchmark datasets, often report only ρ. Many of these datasets involve predicting fitness landscapes or the effects of specific protein mutations. However, ρ only tells us whether the model got the direction of change right, not whether it got the actual value right. This can make results look better than they really are compared to R². It may look good on paper, but are we setting the bar too low for evaluating our models? Or should we aim for performance that can truly translate into quantitative results at the lab bench?

For disclosure, I know some will argue that R² relies on the assumption that the observed values (target) are normally distributed and is sensitive to outliers, which can be problematic since biological data are often noisy and can be skewed. This is why ρ (Spearman correlation) can be better, as it is insensitive to scale and offset and can be more robust in such cases. However, when comparing a few dozen datasets, we cannot assume that all of them will violate these assumptions. My main concern is how we can evaluate our models as honestly and unbiasedly as possible.

References

Schober, Patrick MD, PhD, MMedStat; Boer, Christa PhD, MSc; Schwarte, Lothar A. MD, PhD, MBA. Correlation Coefficients: Appropriate Use and Interpretation. Anesthesia & Analgesia 126(5):p 1763-1768, May 2018.

Mukaka MM. Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012 Sep;24(3):69-71.

Chicco D, Warrens MJ, Jurman G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci. 2021 Jul 5;7:e623.

Spearman's rank correlation coefficient. https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient. Accessed: Aug/13/2025.

Spearman’s correlation. https://www.statstutor.ac.uk/resources/uploaded/spearmans.pdf Accessed: Aug/13/2025.

Coefficient of determination. https://en.wikipedia.org/wiki/Coefficient_of_determination. Accessed: Aug/13/2025.

The Coefficient of Determination, r-squared. https://online.stat.psu.edu/stat462/node/95/. Accessed: Aug/13/2025.

Last updated