More about metrics

The canprot package calculates chemical metrics of proteins from amino acid compositions. This vignette was compiled on 2024-03-28 with canprot version 2.0.0.

Previous vignette: Introduction to canprot

More details on chemical metrics

Carbon oxidation state (ZC) is calculated from elemental ratios, but stoichiometric hydration state (nH2O) depends on a specific choice of thermodynamic components (or basis species). In canprot, nH2O is calculated from a theoretical reaction to form proteins from the following set of basis species, abbreviated as QEC:

The rationale for this choice is described in Dick et al. (2020) and Dick (2022).

To see how it works, consider the formation reaction of alanylglycine, which can be written using functions in the CHNOSZ package:

##           C  H N O S ispecies logact state
## C5H10N2O3 5 10 2 3 0     1723   -3.2    aq
## C5H9NO4   5  9 1 4 0     1724   -4.5    aq
## C3H7NO2S  3  7 1 2 1     1721   -3.6    aq
## H2O       0  2 0 1 0        1    0.0   liq
## O2        0  0 0 2 0     2679  -80.0   gas
CHNOSZ::subcrt("alanylglycine", 1)$reaction
##      coeff          name   formula state ispecies model
## 2665     1 alanylglycine C5H10N2O3    cr     2665   CGL
## 1723    -1     glutamine C5H10N2O3    aq     1723   HKF

As it turns out, alanylglycine has the same elemental formula as glutamine, which is one of the basis species, so there are no other basis species in the reaction. That includes H2O, so we can deduce that nH2O is zero.

Let’s do the calculation with the nH2O() function to see this result. We have to specify terminal_H2O = 1 in order to account for the terminal -H and -OH groups on alanlglycine. As described below, the default for this setting is 0 because, most of the time, we want to measure the per-amino-acid contributions for proteins.

AG <- data.frame(Ala = 1, Gly = 1)
nH2O(AG, terminal_H2O = 1)
## [1] 0

Now let’s try an actual protein, chicken egg-white lysozome, which has the name LYSC_CHICK in UniProt with accession number P00698. The amino acid compositions of this and selected other proteins are available in the CHNOSZ package. This gets the amino acid composition and prints the protein length and ZC and nH2O:

AA <- CHNOSZ::pinfo(CHNOSZ::pinfo("LYSC_CHICK"))
##   6 
## 129
##          6 
## 0.01631321
##          6 
## -0.8868217

By the way, lysozyme (a secreted protein) is highly oxidized compared to cytoplasmic proteins, most of which have negative ZC.

To see where the value of nH2O comes from, we can write the formation reaction of LYSC_CHICK from the QEC basis species. Recall that we set the basis species above with CHNOSZ::basis("QEC"); that setting is used here to balance the reaction.

(reaction <- CHNOSZ::subcrt("LYSC_CHICK", 1)$reaction) # print the reaction
##      coeff          name             formula state ispecies
## 3487   1.0    LYSC_CHICK C613H959N193O185S10    aq     3487
## 1723 -66.4     glutamine           C5H10N2O3    aq     1723
## 1724 -50.2 glutamic acid             C5H9NO4    aq     1724
## 1721 -10.0      cysteine            C3H7NO2S    aq     1721
## 1    113.4         water                 H2O   liq        1
## 2679  60.8        oxygen                  O2   gas     2679
##               model
## 3487            HKF
## 1723            HKF
## 1724            HKF
## 1721            HKF
## 1    water.SUPCRT92
## 2679            CGL
(H2Ocoeff <- with(reaction, coeff[name == "water"])) # print the coefficient on H2O
## [1] 113.4

That says 113.4, but the value for nH2O above was -0.8868217. What happened? Let’s step through the logic:

A general observation: This is a theoretical reaction in terms of thermodynamic components, so we are not dealing with biochemical mechanisms here. That’s one reason for calling this approach geochemical biology.

Implementation details

To save time, nH2O() does not calculate the formation reaction for each protein but instead uses precomputed values of nH2O for each amino acid. The two methods give equivalent results, as described in Dick et al. (2020).

Similarly, Zc() uses precomputed values of ZC and nC (number of carbon atoms) for each amino acid. NOTE: Calculating ZC of proteins from amino acid frequencies (i.e. abundances or counts in a protein sequence) requires weighting the amino-acid ZC by the number of carbon atoms in each amino acid, in addition to weighting by amino acid frequency. Using the unweighted mean of ZC of amino acids leads to artificially higher values for proteins.

GRAVY, pI, and molecular weight

There are also functions for calculating the grand average of hydropathy (GRAVY) and isoelectric point (pI) of proteins. Below, values for representative proteins are compared with results from the ProtParam tool (Gasteiger et al., 2005) in UniProt.

We first look up a few proteins in CHNOSZ’s list of proteins, then get the amino acid compositions.

iprotein <- CHNOSZ::pinfo(c("LYSC_CHICK", "RNAS1_BOVIN", "AMYA_PYRFU", "CSG_HALJP"))
AAcomp <- CHNOSZ::pinfo(iprotein)

Calculate GRAVY and compare it to reference values from

G_calc <- GRAVY(AAcomp)
G_ref <- c(-0.472, -0.663, -0.325)
stopifnot(all.equal(round(G_calc[1:3], 3), G_ref, check.attributes = FALSE))

Calculate pI and compare it to reference values.

pI_calc <- pI(AAcomp)
# Reference values calculated with ProtParam
# LYSC_CHICK: residues 19-147 (sequence v1)
# RNAS1_BOVIN: residues 27-150 (sequence v1)
# AMYA_PYRFU: residues 2-649 (sequence v2)
# CSG_HALJP: residues 35-862 (sequence v1)
pI_ref <- c(9.32, 8.64, 5.46, 3.37)
stopifnot(all.equal(as.numeric(pI_calc), pI_ref))

Calculate molecular weight (MW) and compare it to reference values

# Per-residue molecular weight multiplied by number of residues
MWcalc <- MW(AAcomp) * plength(AAcomp)
# Add terminal groups
MWcalc <- MWcalc + 18.01528
# Reference values for molecular weights of proteins
MWref <- c(14313.14, 13690.29, 76178.25)
stopifnot(all.equal(round(MWcalc[1:3], 2), MWref, check.attributes = FALSE))


Dick JM. 2022. A thermodynamic model for water activity and redox potential in evolution and development. Journal of Molecular Evolution 90(2): 182–199. doi: 10.1007/s00239-022-10051-7

Dick JM, Yu M, Tan J. 2020. Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences. Biogeosciences 17(23): 6145–6162. doi: 10.5194/bg-17-6145-2020

Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A. 2005. Protein identification and analysis tools on the ExPASy server. In: Walker JM, editor. The Proteomics Protocols Handbook. Humana Press. pp. 571–607. doi: 10.1385/1-59259-890-0:571