Problem formulation for enzyme active site prediction
In previous studies of enzyme active site identification, the prediction task was typically defined as either (1) amino acid residue token classification in a given enzyme’s amino acid sequence \({{{\mathcal{A}}}}^{E}=\{{a}_{1},\,\ldots,\, {a}_{n}\}\), or (2) graph node \({{{\mathcal{V}}}}_{i}\) classification in a given enzyme’s graph \({{{\mathcal{G}}}}^{E}=\left({{\mathcal{V}}}{{\mathcal{,\, }}}{{\mathcal{E}}}\right)\). These studies mainly emphasized the sequence or structural information of the enzyme, while the consideration of reaction information was insufficient. In this work, we aim to further improve the task of enzyme active site identification by taking into account of the inclusion of the corresponding enzymatic reactions. The input for this task not only includes the structural information of the enzyme (stored in PDB format) but also the chemical reaction sequence information (stored in reaction SMILES22 format). The structural information of the enzyme is converted into a graph representation through data processing, denoted as \({{{\mathcal{G}}}}^{E}=({\widetilde{{{\mathcal{V}}}}}^{E},\, {{{\mathcal{E}}}}^{E},\,{{{\mathcal{R}}}}^{E})\), where the \({{{\mathcal{V}}}}_{i}^{E}\) node represents the input feature of the i-th amino acid residue, incorporating its sequence information \({a}_{i}\). This can extract the amino acid sequence \({{{\mathcal{A}}}}^{E}=\{{a}_{i}\}\) from the graph structure of the enzyme, and \({{{\mathcal{E}}}}^{E}\) and \({{{\mathcal{R}}}}^{E}\) denote the sets of edges and edge relational types, respectively. The enzyme reaction SMILES is processed and converted into a graph representation, denoted as \({{{\mathcal{G}}}}^{R}=\{{{{\mathcal{G}}}}^{S},\,{{{\mathcal{G}}}}^{P}\}\) where \({{{\mathcal{G}}}}^{S}\) is the substrates graph of the enzyme reaction, denoted as \({{{\mathcal{G}}}}^{S}=({{{\mathcal{V}}}}^{S},\, {{{\mathcal{E}}}}^{S})\), and \({{{\mathcal{G}}}}^{P}\) is the products graph of the enzyme reaction, denoted as \({{{\mathcal{G}}}}^{P}=({{{\mathcal{V}}}}^{P},\, {{{\mathcal{E}}}}^{P})\).
In the context of the given enzyme structure graph \({{{\mathcal{G}}}}^{E}\) and reaction graph \({{{\mathcal{G}}}}^{R}\), the objective of the enzyme active site prediction task is to train a model \({{\mathcal{M}}}\). This model is capable of mapping the joint feature representation space \({{{\mathcal{G}}}}^{E}\times ({{{\mathcal{G}}}}^{S}\cup {{{\mathcal{G}}}}^{P})\) to a binary probability vector \({{\boldsymbol{P}}}=({p}_{1},\, {p}_{2},\,\ldots,\, {p}_{n})\), where \({p}_{i}\,\in \,[0,\,1]\) and \(n\) represents the length of the enzyme sequence. Alternatively, the model can map to a multi-class probability matrix:
$${{\boldsymbol{P}}}=\left[\begin{array}{cccc}{p}_{11} & {p}_{21} & \cdots & {p}_{n1}\\ {p}_{12} & {p}_{22} & \cdots & {p}_{n2}\\ \vdots & \vdots & \ddots & \vdots \\ {p}_{1m} & {p}_{2m} & \cdots & {p}_{{nm}}\end{array}\right],\,{p}_{{ij}}\,\in \,\left[0,\,1\right],\,{\sum }_{j=1}^{m}{p}_{{ij}}=1$$
(1)
where \(m\) indicates the potential functional roles (including scaffold, binding site, catalytic site, or other specific sites) for each amino acid node, \(i\) is the index of the amino acid, and \(j\) is the index for the type of functional role.
EasIFA framework
The EasIFA framework proposed is shown in Fig. 1. Given an enzyme-reaction pair as the input, we use two branches to represent the features of the enzyme and the reaction, respectively. The feature representation of the enzyme is divided into three stages, as illustrated in Supplementary Fig. S2a. In the initial stage, ESM-223 is used to convert the sequence of amino acid residues into a protein language representation. In the second stage, this protein language representation of each amino acid residue serves as the node feature in the enzyme graph \({{{\mathcal{G}}}}^{E}\), which is then input into GearNet24. Here, a message passing mechanism updates the node features of the enzyme. Subsequently, the node features of the enzyme graph are created by concatenating the protein language representation of the amino acid residues with the updated features of the enzyme graph. In the third stage, BridgeNet25 is used to map these features to the same feature size as that of the reaction information.
The feature representation of the reaction is divided into two branches as displayed in Fig. 1, which independently represent the features of substrate molecules and product molecules. After utilizing an MPNN26 to update the features of the substrate and product molecules, a substrate-product interaction network based on the attention mechanism is employed to merge the features of the product molecules with those of the substrate molecules, resulting in a substrate molecule graph that carries product information. Following the embedding of features from both the enzyme and the reaction, the information on the substrate molecule graph is merged onto the enzyme graph via an enzyme-reaction interaction network based on the attention mechanism. It is worth noting that the enzyme-reaction interaction network based on the attention mechanism differs from the substrate-product interaction network, which uses atom-wise distance-aware global attention25,27 in its self-attention mechanism.
Once the information integration is complete, the amino acid residue activity annotation network known as Multi-layer Perceptron Residue Activity Predictor, as shown in Fig. 1, is utilized to predict the activity type of each amino acid residue. We evaluated two variations of Multi-layer Perceptron Residue Activity Predictor for accomplishing two tasks: (1) the identification of active sites and (2) the assignment of active site types.
Performance evaluation strategies and metrics
To conduct the main benchmark test, we constructed the SwissProt-ECReact enzyme-reaction active site annotation dataset (SwissProt E-RXN ASA dataset) based on the UniprotKB/Swiss-Prot28 and ECReact dataset29. Additionally, we retrieved all enzyme and reaction data from the MCSA30 database to establish the MCSA enzyme catalytic site annotation dataset (MCSA E-RXN CSA dataset) for knowledge base transfer experiments. Using enzyme sequences from Swiss-Prot, we selected experimentally validated structures (with PDB records) as the validation and test sets for enzyme structures. To ensure the independence of the training, validation, and test sets, we excluded any enzyme structures with more than 80% amino acid sequence similarity to the enzymes in the training set. After dividing the ECReact dataset into the training, validation, and test sets in an 8:1:1 ratio, we utilized the EC Number as a bridge to match enzyme catalytic reactions with enzyme structures. This matching process was conducted with a maximum match multiplier of 100, generating enzyme-reaction pairs. The final dataset numbers were training set: validation set: test set = 102944: 4711: 892. We then collected and standardized the active site annotations from UniProt, which includes three categories of enzyme active sites and the index of the active sites as the prediction labels, The three types of active sites are: binding sites, catalytic sites, and other sites. It is worth noting that in UniProt, amino acid residues that directly participate in chemical reactions are referred to as the ‘active site’, which is synonymous with the ‘catalytic site’ mentioned in this paper. In the test set, we evaluated and reported the model’s performance across different sequence identity intervals to the training set sequences (0~40%, 40~50%, 50~60%, 60~70%, and 70~80%), and compared it with benchmark methods. Additionally, in the Supplementary Section 5, we provided the evaluation results of the model’s performance based on the maximum TM-Score31 intervals with the training set (0~0.2, 0.2~0.5, and 0.5~1), and conducted comparisons with benchmark methods as well.
The MCSA E-RXN CSA dataset was also divided into the training, validation, and test sets with a maximum amino acid sequence similarity threshold of 80%, and the validation and test sets did not include any data with more than 80% similarity to the training set of the SwissProt E-RXN ASA dataset. In this dataset, all categories of active sites were classified as “catalytic site”, The MCSA E-RXN CSA dataset was then divided into the training set: validation set: test set = 781:88:82, serving as the evaluation dataset for the knowledge base transfer experiment. For more detailed dataset processing steps, please refer to the Dataset Curation subsection in the Methods section.
In the performance evaluation of algorithm, we conducted separate assessments for the model on two tasks: the active site location annotation task (a binary classification task to distinguish whether an amino acid residue is an active site) and the active site type annotation task (a multi-class classification task to predict the type of activity of amino acid residues). For the active site location annotation task, we utilized five metrics: precision, recall, false positive rate (FPR), F1 score, and Matthews correlation coefficient (MCC). For the active site type annotation task, we reported the recall for each activity category and the average MCC across multiple activity categories. Each metric was calculated individually for each validation/test sample, with the final report presenting the average value across all samples. The specific formulas used are as follows:
Active site location annotation task:
$${Precision}=\frac{{TP}}{{TP}+{FP}}$$
(2)
$${Recall}=\frac{{TP}}{{TP}+{FN}}$$
(3)
$${FPR}=\frac{{FP}}{{FP}+{TN}}$$
(4)
$$F1=2\times \frac{{Precision}\times {Recall}}{{Precision}+{Recall}}$$
(5)
$${MC}{C}_{{bin}}=\frac{{TP}\times {TN}-{FP}\times {FN}}{\sqrt{\left({TP}+{FP}\right)\left({TP}+{FN}\right)\left({TN}+{FP}\right)\left({TN}+{FN}\right)}}$$
(6)
Here, \({TP}\) refers to the number of correctly predicted active sites in an enzyme sample, \({FP}\) represents the number of incorrectly predicted active sites, \({TN}\) stands for the number of correctly predicted inactive sites, and \({FN}\) is the number of incorrectly predicted inactive sites. The final reported scores are the average values across all test set samples.
Active site type annotation task:
$${Recal}{l}_{k}=\frac{T{P}_{k}}{T{P}_{k}\,+F{N}_{k}}$$
(7)
$${FP}{R}_{k}=\frac{F{P}_{k}}{F{P}_{k}+T{N}_{k}}$$
(8)
$${MC}{C}_{{multi}}=\frac{c\times s-{\sum }_{k}^{K} \, {p}_{k}\times {t}_{k}}{\sqrt{\left({s}^{2}-{\sum }_{k}^{K}{p}_{k}^{2}\right)\times \left({s}^{2}-{\sum }_{k}^{K}{t}_{k}^{2}\right)}}$$
(9)
Here, for the calculation of MCC in multi-classification, it is necessary to first define a confusion matrix C for K categories, where \({t}_{k}={\sum }_{i}^{K}{C}_{{ik}}\) represents the number of times class k truly occurred, \({p}_{k}={\sum }_{i}^{K}{C}_{{ki}}\) represents the number of times class k was predicted, \(c={\sum }_{k}^{K}{C}_{{kk}}\) represents the total number of samples correctly predicted, and \(s={\sum }_{i}^{K}{\sum }_{j}^{K}{C}_{{ij}}\) represents the total number of samples. \(T{P}_{k}\) represents the number of samples belonging to category k that are correctly predicted as category k. \(F{N}_{k}\), which represents the number of samples from category k that are misclassified into other categories. \(F{P}_{k}\) represents the number of samples that do not belong to k but are incorrectly predicted as category k. \(T{N}_{k}\), represent the number of samples that are neither actually nor predicted as category k.
Improved annotation accuracy and speed
In this study, we compared the EasIFA-ESM and EasIFA-SaProt algorithm with three other representative algorithms (as the baselines) on the SwissProt E-RXN ASA dataset, namely, BLASTp12, AEGAN15, and Schrodinger-SiteMap13,14. The primary difference between EasIFA-ESM and EasIFA-SaProt lies in the enzymatic sequence representation modules. Specifically, EasIFA-ESM utilizes ESM-2-650M as the model for representing enzyme sequences, whereas EasIFA-SaProt employs the SaProt-650M-AF model. The SaProt32 model, as an enhancement of ESM-2, integrates 3D structural encodings calculated by Foldseek33. Our evaluation encompassed not only the algorithms’ capability in labeling the active site positions of enzymes (a binary classification task for amino acid residues) but also their ability to predict the categories of enzyme active sites (a multiclass classification task for amino acid residues). The comparison results are concisely presented in Table 1. It is worth noting that: (1) The EasIFA-ESM/EasIFA-SaProt algorithm comes in two versions: EasIFA-ESM-bin/EasIFA-SaProt-bin, which solely focuses on annotating the positions of catalytic sites in amino acid residues, and EasIFA-ESM-multi/EasIFA-SaProt-multi, which not only annotates active site positions but also identifies the categories of active sites, including binding site, catalytic site and other sites; (2) Given the challenge associated with retraining AEGAN, we utilized the model state published in the original paper for testing purposes. Since a portion of our test set overlaps with their training set (approximately 25.22% of the test set), we report not only the average performance across the entire test set but also the average performance on the non-overlapping test data. To compare the computational capabilities of the algorithms on large-scale annotation tasks, we also compared the time consumption for inference by each algorithm under the same conditions, with results presented in Table 2.
The results in Table 1 demonstrate that the EasIFA-ESM/EasIFA-SaProt models outperform other baseline methods in annotation quality. In the active site location annotation task, EasIFA displays remarkable precision, recall, false positive rate (FPR), F1 score, and Matthews Correlation Coefficient (MCC). In the active site type annotation task, EasIFA-ESM/EasIFA-SaProt achieves recall comparable to the specialized AEGAN model for ‘catalytic sites’, but with a significantly lower FPR, indicating a reduced number of false positives. In the enzyme binding site identification task, the performance of EasIFA-ESM/EasIFA-SaProt is markedly superior to that of Schrodinger-SiteMap. The performance of EasIFA relative to Schrodinger-SiteMap is attributed to differences in algorithm design, Schrodinger-SiteMap’s reliance on empirical rules, and variations in the knowledge bases associated with the evaluation datasets. Moreover, when utilizing sequence alignment, BLASTp, which leverages larger databases and knowledge bases, significantly reduces FPR and notably enhances precision, f1 score, and MCC. However, it still lags significantly behind the EasIFA-ESM/EasIFA-SaProt models, with a 10.15% gap in f1 score and a 0.1012 gap in MCC. Comparing the two variants of the EasIFA model, EasIFA-ESM and EasIFA-SaProt, their performance in active site location annotation is quite similar. EasIFA-ESM-bin shows slightly higher precision, FPR, F1 score, and MCC, albeit with a marginally lower recall. In comparisons between the multi-class versions of the models, EasIFA-SaProt-multi demonstrates a slight advantage over EasIFA-ESM-multi in FPR, but EasIFA-ESM-multi slightly outperforms in other metrics. For the active site type annotation task, EasIFA-ESM-multi is slightly superior to EasIFA-SaProt-multi in all metrics except FPR.
To more clearly demonstrate the predictive capabilities of the EasIFA model and baseline methods across various sequence identity levels between the test and training samples, we used CD-HIT (version 4.8.1) to divide the test set into five subsets, each with a different level of sequence identity compared to the enzyme sequences in the training set: 0~40%, 40~50%, 50~60%, 60~70%, and 70~80%. We subsequently calculated the F1 scores, MCC, Recall, and FPR for each test subset, and the results are displayed in Fig. 2a–d. It is evident from the figures that the predictive performances of EasIFA-ESM-bin, EasIFA-SaProt-bin, and BLASTp are significantly better than those of AEGAN and Schrodinger-SiteMap across all sequence identity intervals. Moreover, the performance of these three algorithms declines as sequence identity decreases. Notably, the performance drop is more pronounced for the two variants of BLASTp, each utilizing a different sequence alignment database (see notes ① and ④ in Table 1), compared to the other algorithms. In the 0~40% sequence identity subset, the gap in F1 score and MCC between EasIFA-SaProt-bin and BLASTp widens to 15.23% and 0.1629, respectively. Except for the highest sequence identity subset of 70~80%, where the performance of the two EasIFA variants is comparable to that of BLASTp, EasIFA demonstrates a clear advantage in other sequence identity subsets. Additionally, EasIFA maintains a significantly lower false positive rate across all sequence identity intervals, consistently outperforming BLASTp and other baseline methods. The performance data based on the protein topological similarity metric TM-Score for the test set reveals consistent trends with the sequence identity-based analysis, as seen in the Supplementary Information Fig. S3.
Table 2 demonstrates that the EasIFA algorithm has exceptional inference speed, averaging only 0.144 s (EasIFA-ESM-bin) to annotate active sites for one enzyme. The BLASTp algorithm also exhibits relatively fast annotation speed, which can be significantly boosted by increasing the number of CPU threads used for sequence alignment. Nevertheless, even when utilizing the same structural database of enzymes, it is still marginally slower compared to the EasIFA algorithm. Although enhancing the annotation quality with a large-scale comparison database improves its quality, it still falls significantly short of the EasIFA algorithm, and the time required is approximately 10 times that of the EasIFA algorithm. The annotation quality of AEGAN is similar to that of EasIFA for catalytic site annotation tasks, but it relies heavily on the computation of PSSM, leading to lower inference efficiency. It takes about 1300 times longer than the EasIFA algorithm. Similarly, Schrodinger-SiteMap has significantly lower computational efficiency compared to the EasIFA algorithm. These results highlight the active sites annotation quality of the EasIFA algorithm, coupled with its high inference efficiency. Moreover, although most sequences in UniProt can now be linked to 3D structures inferred by AlphaFold, we still provide the inference speed of the EasIFA pure sequence version (EasIFA-NG-bin) in Table 2. This variant requires only 0.127 s on average to complete the inference of a single sample, offering a rapid solution for annotating active sites without 3D structures. Although its performance is not as high as that of EasIFA-ESM-bin and EasIFA-SaProt-bin, it still significantly outperforms other benchmark methods, achieving an F1 score 5.42% higher and an MCC 0.0588 higher than BLASTp. In scenarios lacking enzyme 3D structures, one can either utilize the swift EasIFA-NG-bin or rapidly infer 3D structures using ESMfold2 and then apply the complete EasIFA-ESM-bin version. In such cases, the average total inference time per sample in the test set (including prediction of 3D structures and annotation of active sites) is 19.4 s. In most real-world scenarios, the 3D structures of enzymes stored in the AlphaFold Protein Structure Database are readily accessible for inference, eliminating the need for individual 3D structure prediction. Although the EasIFA-ESM-bin model was trained on AlphaFold2-inferred structures, using ESMfold2-inferred structures for active site prediction only leads to a minor decline in annotation quality (a 0.38% drop in F1 score and 0.0005 decrease in MCC compared to EasIFA-ESM-bin). Specific performance details are provided in the Ablation Study section. We also provided a comparison of the number of trainable parameters for various variants of EasIFA and AEGAN in Supplementary Table S3.
Ablation study
In this section, we conducted ablation experiments to evaluate the influence of various factors on the annotation of enzyme active sites: the inclusion or exclusion of chemical reaction information, the pretraining of chemical reaction branches, different pretrained reaction representation schemes, the utilization of 3D graph structures to represent enzymes, and different sequence representations of enzymes. The variations in model configurations for each ablation experiment are illustrated in Supplementary Fig. S6. The results are summarized in Table 3, and the ROC and PRC curves for different EasIFA variants are visualized in Fig. 2e, f. To evaluate the predictive capabilities of models without chemical reaction information, we introduced EasIFA-E-bin, a variant that excludes the reaction representation branch and the enzyme-reaction interaction network from the EasIFA algorithm. We also examined EasIFA-RS-bin, a variant trained solely on the SwissProt E-RXN ASA dataset for reaction representation. Furthermore, to explore the impact of different pretrained reaction representations, we replaced the reaction embedding branch with RXNFP34, resulting in EasIFA-RXNFP-bin. Additionally, we explored the importance of the enzyme’s 3D structure representation module, GearNet, by excluding it to create EasIFA-NG-bin. Finally, to investigate the impact of 3D structured sequence representations calculated by Foldseek, we replaced the enzyme sequence representation method with SaProt, yielding the variant EasIFA-SaProt-bin.
Our study reveals that incorporating reaction branch information significantly enhanced the predictive performance of the EasIFA model, with a 3.79% increase in F1 score, 0.0388 improvement in MCC, and 0.0375 gain in AUPRC. This indicates the value of reactions related to enzyme specificity in annotating enzyme active sites, and the effective integration of the enzyme-reaction interaction network also improves the quality of model annotations. However, the ablation experiments on the reaction branch trained from scratch (EasIFA-RS-bin) suggest that inadequate representation of reaction information could confuse the predictions of the EasIFA model, leading to a significant decline in active site annotation quality. Specifically, this configuration achieves the lowest performance among all variants, with F1 score, MCC, and AUPRC lower by 4.27%, 0.0602, and 0.0666 respectively compared to EasIFA-ESM-bin. This suggests that accurately representing reactions based solely on a limited set of enzyme reactions may be challenging. Therefore, pretraining on a broader dataset of organic reactions is crucial for enhancing the representational capabilities of the reaction branch in annotating enzyme active sites. For the EasIFA-RXNFP-bin variant, where the reaction representation was switched to RXNFP, its performance is very close to that of EasIFA-E-bin, which does not include a reaction representation branch. This suggests that the RXNFP representation neither augmented nor hindered the model’s capabilities, indicating that a pretrained graph network representation based on atom-wise distance awareness is more suitable for this prediction task compared to the purely sequence-based RXNFP representation. The EasIFA-NG-bin model, which lacks the GearNet, exhibited a decline in performance compared to EasIFA-ESM-bin, with drops of 3.32% in F1 score, 0.0333 in MCC, and 0.0326 in AUPRC. However, this reduction was compensated by a reduced computational load and faster inference speed, with an average inference time per test sample reduced by 0.017 s. For the EasIFA-SaProt-bin variant, which changed the sequence representation to SaProt, the recall improved by 1.54% compared to EasIFA-ESM-bin. While other scores were slightly lower, the differences were negligible, indicating that the 3D structured representation information from Foldseek had a minor impact on EasIFA, which already included GearNet.
Beyond examining the aforementioned EasIFA variants, we also investigated the impact of using different PDB structure sources that represent the same structure. We replaced the structures of enzymes inferred by AlphaFold2 in the test set with the structures inferred by ESMFold2. The test results, shown in Table 3, indicate a slight decrease in model performance. However, the decline is negligible, demonstrating the strong robustness of EasIFA against variations in structural data distributions.
Moreover, we compared the differences of the ROC and PRC curves between EasIFA-ESM-bin and EasIFA-E-bin for the test samples with varying levels of sequence identity to the enzymes in the training set. As shown in Fig. 3, the AUPRC gap between EasIFA-ESM-bin and EasIFA-E-bin widens as enzyme sequence identity increases. This suggests that the enzyme-reaction interaction network can improve model performance more effectively when it can assess more similar information from the reaction branch. This is due to the fact that as enzyme sequence identity increases, the reaction patterns catalyzed by the enzymes also become more similar. The attention mechanism within the enzyme-reaction interaction network can then extract this information more easily, thereby enhancing model performance. Therefore, reaction information not only aids in identifying distant relatives but also enhances performance for all samples. In fact, the higher the enzyme similarity, the more valuable the information extracted by the interaction network, thereby further improving active site annotation performance.
Case study
We visualized the active site annotations generated by the EasIFA algorithm on the Swiss-Prot E-RXN ASA test set, presented in Fig. 4a. The left panel of Fig. 4a depicts the model’s annotation of protein tyrosine phosphatases (UniProt ID: Q4G0W2, EC Number: 3.1.3.48), enzymes that catalyze tyrosine dephosphorylation. The EasIFA model (default as EasIFA-ESM-bin) accurately predicted the active site cysteine residue at position 103, which is recorded in the UniProt database and functions as a nucleophile that reacts with the phosphate group. The right panel illustrates the annotation results for carnosine N-methyltransferase (UniProt ID: P53934, EC Number: 2.1.1.22), a transferase enzyme that catalyzes the conversion of S-adenosyl-L-methionine and carnosine to S-adenosyl-L-homocysteine. For this enzyme, UniProt only provides information on its binding sites, which are widely distributed across the amino acid residue sequence, but our model was able to accurately identify all the substrate binding sites within the active pocket. Notably, even though the ASN274 residue is not annotated in the UniProt database, it also positioned in the active pocket and holds some relevance.
EasIFA can accurately identify the catalytic active residues in TIM Barrel structures with different catalytic functions. The TIM Barrel, composed of eight strands and eight helices, is one of the most abundant folds in nature and participates in numerous biological processes. We visualized all EasIFA predictions for the SwissProt E-RXN ASA test set containing TIM Barrel structures. Figure 4b shows two enzymes with significantly different functions but shared TIM Barrel structures. The active sites predicted by EasIFA are marked on the enzymes’ 3D structures, and detailed information on the active residues is displayed in the table in the middle of Fig. 4b. The left panel shows the EasIFA’s annotation for Malate synthase A (UniProt ID: P08997, EC Number: 2.3.3.9), which catalyzes malate synthesis. This enzyme has two crucial catalytic sites: ARG166 and ASP447 residues, respectively acting as the proton acceptor and proton donor in the reaction, respectively. EasIFA accurately predicted these two key catalytic residues. The right panel displays the EasIFA’s active site annotation for Ketose 3-epimerase (UniProt ID: A0A1L7NQ96, EC Number: 5.1.3.-), which catalyzes the reversible C-3 epimerization of several ketoses (the enzymatic reaction visualized here is the conversion of L-ribulose to L-xylulose). Similarly, EasIFA accurately predicted all binding sites and catalytic residues. We also provide visualized cases of the EasIFA prediction results for four other enzymes containing TIM Barrel structures in Supplementary Fig. S7. Overall, EasIFA exhibits remarkable precision in annotating the active sites of enzymes with different functions but similar 3D structures.
Furthermore, we constructed an enzyme and pseudokinase pairing dataset to assess the discriminatory ability of EasIFA against unknown pseudoenzymes. In this experiment, we used “Pseudokinase” as a keyword to search for pseudoenzyme sequences with a length of less than 600 in the SwissProt subset of the UniProt database. Next, we used BLASTp, with an E-value threshold of 0.001, to locate the closest matching enzyme for each pseudoenzyme in the SwissProt E-RXN ASA validation and test sets. Pseudoenzymes without a similar enzyme match were excluded. Through these steps, we assembled a paired dataset consisting of pseudokinases and their corresponding true enzymes. For each pair, we hypothesized that the enzyme’s catalytic reaction could be attributed to the pseudoenzyme, and used EasIFA-ESM-multi to predict active sites. The differentiation criteria were as follows: 1. Successful differentiation was achieved when no active sites were predicted for a pseudoenzyme. 2. If a binding site was predicted for a pseudoenzyme and its paired enzyme had a documented catalytic site, differentiation was also considered successful. However, if the paired enzyme did not have a recorded catalytic site, this data was discarded due to uncertainty regarding whether the predicted protein lacked a recorded catalytic site or if it was simply a pseudoenzyme. 3. If a catalytic site was predicted for a pseudoenzyme, it was considered a failure. This prediction task was treated as a binary classification problem, where pseudoenzymes were labeled as negative data and their paired enzymes as positive data. The final dataset comprised 27 valid pairs, including 27 pseudokinases and 27 enzymes. The confusion matrix for classifier on this test set is shown in Fig. 4c, revealing that EasIFA correctly identified all enzyme samples and 21 pseudoenzyme samples. Further visualization of the EasIFA-ESM-multi’s predictions for serine/threonine-protein kinase BIK1 (UniProt ID: O48814, EC Number: 2.7.11.1) and inactive serine/threonine-protein kinase BKN1 (UniProt ID: Q9LFL7) is provided in Fig. 4d. These results demonstrate that EasIFA precisely predicted all active sites of serine/threonine-protein kinase BIK1 while making no such predictions for inactive BKN1. These findings indicate that EasIFA possesses a certain ability to discriminate between unknown pseudoenzymes.
Knowledge base transfer experiment
The primary source of enzyme active site annotation data currently comes from the UniProt database. While UniProt offers comprehensiveness and extensive collection of manually annotated protein data, its annotations in the specific field of enzymes are often lacking in detail and significantly differ from specialized enzyme mechanism databases. Professionally hand-annotated enzyme datasets like MCSA30 boast high annotation quality and detailed catalytic mechanisms, but they are costly to annotate and cannot be rapidly expanded to a wider range of enzyme entities on a large scale. Furthermore, the MCSA dataset primarily comprises exemplar and iconic instances, resulting in a limited number of entries under the same EC number. This leads to low similarity among enzymes within the dataset. Consequently, modeling on such small-scale, yet high-quality datasets like MCSA poses significant challenges due to their representative but limited data coverage.
In this study, we used the EasIFA algorithm, which was initially pretrained on the SwissProt E-RXN ASA dataset, and employed transfer learning to model the MCSA E-RXN CSA dataset. Prior to transfer learning, the MCSA E-RXN CSA dataset underwent rigorous sequence identity processing with CD-HIT to guarantee that no test set samples had a sequence identity above 80% with those in the training set or the SwissProt E-RXN ASA dataset. The detailed steps used to process the MCSA E-RXN CSA dataset are described in the Experimental Setting subsection within the Methods section. In our experiments, transfer learning studies were conducted using both EasIFA-ESM-bin and EasIFA-SaProt-bin, where the model states trained on the SwissProt E-RXN ASA dataset were directly utilized on the MCSA E-RXN CSA test set. Furthermore, we compared the performance of EasIFA with the sequence alignment-based BLASTp method, and the detailed results are presented in Table 4.
The results on the MCSA E-RXN CSA dataset indicated that the EasIFA-SaProt-bin model achieved the best performance with a precision of 66.59%, recall of 65.32%, a false positive rate of 0.54%, F1 score of 61.33%, MCC of 0.6295, AUROC of 0.9511, and AUPRC of 0.6798. The EasIFA-ESM-bin model exhibited a precision of 61.09%, recall of 63.68%, a slightly higher false positive rate of 0.71%, F1 score of 58.56%, MCC of 0.5977, AUROC of 0.9584, and AUPRC of 0.6479. When the models trained on the SwissProt E-RXN ASA dataset were directly applied to the MCSA E-RXN CSA test set, EasIFA-SaProt-bin recorded a precision of 56.60%, recall of 57.63%, a false positive rate of 1.36%, F1 score of 50.55%, MCC of 0.5271, AUROC of 0.9343, and AUPRC of 0.5501. Meanwhile, EasIFA-ESM-bin recorded a precision of 53.96%, recall of 51.64%, a false positive rate of 1.36%, F1 score of 46.04%, MCC of 0.4829, AUROC of 0.9449, and AUPRC of 0.5292. We have also provided the ROC and PRC curves for EasIFA-ESM-bin and EasIFA-SaProt-bin, obtained via transfer learning and directly using the model states trained on the SwissProt E-RXN ASA dataset (Supplementary Fig. S13).
For the BLASTp method relying on sequence alignment, two evaluation strategies were employed: (1) sequence alignment and active site annotation on the MCSA E-RXN CSA dataset using the MCSA’s enzyme catalytic site data for scoring; (2) sequence alignment with the larger SwissProt database followed by annotation using the SwissProt catalytic site data and scoring based on MCSA tags. However, the first strategy yielded poor predictive quality due to low homology between the test set and the knowledge base, achieving only a precision of 19.95%, recall of 18.54%, an F1 score of 18.12%, and an MCC of 0.1838. The second strategy did not significantly enhance annotation quality, hampered by notable differences in catalytic site labels between SwissProt and MCSA, with an F1 score of only 22.99% and an MCC of 0.2394. In contrast, the EasIFA algorithm, trained on a large yet coarse dataset of enzyme activity site annotations, successfully transferred its knowledge to a high-quality, small-scale dataset.
These results clearly demonstrate that introducing 3D structure data from Foldseek significantly enhances the EasIFA’s ability to transfer knowledge across data spaces. Moreover, the transfer learning strategy implemented on the MCSA E-RXN CSA dataset, which has substantial differences in data distribution, notably improves the model’s predictive performance on the test set. In this dataset with considerable sample variability, the EasIFA algorithm shows a significant advantage over the sequence alignment-based BLASTp method, regardless of whether a transfer learning strategy is implemented.
Exploration of potential as a catalytic sites monitor for artificially designed enzymes
In nature, enzymes with the same function typically exhibit structural patterns and catalytic mechanisms. However, with significant advancements in protein design and enzyme engineering, more artificial proteins and enzymes are being created. These artificially designed enzymes may possess structural patterns that are completely different from those of naturally occurring enzymes, posing challenges in predicting their properties of these artificial enzymes. To address these challenges, reliable in silico activity validation methods could greatly enhance the success rate of enzyme design, thereby reducing experimental costs.
In this section, we focused on the task of scaffolding active sites in enzyme design, exploring the potential applications of the EasIFA algorithm (default as EasIFA-ESM-bin) as a catalytic site monitor. The object of scaffolding active sites task is to design and construct enzymes with minimalist yet functional active sites. However, predicting the activity sites of artificially designed enzymes poses significant challenges due to their distinct amino acid sequence distribution compared to natural enzymes. We refer to these artificially designed enzyme structures as scaffolding enzymes. We tested the performance of BLASTp and AEGAN in predicting the enzyme active sites of four categories designed by Watson et al. using RFdiffusion, categorized under EC2, EC3, EC4, and EC5. Due to the multifunctional nature of the EC1 class enzymes used in the original study, we excluded this category from our analysis. BLASTp requires alignment with similar enzyme structures to predict active sites, while AEGAN relies on computing PSSM to represent enzymes. Since the scaffolding enzymes designed by RFdiffusion differ significantly from natural enzymes with the same function, except at the active sites, both algorithms struggled to accurately predict these challenging active sites accurately. Neither method detected any of the active sites in the four designed enzyme classes. The detailed prediction results can be found in the Supplementary Table S4.
To address the aforementioned issue, we developed a workflow that enables the EasIFA algorithm to annotate catalytic sites of enzymes that fall outside the natural distribution of enzymes. We studied several enzymes, including 4-alpha-glucanotransferase (UniProt ID: O87172, EC Number: 2.4.1.25), ribonuclease alpha-sarcin (UniProt ID: P00655, EC Number M-CSA: 3.1.27.10, EC Number UniProt: 4.6.1.23), deoxyribose-phosphate aldolase (UniProt ID: P0A6L0, EC Number: 4.1.2.4), and galactose mutarotase (UniProt ID: Q96C23, EC Number: 5.1.3.3), along with their corresponding artificially constructed scaffolding enzyme structures designed by Watson et al.33 using RFdiffusion35. Figure 5a shows the structure of 4-alpha-glucanotransferase and its triad active site Asp293-Glu340-Asp395. Figure 5b displays the structural alignment of the same active site across 20 other natural enzymes with the same EC number, revealing high overlap, indicating similar structures among naturally occurring enzymes with the same function. Figure 5c presents the overlapped conformations of 7 artificially scaffolded enzymes designed for 4-alpha-glucanotransferase by RFdiffusion, aligned at the triad catalytic site, demonstrating low overlap and high diversity. Figure 5d displays the sequence logo plot of the 20 natural enzymes surrounding the triad catalytic site, revealing a relatively fixed pattern, while Fig. 5e’s sequence logo plot of the artificially scaffolded enzymes exhibits almost no fixed pattern. We also present similar analysis results for ribonuclease alpha-sarcin, deoxyribose-phosphate aldolase and galactose mutarotase in Supplementary Figs. S8–S10, all of which show huge differences between the artificially constructed enzymes and natural counterparts. Furthermore, we employed CD-HIT to cluster these artificially constructed enzymes with the MCSA E-RXN CSA dataset and SwissProt E-RXN ASA dataset using a 40% sequence identity threshold. All the artificial enzymes did not cluster with sequences from the MCSA E-RXN CSA dataset and SwissProt E-RXN ASA dataset, indicating a sequence identity of less than 40%. Additionally, sequence alignments conducted with BLASTp at an E-value threshold of 0.001 failed to find any similar sequences. This explains why BLASTp and AEGAN are unable to predict the active sites of these unique artificial enzymes. To enhance EasIFA’s prediction capabilities for artificial enzymes, we augmented the MCSA E-RXN CSA dataset with sequence and structure-based data augmentation. Please refer to the MCSA E-RXN CSA dataset augmentation subsection for detailed dataset curation methods. Following the implementation of this augmentation strategy, EasIFA proved effective in accurately identifying the catalytic sites of artificially designed scaffold enzymes, thus demonstrating its superiority over BLASTp and AEGAN in recognizing the activity of the aforementioned four types of artificial enzymes. The left panel of Fig. 6a illustrates the active sites annotated by the EasIFA algorithm on an artificially scaffolded enzyme structure. The algorithm accurately identified the key triad catalytic sites, and the additional predicted Asp93 residue, located near these sites, holds potential relevance. The right panel of Fig. 6a shows the active sites annotated by the EasIFA algorithm on the artificially constructed deoxyribose-phosphate aldolase structure, accurately identifying the ternary active site of ASP35-LYS61-LYS82. We also added the prediction results of the corresponding artificial enzymes of ribonuclease alpha-sarcin and galacto-mutarotase in Supplementary Fig. S11. Th detailed predictions of EasIFA, BLASTp and AEGAN on four types of artificial active scaffolding enzymes can be found in Supplementary Table S4. In general, EasIFA has successfully identified the active sites of these artificial enzymes through our tailored data enhancement process, which is currently challenging for comparable algorithms.
Attention weight visualization of interpretable information interaction network
The integration of an attention mechanism into the enzyme-reaction information interaction network has greatly improved the high interpretability of the EasIFA model. By analyzing a few cases in the validation set of MCSA E-RXN CSA, we identified attention layers and heads that specifically focus on critical enzyme-reaction interactions. We visualized the attention weights for enzyme active residue sites on substrate atoms (marked in blue) that exceeded a threshold (determined based on the validation set). This visualization offered strong interpretability for many test set samples. Figure 6b displays the annotation results for cysteine lyase (left) and the visualization of the active site weights on substrate molecules (right). We also visualized the entire cystine cleavage reaction, highlighting the reaction center (middle), using the RDChiral’s reaction template36). It is noteworthy that His144 exhibited a high attention weight towards the reaction center of the L-cystine zwitterion, particularly the amino group37. The alignment with the enzyme’s catalytic mechanism highlights the crucial role played by His144 in deprotonating the amino group of L-cystine zwitterion. Due to the symmetrical structure of L-cystine zwitterion, the EasIFA model focuses on both sides of the reaction center. However, it is worth noting that the model’s interaction network pays less attention to water molecules, a trend that was also observed in other samples.
EasIFA webserver
We have developed a webserver for EasIFA (accessible at http://easifa.iddd.group), which offers more than just the conventional input of enzyme structure and corresponding enzyme-catalyzed reaction equation to annotate catalytic active sites. It also features an automated workflow that retrieves enzyme structure and catalyzed chemical reaction equations from UniProt, followed by automatic annotation of the enzyme’s active sites using EasIFA. This web-based graphical interface provides users with a convenient way to obtain higher quality annotations of enzyme active sites. Supplementary Fig. S12a–d demonstrates the two modes of input and the results interface. Supplementary Fig. S12a shows the interface for inputting the reaction equation and uploading the enzyme structure. Users can either paste the reaction’s SMILES in the left panel or draw the corresponding enzyme-catalyzed reaction in the JSME molecular editor. The right panel serves as an interface for uploading and previewing the enzyme structure. After uploading, users can preview the enzyme structure. Once all inputs are complete, click the ‘Start Prediction’ button, and users will be presented with the results interface as shown in Supplementary Fig. S12b. The upper left of the interface displays the enzyme’s sequence structure, with different types of catalytic active site amino acid residues marked in different colors. The upper right features an interactive enzyme structure viewing interface, with different types of active sites marked in different colors. In the center is the catalyzed reaction equation, and the detailed information about the active amino acid residues is provided at the bottom. Supplementary Fig. S12c illustrates the prediction workflow starting from a UniProt ID. Users can enter the UniProt ID of an enzyme they are interested in, and after clicking the ‘Start Prediction’ button, EasIFA Web will automatically retrieve the enzyme’s structure data and corresponding catalytic reaction data from the database for prediction. The returned results interface, as shown in Supplementary Fig. S12d, typically presents multiple sets of results, each set organized in a drawer format, with content similar to that shown in Supplementary Fig. S12b.
EasIFA’s structure-based prediction is rapid, requiring only a few seconds to annotate an enzyme’s active site with GPU support. Predictions starting from UniProt depend on the network environment of the server where EasIFA is deployed, typically resulting in an annotated enzyme and all its catalyzed reaction combinations under a UniProt ID within one minute. With sufficient database resources and GPU computational power, EasIFA has the potential to provide reliable active site annotation information for the majority of enzyme structures.