(蛋白)分子对接简明教程

文摘   2024-09-02 01:14   江苏  

上次我们详细介绍了如何使用HDOCK 服务器:

HDOCK分子对接 详细教程

今天,分享简约版的分子对接教程,尤其关注下第六点

1. 提供对接分子的输入信息

HDOCK 服务器采用混合对接策略预测蛋白质和核酸等两种分子之间的结合复合物。因此,用户需要提供要对接的两个分子的输入信息。HDOCK 服务器可接受四种类型的分子输入:

a 上传 PDB 格式的 pdb 文件。

b 提供 PDB ID:ChainID 格式的 pdb 文件(如 1CGI:E)。

c 复制并粘贴 FASTA 格式的蛋白质序列。

d 上传 FASTA 格式的蛋白质序列文件

每个分子只需一种输入类型。

如果提供的输入类型超过一种,则使用第一种。对于 “PDB ID:ChainID ”输入,用户可以提供一个链 ID 或多个链 ID。例如,“1CGI:E ”代表 1CGI pdb 文件中的链 E;“1AHW:AB ”代表 1AHW pdb 文件中的链 A 和 B。如果只提供序列,服务器将使用内部建模管道(HH Suite、Clustalw2 和 MODELLER)从蛋白质数据库中的同源模板自动构建模型结构。此外,如果蛋白质包含多条链,我们也建议用户提交自己的 pdb 文件,因为我们的建模管道目前是针对单链蛋白质设计的。

注意:为了提高对接效率,如果一个分子比另一个分子大很多,建议输入两个分子中较大的一个作为受体。


分子类型(Molecular Type):

输入结构时不需要 “Select a type(选择类型)”,因为 HDOCK 服务器能够根据输入的结构确定分子类型。但对于序列输入,强烈建议用户选择分子类型,否则服务器将根据输入序列从 “蛋白质”、“ssRNA ”或 “dsDNA ”中猜测一个。

以下是不同分子类型的定义:

类型            描述

蛋白质 标准蛋白质分子

ssRNA 一般单链 RNA 分子

ssDNA 一般单链 DNA 分子

dsDNA 双链 B-DNA 双工分子

dsRNA 双链 A-RNA 双工分子

其中双链 (ds) RNA/DNA 分子的最大输入序列为 500。


2. RNA/DNA 三维结构建模

HDOCK 服务器现在可接受单链 (ss) 或双链 (ds) RNA/DNA 的序列输入。只需输入单链序列,其中可包含如下序列

>示例

ggagcgguaguucagucgguagaauaccugccugucacgcagggggucgcggguucaggucccguccguuccgcca

或单链 (ss) RNA/DNA 的序列及其二级结构,如下所示

>示例

ggagcgguaguucagucgguagaauaccugccugucacgcagggggucgcggguucgccguccgguuccgcca

(((((((..((((.........))))((((.(((((...))))))))).(((((.......))))))))))))....

然后,HDOCK 将根据单序列构建其三维结构,或通过构建互补的 Watson-Crick 配对第二链来建立双链三维双工结构模型。


3. 指定结合位点 [可选]

HDOCK 通过全局对接来预测两个分子之间的结合复合物。因此,对接工作不需要结合位点信息。不过,如果有结合位点残基的信息,服务器也会为用户提供指定结合位点残基的选项,这样预测模型的准确性会更高。可以提供两种类型的结合位点信息。

受体或配体上的结合位点残基。

文本框中提供的结合位点残基如下所示

195:a,203-206:a,108:b

请注意,一行中的残基必须用逗号隔开。

也可以将结合位点残基作为一个文件提交,如下所示


195:A

203-206:A

108:B

这些残基被放在文件的不同行中。


相互作用残基之间的距离限制

用户可以直接在文本框中的一行中提供这些信息,例如

195:A 236:B 8, 215-218:A 306:B 6

其中,受体上链 A 的残基 195 与配体上链 B 的残基 236 的距离将在 8 A 以内;受体上链 A 的残基 215-218 与配体上链 B 的残基 306 的距离将在 6 A 以内。同样,上述距离限制也可以以如下文件形式提供

195:A 236:B 8 

215-218:A 306:B 6

注意 对于每个限制条件,第一个字段是受体,第二个字段是配体,第三个字段是限制距离。残基表示必须采用 num:chainID 或 num1-num2:chainID 格式,其中残基编号和链 ID 在输入为结构时指输入结构,在输入为序列时指建模结构。


对于服务器建模的三维结构,单链分子的链 ID 设置为 “A”。残基编号与输入序列一致。


4. SAXS 实验数据曲线

小角 X 射线散射(SAXS)实验数据可作为对接后的过滤数据,用于对 HDOCK 对接预测的结合模式进行排序。SAXS 数据文件包含 q、I(q) 和误差三列,如下所示

        0.0000e+00 1.4612e+07 3.0685e+03

        1.0000e-03 1.4743e+07 4.8653e+03

        2.0000e-03 1.4827e+07 7.3394e+03

        3.0000E-03 1.4685E+07 1.0573E+04

        4.0000E-03 1.4674E+07 1.3206E+04

        5.0000E-03 1.4659E+07 1.5831E+04

        6.0000E-03 1.4729E+07 1.5466E+04

        7.0000E-03 1.4707E+07 1.7649E+04

        8.0000E-03 1.4594E+07 2.3642E+04

        9.0000e-03 1.4787e+07 2.8835e+04

根据 SAXS 实验曲线,结合模型将按照我们的评分函数计算出的对接能量得分和 CHI 值的加权得分进行排序,CHI 值用于衡量预测的结合模式与 SAXS 实验数据的拟合程度。

5. 对接后处理(可选)

如果高级用户希望获得 100 个以上的预测复合模型,或者希望用自己的实验信息过滤对接后的复合模型,则可以使用此步骤。下载的软件包包含一个 HDOCK 输出文件,命名为 hdock_5c984053e4b83.out,其中包括所有 4392 个对接方案,如下所示

网格间距     1.200

角度步长:15.000

初始旋转     0.00000 0.00000 0.00000

1CGI_r_b.pdb 23.562 26.523 22.675

1CGI_l_b.pdb 47.776 34.961 33.826

   1.27246 0.01055 5.02167 -0.328 -0.164 0.264 -445.20 0.45 1.00

   2.80075 0.00162 3.49381 -0.286 -0.209 0.111 -444.37 0.38 1.00

   0.02137 0.00051 -0.00948 -0.267 -0.212 0.104 -444.28 0.36 1.00

   2.98094 0.00164 3.31735 -0.237 -0.259 0.116 -444.15 0.37 1.00

   3.04247 0.00300 3.25767 -0.340 -0.315 0.134 -442.80 0.49 1.00

   ...

其中前 5 行定义如下

   第一行是三个(x、y、z)平移自由度的网格间距。

   第 2 行是三个旋转自由度的欧拉角步长。

   第 3 行是对接前配体的初始旋转(可选)。

   第 4 行是受体文件及其几何中心。

   第 5 行是配体文件及其几何中心。

从第 6 行开始是预测的结合模式,每种模式由三个平移、三个旋转、结合得分、与配体初始方向的 RMSD 值以及旋转的平移 ID 表示。

用户可以下载我们的 “createpl_linux ”程序并在本地运行,以生成类似下面这样的复杂模型


createpl_linux hdock_5c984053e4b83.out top100.pdb -nmax 100 -complex -models

其中可以应用结合位点残基或限制条件来过滤复杂模型。用户可以键入

createpl_linux

了解程序的详细用法。

生成复杂模型后,用户还可以使用 FoXS 等第三方程序,根据模型的小角 X 射线散射(SAXS)剖面文件计算模型的 SAXS CHI 值。


6. 评价指标解释(重要!!)

对接得分:对接得分由我们基于知识的迭代评分函数 ITScorePP 或 ITScorePR 计算得出。对接得分越负,表示可能的结合模型越多,但不应将该得分视为两个分子的真实结合亲和力,因为它尚未与实验数据进行校准。

置信度得分:鉴于 PDB 中的蛋白质-蛋白质/RNA/DNA 复合物的对接得分通常在-200 左右或更高,我们根据经验定义了一个与对接得分相关的置信度得分,用于表示两个分子的结合可能性,具体如下、

置信度得分 = 1.0/[1.0+e0.02*(对接得分+150)] (Confidence_score = 1.0/[1.0+e0.02*(Docking_Score+150))

粗略地说,当置信度高于 0.7 时,两个分子很有可能结合;当置信度介于 0.5 和 0.7 之间时,两个分子有可能结合;当置信度低于 0.5 时,两个分子不可能结合。然而,由于置信度的经验性质,应谨慎使用。

配体 RMSD:配体 RMSD 是通过比较对接模型中的配体和输入或建模的结构来计算的。因此,配体 RMSD 不一定是衡量相应模型准确性的指标。

界面信息:每个模型的界面信息包括相应模型中受体和配体之间 5.0 A 范围内的所有残基对。用户可以点击检查/下载不同模型的文件。

SAXS CHI 平方:预测模型的 CHI 值与 SAXS 数据曲线的比较,该值使用 FoXS 程序计算。CHI 平方越小,表示模型与 SAXS 数据的一致性越好。


英文原版

Help for using HDOCK server


1. How to provide input for docked molecules

The HDOCK server is to predict the binding complexes between two molecules like proteins and nucleic acids by using a hybrid docking strategy. Therefore, users need to provide input for the two molecule to be docked. The HDOCK server can accept four types of input for molecules:

  • Upload your pdb file in PDB format.

  • Provide your pdb file in PDB ID:ChainID (e.g. 1CGI:E).

  • Copy and paste your protein sequence in FASTA format.

  • Upload your protein sequence file in FASTA format

Only ONE type of input is needed for each molecule.

If more than one types of input are provided, the first one will be used. For the "PDB ID:ChainID" input, the user can provide one single chain ID or multiple chain IDs. For example, "1CGI:E" stands for the chain E of the pdb file of 1CGI; "1AHW:AB" stands for the chains A and B of the pdb file of 1AHW. If only a sequence is provided, the server will automatically constuct a model structure from a homologous template in the Protein Data Bank using a in-house modeling pipeline of HH Suite , Clustalw2, and MODELLER. In addition, users are also recommended to submit their own pdb file if the protein contains multiple chains, as our pipeline is currently designed to model single-chain proteins.

NOTE: For docking efficiency, it is recommended that the larger one of two molecules is input as receptor if one molecule is much larger than the other one.

Molecular Type:
"Select a type" is not needed for structure input, as the HDOCK server is able to determine a molecular type according to the input structure. However, for sequence input, users are strongly recommended to select a molecular type; otherwise, the server will guess one from `Protein', `ssRNA', or `dsDNA' based on the input sequence.
Here are the definitions of different molecular types:

Type             Description
Protein Standard protein molecule
ssRNA General single-chain RNA molecule
ssDNA General single-chain DNA molecule
dsDNA Double-stranded B-DNA duplex molecule
dsRNA Double-stranded A-RNA duplex molecule
where the maximum input sequence is 500 for double-stranded (ds) RNA/DNA molecules.


2. RNA/DNA 3D structure modeling

HDOCK server now accepts sequence inputs for single-stranded (ss) or double-stranded (ds) RNA/DNA. Only the sequence of a single strand is needed, which can contain the sequence only like this
>example
GGAGCGGUAGUUCAGUCGGUUAGAAUACCUGCCUGUCACGCAGGGGGUCGCGGGUUCGAGUCCCGUCCGUUCCGCCA
or both the sequence and its secondary structure for single-stranded (ss) RNA/DNA like this
>example
GGAGCGGUAGUUCAGUCGGUUAGAAUACCUGCCUGUCACGCAGGGGGUCGCGGGUUCGAGUCCCGUCCGUUCCGCCA
(((((((..((((.........))))((((.(((((...))))))))).(((((.......))))))))))))....
HDOCK will then build its 3D structure based on the single sequence, or model a double-stranded 3D duplex structure by construting a complementary Watson-Crick paired second strand.


3. How to specify the binding site [optional]

The HDOCK performs global docking to predict the binding complexes between two molecules. Therefore, no information about the binding site is necessary for the docking job. However, the server also gives users the option to specify the binding site residues if such information is available, such that the predicted models will have a higher accuracy. Two types of binding site information can be provided.
  • Binding site resdiues on the receptor or ligand.

  • The binding site residues are provided like this in the text box

	195:A, 203-206:A, 108:B
  • which stand for residues 195, 203-206 of chain A, and 108 of chain B. Note that the residues in a line must be separated by comma.

  • The binding site residues may also be submitted as a file that will look like this

    	195:A
    203-206:A
    108:B

    The residues are put on different lines in the file.

  • Distance restraints between interacting residues

  • The users may directly provied such information on one line in the text box like

  • 	195:A 236:B 8, 215-218:A 306:B 6
  • where the distance of residue 195 of chain A on the receptor and residue 236 of chain B on the ligand will be within 8 A; The distance of residues 215-218 of chain A on the receptor and residue 306 of chain B on the ligand will be within 6 A. Likewise, the above distance restraints can also be provided as a file that looks like this

  • 	195:A 236:B 8 
    215-218:A 306:B 6

    NOTE For each restraint, the first field is for receptor, the second field is for ligand, and the third field is for the constrained distance. The residue representation must be in num:chainID or num1-num2:chainID format, where the residue number and chain ID refer to the input structure if the input is a structure, or the modeled structure if the input is a sequence.

    CAUTION For the 3D structure modeled by the server, the chain ID is set to “A” for single-chain molecule. The numbering of residues is consistent with that in the input sequence.


    4. SAXS experimental data curve

    The small-angle X-ray scattering (SAXS) experimental data can be provided as a post-docking filter for ranking the binding modes predicted by the HDOCK docking. The SAXS data file contains three columns, q, I(q), and error, like this
            0.0000E+00  1.4612E+07  3.0685E+03
    1.0000E-03 1.4743E+07 4.8653E+03
    2.0000E-03 1.4827E+07 7.3394E+03
    3.0000E-03 1.4685E+07 1.0573E+04
    4.0000E-03 1.4674E+07 1.3206E+04
    5.0000E-03 1.4659E+07 1.5831E+04
    6.0000E-03 1.4729E+07 1.5466E+04
    7.0000E-03 1.4707E+07 1.7649E+04
    8.0000E-03 1.4594E+07 2.3642E+04
    9.0000E-03 1.4787E+07 2.8835E+04
    With the SAXS experimental curve, the binding models will be ranked according to a weighted score of the docking energy score calculated by our scoring function and the CHI value that measure the goodness of the predicted binding modes fitting to the SAXS experimental data.



    5. Post-docking process (optional)

    This step is for advanced users if they want to obtain more than 100 predicted complex models or filter the docked complex models with their own experimental information. The downloaded package contains an HDOCK output file, named like hdock_5c984053e4b83.out, that includes all 4392 docking solutions like this
    Grid spacing:     1.200
    Angle step: 15.000
    Initial rotation: 0.00000 0.00000 0.00000
    1CGI_r_b.pdb 23.562 26.523 22.675
    1CGI_l_b.pdb 47.776 34.961 33.826
    1.27246 0.01055 5.02167 -0.328 -0.164 0.264 -445.20 0.45 1.00
    2.80075 0.00162 3.49381 -0.286 -0.209 0.111 -444.37 0.38 1.00
    0.02137 0.00051 -0.00948 -0.267 -0.212 0.104 -444.28 0.36 1.00
    2.98094 0.00164 3.31735 -0.237 -0.259 0.116 -444.15 0.37 1.00
    3.04247 0.00300 3.25767 -0.340 -0.315 0.134 -442.80 0.49 1.00
    ...
    where the first 5 lines have the following definitions
       The 1st line is the Grid spacing of three (x, y, z) translational degrees of freedom.
    The 2nd line is the Euler angle step for three rotational degrees of freedom.
    The 3rd line are the initial rotation of the ligand before docking (optional).
    The 4th line stands for the receptor file and its center of geometry.
    The 5th line is the ligand file and its center of geometry.
    Starting from the 6th line are the predicted binding modes each of which is represented by three translations, three rotations, its binding score, RMSD from the initial ligand orientation, and the translational ID for the rotation.

    Users can download our "createpl_linux" program and run it locally to generate complex models like this

    	createpl_linux hdock_5c984053e4b83.out top100.pdb -nmax 100 -complex -models
    where binding site residues or restraints can be applied to filter the complex models. Users can type
    	createpl_linux
    for the detailed usage about the program.

    After generating the complex models, users may also use a third-party program like FoXS to calculate the SAXS CHI values of the models based on their small-angle X-ray scattering (SAXS) profile file.


    6. Explanations of evaluation metrics

    • Docking Score: The docking scores are calculated by our knowledge-based iterative scoring function ITScorePP or ITScorePR. A more negative docking score means a more possible binding model, but the score should not be treated as the true binding affinity of two molecules because it has not been calibrated to the experimental data.


    • Confidence Score: Given that the protein-protein/RNA/DNA complexes in the PDB normally have a docking score of around -200 or better, we have empirically defined a docking score-dependent confidence score to indicate the binding likeliness of two molecules as follows,

      		Confidence_score = 1.0/[1.0+e0.02*(Docking_Score+150)]

      Roughly, when the confidence score is above 0.7, the two molecules would be very likely to bind; when the confidence score is between 0.5 and 0.7, the two molecules would be possible to bind; when the confidence score is below 0.5, the two molecules would be unlikely to bind. Nevertheless, the confidence score here should be used carefully due to its empirical nature.


    • Ligand rmsd: The ligand RMSDs are calculated by comparing the ligands in the docking models with the input or modeled structures. Therefore, the ligand RMSD is not necessarily a metric of the accuracy for the corresponding model.


    • Interface redidues: The interface information for each model includes all the residue pairs within 5.0 A between the receptor and the ligand for the corresponding model. Users can click to check/dowload the files for different models.


    • SAXS CHI Square: The CHI values of the predicted models compared to the SAXS data curve, which is calculated using the FoXS program. A smaller CHI square means a better consistence between the model and the SAXS data.



    生信小博士
    【生物信息学】R语言开始,学习生信。Seurat,单细胞测序,空间转录组。 Python,scanpy,cell2location。资料分享
     最新文章