To test the version of HELIQUEST, we have run tests on proteins whose 3D structures are known and on well-annotated data set. More specifically is the screening procedure able to extract transmembrane segment or amphipathic helices in protein whose structure is known or that can bind to lipid membrane surface? Does the implementation of the decision tree permit to classify screening results?

 

1-  Test on PDB datasets.

We ran HELIQUEST on sequences corresponding to a subset of non-redundant PDB (less than 30 % homology: one set with 7762 RX and RMN structures with a resolution < 3 Å and a second one with 1858 RX structure with a resolution < 1.6 Å (Wang G, Dunbrack RL. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003). We examined if increasing µH in the absence of other constraints (see Table 1) better identify helical amphipathic segments that exist in PDB. To determine this, segments positive for the screening are classified as helical or not (helix/random coil + a high propensity of b-sheet) and are compared to PDB sequences assigned by P-SEA(Labesse G, Colloc'h N, Pothier J, Mornon JP.P-SEA: a new efficient assignment of secondary structure from C alpha trace of proteins. Comput Appl Biosci. 1997, 13 :291-5). For each dataset, we found for various µH range 1) a very good sensitivity despite a poor specificity. It should be noted that increasing the µH permits to enhance the positive predictive value (PPV) to 85 % (with µH 0.6). On the other hand, at lower µH, PSIPRED helps HELIQUEST to distinguish with high specificity true helices from false helices (71.5 % for µH between 0.3 and 0.4).  We observed also that, without any structural prediction with PSIPRED, increasing µH value allow to better detect true amphipathic helices in structures (Number of true amphipathic helices/Number of total retrieved segments = 0.81). This indicates that the amphipathic moment is a good “predictor” of the presence of amphipathic helix in proteins.  

 

Table 1A. Analysis of screeninga carried out on a small, non-redundant PDB set (1858 RX structure < 1.6 Å)

 

Classification with PSIPRED (Helical/Non helical)

Without PSIPREDc

µH

Sensitivity (%)

Specificity (%)

TPb

FP

FN

TN

% PPV

% Acc

Helical in PDB

Non helical in PDB

NTRUE HELICES  / NTOTAL

NTOTAL

0.6 µH 1.0

99

30.4

95

16

1

7

85.6

85.7

96

23

0.81

119

0.5 µH  0.6

88.9

       42.5

464

184

58

136

71.6

71.2

522

320

0.61

842

0.4 µH 0.5

79.3

56.8

1145

539

299

710

67.9

68.9

1444

1249

0.54

2693

0.3 µH 0.4

68.0

71.5

1525

823

719

2064

64.9

69.9

2244

2887

0.44

5131

0.5 µH 1.0

89.2

41.8

472

185

57

133

71.8

71.4

529

318

0.63

847

a Parameters : 0.7 ≤ H ≤ 1.5 ; -8 ≤ z ≤ +8 ; NPolar  ≥ 0 ;  NCharged Residues ≤ 10 ; NGly  ≥ 0 ; Cys accepted ; no Pro accepted. The algorithm refining the identification of well-defined amphipathic helices was deactivated.

bTP : true positive ; FP : false positive ; FN : false negative ; TN : true negative ; Sensitivity = TP/(TP+FN) : Specificity =TN/(TN+FP) ; PPV= Positive Predictive Value= TP/(TP+FP); Acc=accuracy = (TP + TN ) / (TP+ FN+FP+TN) ; NTOTAL : total number of segments identified by the screening.

c in this case, without PSIPRED prediction, the NTOTAL segments are considered as helical – we simply examined if these sequences are helical or not in the PDB

 

2 -   Test on a TM-containing protein dataset.

We ran HELIQUEST on a dataset from the MPtopo database (Jayasinghe, S., Hristova, K., and White, S.H. MPtopo: A database of membrane protein topology. Protein Sci 10: 455-458. 2001) and containing 131 sequences where the positions of TM are determined structurally. The dataset is screened with µH  0.5 and H between 0.7 and 1.5:  TMHMM recognizes among sorted sequences, transmembrane segment with a positive predictive value of about 95 % (sensitivity= 95.5 % ; specificity=53.5 %).

 

Table 2. Analysis of screeninga carried out on a dataset of TM-containing proteins

Sensitivity (%)

Specificity (%)

TPb

FP

FN

TN

%PPV

% Acc

95.5

53.5

382

20

18

23

95.1

91.4

a Parameters : 0.7 ≤ H ≤ 1.5 ; µH  ≤ 0.5 ; -8 ≤ z ≤ +8 ; NPolar  ≥ 0 ;  NCharged Residues ≤ 10 ; NGly  ≥ 0 ; Cys accepted ; no Pro accepted. The algorithm refining the identification of well-defined amphipathic helices was deactivated.

b TP : true positive ; FP : false positive ; FN: false negative ; TN : true negative ; Sensitivity = TP/(TP+FN) : Specificity =TN/(TN+FP) ; PPV= Positive Predictive Value= TP/(TP+FP); Acc=accuracy = (TP + TN ) / (TP+ FN+FP+TN)

 

3 - Test on a dataset of lipid-binding protein helices

We screened a small database containing 18 proteins that bind to lipid membrane surface and that has not been used for our discriminant analysis. We selected screening parameters (see Table 3) that allowed to extract from the database 10 sequences (among a total of 38 retrieved segments) that corresponds to or overlap known lipid-binding segments: the decision tree classified 2 segments as Lipid-Binding Helix, 6 segments as Possible Lipid-Binding Helix, 1 segment as simply helical and one segment as a TM segment. Those results suggest that the decision tree would be able to help users to better identify interesting hits.

 

Table 3. Analysis of screening carried out on a dataset of lipid-binding protein segment.

Lipid-binding protein helices (bracketed numbers indicate sequence position)

UNIPROT

Lipid-binding-Helix

Possible Lipid-binding Helix

TM

Helix

Helix/coil

High propensity in b-sheet

Aerobic Glycerol-3-Phosphate Dehydrogenase  [355-370]

P13035

 

X

 

 

 

 

FHV Coat Protein  [364-385 ]

P12870

 

X

 

 

 

 

Hepatitis C Core protein [117-134]

P27957

 

 

X

 

 

 

Dense Granule Protein 2  [69-87]

P13404

 

X

 

 

 

 

G protein-coupled Receptor Kinase 5  [546-565]

P34947

 

 

 

 

 

 

Glucose-specific IIa component   [1-18 ]

Q8XBL1

 

 

 

 

 

 

Lactophorin [116-153 ] 

P80195

X

 

 

 

 

 

Myelin Basic Protein [81-97]

P02687

 

 

 

 

 

 

Sterol Carrier Protein 2 [1-32]

P22307

 

 

 

 

 

 

Phosphodiesterase 4A cAMP specific [1-25]

Q684M5

 

 

 

 

 

 

Synuclein a [3-37] and [45-92]

P37840

 

X

 

 

 

 

GMAP-210 [1829-1843]

Q15643

 

X

 

 

 

 

Cholinephosphate CytidylylTransferase [240-295]

P19836

 

 

 

 

 

 

RGS4 [1-33]

O08899

 

X

 

 

 

 

DnaA [357-374]

P03004

 

 

 

X

 

 

Spo20p [62-79]

Q04359

 

 

 

 

 

 

BVDV NS5A [1-28]

P19711

X

 

 

 

 

 

Measles virus F1 protein [197-225]

Q9YJ94

 

 

 

 

 

 

a Parameters : 0 ≤ H ≤  2.25 ; -2 ≤ z ≤ +5 ; NPolar  ≥ 5 ;  NCharged Residues ≤ 6 ; NGly  ≥ 0 ; Cys and Pro accepted. The algorithm refining the identification of well-defined amphipathic helices was activated. Dashed rows = protein whose lipid-binding segment was not retrieved by the  screening.