Pseudo amino acid composition explained

Pseudo amino acid composition, or PseAAC, in molecular biology, was originally introduced by Kuo-Chen Chou in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction.^[1] Like the vanilla amino acid composition (AAC) method, it characterizes the protein mainly using a matrix of amino-acid frequencies, which helps with dealing with proteins without significant sequential homology to other proteins. Compared to AAC, additional information are also included in the matrix to represent some local features, such as correlation between residues of a certain distance.^[2] When dealing the cases of PseAAC, the Chou's invariance theorem has been often used.

Background

To predict the subcellular localization of proteins and other attributes based on their sequence, two kinds of models are generally used to represent protein samples: (1) the sequential model, and (2) the non-sequential model or discrete model.

The most typical sequential representation for a protein sample is its entire amino acid (AA) sequence, which can contain its most complete information. This is an obvious advantage of the sequential model. To get the desired results, the sequence-similarity-search-based tools are usually utilized to conduct the prediction.

Given a protein sequence P with

amino acid residues, i.e.,

P={\begin{bmatrix}R₁R₂R₃R₄R₅R₆R₇ … R_{L\end{bmatrix}}} (1)

where R₁ represents the 1st residue of the protein P, R₂ the 2nd residue, and so forth. This is the representation of the protein under the sequential model.

However, this kind of approach fails when a query protein does not have significant homology to the known protein(s). Thus, various discrete models were proposed that do not rely on sequence-order. The simplest discrete model is using the amino acid composition (AAC) to represent protein samples. Under the AAC model, the protein P of Eq.1 can also be expressed by

P={\begin{bmatrix}f₁&f₂& … &f₂₀\end{bmatrix}}^T (2)

where

f_u(u=1,2, … ,20)

are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. The AAC of a protein is trivially derived with the protein primary structure known like given in Eq.1; it is also possible by hydrolysis without knowing the exact sequence, and such a step in fact is often a prerequisite for protein sequencing.^[3]

Owing to its simplicity, the amino acid composition (AAC) model was widely used in many earlier statistical methods for predicting protein attributes. However, all the sequence-order information is lost. This is its main shortcoming.

Concept

To avoid completely losing the sequence-order information, the concept of PseAAC (pseudo amino acid composition) was proposed. In contrast with the conventional amino acid composition (AAC) that contains 20 components with each reflecting the occurrence frequency for one of the 20 native amino acids in a protein, the PseAAC contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional amino acid composition while the additional factors incorporate some sequence-order information via various pseudo components.

The additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAAC is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model.

Meanwhile, various modes to formulate the PseAAC vector have also been developed, as summarized in a 2009 review article.^[2]

Algorithm

According to the PseAAC model, the protein P of Eq.1 can be formulated as

P={\begin{bmatrix}p_1,p_2,\ldots,p₂₀,p₂₀₊₁,\ldots,p_20+λ\end{bmatrix}}^T,(λ<L) (3)

where the (

20+λ

) components are given by

p_u=\begin{cases} \dfrac{f_u}

	20
{\sum
	i=1

f_i+

	λ
w\sum
	k=1

\tau_k},&(1\leu\le20) \\[10pt] \dfrac{w\tau_u-20

}, & (20+1 \le u \le 20+\lambda)\end\qquad \text

where

is the weight factor, and

\tau_k

the

-th tier correlation factor that reflects the sequence order correlation between all the

-th most contiguous residues as formulated by

\tau_k=

	1
	L-k

	L-k
\sum
	i=1

J_i,,(k<L) (5)

with

J_i,=

	1
	\Gamma

	\Gamma
\sum
	q=1

\left[\Phi_q\left(R_i+k\right)-\Phi_q\left(R_i\right)\right]²(6)

where

\Phi_q\left(R_i\right)

is the

{q}

-th function of the amino acid

R_i

, and

\Gamma

the total number of the functions considered. For example, in the original paper by Chou,^[1]

\Phi₁\left(R_i\right)

\Phi₂\left(R_i\right)

and

\Phi₃\left(R_i\right)

are respectively the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid

R_i

; while

\Phi₁\left(R_i+1\right)

\Phi₂\left(R_i+1\right)

and

\Phi₃\left(R_i+1\right)

the corresponding values for the amino acid

R_i+1

. Therefore, the total number of functions considered there is

\Gamma=3

. It can be seen from Eq.3 that the first 20 components, i.e.

p_1,p_2, … ,p₂₀

are associated with the conventional AA composition of protein, while the remaining components

p₂₀₊₁, … ,p_20+λ

are the correlation factors that reflect the 1st tier, 2nd tier, ..., and the

-th tier sequence order correlation patterns (Figure 1). It is through these additional

factors that some important sequence-order effects are incorporated.

in Eq.3 is a parameter of integer and that choosing a different integer for

will lead to a dimension-different PseAA composition.^[4]

Using Eq.6 is just one of the many modes for deriving the correlation factors in PseAAC or its components. The others, such as the physicochemical distance mode^[5] and amphiphilic pattern mode,^[6] can also be used to derive different types of PseAAC, as summarized in a 2009 review article.^[2] In 2011, the formulation of PseAAC (Eq.3) was extended to a form of the general PseAAC as given by:^[7]

P={\begin{bmatrix}\psi_1,\psi_2,\ldots,\psi_u,\ldots,\psi_\Omega\end{bmatrix}}^T (7)

where the subscript

\Omega

is an integer, and its value and the components

\psi_1,\psi_2,\ldots

will depend on how to extract the desired information from the amino acid sequence of P in Eq.1.

The general PseAAC can be used to reflect any desired features according to the targets of research, including those core features such as functional domain, sequential evolution, and gene ontology to improve the prediction quality for the subcellular localization of proteins.^[8] ^[9] as well as their many other important attributes.

External links

PseAAC web server

Notes and References

Chou KC . Prediction of protein cellular attributes using pseudo-amino acid composition . Proteins . 43 . 3 . 246–55 . May 2001 . 11288174 . 10.1002/prot.1035. 28406797 .
Chou KC . Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. . Current Proteomics . 6 . 262–274 . 2009 . 10.2174/157016409789973707 . 4.
Book: Michail A. Alterman. Peter Hunziker. Amino Acid Analysis: Methods and Protocols. 2 December 2011. Humana Press. 978-1-61779-444-5.
Chou KC, Shen HB . Recent progress in protein subcellular location prediction . Anal. Biochem. . 370 . 1 . 1–16 . November 2007 . 17698024 . 10.1016/j.ab.2007.07.006 .
Chou KC . Prediction of protein subcellular locations by incorporating quasi-sequence-order effect . Biochem. Biophys. Res. Commun. . 278 . 2 . 477–83 . November 2000 . 11097861 . 10.1006/bbrc.2000.3815 .
Chou KC . Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes . Bioinformatics . 21 . 1 . 10–9 . January 2005 . 15308540 . 10.1093/bioinformatics/bth466 . free .
Chou KC . Some remarks on protein attribute prediction and pseudo amino acid composition . Journal of Theoretical Biology . 273 . 1 . 236–47 . March 2011 . 21168420 . 10.1016/j.jtbi.2010.12.024 . 7125570 . 2011JThBi.273..236C .
Chou KC, Shen HB . Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms . Nat Protoc . 3 . 2 . 153–62 . 2008 . 18274516 . 10.1038/nprot.2007.494 . 226104 . 2008-03-24 . https://web.archive.org/web/20070827234010/http://chou.med.harvard.edu/bioinf/Cell-PLoc/ . 2007-08-27 . dead .
Shen HB, Chou KC . PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition . Anal. Biochem. . 373 . 2 . 386–8 . February 2008 . 17976365 . 10.1016/j.ab.2007.10.012 .