Login at Biotechnology Forums

hinasajid · 05-16-2012, 11:57 PM

Hi,
I am currently conducting an independent research on a class of proteins. With 19000 protein sequences to be stored and classified into 5 classes based on conserved domain of each class,using bioinformatic tools. What would be the best database to be used that is easy to use and will consume lesser time for such a huge number of proteins to be worked on.

I will be linking this database with a software that would identify consensus sequences.

BojanaL · 01-24-2013, 09:38 PM

(05-16-2012, 11:57 PM)hinasajid Wrote: Hi,
I am currently conducting an independent research on a class of proteins. With 19000 protein sequences to be stored and classified into 5 classes based on conserved domain of each class,using bioinformatic tools. What would be the best database to be used that is easy to use and will consume lesser time for such a huge number of proteins to be worked on.

I will be linking this database with a software that would identify consensus sequences.

Protein sequence databases

Protein databases could be roughly divided in two groups: those that collect proteins of known structure (determined via X-ray crystallography, for example) and those that collect sequences of proteins, without structural information about proteins.

Here are protein sequence databases that might be useful:

GenBank Gene Products Data Bank or GenPept

GenPept is protein sequence database that act as repository of protein sequences with minimal annotation. This is a primary nucleotide sequence database; proteins derived from amino acid sequencing are missing. Also, each protein is represented by multiple entries.

NCBI’s Entrez Protein

This database also acts as a repository of proteins. It contains nucleotide sequences and sequences derived from Swiss-Prot. Unlike GenPept, this database has additional info on repositioned sequences (curated information), but sequence collection is redundant, just like in GenPept.

Reference Sequence (RefSeq) collection

RefSeq is developed by NCBI with an aim to develop non-redundant (multiple entries are merged in a single entry) collection of protein sequences. Database contains sequences of various species, from viruses and bacteria, to the higher organisms such as human, mouse, zebra fish, some plants… Large collection of nucleotide and protein sequences is clear, neat and contains additional information and latest news & important findings associated with specific sequence.

Protein Information Resource Protein Sequence Database (PIR-PSD)

PIR-PSD is the oldest protein sequence database. Collected data is non-redundant, well organized in families and super-families and annotated with structural, functional and genetic data. Protein name, its classification and regions of biological interest within the sequence could be easily found in the PIR. This database is cross-referenced with DDBJ/EMBL/GenBank nucleic acid and protein identiﬁers, PubMed and MEDLINE IDs, as well with some other databases.

Swiss-Prot

Swiss-Prot contains large collection of universally curated non-redundant sequences. High quality of the collected data is ensured via strict evaluation of each new data entry. Swiss-Prot is detail oriented. Additional information includes experimental findings, post-translational modiﬁcations and protein function, information associated with protein structure, similarities to other proteins, protein deﬁciency related diseases, developmental stages associated with protein expression, tissues that contain protein of interest and its associated biological pathways.

Translation from EMBL - TrEMBL

TrEMBL was created as more efficient and faster way to find the newly incorporated sequence. TrEMBL consists of computer annotated coding regions of nucleotide sequences derived from DDBJ/EMBL/GenBank, prior they become available as a Swiss-Prot data. After sequence is added to the TrEMBL, multiple entries will merge in a single entry and data will be enriched with additional information by transferring Swiss-Prot annotated data to the group of protein of interest in TrEMBL. To determine where new sequence belongs (correct protein group), InterPro is used. This tool helps assigning proteins to specific groups as it contains information about protein families, domains and functional sites.

UniProt

UnitProt exists from December 2003. It consists of Swiss-Prot, TrEMBL and PIR-PSD. This database is divided in three sections: UniProt Knowledgebase (UNIPROT), UniProt Archive (UniParc) and UniProt non-redundant reference databases (UniProt NREF).
UniProt Knowledgebase provides expert curated data from all three member databases. If sequence is missing in Swiss-Prot or TrEMBL, it will be uploaded from the PIR. During annotation, each entry will be assigned with specific GO term (gene onthology refers to function of the protein, its location in cell, biological pathways…). Isoform identifier is useful feature of UniProt that helps identify splice isoforms using the tool called VARSPLIC.
UniParc provides huge collection of protein sequences due to daily maintenance of already existing sequences and constant upload of new data. This is the most comprehensive publically available protein sequence database, because scientists from various organizations and/or parts of the world (USA, Europe, Japan) collaborate and work on expanding this database. Each sequence in this large collection is unique and assigned with UniParc identifier. If you need to cross-check desired sequence with some other database, you can use accession numbers. Cross-reference is possible with 50 different databases.
Thanks to UniProt NREF, sequence collection is non-redundant These three separate UniProt databases, NREF100, NREF90 and NREF50, provide only unique database sequences and hide redundant entries. In NREF100 sequences are organized by their identity and taxonomy and represented as a single data entry. Other two databases also offer non-redundant sequence collection and allow fast homology search.

Hope this will help you decide which one is the best for you. Good luck.