Collections of Sequences


Introduction
Seq-entry: The Sequence Entry
Bioseq-set: A Set Of Seq-entrys
Bioseq-sets are Convenient Packages
ASN.1 Specification: seqset.asn
C Structures and Functions: objsset.h

Introduction

top

A biological sequence is often most appropriately stored in the context of other, related sequences. Such a collection might have a biological basis (e.g. a nucleic acid and its translated proteins, or the chains of an enzyme complex) or some other basis (e.g. a release of GenBank, or the sequences published in an article). The Bioseq-set provides a framework for collections of sequences.

Seq-entry: The Sequence Entry

top

Sometimes a sequence is not part of a collection (e.g. a single annotated protein). Thus a sequence entry could be either a single Bioseq or a collection of them. A Seq-entry is an entity which represents this choice. A great deal of NCBI software is designed to accept a Seq-entry as the primary unit of data. This is the most powerful and flexible object to use as a target software developement in general.

Bioseq-set: A Set Of Seq-entrys

top

A Bioseq-set contains a convenient collection of Seq-entrys. It can have descriptors and annotations just like a single Bioseq (see Biological Sequences). It can have identifiers for the set, although these are less thoroughly controlled than Seq-ids at this time. Since the "heart" of a Bioseq-set is a collection of Seq-entrys, which themselves are either a Bioseq or a Bioseq-set, a Bioseq-set can recursively contain other sets. This recursive property makes for a very rich data structure, and a necessary one for biological sequence data, but presents new challenges for software to manipulate and display it. We will discuss some guidelines for building and using Bioseq-sets below, based on the NCBI experience to date.

id: local identifier for this set

The id field just contains an integer or string to identify this set for internal use by a software system or database. This is useful for building collections of sequences for temporary use, but still be able to cite them.

coll: global identifier for this set

The coll field is a Dbtag, which will accept a string to identify a source database and a string or integer as an identifier within that database. This semi-controlled form provides a global identifier for the set of sequences in a simple way.

level: nesting level of set

Since Bioseq-sets are recursive, the level integer was conceived as way of expliciting indicating the nesting level. In practice we have found this to be little or no use and recommend it be ignored and eventually removed.

class: classification of sets

The class field is an attempt to classify sets of sequences that may be widely used. There are a number which are just releases of well known databases and others which represent biological groupings.

Bioseq-set classes
ValueASN.1 name Explanation
0not-setnot determined
1nuc-prota nucleic acid and the proteins from its coding regions
2segseta segmented Bioseq and the Bioseqs it is made from
3conseta constructed Bioseq and the Bioseqs it was assembled from
4partsa set cotained within a segset or conset holding the Bioseqs which are the components of the segmented or constructed Bioseq
5gibbGenInfo Backbone entries (NCBI Journal Scanning Database)
6giGenInfo entries (NCBI ID Database)
7genbankGenBank entries
8pirPIR entries
9pub-setall the Seq-entrys from a single publication
10equiva set of equivalent representations of the same sequence (e.g. a genetic map Bioseq and a physical map Bioseq for the same chromosome)
11swissprotSWISSPROT entries
12pdb-entryall the Bioseqs associated with a single PDB structure
255othernew type. Usually Bioseq-set.release will have an explanatory string

release: an explanatory string

This is just a free text field which can contain a human readable description of the set. Often used to show which release of GenBank, for example.

date:

This is a date associated with the creation of this set.

descr: Seq-descr for this set

Just like a Bioseq, a Bioseq-set can have Seq-descr (see Biological Sequences) which set it in a biological or bibliographic context, or confer a title or a name. The rule for descriptors at the set level is that they apply to "all of everything below". So if an Org-ref is given at the set level, it means that every Bioseq in the set comes from that organism. If this is not true, then Org-ref would not appear on the set, but different Org-refs would occur on lower level members.

For any Bioseq in arbitrarily deeply nested Bioseq-sets, one should be able to collect all Bioseq-set.descr from all higher level Bioseq-sets that contain the Bioseq, and move them to the Bioseq. If this process introduces any confusion or contradiction, then the set level descriptor has been incorrectly used.

The only exception to this is the title and name types, which often refer to the set level on which they are placed (a nuc-prot may have the title "Adh gene and ADH protein", while the Bioseqs have the titles "Adh gene" and "ADH protein". The gain in code sharing by using exactly the same Seq-descr for Bioseq or Bioseq-set seemed to outweigh the price of this one exception to the rule.

To simplify access to elements like this that depend on a set context, a series of BioseqContext() functions are provided in utilities which allow easy access to all relevant descriptors starting with a specific Bioseq and moving up the levels in the set.

seq-set: the sequences and sets within the Bioseq-set

The seq-set field contains a SEQUENCE OF Seq-entry which represent the contents of the Bioseq-set. As mentioned above, these may be nested internally to any level. Although there is no guarantee that members of a set will come in any particular order, NCBI finds the following conventions useful and natural.

For sets of entries from specific databases, each Seq-entry is the "natural" size of an entry from that databases. Thus GenBank will contain a set of Seq-entry which will be a mixture of Bioseq (just a nucleic acid, no coding regions), seg-set (segmented nucleic acid, no coding regions), or nuc-prot (nucleic acid (as Bioseq or seg-set) and proteins from the translated coding regions). PDB will contain a mixture of Bioseq (single chain structures) or pdb-entry (multi-chain structures).

A segset, representing a segmented sequence combines the segmented Bioseq with the set of the Bioseqs that make it up.


segset (Bioseq-set) contains



segmented sequence (Bioseq)


parts (Bioseq-set) contains

first piece (Bioseq)

second piece (Bioseq


etc


A consset has the same layout as a segset, except the top level Bioseq is constructured rather than segmented.

A nuc-prot set gives the nucleic acid and its protein products at the same levels.


nuc-prot (Bioseq-set) contains



nucleic acid (Bioseq)


protein1 (Bioseq)

protein2 (Bioseq)


etc.


A nuc-prot set where the nucleic acid is segmented simply replaces the nucleic acid Bioseq with a seg-set.


nuc-prot (Bioseq-set) contains



nucleic acid segset (Bioseq-set) contains


segmented sequence (Bioseq)

parts (Bioseq-set) contains

first piece (Bioseq)

second piece (Bioseq

etc

protein1 (Bioseq)

protein2 (Bioseq)


etc.


annot: Seq-annots for the set

A Bioseq-set can have Seq-annots just like a Bioseq can. Because all forms of Seq-annot use explicit ids for the Bioseqs they reference, there is no dependence on context. Any Seq-annot can appear at any level of nesting in the set (or even stand alone) without any loss of information.

However, as a convention, NCBI puts the Seq-annot at the nesting level of the set that contains all the Bioseqs referenced by it, if possible. So if a feature applies just to one Bioseq, it goes in the Bioseq.annot itself. If it applies to all the members of a segmented set, it goes in Bioseq-set.annot of the segset. If, like a coding region, it points to both nucleic acid and protein sequences, it goes in the Bioseq-set.annot of the nuc-prot set.

The utilities include BioseqContextGetSeqFeat() which provides a convenient way of getting all the features that apply to a particular Bioseq in a set, not matter where in the nesting they occur.

Bioseq-sets are Convenient Packages

top

Remember that Bioseq-sets are just convenient ways to package Bioseqs and associated annotations. But Bioseqs may appear in various contexts and software should always be prepared to deal with them that way. A segmented Bioseq may not appear as part of a segset and a Bioseq with coding regions may not appear as part of a nuc-prot set. In both cases the elements making up the segmented Bioseq and the Bioseqs involved in the coding regions all use Seq-locs, which explicit reference Seq-ids. So they are not dependent on context. NCBI packages Bioseqs in sets for convenience, so all the closely related elements can be retrieved together. But this is only a convenience, not a requirement of the specification. The same caveat applies to the ordering conventions within a set, described above.

ASN.1 Specification: seqset.asn

top
--$Revision: 2.1 $

--**********************************************************************

--
--  NCBI Sequence Collections
--  by James Ostell, 1990
--
--**********************************************************************
NCBI-Seqset DEFINITIONS ::=
BEGIN

EXPORTS Bioseq-set, Seq-entry;

IMPORTS Bioseq, Seq-annot, Seq-descr FROM NCBI-Sequence
        Object-id, Dbtag, Date FROM NCBI-General;
--*** Sequence Collections ********************************
--*
Bioseq-set ::= SEQUENCE {      -- just a collection
    id Object-id OPTIONAL ,
    coll Dbtag OPTIONAL ,          -- to identify a collection
    level INTEGER OPTIONAL ,       -- nesting level
    class ENUMERATED {
        not-set (0) ,
        nuc-prot (1) ,              -- nuc acid and coded proteins
        segset (2) ,                -- segmented sequence + parts
        conset (3) ,                -- constructed sequence + parts
        parts (4) ,                 -- parts for 2 or 3
        gibb (5) ,                  -- geninfo backbone
        gi (6) ,                    -- geninfo
        genbank (7) ,               -- converted genbank
        pir (8) ,                   -- converted pir
        pub-set (9) ,               -- all the seqs from a single publication
        equiv (10) ,                -- a set of equivalent maps or seqs
		swissprot (11) ,            -- converted SWISSPROT
		pdb-entry (12) ,            -- a complete PDB entry
        other (255) } DEFAULT not-set ,
    release VisibleString OPTIONAL ,
    date Date OPTIONAL ,
    descr Seq-descr OPTIONAL ,
    seq-set SEQUENCE OF Seq-entry ,
    annot SET OF Seq-annot OPTIONAL }
Seq-entry ::= CHOICE {
        seq Bioseq ,
        set Bioseq-set } 


END

C Structures and Functions: objsset.h

top
/*  objsset.h

* ===========================================================================

*
*                            PUBLIC DOMAIN NOTICE                          
*               National Center for Biotechnology Information
*                                                                          
*  This software/database is a "United States Government Work" under the   
*  terms of the United States Copyright Act.  It was written as part of    
*  the author's official duties as a United States Government employee and 
*  thus cannot be copyrighted.  This software/database is freely available 
*  to the public for use. The National Library of Medicine and the U.S.    
*  Government have not placed any restriction on its use or reproduction.  
*                                                                          
*  Although all reasonable efforts have been taken to ensure the accuracy  
*  and reliability of the software and data, the NLM and the U.S.          
*  Government do not and cannot warrant the performance or results that    
*  may be obtained by using this software or data. The NLM and the U.S.    
*  Government disclaim all warranties, express or implied, including       
*  warranties of performance, merchantability or fitness for any particular
*  purpose.                                                                
*                                                                          
*  Please cite the author in any work or product based on this material.   
*
* ===========================================================================

*

* File Name:  objsset.h

*

* Author:  James Ostell
*   
* Version Creation Date: 4/1/91
*
* $Revision: 2.0 $
*
* File Description:  Object manager interface for module NCBI-Seqset
*
* Modifications:  
* --------------------------------------------------------------------------
* Date	   Name        Description of modification
* -------  ----------  -----------------------------------------------------
*
*
* ==========================================================================
*/
#ifndef _NCBI_Seqset_
#define _NCBI_Seqset_
#ifndef _ASNTOOL_
#include <asn.h>
#endif
#ifndef _NCBI_General_
#include <objgen.h>
#endif
#ifndef _NCBI_Seq_
#include <objseq.h>
#endif
#ifdef __cplusplus
extern "C" {
#endif
typedef ValNodePtr SeqEntryPtr;
/*****************************************************************************
*
*   loader
*
*****************************************************************************/
extern Boolean SeqSetAsnLoad PROTO((void));
/*****************************************************************************
*
*   internal structures for NCBI-Seqset objects
*
*****************************************************************************/
/*****************************************************************************
*
*   BioseqSet - a collection of sequences
*
*****************************************************************************/
typedef struct seqset {
    ObjectIdPtr id;
    DbtagPtr coll;
    Int2 level;            /* set to INT2_MIN (ncbilcl.h) for not used */
    Uint1 _class;
    CharPtr release;
    DatePtr date;
    ValNodePtr descr;
    SeqEntryPtr seq_set;
    SeqAnnotPtr annot;
} BioseqSet, PNTR BioseqSetPtr;

BioseqSetPtr BioseqSetNew PROTO((void));

Boolean BioseqSetAsnWrite PROTO((BioseqSetPtr bsp, AsnIoPtr aip, AsnTypePtr atp));

BioseqSetPtr BioseqSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

BioseqSetPtr BioseqSetFree PROTO((BioseqSetPtr bsp));
/*****************************************************************************
*
*   SeqEntry - implemented as an ValNode
*     choice:
*       1 = Bioseq
*       2 = Bioseq-set

*

*****************************************************************************/
SeqEntryPtr SeqEntryNew PROTO((void));
Boolean SeqEntryAsnWrite PROTO((SeqEntryPtr sep, AsnIoPtr aip, AsnTypePtr atp));
SeqEntryPtr SeqEntryAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));
SeqEntryPtr SeqEntryFree PROTO((SeqEntryPtr sep));
SeqEntryPtr PNTR SeqEntryInMem PROTO((Int2Ptr numptr));
/*****************************************************************************
*
*   Options for SeqEntryAsnRead()
*
*****************************************************************************/
SeqEntryPtr SeqEntryAsnGet PROTO((AsnIoPtr aip, AsnTypePtr atp, SeqIdPtr sip, Int2 retcode));
#define SEQENTRY_OPTION_MAX_COMPLEX 1   /* option type to use with OP_NCBIOBJSSET */
    /* values for retcode, implemented with AsnIoOptions */
#define SEQENTRY_READ_BIOSEQ 1    /* read only Bioseq identified by sip */
#define SEQENTRY_READ_SEG_SET 2   /* read any seg-set it may be part of */
#define SEQENTRY_READ_NUC_PROT 3   /* read any nuc-prot set it may be in */
#define SEQENTRY_READ_PUB_SET 4    /* read pub-set it may be part of */
typedef struct objsset_option {
	SeqIdPtr sip;              /* seq-id to find */
	Int2 retcode;              /* type of set/seq to return */
	Boolean in_right_set;
	Uint1 working_on_set;      /* 2, if in first set of retcode type */
	                           /* 1, if found Bioseq, but not right set */
	                           /* 0, if Bioseq not yet found */
} Op_objsset, PNTR Op_objssetPtr;
#define IS_Bioseq(a) (a->choice == 1)
#define IS_Bioseq_set(a)  (a->choice == 2)
/*****************************************************************************
*
*   loader for ObjSeqSet and Sequence Codes
*
*****************************************************************************/
extern Boolean SeqEntryLoad PROTO((void));
#ifdef __cplusplus
}
#endif
#endif