Biological Sequences


Introduction
Bioseq: the Biological Sequence
Seq-id: Identifying the Bioseq
Seq-annot: Annotating the Bioseq
Seq-descr: Describing the Bioseq and Placing It In Context
Seq-inst: Instantiating the Bioseq
Seq-hist: History of a Seq-inst
Seq-data: Encoding the Sequence Data Itself
Tables of Sequence Codes
Mapping Between Different Sequence Alphabets
Data and Tools for Sequence Alphabets
Pubdesc: Publication Describing a Bioseq
Numbering: Applying a Numbering System to a Bioseq
ASN.1 Specification: seq.asn
ASN.1 Specification: seqblock.asn
ASN.1 Specification: seqcode.asn
C Structures and Functions: objseq.h
C Structures and Functions: objpubd.h
C Structures and Functions: objblock.h
C Structures and Functions: objcode.h

Introduction

top

A biological sequence is a single, continuous molecule of nucleic acid or protein. It can be thought of as a multiple inheritance class hierarchy. One hierarchy is that of the underlying molecule type: DNA, RNA, or protein. The other hierarchy is the way the underlying biological sequence is represented by the data structure. It could be a physical or genetic map, an actual sequence of amino acids or nucleic acids, or some more complicated data structure building a composite view from other entries. An overview of this data model has been presented previously, in the Data Model chapter. The overview will not be repeated here so if you have not read that chapter, do so now. This chapter will concern itself with the details of the specification and representation of biological sequence data.

Bioseq: the Biological Sequence

top

A Bioseq represents a single, continuous molecule of nucleic acid or protein. It can be anything from a band on a gel to a complete chromosome. It can be a genetic or physical map. All Bioseqs have more common properties than differences. All Bioseqs must have at least one identifier, a Seq-id (i.e. Bioseqs must be citable). Seq-ids are discussed in detail in the chapter Sequence Ids and Locations. All Bioseqs represent an integer coordinate system (even maps). All positions on Bioseqs are given by offsets from the first residue, and thus fall in the range from zero to (length - 1). All Bioseqs may have specific descriptive data elements (descriptors) and/or annotations such as feature tables, alignments, or graphs associated with them.

The differences in Bioseqs arise primarily from the way they are instantiated (represented). Different data elements are required to represent a map than are required to represent a sequence of residues.

The C structure for a Bioseq has pointers for a linked list of Seq-ids, a linked list of Seq-descr, and a linked list of Seq-annot, mapping quite directly from the ASN.1. However, since a Seq-inst is always required for a Bioseq, those fields have been incorporated into the Bioseq itself. There are SeqInstAsnRead() and SeqInstAsnWrite() as separate functions, but they take a pointer to a Bioseq.

A number of #defines are provided in objseq.h for the representation classes, molecule types, and types of sequence encoding used in the Bioseq C structure. Also the macros ISA_na() and ISA_aa() are provided to split Bioseqs into the two major molecule classes. A Bioseq.length equal to -1 means the length is unknown and will not appear in the ASN.1. When actual sequence data is present, Bioseq.seq_data holds the pointer to it. Bioseq.seq_data_type contains a value indicating the type of sequence encoding used (and thus the pointer type to cast Bioseq.seq_data to). Sequence encoding is discussed in more detail below.

Seq-id: Identifying the Bioseq

top

Every Bioseq MUST have at least one Seq-id, or sequence identifier. This means a Bioseq is always citable. You can refer to it by a label of some sort. This is a crucial property for different software tools or different scientists to be able to talk about the same thing. There is a wide range of Seq-ids and they are used in different ways. They are discussed in more detail in the Sequence Ids and Locations chapter.

Seq-annot: Annotating the Bioseq

top

A Seq-annot is a self-contained package of sequence annotations, or information that refers to specific locations on specific Bioseqs. Every Seq-annot can have an Object-id for local use by software, a Dbtag for globally identifying the source of the Seq-annot, and/or a name and description for display and use by a human. These describe the whole package of annotations and make it attributable to a source, independent of the source of the Bioseq.

A Seq-annot may contain a feature table, a set of sequence alignments, or a set of graphs of attributes along the sequence. These are described in detail in the Sequence Annotation chapter.

A Bioseq may have many Seq-annots. This means it is possible for one Bioseq to have feature tables from several different sources, or a feature table and set of alignments. A collection of sequences (see Sets Of Bioseqs) can have Seq-annots as well. Finally, a Seq-annot can stand alone, not directly attached to anything. This is because each element in the Seq-annot has specific references to locations on Bioseqs so the information is very explicitly associated with Bioseqs, not implicitly associated by attachment. This property makes possible the exchange of information about Bioseqs as naturally as the exchange of the Bioseqs themselves, be it among software tools or between scientists or as contributions to public databases.

Seq-descr: Describing the Bioseq and Placing It In Context

top

A Seq-descr is meant to describe a Bioseq (or set of Bioseqs.. see Sets Of Bioseqs) and place it in a biological and/or bibliographic context. Seq-descrs apply to the whole Bioseq. Some Seq-descr classes appear also as features, when used to describe a specific part of a Bioseq. But anything appearing at the Seq-descr level applies to the whole thing.

The C implementation uses a linked list of ValNodes, where the ValNode.choice indicates what kind of Seq-descr this is, and ValNode.data contains either an integer or pointer depending on the type of descriptor. The file objseq.h lists the choices and data types and is summarize in the following table. Under Value is the value of ValNode.choice. Type gives an indication of the data stored in ValNode.data. If "i", then an integer is stored in valnode->data.intvalue. Otherwise a pointer is stored in valnode->data.ptrvalue and the datatype of the pointer is given. The file objseq.h also has a series of #defines for Value below constructed by prefixing "Seq_descr_" to the Name below and replacing any hyphens (-) in the ASN.1 name with underline (_) to make it legal C (e.g. #define Seq_descr_mol_type 1).

Seq-descr
ValueName TypeExplanation
1mol-typei role of molecule in life
2modifValNodePtr modifying keywords of mol-type
3methodi protein sequencing method used
4nameCharPtr a commonly used name (e.g. "SV40")
5titleCharPtr a descriptive title or definition
6orgOrgRefPtr (single) organism from which mol comes
7commentCharPtr descriptive comment (may have many)
8numNumberingPtr a numbering system for whole Bioseq
9maplocDbtagPtr a map location from a mapping database
10pirPirBlockPtr PIR specific data
11genbankGBBlockPtr GenBank flatfile specific data
12pubPubdescPtr Publication citation and descriptive info from pub
13regionCharPtr name of genome region (e.g. B-globin cluster)
14userUserObjectPtr user defined data object for any purpose
15spSPBlockPtr SWISSPROT specific data
16neighborsLinkSetPtr ids of pre-calculated similar sequences
17emblEMBLBlockPtr EMBL specific data
18create-dateDatePtr date entry was created by source database
19update-dateDatePtr date entry last updated by source database
20prfPrfBlockPtr PRF specific data
21pdbPdbBlockPtr PDB specific data
22hetCharPtr heterogen: non-Bioseq atom/molecule

mol-type: The Molecule Type

A Seq-descr.mol-type is of type GIBB-mol. It is derived from the molecule information used in the GenInfo BackBone database. It indicates the biological role of the Bioseq in life. It can be genomic (including organelle genomes). It can be a transcription product such as pre-mRNA, mRNA, rRNA, tRNA, snRNA (small nuclear RNA), or scRNA (small cytoplasmic RNA). All amino acid sequences are peptides. No distinction is made at this level about the level of processing of the peptide (but see Prot-ref in the Sequence Annotations chapter). The type other-genetic is provided for "other genetic material" such a B chromosomes or F factors that are not normal genomic material but are also not transcription products. The type genomic-mRNA is provided to describe sequences presented in figures in papers in which the author has combined genomic flanking sequence with cDNA sequence. Since such a figure often does not accurately reflect either the sequence of the mRNA or the sequence of genome, this practice should be discouraged.

Since GIBB-mol is an ENUMERATED type, the ValNode for the Seq-descr simply places the enumerated value in ValNode.data.intvalue.

modif: Modifying Our Assumptions About a Bioseq

A GIBB-mod began as a GenInfo BackBone component and was found to be of general utility. A GIBB-mod is meant to modify the assumptions one might make about a Bioseq. If a GIBB-mod is not present, it does not mean it does not apply, only that it is part of a reasonable assumption already. For example, a Bioseq with GIBB-mol = genomic would be assumed to be DNA, to be chromosomal, and to be partial (complete genome sequences are still rare). If GIBB-mod = mitochondrial and GIBB-mod = complete are both present in Seq-descr, then we know this is a complete mitochondrial genome. Even though GIBB-mod = DNA is not present we can still assume it is DNA.

The modifier concept permits a lot of flexibility. So a peptide with GIBB-mod = mitochondrial is a mitochondrial protein. There is no implication that it is from a mitochondrial gene, only that it functions in the mitochondrion. The assumption is that peptide sequences are complete, so GIBB-mod = complete is not necessary for most proteins, but GIBB-mod = partial is important information for some. A list of brief explanations of GIBB-mod values follows:

GIBB-mod
ValueName Explanation
0dnamolecule is DNA in life
1rnamolecule is RNA in life
2extrachrommolecule is extrachromosomal
3plasmidmolecule is or is from a plasmid
4mitochondrialmolecule is from mitochondrion
5chloroplastmolecule is from chloroplast
6kinetoplastmolecule is from kinetoplast
7cyanellemolecule is from cyanelle
8syntheticmolecule was synthesized artificially
9recombinantmolecule was formed by recombination
10partialnot a complete sequence for molecule
11completesequence covers complete molecule
12mutagenmolecule subjected to mutagenesis
13natmutmolecule is a naturally occuring mutant
14transposonmolecule is a transposon
15insertion-seqmolecule is an insertion sequence
16no-leftpartial molecule is missing left end

5' end for nucleic acid, NH3 end for peptide

17no-rightpartial molecule is missing right end

3' end for nucleic acid, COOH end for peptide

18macronuclearmolecule is from macronucleus
19proviralmolecule is an integrated provirus
20estmolecule is an expressed sequence tag

Seq-descr.modif is defined as a SET OF GIBB-mod, so it must be implemented as a chain, not as a single value. The ValNode representing a Seq-descr.modif then has ValNode.choice = Seq_descr_modif and a ValNode.data.ptrvalue is the head of a chain of ValNodes. Each member of that chain has a ValNode.data.intvalue set to represent a single GIBB-mod according to the table above.

method: Protein Sequencing Method

The method Seq-descr gives the method used to obtain a protein sequence. The values for a GIBB-method are also stored in the C structure as integer values mapping directly from the ASN.1 ENUMERATED type. They are:

GIBB-method
ValueName Explanation
1concept-transconceptual translation
2seq-peptpeptide itself was sequenced
3bothconceptual translation with partial peptide sequencing
4seq-pept-overlappeptides sequenced, fragments ordered by overlap
5seq-pept-homolpeptides sequenced, fragments ordered by homology
6concept-trans-aconceptual translation, provided by author of sequence

name: A Descriptive Name

A sequence name is very different from a sequence identifier. A Seq-id uniquely identifies a specific Bioseq. A Seq-id may be no more than an integer and will not necessarily convey any biological or descriptive information in itself. A name is not guaranteed to uniquely identify a single Bioseq, but if used with caution, can be a very useful tool to identify the best current entry for a biological entity. For example, we may wish to associate the name "SV40" with a single Bioseq for the complete genome of SV40. Let us suppose this Bioseq has the Seq-id 10. Then it is discovered that there were errors in the original Bioseq designated 10, and it is replaced by a new Bioseq from a curator with Seq-id 15. The name "SV40" can be moved to Seq-id 15 now. If a biologist wishes to see the "best" or "most typical" sequence of the SV40 genome, she would retrieve on the name "SV40". At an earlier point in time she would get Bioseq 10. At a later point she would get Bioseq 15. Note that her query is always answered in the context of best current data. On the other hand, if she had done a sequence analysis on Bioseq 10 and wanted to compare results, she would cite Seq-id 10, not the name "SV40", since her results apply to the specific Bioseq, 10, not necessarily to the "best" or "most typical" entry for the virus at the moment.

title: A Descriptive Title

A title is a brief, generally one line, description of an entry. It is extremely useful when presenting lists of Bioseqs returned from a query or search. This is the same as the familiar GenBank flatfile DEFINITION line.

Because of the utility of such terse summaries, NCBI has been experimenting with algorithmically generated titles which try to pack as much information as possible into a single line in a regular and readable format. You will see titles of this form appearing on entries produced by the NCBI journal scanning component of GenBank.


DEFINITION  atp6=F0-ATPase subunit 6 {RNA edited} [Brassica napus=rapeseed,

            mRNA Mitochondrial, 905 nt]

DEFINITION  mprA=metalloprotease, mprR=regulatory protein [Streptomyces
            coelicolor, Muller DSM3030, Genomic, 3 genes, 2040 nt]
DEFINITION  pelBC gene cluster: pelB=pectate lyase isozyme B, pelC=pectate
            lyase isozyme C [Erwinia chrysanthemi, 3937, Genomic, 2481 nt]
DEFINITION  glycoprotein J...glycoprotein I [simian herpes B virus SHBV,
            prototypic B virus, Genomic, 3 genes, 2652 nt]
DEFINITION  glycoprotein B, gB [human herpesvirus-6 HHV6, GS, Peptide, 830
               aa]
DEFINITION  {pseudogene} RESA-2=ring-infected erythrocyte surface antigen 2
            [Plasmodium falciparum, FCR3, Genomic, 3195 nt]

DEFINITION  microtubule-binding protein tau {exons 4A, 6, 8 and 13/14} [human,

            Genomic, 954 nt, segment 1 of 4]

DEFINITION  CAD protein carbamylphosphate synthetase domain {5' end} [Syrian
            hamsters, cell line 165-28, mRNA Partial, 553 nt]


DEFINITION  HLA-DPB1 (SSK1)=MHC class II antigen [human, Genomic, 288 nt]

Gene and protein names come first. If both gene name and protein name are know they are linked with "=". If more than two genes are on a Bioseq then the first and last gene are given, separated by "...". A region name, if available, will precede the gene names. Extra comments will appear in {}. Organism, strain names, and molecule type and modifier appear in [] at the end. Note that the whole definition is constructed from structured information in the ASN.1 data structure by software. It is not composed by hand, but is instead a brief, machine generated summary of the entry based on data within the entry. We therefore discourage attempts to machine parse this line. It may change, but the underlying structured data will not. Software should always be designed to process the structured data.

org: What Organism Did this Come From?

If the whole Bioseq comes from a single organism (the usual case). See the Feature Table chapter for a detailed description of the Org-ref (organism reference) data structure.

comment: Commentary Text

A comment that applies to the whole Bioseq may go here. A comment may contain many sentences or paragraphs. A Bioseq may have many comments.

num: Applying a Numbering System to a Bioseq

One may apply a custom numbering system over the full length of the Bioseq with this Seq­descr. See the section on Numbering later in this chapter for a detailed description of the possible forms this can take. To report the numbering system used in a particular publication, the Pubdesc Seq-descr has its own Numbering slot.

maploc: Map Location

The map location given here is a Dbtag, to be able to cite a map location given by a map database to this Bioseq (e.g. "GDB", "4q21"). It is not necessarily the map location published by the author of the Bioseq. A map location published by the author would be part of a Pubdesc Seq-descr.

pir: PIR Specific Data

sp: SWISSPROT Data

embl: EMBL Data

prf: PRF Data

pdb: PDB Data

NCBI produces ASN.1 encoded entries from data provided by many different sources. Almost all of the data items from these widely differing sources are mapped into the common ASN.1 specifications described in this document. However, in all cases a small number of elements are unique to a particular data source, or cannot be unambiguously mapped into the common ASN.1 specification. Rather than lose such elements, they are carried in small data structures unique to each data source. These are specified in seqblock.asn and objblock.h.

genbank: GenBank Flatfile Specific Data

A number of data items unique to the GenBank flatfile format do not map readily to the common ASN.1 specification. These fields are partially populated by NCBI for Bioseqs derived from other sources than GenBank to permit the production of valid GenBank flatfile entries from those Bioseqs. Other fields are populated to preserve information coming from older GenBank entries.

pub: Description of a Publication

This Seq-descr is used both to cite a particular bibliographic source and to carry additional information about the Bioseq as it appeared in that publication, such as the numbering system to use, the figure it appeared in, a map location given by the author in that paper, and so. See the section on the Pubdesc later in this chapter for a more detailed description of this data type.

region: Name of a Genomic Region

A region of genome often has a name which is a commonly understood description for the Bioseq, such as "B-globin cluster".

user: A User-defined Structured Object

This is a place holder for software or databases to add their own structured datatypes to Bioseqs without corrupting the common specification or disabling the automatic ASN.1 syntax checking. A User-object can also be used as a feature. See the chapter on General User Objects for a detailed explanation of User-objects.

neighbors: Bioseqs Related by Sequence Similarity

NCBI computes a list of "neighbors", or closely related Bioseqs based on sequence similarity for use in the Entrez service. This descriptor is so that such context setting information could be included in a Bioseq itself, if desired.

create-date:

This is the date a Bioseq was created for the first time. It is normally supplied by the source database. It may not be present when not normally distributed by the source database.

update-date:

This is the date of the last update to a Bioseq by the source database. For several source databases this is the only date provided with an entry. The nature of the last update done is generally not available in computer readable (or any) form.

het: Heterogen

A "heterogen" is a non-biopolymer atom or molecule associated with Bioseqs from PDB. When a heterogen appears at the Seq-descr level, it means it was resolved in the crystal structure but is not associated with specific residues of the Bioseq. Heterogens which are associated with specific residues of the Bioseq are attached as features.

Seq-inst: Instantiating the Bioseq

top

Seq-inst.mol gives the physical type of the Bioseq in the living organism. If it is not certain if the Bioseq is DNA (dna) or RNA (rna), then (na) can be used to indicate just "nucleic acid". A protein is always (aa) or "amino acid". The values "not-set" or "other" are provided for internal use by editing and authoring tools, but should not be found on a finished Bioseq being sent to an analytical tool or database.

The representation class to which the Bioseq belongs is encoded in Seq-inst.repr. The values "not-set" or "other" are provided for internal use by editing and authoring tools, but should not be found on a finished Bioseq being sent to an analytical tool or database. The Data Model chapter discusses the representation class hierarchy in general. Specific details follow below.

Seq-inst: Virtual Bioseq

A "virtual" Bioseq is one in which we know the type of molecule, and possibly it's length, topology, and/or strandedness, but for which we do not have sequence data. It is not unusual to have some uncertainty about the length of a virtual Bioseq, so Seq-inst.fuzz may be used. The fields Seq-inst.seq-data and Seq-inst.ext are not appropriate for a virtual Bioseq.

Seq-inst: Raw Bioseq

A "raw" Bioseq does have sequence data, so Seq-inst.length must be set and there should be no Seq-inst.fuzz associated with it. Seq-inst.seq-data must be filled in with the sequence itself and a Seq-data encoding must be selected which is appropriate to Seq-inst.mol. The topology and strandedness may or may not be available. Seq-inst.ext is not appropriate.

Seq-inst: Segmented Bioseq

A segmented ("seg") Bioseq has all the properties of a virtual Bioseq, except that Seq-hist.ext of type Seq-ext.seg must be used to indicate the pieces of other Bioseqs to assemble to make the segmented Bioseq. A Seq-ext.seg is defined as a SEQUENCE OF Seq-loc, or a series of locations on other Bioseqs, taken in order.

For example, a segmented Bioseq (called "X") has a SEQUENCE OF Seq-loc which are an interval from position 11 to 20 on Bioseq "A" followed by an interval from position 6 to 15 on Bioseq "B". So "X" is a Bioseq with no internal gaps which is 20 residues long (no Seq-inst.fuzz). The first residue of "X" is the residue found at position 11 in "A". To obtain this residue, software must retrieve Bioseq "A" and examine the residue at "A" position 11. The segmented Bioseq contains no sequence data itself, only pointers to where to get the sequence data and what pieces to assemble in what order.

The type of segmented Bioseq described above might be used to represent the putative mRNA by simply pointing to the exons on two pieces of genomic sequence. Suppose however, that we had only sequenced around the exons on the genomic sequence, but wanted to represent the putative complete genomic sequence. Let us assume that Bioseq "A" is the genomic sequence of the first exon and some small amount of flanking DNA, and that Bioseq "B" is the genomic sequence around the second exon. Further, we may know from mapping that the exons are separated by about two kilobases of DNA. We can represent the genomic region by creating a segmented sequence in which the first location is all of Bioseq "A". The second location will be all of a virtual Bioseq (call it "C") whose length is two thousand and which has a Seq-inst.fuzz representing whatever uncertainty we may have about the exact length of the intervening genomic sequence. The third location will be all of Bioseq "B". If "A" is 100 base pairs long and "B" is 200 base pairs, then the segmented entry is 2300 base pairs long ("A"+"C"+"B") and has the same Seq-inst.fuzz as "C" to express the uncertainty of the overall length.

A variation of the case above is when one has no idea at all what the length of the intervening genomic region is. A segmented Bioseq can also represent this case. The Seq-inst.ext location chain would be first all of "A", then a Seq-loc of type "null", then all of "B". The "null" indicates that there is no available information here. The length of the segmented Bioseq is just the sum of the length of "A" and the length of "B", and Seq-inst.fuzz is set to indicate the real length is greater-than the length given. The "null" location does not add to the overall length of the segmented Bioseq and is ignored in determining the integer value of a location on the segmented Bioseq itself. If "A" is 100 base pairs long and "B" is 50 base pairs long, then position 0 on the segmented Bioseq is equivalent to the first residue of "A" and position 100 on the segmented Bioseq is equivalent to the first residue of "B", despite the intervening "null" location indicating the gap of unknown length. Utility functions such as the SeqPort (described in the Sequence Utilities chapter) can be configured to signal when crossing such boundaries, or to ignore them.

The Bioseqs referenced by a segmented Bioseq should always be from the same Seq-inst.mol class as the segmented Bioseq, but may well come from a mixture of Seq-inst.repr classes (as for example the mixture of virtual and raw Bioseq references used to describe sequenced and unsequenced genomic regions above). Other reasonable mixtures might be raw and map (see below) Bioseqs to describe a region which is fully mapped and partially sequenced, or even a mixture of virtual, raw, and map Bioseqs for a partially mapped and partially sequenced region. The "character" of any region of a segmented Bioseq is always taken from the underlying Bioseq to which it points in that region. However, a segmented Bioseq can have its own annotations. Things like feature tables are not automatically propagated to the segmented Bioseq.

Seq-inst: Reference Bioseq

A reference Bioseq is effectively a segmented Bioseq with only one pointer location. It behaves exactly like a segmented Bioseq in taking it's data and "character" from the Bioseq to which it points. Its purpose is not to construct a new Bioseq from others like a segmented Bioseq, but to refer to an existing Bioseq. It could be used to provide a convenient handle to a frequently used region of a larger Bioseq. Or it could be used to develop a customized, personally annotated view of a Bioseq in a public database without losing the "live" link to the public sequence.

In the first example, software would want to be able to use the Seq-loc to gather up annotations and descriptors for the region and display them to user with corrections to align them appropriately to the sub region. In this form, a scientist my refer to the "lac region" by name, and analyze or annotate it as if it were a separate Bioseq, but each retrieve starts with a fresh copy of the underlying Bioseq and annotations, so corrections or additions made to the underlying Bioseq in the public database will be immediately visible to the scientist, without either having to always look at the whole Bioseq or losing any additional annotations the scientist may have made on the region themselves.

In the second example, software would not propagate annotations or descriptors from the underlying Bioseq by default (because presumably the scientist prefers his own view to the public one) but the connection to the underlying Bioseq is not lost. Thus the public annotations are available on demand and any new annotations added by the scientist share the public coordinate system and can be compared with those done by others.

Seq-inst: Constructed Bioseq

A constructed (const) Bioseq inherits all the attributes of a raw Bioseq. It is used to represent a Bioseq which has been constructed by assembling other Bioseqs. In this case the component Bioseqs normally overlap each other and there may be considerable redundancy of component Bioseqs. A constructed Bioseq is often also called a "contig" or a "merge".

Most raw Bioseqs in the public databases were constructed by merging overlapping gel or sequencer readings of a few hundred base pairs each. While the const Bioseq data structure can easily accommodate this information, the const Bioseq data type was not really intended for this purpose. It was intended to represent higher level merges of public sequence data and private data, such as when a number of sequence entries from different authors are found to overlap or be contained in each other. In this case a view of the larger sequence region can be constructed by merging the components. The relationship of the merge to the component Bioseqs is preserved in the constructed Bioseq, but it is clear that the constructed Bioseq is a "better" or "more complete" view of the overall region, and could replace the component Bioseqs in some views of the sequence database. In this way an author can submit a data structure to the database which in this author's opinion supersedes his own or other scientist's database entries, without the database actually dropping the other author's entries (who may not necessarily agree with the author submitting the constructed Bioseq).

The constructed Bioseq is like a raw, rather than a segmented, Bioseq because Seq-inst.seq-data must be present. The sequence itself is part of the constructed Bioseq. This is because the component Bioseqs may overlap in a number of ways, and expert knowledge or voting rules may have been applied to determine the "correct" or "best" residue from the overlapping regions. The Seq-inst.seq-data contains the sequence which is the final result of such a process.

Seq-inst.ext is not used for the constructed Bioseq. The relationship of the merged sequence to its component Bioseqs is stored in Seq-inst.hist, the history of the Bioseq (described in more detail below). Seq-hist.assembly contains alignments of the constructed Bioseq with its component Bioseqs. Any Bioseq can have a Seq-hist.assembly. A raw Bioseq may use this to show its relationship to its gel readings. The constructed Bioseq is special in that its Seq-hist.assembly shows how a high level view was constructed from other pieces. The sequence in a constructed Bioseq is only posited to exist. However, since it is constructed from data by possibly many different laboratories, it may never have been sequenced in its entirety from a single biological source.

Seq-inst: Typical or Consensus Bioseq

A consensus (consen) Bioseq is used to represent a pattern typical of a sequence region or family of sequences. There is no assertion that even one sequence exists that is exactly like this one, or even that the Bioseq is a best guess at what a real sequence region looks like. Instead it summarizes attributes of an aligned collection of real sequences. It could be a "typical" ferredoxin made by aligning ferredoxin sequences from many organisms and producing a protein sequence which is by some measure "central" to the group. By using the NCBIpaa encoding for the protein, which permits a probability to be assigned to each position that any of the standard amino acids occurs there, one can create a "weight matrix" or "profile" to define the sequence.

While a consensus Bioseq can represent a frequency profile (including the probability that any amino acid can occur at a position, a type of gap penalty), it cannot represent a regular expression per se. That is because all Bioseqs represent fixed integer coordinate systems. This property is essential for attaching feature tables or expressing alignments. There is no clear way to attach a fixed coordinate system to a regular expression, while one can approximate allowing weighted gaps in specific regions with a frequency profile. Since the consensus Bioseq is like any other, information can be attached to it through a feature table and alignments of the consensus pattern to other Bioseqs can be represented like any other alignment (although it may be computed a special way). Through the alignment, annotated features on the pattern can be related to matched regions of the aligned sequence in a straightforward way.

Seq-hist.assembly can be used in a consensus Bioseq to record the sequence regions used to construct the pattern and their relationships with it. While Seq-hist.assembly for a constructed Bioseq indicates the relationship with Bioseqs which are meant to be superseded by the constructed Bioseq, the consensus Bioseq does not in any way replace the Bioseqs in its Seq-hist.assembly. Rather it is a summary of common features among them, not a "better" or "more complete" version of them.

Seq-inst: Map Bioseqs

A map Bioseq inherits all the properties of a virtual Bioseq. For a consensus genetic map of E.coli, we can posit that the chromosome is DNA, circular, double-stranded, and about 5 million base pairs long. Given this coordinate system, we estimate the positions of genes on it based on genetic evidence. That is, we build a feature table with Gene-ref features on it (explained in more detail in the Feature Table chapter). Thus, a map Bioseq is a virtual Bioseq with a Seq-inst.ext which is a feature table. In this case the feature table is an essential part of instantiating the Bioseq, not simply an annotation on the Bioseq. This is not to say a map Bioseq cannot have a feature table in the usual sense as well. It can. It can also be used in alignments, displays, or by any software that can process or store Bioseqs. This is the great strength of this approach. A genetic or physical map is just another Bioseq and can be stored or analyzed right along with other more typical Bioseqs.

It is understood that within a particular physical or genetic mapping research project more data will have to be present than the map Bioseq can represent. But the same is true for a big sequencing project. The Bioseq is an object for reporting the result of such projects to others in a way that preserves most or all the information of use to workers outside the particular research group. It also preserves enough information to be useful to software tools within the project, such as display tools or analysis tools which were written by others.

A number of attributes of Bioseqs can make such a generic representation more "natural" to a particular research community. For the E.coli map example, above, no E.coli geneticist thinks of the positions of genes in base pairs (yet). So a Num-ref annotation (see Seq-descr, below) can be attached to the Bioseq, which provides information to convert the internal integer coordinate system of the map Bioseq to "minutes", the floating point numbers from 0.0 to 100.0 that E.coli gene positions are traditionally given in. Seq-loc objects which the Gene-ref features use to indicate their position can represent uncertainty, and thus give some idea of the accuracy of the mapping in a simple way. This representation cannot store order information directly (e.g. B and C are after A and before D, but we don't know the absolute distance and we don't know the relative order of B and C), which would need to be stored in a genetic mapping research database. However, a reasonable enough presentation can be made of this situation using locations and uncertainties to be very useful for a wide variety of purposes. As more sequence and physical map information become available, such uncertainties in gene position, at least for the "typical" chromosome, will gradually be resolved and will then map very will to such a generic model.

A physical map Bioseq has similar strengths and weaknesses as the genetic map Bioseq. It can represent an ordered map (such as an ordered restriction map) very well and easily. For some contig building approaches, ordering information is essential to the process of building the physical map and would have to be stored and processed separately by the map building research group. However, the map Bioseq serves very well as a vehicle for periodic reports of the group's best view of the physical map for consumption by the scientific public. The map Bioseq data structure maps quite well to the figures such groups publish to summarize their work. The map Bioseq is an electronic summary that can be integrated with other data and software tools.

Seq-hist: History of a Seq-inst

top

Seq-hist is literally the history of the Seq-inst part of a Bioseq. It does not track changes in annotation at all. However, since the coordinate system provided by the Seq-inst is the critical element for tying annotations and alignments done at various times by various people into a single consistent database, this is the most important element to track.

While Seq-hist can use any valid Seq-id, in practice NCBI will use the best available Seq-id in the Seq-hist. For this purpose, the Seq-id most tightly linked to the exact sequence itself is best. See the Seq-id discussion.

Seq-hist.assembly has been mentioned above. It is a SET OF Seq-align which show the relationship of this Bioseq to any older components that might be merged into it. The Bioseqs included in the assembly are those from which this Bioseq was made or is meant to supersede. The Bioseqs in the assembly need not all be from the author, but could come from anywhere. Assembly just sets the Bioseq in context.

Seq-hist.replaces makes an editorial statement using a Seq-hist-rec. As of a certain date, this Bioseq should replace the following Bioseqs. Databases at NCBI interpret this in a very specific way. Seq-ids in Seq-hist.replaces, which are owned by the owner of the Bioseq, are taken from the public view of the database. The author has told us to replace them with this one. If the author does not own some of them, it is taken as advice that the older entries may be obsolete, but they are not removed from the public view.

Seq-hist.replaced-by is a forward pointer. It means this Bioseq was replaced by the following Seq-id(s) on a certain date. In the case described above, that an author tells NCBI that a new Bioseq replaces some of his old ones, not only is the backward pointer (Seq-hist.replaces) provided by the author in the database, but NCBI will update the Seq-hist.replaced-by forward pointer when the old Bioseq is removed from public view. Since such old entries are still available for specific retrieval by the public, if a scientist does have annotation pointing to the old entry, the new entry can be explicitly located. Conversely, the older versions of a Bioseq can easily be located as well. Note that Seq-hist.replaced-by points only one generation forward and Seq-hist.replaces points only one generation back. This makes Bioseqs with a Seq-hist a doubly linked list over its revision history. This is very different from GenBank/EMBL/DDBJ secondary accession numbers, which only indicate "some relationship" between entries. When that relationship happens to be the replacement relationship, they still carry all accession numbers in the secondary accessions, not just the last ones, so reconstructing the entry history is impossible, even in a very general way.

Another fate which may await a Bioseq is that it is completely withdrawn. This is relatively rare but does happen. Seq-hist.deleted can either be set to just TRUE, or the date of the deletion event can be entered (preferred). In the SeqHist C structure, slots for both the deleted boolean and deleted date are present. If the deleted date is present, the ASN.1 will have the Date CHOICE for Seq-hist.deleted, else if the deleted boolean is TRUE the ASN.1 will have the BOOLEAN form.

Seq-data: Encoding the Sequence Data Itself

top

In the case of a raw or constructed Bioseq, the sequence data itself is stored in Seq-inst.seq-data, which is the data type Seq-data. Seq-data is a CHOICE of different ways of encoding the data, allowing selection of the optimal type for the case in hand. Both nucleic acid and amino acid encoding are given as CHOICEs of Seq-data rather than further subclassing first. But it is still not reasonable to encode a Bioseq of Seq-inst.mol of "aa" using a nucleic acid Seq-data type.

In the C structures all types of Seq-data are stored in ByteStores in Bioseq.seq_data. The encoding is given by the value of Bioseq.seq_data_type. The file objseq.h contains a series of #defines for the values of Bioseq.seq_data_type. These #defines map exactly to the ASN.1 Seq-code-type described below.

The ASN.1 module seqcode.asn and C header objcode.h define tables for recording the allowed values for the various sequence encoding and the ways to display or map between codes. This permits useful information about the allowed encoding to be stored as ASN.1 data and read into a program at runtime. NCBI uses the text file seqcode.prt and the binary version of that, seqcode.val, with its software tools. Some of the data from this file is presented in tables in the following discussion of the different sequence encoding. The "value" is the internal numerical value of a residue in the C code. The "symbol" is a one letter or multi-letter symbol to be used in display to a human. The "name" is a descriptive name for the residue. Other data in seqcode.prt will be discussed in the section on seqcode.asn itself.

IUPACaa: The IUPAC-IUB Encoding of Amino Acids

A set of one letter abbreviations for amino acids were suggested by the IUPAC-IUB Commission on Biochemical Nomenclature, published in J. Biol. Chem. (1968) 243: 3557-3559. It is very widely used in both printed and electronic forms of protein sequence, and many computer programs have been written to analyze data in this form internally (that is the actual ASCII value of the one letter code is used internally). To support such approaches, the IUPACaa encoding represents each amino acid internally as the ASCII value of its external one letter symbol. Note that this symbol is UPPER CASE. One may choose to display the value as lower case to a user for readability, but the data itself must be the UPPER CASE value.

In the NCBI C code implementation, the values are stored one value per byte.

IUPACaa
ValueSymbol Name
65AAlanine
66BAsp or Asn
67CCysteine
68DAspartic Acid
69EGlutamic Acid
70FPhenylalanine
71GGlycine
72HHistidine
73IIsoleucine
75KLysine
76LLeucine
77MMethionine
78NAsparagine
80PProline
81QGlutamine
82RArginine
83SSerine
84TThreoine
86VValine
87WTryptophan
88XUndetermined or atypical
89YTyrosine
90ZGlu or Gln

NCBIeaa: Extended IUPAC Encoding of Amino Acids

The official IUPAC amino acid code has some limitations. One is the lack of symbols for termination, gap, or selenocysteine. Such extensions to the IUPAC codes are also commonly used by sequence analysis software. NCBI has created such a code which is simply the IUPACaa code above extended with the additional symbols.

In the NCBI C code implementation, the values are stored one value per byte.

NCBIeaa
ValueSymbol Name
42*Termination
45-Gap
65AAlanine
66BAsp or Asn
67CCysteine
68DAspartic Acid
69EGlutamic Acid
70FPhenylalanine
71GGlycine
72HHistidine
73IIsoleucine
75KLysine
76LLeucine
77MMethionine
78NAsparagine
80PProline
81QGlutamine
82RArginine
83SSerine
84TThreoine
85USelenocysteine
86VValine
87WTryptophan
88XUndetermined or atypical
89YTyrosine
90ZGlu or Gln

NCBIstdaa: A Simple Sequential Code for Amino Acids

It is often very useful to separate the external symbol for a residue from its internal representation as a data value. For amino acids NCBI has devised a simple continuous set of values that encompasses the set of "standard" amino acids also represented by the NCBIeaa code above. A continuous set of values means that compact arrays can be used in computer software to look up attributes for residues simply and easily by using the value as an index into the array. The only significance of any particular mapping of a value to an amino acid is that zero is used for gap and the official IUPAC amino acids come first in the list. In general, we recommend the use of this encoding for standard amino acid sequences.

In the NCBI C code implementation, the values are stored one value per byte.

NCBIstdaa
ValueSymbol Name
0-Gap
1AAlanine
2BAsp or Asn
3CCysteine
4DAspartic Acid
5EGlutamic Acid
6FPhenylalanine
7GGlycine
8HHistidine
9IIsoleucine
10KLysine
11LLeucine
12MMethionine
13NAsparagine
14PProline
15QGlutamine
16RArginine
17SSerine
18TThreoine
19VValine
20WTryptophan
21XUndetermined or atypical
22YTyrosine
23ZGlu or Gln
24USelenocysteine
25*Termination

NCBI8aa: An Encoding for Modified Amino Acids

Post-translational modifications can introduce a number of non-standard or modified amino acids into biological molecules. The NCBI8aa code will be used to represent up to 250 possible amino acids by using the remaining coding space in the NCBIstdaa code. That is, for the first 26 values, NCBI8aa will be identical to NCBIstdaa. The remaining 224 values will be used for the most commonly encountered modified amino acids. Only the first 250 values will be used to signify amino acids, leaving values in the range of 250-255 to be used for software control codes. Obviously there are a very large number of possible modified amino acids, especially if one takes protein engineering into account. However, the intent here is to only represent commonly found biological forms. This encoding is not yet available since decisions about what amino acids to include have not all been made yet.

IUPAC3aa: A 3 Letter Display Code for Amino Acids

The IUPAC3aa code uses exactly the same values as NCBIstdaa. The only difference is the symbol is the three letter instead of the one letter code. This code is purely for display. As such it does not appear as a valid CHOICE in Seq-data for encoding actual sequence data. However, it does appear in the seqcode.asn specification and is stored in seqcode.val. The symbols follow the IUPAC-IUB recommendations for three letter codes where possible.

IUPAC3aa
ValueSymbol Name
0---Gap
1AlaAlanine
2AsxAsp or Asn
3CysCysteine
4AspAspartic Acid
5GluGlutamic Acid
6PhePhenylalanine
7GlyGlycine
8HisHistidine
9IleIsoleucine
10LysLysine
11LeuLeucine
12MetMethionine
13AsnAsparagine
14ProProline
15GlnGlutamine
16ArgArginine
17SerSerine
18ThrThreoine
19ValValine
20TrpTryptophan
21XxxUndetermined or atypical
22TyrTyrosine
23GlxGlu or Gln
24SecSelenocysteine
25TerTermination

NCBIpaa: A Profile Style Encoding for Amino Acids

The NCBIpaa encoding is designed to accommodate a frequency profile describing a protein motif or family in a form which is consistent with the sequences in a Bioseq. Each position in the sequence is defined by 30 values. Each of the 30 values represents the probability that a particular amino acid (or gap, termination, etc.) will occur at that position. One can consider each set of 30 values an array. The amino acid for each cell of the 30 value array corresponds to the NCBIstdaa index scheme. This means that currently only the first 26 array elements will ever have a meaningful value. The remaining 4 cells are available for possible future additions to NCBIstdaa. Each cell represents the probability that the amino acid defined by the NCBIstdaa index to that cell will appear at that position in the motif or protein. The probability is encoded as an 8-bit value from 0-255 corresponding to a probability from 0.0 to 1.0 by interpolation.

This type of encoding would presumably never appear except in a Bioseq of type "consensus". In the C code implementation these amino acids are encoded at 30 bytes per amino acid in a simple linear order. That is, the first 30 bytes are the first amino acid, the second 30 the next amino acid, and so on.

IUPACna: The IUPAC-IUB Encoding for Nucleic Acids

Like the IUPACaa codes the IUPACna codes are single letters for nucleic acids and the value is the same as the ASCII value of the recommended IUPAC letter. The IUPAC recommendations for nucleic acid codes also include letters to represent all possible ambiguities at a single position in the sequence except a gap. To make the values non-redundant, U is considered the same as T. Whether a sequence actually contains U or T is easily determined from Seq-inst.mol. Since some software tools are designed to work directly on the ASCII representation of the IUPAC letters, this representation is provided. Note that the ASCII values correspond to the UPPER CASE letters. Using values corresponding to lower case letters in Seq-data is an error. For display to a user, any readable case or font is appropriate.

The C implementation encodes one value for a nucleic acid residue per byte.

IUPACna
ValueSymbol Name
65AAdenine
66BG or T or C
67CCytosine
68DG or A or T
71GGuanine
72HA or C or T
75KG or T
77MA or C
78NA or G or C or T
82RG or A
83SG or C
84TThymine
86VG or C or A
87WA or T
89YT or C

NCBI4na: A Four Bit Encoding of Nucleic Acids

It is possible to represent the same set of nucleic acid and ambiguities with a four bit code, where one bit corresponds to each possible base and where more than one bit is set to represent ambiguity. The particular encoding used for NCBI4na is the same as that used on the GenBank Floppy Disk Format. A four bit encoding has several advantages over the direct mapping of the ASCII IUPAC codes. One can represent "no base" as 0000. One can match various ambiguous or unambiguous bases by a simple AND. For example, in NCBI4na 0001=A, 0010=C, 0100=G, 1000=T/U. Adenine (0001) then matches Purine (0101) by the AND method. Finally, it is possible to store the sequence in half the space by storing two bases per byte. This is done both in the ASN.1 encoding and in the NCBI C software implementation. Utility functions (see SeqPort()) allow the developer to ignore the complexities of storage while taking advantage of the greater packing. Since nucleic acid sequences can be very long, this is a real savings.

NCBI4na
ValueSymbol Name
0-Gap
1AAdenine
2CCytosine
3MA or C
4GGuanine
5RG or A
6SG or C
7VG or C or A
8TThymine/Uracil
9WA or T
10YT or C
11HA or C or T
12KG or T
13DG or A or T
14BG or T or C
15NA or G or C or T

NCBI2na: A Two Bit Encoding for Nucleic Acids

If no ambiguous bases are present in a nucleic acid sequence it can be completely encoded using only two bits per base. This allows encoding into ASN.1 or storage in the NCBI C implementation with a four fold savings in space. As with the four bit packing, the NCBI C utility SeqPort() allows the programmer to ignore the complexities introduced by the packing. The two bit encoding selected is the same as that proposed for the GenBank CDROM.

NCBI2na
ValueSymbol Name
0AAdenine
1CCytosine
2GGuanine
3TThymine/Uracil

NCBI8na: An Eight Bit Sequential Encoding for Modified Nucleic Acids

The first 16 values of NCBI8na are identical with those of NCBI4na. The remaining possible 234 values will be used for common, biologically occurring modified bases such as those found in tRNAs. This full encoding is still being determined at the time of this writing. Only the first 250 values will be used, leaving values in the range of 250-255 to be used as control codes in software.

NCBIpna: A Frequency Profile Encoding for Nucleic Acids

Frequency profiles have been used to describe motifs and signals in nucleic acids. This can be encoded by using five bytes per sequence position. The first four bytes are used to express the probability that particular bases occur at that position, in the order A, C, G, T as in the NCBI2na encoding. The fifth position encodes the probability that a base occurs there at all. Each byte has a value from 0-255 corresponding to a probability from 0.0-1.0.

The sequence is encoded as a simple linear sequence of bytes where the first five bytes code for the first position, the next five for the second position, and so on. Typically the NCBIpna notation would only be found on a Bioseq of type consensus. However, one can imagine other uses for such an encoding, for example to represent knowledge about low resolution sequence data in an easily computable form.

Tables of Sequence Codes

top

Various sequence alphabets can be stored in tables of type Seq-code-table, defined in seqcode.asn. An enumerated type, Seq-code-type is used as a key to each table. Each code can be thought of as a square table essentially like those presented above in describing each alphabet. Each "residue" of the code has a numerical one-byte value used to represent that residue both in ASN.1 data and in internal C structures. The information necessary to display the value is given by the "symbol". A symbol can be in a one-letter series (e.g. A,G,C,T) or more than one letter (e.g. Met, Leu, etc.). The symbol gives a human readable representation the corresponds to each numerical residue value. A name, or explanatory string, is also associated with each.

So, the NCBI2na code above would be coded into a Seq-code-table very simply as:


{ -- NCBI2na


code ncbi2na ,

num 4 , -- continuous 0-3

one-letter TRUE , -- all one letter codes

table {

{ symbol "A", name "Adenine" },

{ symbol "C", name "Cytosine" },

{ symbol "G", name "Guanine" },

{ symbol "T", name "Thymine/Uracil"}

} , -- end of table

comps { -- complements

3,

2,

1,

0

}

} ,

The table has 4 rows (with values 0-3) with one letter symbols. If we wished to represent a code with values which do not start at 0 (such as the IUPAC codes) then we would set the OPTIONAL "start-at" element to the value for the first row in the table.

In the case of nucleic acid codes, the Seq-code-table also has rows for indexes to complement the values represented in the table. In the example above, the complement of 0 ("A") is 3 ("T").

Mapping Between Different Sequence Alphabets

top

A Seq-map-table provides a mapping from the values of one alphabet to the values of another, very like the way complements are mapped above. A Seq-map-table has two Seq-code-types, one giving the alphabet to map from and the other the alphabet to map to. The Seq-map-table has the same number of rows and the same "start-at" value as the Seq-code-table for the alphabet it maps FROM. This makes the mapping a simple array lookup using the value of a residue of the FROM alphabet and subtracting "start-at". Remember that alphabets are not created equal and mapping from a bigger alphabet to a smaller may result in loss of information.

Data and Tools for Sequence Alphabets

top

NCBI provides a collection of Seq-code-tables and Seq-map-tables together in a Seq-code-set as part of the software toolbox. The file is called seqcode.prt (text form) or seqcode.val (binary ASN.1 used by the software). The function SeqCodeSetLoad() will check your NCBI configuration file looking for the path to "DATA", then read seqcode.val into memory using SeqCodeSetAsnRead(). A local static pointer to the loaded SeqCodes is kept in the SeqCode module, and thus need not be kept by the caller. Additional functions use the static pointer to provide access to the codes. SeqCodeTableFind() will return the appropriate SeqCodeTablePtr given a valid sequence code, and SeqMapTableFind() will return the appropriate SeqMapTablePtr given a code to map from and a code to map to. The SeqPort functions use these functions to provide a view of a sequence in any requested alphabet by mapping residues on demand. See the chapter on Writing Sequence Software.

Pubdesc: Publication Describing a Bioseq

top

A Pubdesc is a data structure used to record how a particular publication described a Bioseq. It contains the citation itself as a Pub-equiv (see the Bibliographic References chapter) so that equivalent forms of the citation (e.g. a MEDLINE uid and a Cit-Art) can all be accommodated in a single data structure. Then a number of additional fields allow a more complete description of what was presented in the publication. These extra fields are generally only filled in for entries produced by the NCBI journal scanning component of GenBank, also known as the Backbone database. This information is not generally available in data from any other database yet.

Pubdesc.name is the name given the sequence in the publication, usually in the figure. Pubdesc.fig gives the figure the Bioseq appeared in so a scientist can locate it in the paper. Pubdesc.num preserves the numbering system used by the author (see Numbering below). Pubdesc.numexc, if TRUE, indicates that a "numbering exception" was found (i.e. the author's numbering did not agree with the number of residues in the sequence). This usually indicates an error in the preparation of the figure. If Pubdesc.poly-a is TRUE, then a poly-A tract was indicated for the Bioseq in the figure, but was not explicitly preserved in the sequence itself (e.g. ...AGAATTTCT (Poly-A) ). Pubdesc.maploc is the map location for this sequence as given by the author in this paper. Pubdesc.seq-raw allows the presentation of the sequence exactly as typed from the figure. This is never used now. Pubdesc.align-group, if present, indicates the Bioseq was presented in a group aligned with other Bioseqs. The align-group value is an arbitrary integer. Other Bioseqs from the same publication which are part of the same alignment will have the same align-group number.

Pubdesc.comment is simply a free text comment associated with this publication. SWISSPROT entries may also have this field filled.

Numbering: Applying a Numbering System to a Bioseq

top

Internally, locations on Bioseqs are ALWAYS integer offsets in the range 0 to (length - 1). However, it is often helpful to display some other numbering system. The Numbering data structure supports a variety of numbering styles and conventions. In the ASN.1 specification, it is simply a CHOICE of the four possible types. When a Numbering object is supplied as a Seq-descr, then it applies to the complete length of the Bioseq. A Numbering object can also be a feature, in which case it only applies to the interval defined by the feature's location.

Num-cont: A Continuous Integer Numbering System

The most widely used numbering system for sequences is some form of a continuous integer numbering. Num-cont.refnum is the number to assign to the first residue in the Bioseq. If Num-cont.has-zero is TRUE, the numbering system uses zero. When biologists start numbering with a negative number, it is quite common for them to skip zero, going directly from -1 to +1, so the DEFAULT for has-zero is FALSE. This only reflects common usage, not any recommendation in terms of convention. Any useful software tool should support both conventions, since they are both used in the literature. Finally, the most common numbering systems are ascending, however descending numbering systems are encountered from time to time, so Num-cont.ascending would then be set to FALSE.

Num-real: A Real Number Numbering Scheme

Genetic maps may use real numbers as "map units" since they treat the chromosome as a continuous coordinate system, instead of a discrete, integer coordinate system of base pairs. Thus a Bioseq of type "map" which may use an underlying integer coordinate system from 0 to 5 million may be best presented to user in the familiar 0.0 to 100.0 map units. Num-real supports a simply linear equation specifying the relationship:

map units = ( Num-real.a + base_pair_position) + Num-real.b

in this example. Since such numbering systems generally have their own units (e.g. "map units", "centisomes", "centimorgans", etc), Num-real.units provides a string for labeling the display.

Num-enum: An Enumerated Numbering Scheme

Occasionally biologists do not use a continuous numbering system at all. Crystallographers and immunologists, for example, who do extensive studies on one or a few sequences, may name the individual residues in the sequence as they fit them into a theoretical framework. So one might see residues numbered ... "10" "11" "12" "12A" "12B" "12C" "13" "14" ... To accommodate this sort of scheme the "name" of each residue must be explicitly given by a string, since there is no anticipating any convention that may be used. The Num-enum.num gives the number of residue names (which should agree with the number of residues in the Bioseq, in the case of use as a Seq-descr), followed by the names as strings.

Num-ref: Numbering by Reference to Another Bioseq

Two types of references are allowed. The "sources" references is meant to apply the numbering system of constituent Bioseqs to a segmented Bioseq. This is useful for seeing the mapping from the parts to the whole.

The "aligns" reference requires that the Num-ref-aligns alignment be filled in with an alignment of the target Bioseq with one or more pieces of other Bioseqs. The numbering will come from the aligned pieces.

Numbering: C Structures and Utility Functions

A Numbering object is implemented in C simply as a ValNode, where ValNode.choice is given by a series of #defines in objpubd.h and ValNode.ptrvalue is a pointer to the appropriate data structure for the Numbering type.

In sequtil.h (see the Sequence Utilities chapter) a number of functions are defined which convert from internal to display numbering systems and vice versa. These functions make the use of fairly complex numbering systems fairly straightforward.

ASN.1 Specification: seq.asn

top
--$Revision: 2.1 $

--**********************************************************************
--
--  NCBI Sequence elements
--  by James Ostell, 1990
--
--********************************************************************** NCBI-Sequence DEFINITIONS ::= BEGIN EXPORTS Bioseq, Seq-annot, Pubdesc, Seq-descr, Numbering, Heterogen; IMPORTS Date, Int-fuzz, Dbtag, Object-id, User-object FROM NCBI-General
        Seq-align FROM NCBI-Seqalign
        Seq-feat FROM NCBI-Seqfeat
        Seq-graph FROM NCBI-Seqres
        Pub-equiv FROM NCBI-Pub
        Org-ref FROM NCBI-Organism
        Seq-id, Seq-loc FROM NCBI-Seqloc
        Link-set FROM NCBI-Access
		GB-block FROM GenBank-General
		PIR-block FROM PIR-General
        EMBL-block FROM EMBL-General
		SP-block FROM SP-General
		PRF-block FROM PRF-General
		PDB-block FROM PDB-General;
--*** Sequence ********************************
--* Bioseq ::= SEQUENCE {
    id SET OF Seq-id ,            -- equivalent identifiers
    descr Seq-descr OPTIONAL , -- descriptors
    inst Seq-inst ,            -- the sequence data
    annot SET OF Seq-annot OPTIONAL }
--*** Descriptors *****************************
--* Seq-descr ::= SET OF CHOICE {
    mol-type GIBB-mol ,          -- type of molecule
    modif SET OF GIBB-mod ,      -- modifiers
    method GIBB-method ,         -- sequencing method
    name VisibleString ,         -- a name for this sequence
    title VisibleString ,        -- a title for this sequence
    org Org-ref ,                -- if all from one organism
    comment VisibleString ,      -- a more extensive comment
    num Numbering ,              -- a numbering system
    maploc Dbtag ,               -- map location of this sequence
    pir PIR-block ,              -- PIR specific info
    genbank GB-block ,           -- GenBank specific info
    pub Pubdesc ,                -- a reference to the publication
    region VisibleString ,       -- overall region (globin locus)
    user User-object ,           -- user defined object
	sp SP-block ,                -- SWISSPROT specific info
    neighbors Link-set ,         -- neighboring information
    embl EMBL-block ,            -- EMBL specific information
	create-date Date ,           -- date entry first created/released
	update-date Date ,           -- date of last update
	prf PRF-block ,				 -- PRF specific information
	pdb PDB-block ,              -- PDB specific information
	het Heterogen }              -- cofactor, etc associated but not bound GIBB-mol ::= ENUMERATED {       -- type of molecule represented
    unknown (0) ,
    genomic (1) ,
    pre-mRNA (2) ,
    mRNA (3) ,
    rRNA (4) ,
    tRNA (5) ,
    snRNA (6) ,
    scRNA (7) ,
    peptide (8) ,
	other-genetic (9) ,      -- other genetic material
	genomic-mRNA (10) ,      -- reported a mix of genomic and cdna sequence
    other (255) }
     GIBB-mod ::= ENUMERATED {        -- GenInfo Backbone modifiers
    dna (0) ,
    rna (1) ,
    extrachrom (2) ,
    plasmid (3) ,
    mitochondrial (4) ,
    chloroplast (5) ,
    kinetoplast (6) ,
    cyanelle (7) ,
    synthetic (8) ,
    recombinant (9) ,
    partial (10) ,
    complete (11) ,
    mutagen (12) ,    -- subject of mutagenesis ?
    natmut (13) ,     -- natural mutant ?
    transposon (14) ,
    insertion-seq (15) ,
	no-left (16) ,    -- missing left end (5' for na, NH2 for aa)
	no-right (17) ,   -- missing right end (3' or COOH)
	macronuclear (18) ,
	proviral (19) ,
	est (20) ,        -- expressed sequence tag
    other (255) } GIBB-method ::= ENUMERATED {        -- sequencing methods
    concept-trans (1) ,    -- conceptual translation
    seq-pept (2) ,         -- peptide was sequenced
    both (3) ,             -- concept transl. w/ partial pept. seq.
	seq-pept-overlap (4) , -- sequenced peptide, ordered by overlap
	seq-pept-homol (5) ,   -- sequenced peptide, ordered by homology
	concept-trans-a (6) ,  -- conceptual transl. supplied by author
    other (255) }
     Numbering ::= CHOICE {           -- any display numbering system
    cont Num-cont ,              -- continuous numbering
    enum Num-enum ,              -- enumerated names for residues
    ref Num-ref ,                -- by reference to another sequence
    real Num-real }              -- supports mapping to a float system
     Num-cont ::= SEQUENCE {          -- continuous display numbering system
    refnum INTEGER DEFAULT 1,         -- number assigned to first residue
    has-zero BOOLEAN DEFAULT FALSE ,  -- 0 used?
    ascending BOOLEAN DEFAULT TRUE }  -- ascending numbers? Num-enum ::= SEQUENCE {          -- any tags to residues
    num INTEGER ,                        -- number of tags to follow
    names SEQUENCE OF VisibleString }    -- the tags Num-ref ::= SEQUENCE {           -- by reference to other sequences
    type ENUMERATED {            -- type of reference
        not-set (0) ,
        sources (1) ,            -- by segmented or const seq sources
        aligns (2) } ,           -- by alignments given below
    aligns Seq-align OPTIONAL } Num-real ::= SEQUENCE {          -- mapping to floating point system
    a REAL ,                     -- from an integer system used by Bioseq
    b REAL ,                     -- position = (a * int_position) + b
    units VisibleString OPTIONAL } Pubdesc ::= SEQUENCE {              -- how sequence presented in pub
    pub Pub-equiv ,                 -- the citation(s)
    name VisibleString OPTIONAL ,   -- name used in paper
    fig VisibleString OPTIONAL ,    -- figure in paper
    num Numbering OPTIONAL ,        -- numbering from paper
    numexc BOOLEAN OPTIONAL ,       -- numbering problem with paper
    poly-a BOOLEAN OPTIONAL ,       -- poly A tail indicated in figure?
    maploc VisibleString OPTIONAL , -- map location reported in paper
    seq-raw StringStore OPTIONAL ,  -- original sequence from paper
    align-group INTEGER OPTIONAL ,  -- this seq aligned with others in paper
	comment VisibleString OPTIONAL }-- any comment on this pub in context Heterogen ::= VisibleString       -- cofactor, prosthetic group, inibitor, etc
--*** Instances of sequences *******************************
--* Seq-inst ::= SEQUENCE {            -- the sequence data itself
    repr ENUMERATED {              -- representation class
        not-set (0) ,              -- empty
        virtual (1) ,              -- no seq data
        raw (2) ,                  -- continuous sequence
        seg (3) ,                  -- segmented sequence
        const (4) ,                -- constructed sequence
        ref (5) ,                  -- reference to another sequence
        consen (6) ,               -- consensus sequence or pattern
        map (7) ,                  -- ordered map (genetic, restriction)
        other (255) } ,
    mol ENUMERATED {               -- molecule class in living organism
        not-set (0) ,              --   > cdna = rna
        dna (1) ,
        rna (2) ,
        aa (3) ,
        na (4) ,                   -- just a nucleic acid
        other (255) } ,
    length INTEGER OPTIONAL ,      -- length of sequence in residues
    fuzz Int-fuzz OPTIONAL ,       -- length uncertainty
    topology ENUMERATED {          -- topology of molecule
        not-set (0) ,
        linear (1) ,
        circular (2) ,
        tandem (3) ,               -- some part of tandem repeat
        other (255) } DEFAULT linear ,
    strand ENUMERATED {            -- strandedness in living organism
        not-set (0) ,
        ss (1) ,                   -- single strand
        ds (2) ,                   -- double strand
        mixed (3) ,
        other (255) } OPTIONAL ,   -- default ds for DNA, ss for RNA, pept
    seq-data Seq-data OPTIONAL ,   -- the sequence
    ext Seq-ext OPTIONAL ,         -- extensions for special types
	hist Seq-hist OPTIONAL }       -- sequence history
--*** Sequence Extensions **********************************
--*  for representing more complex types
--*  const type uses Seq-hist.assembly Seq-ext ::= CHOICE {
    seg Seg-ext ,        -- segmented sequences
    ref Ref-ext ,        -- hot link to another sequence (a view)
    map Map-ext }        -- ordered map of markers Seg-ext ::= SEQUENCE OF Seq-loc Ref-ext ::= Seq-loc Map-ext ::= SEQUENCE OF Seq-feat
--*** Sequence History Record ***********************************
--** assembly = records how seq was assembled from others
--** replaces = records sequences made obsolete by this one
--** replaced-by = this seq is made obsolete by another(s) Seq-hist ::= SEQUENCE {
	assembly SET OF Seq-align OPTIONAL ,-- how was this assembled?
	replaces Seq-hist-rec OPTIONAL ,    -- seq makes these seqs obsolete
	replaced-by Seq-hist-rec OPTIONAL , -- these seqs make this one obsolete
	deleted CHOICE {
		bool BOOLEAN ,
		date Date } OPTIONAL } Seq-hist-rec ::= SEQUENCE {
	date Date OPTIONAL ,
	ids SET OF Seq-id }
	
--*** Various internal sequence representations ************
--*      all are controlled, fixed length forms Seq-data ::= CHOICE {              -- sequence representations
    iupacna IUPACna ,              -- IUPAC 1 letter nuc acid code
    iupacaa IUPACaa ,              -- IUPAC 1 letter amino acid code
    ncbi2na NCBI2na ,              -- 2 bit nucleic acid code
    ncbi4na NCBI4na ,              -- 4 bit nucleic acid code
    ncbi8na NCBI8na ,              -- 8 bit extended nucleic acid code
    ncbipna NCBIpna ,              -- nucleic acid probabilities
    ncbi8aa NCBI8aa ,              -- 8 bit extended amino acid codes
    ncbieaa NCBIeaa ,              -- extended ASCII 1 letter aa codes
    ncbipaa NCBIpaa ,              -- amino acid probabilities
    ncbistdaa NCBIstdaa }          -- consecutive codes for std aas IUPACna ::= StringStore       -- IUPAC 1 letter codes, no spaces IUPACaa ::= StringStore       -- IUPAC 1 letter codes, no spaces NCBI2na ::= OCTET STRING      -- 00=A, 01=C, 10=G, 11=T NCBI4na ::= OCTET STRING      -- 1 bit each for agct
                              -- 0001=A, 0010=C, 0100=G, 1000=T/U
                              -- 0101=Purine, 1010=Pyrimidine, etc NCBI8na ::= OCTET STRING      -- for modified nucleic acids NCBIpna ::= OCTET STRING      -- 5 octets/base, prob for a,c,g,t,n
                              -- probabilities are coded 0-255 = 0.0-1.0 NCBI8aa ::= OCTET STRING      -- for modified amino acids NCBIeaa ::= StringStore       -- ASCII extended 1 letter aa codes
                              -- IUPAC codes + U=selenocysteine NCBIpaa ::= OCTET STRING      -- 25 octets/aa, prob for IUPAC aas in order:
                              -- A-Y,B,Z,X,(ter),anything
                              -- probabilities are coded 0-255 = 0.0-1.0 NCBIstdaa ::= OCTET STRING    -- codes 0-25, 1 per byte
--*** Sequence Annotation *************************************
--* Seq-annot ::= SEQUENCE {
    id Object-id OPTIONAL ,
    db Dbtag OPTIONAL ,
    name VisibleString OPTIONAL ,
    desc VisibleString OPTIONAL ,
    data CHOICE {
        ftable SET OF Seq-feat ,
        align SET OF Seq-align ,
        graph SET OF Seq-graph } } END

ASN.1 Specification: seqblock.asn

top
--$Revision: 2.0 $

--*********************************************************************
--
--  EMBL specific data
--  This block of specifications was developed by Reiner Fuchs of EMBL
--
--********************************************************************* EMBL-General DEFINITIONS ::= BEGIN EXPORTS EMBL-dbname, EMBL-xref, EMBL-block; IMPORTS Date, Object-id FROM NCBI-General; EMBL-dbname ::= CHOICE {
	code ENUMERATED {
		embl(0),
		genbank(1),
		ddbj(2),
		geninfo(3),
		medline(4),
		swissprot(5),
		pir(6),
		pdb(7),
		epd(8),
		ecd(9),
		tfd(10),
		flybase(11),
		prosite(12),
		enzyme(13),
		mim(14),
		ecoseq(15),
		hiv(16) },
	name	VisibleString } EMBL-xref ::= SEQUENCE {
	dbname EMBL-dbname,
	id SEQUENCE OF Object-id } EMBL-block ::= SEQUENCE {
	class ENUMERATED {
		not-set(0),
		standard(1),
		unannotated(2),
		other(255) } DEFAULT standard,
	div ENUMERATED {
		fun(0),
		inv(1),
		mam(2),
		org(3),
		phg(4),
		pln(5),
		pri(6),
		pro(7),
		rod(8),
		syn(9),
		una(10),
		vrl(11),
		vrt(12) } OPTIONAL,
	creation-date Date,
	update-date Date,
	extra-acc SEQUENCE OF VisibleString OPTIONAL,
	keywords SEQUENCE OF VisibleString OPTIONAL,
	xref SEQUENCE OF EMBL-xref OPTIONAL } END
--*********************************************************************
--
--  SWISSPROT specific data
--  This block of specifications was developed by Mark Cavanaugh of
--		NCBI working with Amos Bairoch of SWISSPROT
--
--********************************************************************* SP-General DEFINITIONS ::= BEGIN EXPORTS SP-block; IMPORTS Date, Dbtag FROM NCBI-General
		Seq-id FROM NCBI-SeqLoc; SP-block ::= SEQUENCE {         -- SWISSPROT specific descriptions
	class ENUMERATED {
		not-set (0) ,
		standard (1) ,      -- conforms to all SWISSPROT checks
		prelim (2) ,        -- only seq and biblio checked
		other (255) } ,
	extra-acc SET OF VisibleString OPTIONAL ,  -- old SWISSPROT ids
	imeth BOOLEAN DEFAULT FALSE ,  -- seq known to start with Met
	plasnm SET OF VisibleString OPTIONAL,  -- plasmid names carrying gene
	seqref SET OF Seq-id OPTIONAL,         -- xref to other sequences
	dbref SET OF Dbtag OPTIONAL ,          -- xref to non-sequence dbases
	keywords SET OF VisibleString OPTIONAL , -- keywords
	created Date OPTIONAL ,			-- creation date
	sequpd Date OPTIONAL ,			-- sequence update
	annotupd Date OPTIONAL }		-- annotation update END
--*********************************************************************
--
--  PIR specific data
--  This block of specifications was developed by Jim Ostell of
--		NCBI
--
--********************************************************************* PIR-General DEFINITIONS ::= BEGIN EXPORTS PIR-block; IMPORTS Seq-id FROM NCBI-SeqLoc; PIR-block ::= SEQUENCE {          -- PIR specific descriptions
    had-punct BOOLEAN OPTIONAL ,      -- had punctuation in sequence ?
    host VisibleString OPTIONAL ,
    source VisibleString OPTIONAL ,     -- source line
    summary VisibleString OPTIONAL ,
    genetic VisibleString OPTIONAL ,
    includes VisibleString OPTIONAL ,
    placement VisibleString OPTIONAL ,
    superfamily VisibleString OPTIONAL ,
    keywords SEQUENCE OF VisibleString OPTIONAL ,
    cross-reference VisibleString OPTIONAL ,
    date VisibleString OPTIONAL ,
	seq-raw VisibleString OPTIONAL ,  -- seq with punctuation
	seqref SET OF Seq-id OPTIONAL }         -- xref to other sequences END
--*********************************************************************
--
--  GenBank specific data
--  This block of specifications was developed by Jim Ostell of
--		NCBI
--
--********************************************************************* GenBank-General DEFINITIONS ::= BEGIN EXPORTS GB-block; IMPORTS Date FROM NCBI-General; GB-block ::= SEQUENCE {          -- GenBank specific descriptions
    extra-accessions SEQUENCE OF VisibleString OPTIONAL ,
    source VisibleString OPTIONAL ,     -- source line
    keywords SEQUENCE OF VisibleString OPTIONAL ,
    origin VisibleString OPTIONAL,
    date VisibleString OPTIONAL ,       -- old form Entry Date
    entry-date Date OPTIONAL ,          -- replaces date
    div VisibleString OPTIONAL ,        -- GenBank division
    taxonomy VisibleString OPTIONAL }   -- continuation line of organism END
--**********************************************************************
-- PRF specific definition
--    PRF is a protein sequence database crated and maintained by
--    Protein Research Foundation, Minoo-city, Osaka, Japan.
--
--    Written by A.Ogiwara, Inst.Chem.Res. (Dr.Kanehisa's Lab),
--            Kyoto Univ., Japan
--
--********************************************************************** PRF-General DEFINITIONS ::= BEGIN EXPORTS PRF-block; PRF-block ::= SEQUENCE {
      extra-src       PRF-ExtraSrc OPTIONAL,
      keywords        SEQUENCE OF VisibleString OPTIONAL
} PRF-ExtraSrc ::= SEQUENCE {
      host    VisibleString OPTIONAL,
      part    VisibleString OPTIONAL,
      state   VisibleString OPTIONAL,
      strain  VisibleString OPTIONAL,
      taxon   VisibleString OPTIONAL
} END
--*********************************************************************
--
--  PDB specific data
--  This block of specifications was developed by Jim Ostell and
--		Steve Bryant of NCBI
--
--********************************************************************* PDB-General DEFINITIONS ::= BEGIN EXPORTS PDB-block; IMPORTS Date FROM NCBI-General; PDB-block ::= SEQUENCE {          -- PDB specific descriptions
	deposition Date ,         -- deposition date  month,year
	class VisibleString ,
	compound SEQUENCE OF VisibleString ,
	source SEQUENCE OF VisibleString ,
	exp-method VisibleString OPTIONAL ,  -- present if NOT X-ray diffraction
	replace PDB-replace OPTIONAL } -- replacement history PDB-replace ::= SEQUENCE {
	date Date ,
	ids SEQUENCE OF VisibleString }   -- entry ids replace by this one END

ASN.1 Specification: seqcode.asn

top
--$Revision: 2.0 $

--  *********************************************************************
--
--  These are code and conversion tables for NCBI sequence codes
--  ASN.1 for the sequences themselves are define in seq.asn
--
--  Seq-map-table and Seq-code-table REQUIRE that codes start with 0
--    and increase continuously.  So IUPAC codes, which are upper case
--    letters will always have 65 0 cells before the codes begin.  This
--    allows all codes to do indexed lookups for things
--
--  Valid names for code tables are:
--    IUPACna
--    IUPACaa
--    IUPACeaa
--    IUPACaa3     3 letter amino acid codes : parallels IUPACeaa
--                   display only, not a data exchange type
--    NCBI2na
--    NCBI4na
--    NCBI8na
--    NCBI8aa
--    NCBIstdaa
--     probability types map to IUPAC types for display as characters NCBI-SeqCode DEFINITIONS ::= BEGIN EXPORTS Seq-code-table, Seq-map-table, Seq-code-set; Seq-code-type ::= ENUMERATED {              -- sequence representations
    iupacna (1) ,              -- IUPAC 1 letter nuc acid code
    iupacaa (2) ,              -- IUPAC 1 letter amino acid code
    ncbi2na (3) ,              -- 2 bit nucleic acid code
    ncbi4na (4) ,              -- 4 bit nucleic acid code
    ncbi8na (5) ,              -- 8 bit extended nucleic acid code
    ncbipna (6) ,              -- nucleic acid probabilities
    ncbi8aa (7) ,              -- 8 bit extended amino acid codes
    ncbieaa (8) ,              -- extended ASCII 1 letter aa codes
    ncbipaa (9) ,              -- amino acid probabilities
    iupacaa3 (10) ,            -- 3 letter code only for display
    ncbistdaa (11) }           -- consecutive codes for std aas, 0-25 Seq-map-table ::= SEQUENCE { -- for tables of sequence mappings 
    from Seq-code-type ,      -- code to map from
    to Seq-code-type ,        -- code to map to
    num INTEGER ,             -- number of rows in table
    start-at INTEGER DEFAULT 0 ,   -- index offset of first element
    table SEQUENCE OF INTEGER }  -- table of values, in from-to order Seq-code-table ::= SEQUENCE { -- for names of coded values
    code Seq-code-type ,      -- name of code
    num INTEGER ,             -- number of rows in table
    one-letter BOOLEAN ,   -- symbol is ALWAYS 1 letter?
    start-at INTEGER DEFAULT 0 ,   -- index offset of first element
    table SEQUENCE OF
        SEQUENCE {
            symbol VisibleString ,      -- the printed symbol or letter
            name VisibleString } ,      -- an explanatory name or string
    comps SEQUENCE OF INTEGER OPTIONAL } -- pointers to complement nuc acid Seq-code-set ::= SEQUENCE {    -- for distribution
    codes SET OF Seq-code-table OPTIONAL ,
    maps SET OF Seq-map-table OPTIONAL } END

C Structures and Functions: objseq.h

top
/*  objseq.h

* ===========================================================================
*
*                            PUBLIC DOMAIN NOTICE                          
*               National Center for Biotechnology Information
*                                                                          
*  This software/database is a "United States Government Work" under the   
*  terms of the United States Copyright Act.  It was written as part of    
*  the author's official duties as a United States Government employee and 
*  thus cannot be copyrighted.  This software/database is freely available 
*  to the public for use. The National Library of Medicine and the U.S.    
*  Government have not placed any restriction on its use or reproduction.  
*                                                                          
*  Although all reasonable efforts have been taken to ensure the accuracy  
*  and reliability of the software and data, the NLM and the U.S.          
*  Government do not and cannot warrant the performance or results that    
*  may be obtained by using this software or data. The NLM and the U.S.    
*  Government disclaim all warranties, express or implied, including       
*  warranties of performance, merchantability or fitness for any particular
*  purpose.                                                                
*                                                                          
*  Please cite the author in any work or product based on this material.   
*
* ===========================================================================
*
* File Name:  objseq.h
*
* Author:  James Ostell
*   
* Version Creation Date: 4/1/91
*
* $Revision: 2.0 $
*
* File Description:  Object manager interface for module NCBI-Seq
*
* Modifications:  
* --------------------------------------------------------------------------
* Date	   Name        Description of modification
* -------  ----------  -----------------------------------------------------
*
*
* ==========================================================================
*/
#ifndef _NCBI_Seq_
#define _NCBI_Seq_
#ifndef _ASNTOOL_
#include <asn.h>
#endif
#ifndef _NCBI_General_
#include <objgen.h>
#endif
#ifndef _NCBI_Seqloc_
#include <objloc.h>
#endif
#ifndef _NCBI_Pub_
#include <objpub.h>
#endif
#ifndef _NCBI_Seqalign_
#include <objalign.h>
#endif
#ifndef _NCBI_Pubdesc_
#include <objpubd.h>     /* separated out to avoid typedef order problems */
#endif
#ifndef _NCBI_Seqfeat_
#include <objfeat.h>       /* include organism for now */
#endif
#ifndef _NCBI_Seqres_
#include <objres.h>
#endif
#ifndef _NCBI_Access_
#include <objacces.h>
#endif
#ifndef _NCBI_SeqBlock_
#include <objblock.h>
#endif
#ifndef _NCBI_SeqCode_
#include <objcode.h>
#endif
#ifdef __cplusplus extern "C" {
#endif
/*****************************************************************************
*
*   loader
*
*****************************************************************************/ extern Boolean SeqAsnLoad PROTO((void));
/*****************************************************************************
*
*   internal structures for NCBI-Seq objects
*
*****************************************************************************/
/*****************************************************************************
*
*   SeqAnnot - Sequence annotations
*
*****************************************************************************/ typedef struct seqannot {
    ObjectIdPtr id;
    DbtagPtr db;
    CharPtr name,
        desc;
    Uint1 type;             /* 1=ftable, 2=align, 3=graph */
    Pointer data;
    struct seqannot PNTR next;
} SeqAnnot, PNTR SeqAnnotPtr; SeqAnnotPtr SeqAnnotNew PROTO((void)); Boolean SeqAnnotAsnWrite PROTO((SeqAnnotPtr sap, AsnIoPtr aip, AsnTypePtr atp)); SeqAnnotPtr SeqAnnotAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); SeqAnnotPtr SeqAnnotFree PROTO((SeqAnnotPtr sap));
/*****************************************************************************
*
*   Sets of SeqAnnots
*
*****************************************************************************/ Boolean SeqAnnotSetAsnWrite PROTO((SeqAnnotPtr sap, AsnIoPtr aip, AsnTypePtr set, AsnTypePtr element)); SeqAnnotPtr SeqAnnotSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr set, AsnTypePtr element));
/*****************************************************************************
*
*   SeqHist
*
*****************************************************************************/ typedef struct seqhist {
	SeqAlignPtr assembly;
	DatePtr replace_date;
	SeqIdPtr replace_ids;
	DatePtr replaced_by_date;
	SeqIdPtr replaced_by_ids;
	Boolean deleted;
	DatePtr deleted_date;
} SeqHist, PNTR SeqHistPtr; SeqHistPtr SeqHistNew PROTO((void)); Boolean SeqHistAsnWrite PROTO((SeqHistPtr shp, AsnIoPtr aip, AsnTypePtr atp)); SeqHistPtr SeqHistAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); SeqHistPtr SeqHistFree PROTO((SeqHistPtr shp));
/*****************************************************************************
*
*   Bioseq.
*     Inst is incorporated within Bioseq for efficiency
*     seq_data_type
*       0 = not set
*       1 = IUPACna
*       2 = IUPACaa
*       3 = NCBI2na
*       4 = NCBI4na
*       5 = NCBI8na
*       6 = NCBIpna
*       7 = NCBI8aa
*       8 = NCBIeaa
*       9 = NCBIpaa
*      11 = NCBIstdaa
*   seq_ext_type
*       0 = none
*       1 = seg-ext
*       2 = ref-ext
*       3 = map-ext
*
*****************************************************************************/
#define Seq_code_iupacna 1
#define Seq_code_iupacaa 2
#define Seq_code_ncbi2na 3
#define Seq_code_ncbi4na 4
#define Seq_code_ncbi8na 5
#define Seq_code_ncbipna 6
#define Seq_code_ncbi8aa 7
#define Seq_code_ncbieaa 8
#define Seq_code_ncbipaa 9
#define Seq_code_iupacaa3 10
#define Seq_code_ncbistdaa 11
#define Seq_repr_virtual 1
#define Seq_repr_raw    2
#define Seq_repr_seg    3
#define Seq_repr_const  4
#define Seq_repr_ref    5
#define Seq_repr_consen 6
#define Seq_repr_map    7
#define Seq_repr_other  255
#define Seq_mol_dna     1
#define Seq_mol_rna     2
#define Seq_mol_aa      3
#define Seq_mol_na      4
#define Seq_mol_other   255
#define ISA_na(x)  ((x==1)||(x==2)||(x==4))
#define ISA_aa(x)  (x == 3) typedef struct bioseq {
    SeqIdPtr id;          /* Seq-ids */
    ValNodePtr descr;              /* Seq-descr */
    Uint1 repr,
        mol;
    Int4 length;            /* -1 if not set */
    IntFuzzPtr fuzz;
    Uint1 topology,
        strand,
        seq_data_type,      /* as in Seq_code_type above */
        seq_ext_type;
    ByteStorePtr seq_data;
    Pointer seq_ext;
    SeqAnnotPtr annot;
	SeqHistPtr hist;
} Bioseq, PNTR BioseqPtr; BioseqPtr BioseqNew PROTO((void)); Boolean BioseqAsnWrite PROTO((BioseqPtr bsp, AsnIoPtr aip, AsnTypePtr atp)); BioseqPtr BioseqAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); BioseqPtr BioseqFree PROTO((BioseqPtr bsp)); Boolean BioseqInstAsnWrite PROTO((BioseqPtr bsp, AsnIoPtr aip, AsnTypePtr orig)); Boolean BioseqInstAsnRead PROTO((BioseqPtr bsp, AsnIoPtr aip, AsnTypePtr orig)); BioseqPtr PNTR BioseqInMem PROTO((Int2Ptr numptr));
/*****************************************************************************
*
*   Initialize bioseq and seqcode tables and default numbering
*
*****************************************************************************/ Boolean BioseqLoad PROTO((void));
/*****************************************************************************
*
*   BioseqAsnRead Options
*
*****************************************************************************/ typedef struct op_objseq {
	SeqIdPtr sip;          /* seq id to find */
	Boolean found_it;      /* set to TRUE when BioseqAsnRead matches sip */
	Boolean load_by_id;    /* if TRUE, load only if sip matches */
} Op_objseq, PNTR Op_objseqPtr;
/* types for AsnIoOption OP_NCBIOBJSEQ */
#define BIOSEQ_CHECK_ID 1    /* match Op_objseq.sip */
/*****************************************************************************
*
*   SeqDescr uses an ValNode with choice = 
    1 = * mol-type GIBB-mol ,          -- type of molecule
    2 = ** modif SET OF GIBB-mod ,      -- modifiers
    3 = * method GIBB-method ,         -- sequencing method
    4 = name VisibleString ,         -- a name for this sequence
    5 = title VisibleString ,        -- a title for this sequence
    6 = org Org-ref ,                -- if all from one organism
    7 = comment VisibleString ,      -- a more extensive comment
    8 = num Numbering ,              -- a numbering system
    9 = maploc Dbtag ,               -- map location of this sequence
    10 = pir PIR-block ,              -- PIR specific info
    11 = genbank GB-block ,           -- GenBank specific info
    12 = pub Pubdesc                 -- a reference to the publication
    13 = region VisibleString        -- name for this region of sequence
    14 = user UserObject             -- user structured data object
    15 = sp SP-block                 -- SWISSPROT specific info
    16 = neighbors Entrez-link       -- links to sequence neighbors
	17 = embl EMBL-block             -- EMBL specific info
	18 = create-date Date            -- date entry created
	19 = update-date Date            -- date of last update
	20 = prf PRF-block 				 -- PRF specific information
	21 = pdb PDB-block              -- PDB specific information
	22 = het Heterogen              -- cofactor, etc associated but not bound
    types with * use data.intvalue.  Other use data.ptrvalue
    ** uses a chain of ValNodes which use data.intvalue for enumerated type
*
*****************************************************************************/
#define Seq_descr_mol_type 1
#define Seq_descr_modif 2
#define Seq_descr_method 3
#define Seq_descr_name 4
#define Seq_descr_title 5
#define Seq_descr_org 6
#define Seq_descr_comment 7
#define Seq_descr_num 8
#define Seq_descr_maploc 9
#define Seq_descr_pir 10
#define Seq_descr_genbank 11
#define Seq_descr_pub 12
#define Seq_descr_region 13
#define Seq_descr_user 14
#define Seq_descr_sp 15
#define Seq_descr_neighbors 16
#define Seq_descr_embl 17
#define Seq_descr_create_date 18
#define Seq_descr_update_date 19
#define Seq_descr_prf 20
#define Seq_descr_pdb 21
#define Seq_descr_het 22 Boolean SeqDescrAsnWrite PROTO((ValNodePtr anp, AsnIoPtr aip, AsnTypePtr atp)); ValNodePtr SeqDescrAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); ValNodePtr SeqDescrFree PROTO((ValNodePtr anp));
/*****************************************************************************
*
*   Pubdesc and Numbering types defined in objpubd.h
*
*****************************************************************************/
#ifdef __cplusplus
}
#endif
#endif

C Structures and Functions: objpubd.h

top
/*  objpubd.h

* ===========================================================================
*
*                            PUBLIC DOMAIN NOTICE                          
*               National Center for Biotechnology Information
*                                                                          
*  This software/database is a "United States Government Work" under the   
*  terms of the United States Copyright Act.  It was written as part of    
*  the author's official duties as a United States Government employee and 
*  thus cannot be copyrighted.  This software/database is freely available 
*  to the public for use. The National Library of Medicine and the U.S.    
*  Government have not placed any restriction on its use or reproduction.  
*                                                                          
*  Although all reasonable efforts have been taken to ensure the accuracy  
*  and reliability of the software and data, the NLM and the U.S.          
*  Government do not and cannot warrant the performance or results that    
*  may be obtained by using this software or data. The NLM and the U.S.    
*  Government disclaim all warranties, express or implied, including       
*  warranties of performance, merchantability or fitness for any particular
*  purpose.                                                                
*                                                                          
*  Please cite the author in any work or product based on this material.   
*
* ===========================================================================
*
* File Name:  objpubd.h
*
* Author:  James Ostell
*   
* Version Creation Date: 4/1/91
*
* $Revision: 2.0 $
*
* File Description:  Object manager interface for type Pubdesc from
*                    NCBI-Sequence.  This is separate to avoid typedef
*                    order problems with NCBI-Sequence and NCBI-Seqfeat
*                    which both reference Pubdesc
*   				   Numbering and Heterogen have now also been added
*   				for the same reason.  Heterogen is just a string,
*   				so no special typedefs are required.
*
* Modifications:  
* --------------------------------------------------------------------------
* Date	   Name        Description of modification
* -------  ----------  -----------------------------------------------------
*
*
* ==========================================================================
*/
#ifndef _NCBI_Pubdesc_
#define _NCBI_Pubdesc_
#ifndef _ASNTOOL_
#include <asn.h>
#endif
#ifdef __cplusplus extern "C" {
#endif
/*****************************************************************************
*
*   Pubdesc
*
*****************************************************************************/ typedef struct pd {
    ValNodePtr pub;          /* points to Pub-equiv */
    CharPtr name,
        fig;
    ValNodePtr num;          /* points to Numbering */
    Boolean numexc,
        poly_a;
    Uint1 align_group;       /* 0 = not part of a group */
    CharPtr maploc,
        seq_raw,
		comment;
} Pubdesc, PNTR PubdescPtr; PubdescPtr PubdescNew PROTO((void)); Boolean PubdescAsnWrite PROTO((PubdescPtr pdp, AsnIoPtr aip, AsnTypePtr atp)); PubdescPtr PubdescAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); PubdescPtr PubdescFree PROTO((PubdescPtr pdp)); typedef ValNodePtr NumberingPtr;
/*****************************************************************************
*
*   Numbering uses an ValNode with choice = 
    1 = cont Num-cont ,              -- continuous numbering
    2 = enum Num-enum ,              -- enumerated names for residues
    3 = ref Num-ref, type 1 sources  -- by reference to another sequence
    4 = ref Num-ref, type 2 aligns  (SeqAlign in data.ptrvalue)
    5 = real Num-real     -- for maps etc
*
*****************************************************************************/
#define Numbering_cont 1
#define Numbering_enum 2
#define Numbering_ref_source 3
#define Numbering_ref_align 4
#define Numbering_real 5 Boolean NumberingAsnWrite PROTO((NumberingPtr anp, AsnIoPtr aip, AsnTypePtr atp)); NumberingPtr NumberingAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); NumberingPtr NumberingFree PROTO((NumberingPtr anp));
/*****************************************************************************
*
*   NumCont - continuous numbering system
*
*****************************************************************************/ typedef struct numcont {
    Int4 refnum;
    Boolean has_zero,
        ascending;
} NumCont, PNTR NumContPtr; NumContPtr NumContNew PROTO((void)); Boolean NumContAsnWrite PROTO((NumContPtr ncp, AsnIoPtr aip, AsnTypePtr atp)); NumContPtr NumContAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); NumContPtr NumContFree PROTO((NumContPtr ncp));
/*****************************************************************************
*
*   NumEnum - enumerated numbering system
*
*****************************************************************************/ typedef struct numenum {
    Int4 num;               /* number of names */
    CharPtr buf;            /* a buffer for the names */
    CharPtr PNTR names;     /* array of pointers to names */
} NumEnum, PNTR NumEnumPtr; NumEnumPtr NumEnumNew PROTO((void)); Boolean NumEnumAsnWrite PROTO((NumEnumPtr nep, AsnIoPtr aip, AsnTypePtr atp)); NumEnumPtr NumEnumAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); NumEnumPtr NumEnumFree PROTO((NumEnumPtr nep));
/*****************************************************************************
*
*   NumReal - float type numbering system
*
*****************************************************************************/ typedef struct numreal {
    FloatHi a, b;        /* number in "units" = ax + b */
    CharPtr units;
} NumReal, PNTR NumRealPtr; NumRealPtr NumRealNew PROTO((void)); Boolean NumRealAsnWrite PROTO((NumRealPtr ncp, AsnIoPtr aip, AsnTypePtr atp)); NumRealPtr NumRealAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); NumRealPtr NumRealFree PROTO((NumRealPtr ncp));
#ifdef __cplusplus
}
#endif
#endif

C Structures and Functions: objblock.h

top
/*  objblock.h

* ===========================================================================
*
*                            PUBLIC DOMAIN NOTICE                          
*               National Center for Biotechnology Information
*                                                                          
*  This software/database is a "United States Government Work" under the   
*  terms of the United States Copyright Act.  It was written as part of    
*  the author's official duties as a United States Government employee and 
*  thus cannot be copyrighted.  This software/database is freely available 
*  to the public for use. The National Library of Medicine and the U.S.    
*  Government have not placed any restriction on its use or reproduction.  
*                                                                          
*  Although all reasonable efforts have been taken to ensure the accuracy  
*  and reliability of the software and data, the NLM and the U.S.          
*  Government do not and cannot warrant the performance or results that    
*  may be obtained by using this software or data. The NLM and the U.S.    
*  Government disclaim all warranties, express or implied, including       
*  warranties of performance, merchantability or fitness for any particular
*  purpose.                                                                
*                                                                          
*  Please cite the author in any work or product based on this material.   
*
* ===========================================================================
*
* File Name:  objblock.h
*
* Author:  James Ostell
*   
* Version Creation Date: 4/1/91
*
* $Revision: 2.0 $
*
* File Description:  Object manager for module GenBank-General,
*   					EMBL-General, PIR-General, SWISSPROT-General
*
* Modifications:  
* --------------------------------------------------------------------------
* Date	   Name        Description of modification
* -------  ----------  -----------------------------------------------------
*
*
* ==========================================================================
*/
#ifndef _NCBI_SeqBlock_
#define _NCBI_SeqBlock_
#ifndef _ASNTOOL_
#include <asn.h>
#endif
#ifndef _NCBI_General_
#include <objgen.h>
#endif
#ifndef _NCBI_Seqloc_
#include <objloc.h>
#endif
#ifdef __cplusplus extern "C" {
#endif
/*****************************************************************************
*
*   loader
*
*****************************************************************************/ extern Boolean SeqBlockAsnLoad PROTO((void));
/*****************************************************************************
*
*   PirBlock - PIR specific data elements
*
*****************************************************************************/ typedef struct pirblock {
    Boolean had_punct;
    CharPtr host,
        source,
        summary,
        genetic,
        includes,
        placement,
        superfamily,
        cross_reference,
        date,
        seq_raw;
    ValNodePtr keywords;
    SeqIdPtr seqref;
} PirBlock, PNTR PirBlockPtr; PirBlockPtr PirBlockNew PROTO((void)); Boolean PirBlockAsnWrite PROTO((PirBlockPtr pbp, AsnIoPtr aip, AsnTypePtr atp)); PirBlockPtr PirBlockAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); PirBlockPtr PirBlockFree PROTO((PirBlockPtr pbp));
/*****************************************************************************
*
*   GBBlock - GenBank specific data elements
*
*****************************************************************************/ typedef struct gbblock {
    ValNodePtr extra_accessions,
        keywords;
    CharPtr source,
        origin,
        date,
        div,
        taxonomy;
    DatePtr entry_date;
} GBBlock, PNTR GBBlockPtr; GBBlockPtr GBBlockNew PROTO((void)); Boolean GBBlockAsnWrite PROTO((GBBlockPtr gbp, AsnIoPtr aip, AsnTypePtr atp)); GBBlockPtr GBBlockAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); GBBlockPtr GBBlockFree PROTO((GBBlockPtr gbp));
/*****************************************************************************
*
*   SPBlock - SWISSPROT specific data elements
*
*****************************************************************************/ typedef struct spblock {
    Uint1 _class;
    ValNodePtr extra_acc;
    Boolean imeth;
    ValNodePtr plasnm;
    SeqIdPtr seqref;
    ValNodePtr dbref;
    ValNodePtr keywords;
	NCBI_DatePtr created,
		sequpd,
		annotupd;
} SPBlock, PNTR SPBlockPtr; SPBlockPtr SPBlockNew PROTO((void)); Boolean SPBlockAsnWrite PROTO((SPBlockPtr sbp, AsnIoPtr aip, AsnTypePtr atp)); SPBlockPtr SPBlockAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); SPBlockPtr SPBlockFree PROTO((SPBlockPtr sbp));
/*****************************************************************************
*
*   EMBLBlock - EMBL specific data elements
*
*****************************************************************************/ typedef struct emblxref {
	Uint1 _class;		
	CharPtr name;			/* NULL if class used */
	ValNodePtr id;          /* ValNode->data.ptrvalue is an ObjectIdPtr */
	struct emblxref PNTR next;
} EMBLXref, PNTR EMBLXrefPtr; typedef struct emblblock {
    Uint1 _class,
		div;               /* 255 = not set */
	NCBI_DatePtr creation_date ,
		update_date;
    ValNodePtr extra_acc;
    ValNodePtr keywords;
	EMBLXrefPtr xref;
} EMBLBlock, PNTR EMBLBlockPtr; EMBLBlockPtr EMBLBlockNew PROTO((void)); Boolean EMBLBlockAsnWrite PROTO((EMBLBlockPtr ebp, AsnIoPtr aip, AsnTypePtr atp)); EMBLBlockPtr EMBLBlockAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); EMBLBlockPtr EMBLBlockFree PROTO((EMBLBlockPtr ebp));
/*****************************************************************************
*
*	PRF-Block	- PRF specific data emements
*				by A.Ogiwara
*****************************************************************************/ typedef struct prfextsrc {
	CharPtr		host;
	CharPtr		part;
	CharPtr		state;
	CharPtr		strain;
	CharPtr		taxon;
} PrfExtSrc, PNTR PrfExtSrcPtr; typedef struct prfblock {
	PrfExtSrcPtr	extra_src;
	ValNodePtr	keywords;
} PrfBlock, PNTR PrfBlockPtr; PrfBlockPtr	PrfBlockNew	PROTO((void)); Boolean		PrfBlockAsnWrite	PROTO((PrfBlockPtr pbp, AsnIoPtr aip,
						AsnTypePtr atp)); PrfBlockPtr	PrfBlockAsnRead	PROTO((AsnIoPtr aip, AsnTypePtr atp)); PrfBlockPtr	PrfBlockFree	PROTO((PrfBlockPtr pbp));
/*****************************************************************************
*
*	PDB-Block	- PDB specific data emements
*
*****************************************************************************/ typedef struct pdbreplace {
	DatePtr date;
	ValNodePtr ids;
} PdbRep, PNTR PdbRepPtr; typedef struct pdbblock {
	DatePtr deposition ;
	CharPtr class ;
	ValNodePtr compound ;
	ValNodePtr source ;
	CharPtr exp_method ;
	PdbRepPtr replace;
} PdbBlock, PNTR PdbBlockPtr; PdbBlockPtr	PdbBlockNew	PROTO((void)); Boolean		PdbBlockAsnWrite	PROTO((PdbBlockPtr pdbp, AsnIoPtr aip,
						AsnTypePtr atp)); PdbBlockPtr	PdbBlockAsnRead	PROTO((AsnIoPtr aip, AsnTypePtr atp)); PdbBlockPtr	PdbBlockFree	PROTO((PdbBlockPtr pdbp));
#ifdef __cplusplus
}
#endif
#endif

C Structures and Functions: objcode.h

top
/*  objcode.h

* ===========================================================================
*
*                            PUBLIC DOMAIN NOTICE                          
*               National Center for Biotechnology Information
*                                                                          
*  This software/database is a "United States Government Work" under the   
*  terms of the United States Copyright Act.  It was written as part of    
*  the author's official duties as a United States Government employee and 
*  thus cannot be copyrighted.  This software/database is freely available 
*  to the public for use. The National Library of Medicine and the U.S.    
*  Government have not placed any restriction on its use or reproduction.  
*                                                                          
*  Although all reasonable efforts have been taken to ensure the accuracy  
*  and reliability of the software and data, the NLM and the U.S.          
*  Government do not and cannot warrant the performance or results that    
*  may be obtained by using this software or data. The NLM and the U.S.    
*  Government disclaim all warranties, express or implied, including       
*  warranties of performance, merchantability or fitness for any particular
*  purpose.                                                                
*                                                                          
*  Please cite the author in any work or product based on this material.   
*
* ===========================================================================
*
* File Name:  objcode.h
*
* Author:  James Ostell
*   
* Version Creation Date: 8/10/92
*
* $Revision: 2.0 $
*
* File Description:  Object manager interface for module NCBI-SeqCode
*
* Modifications:  
* --------------------------------------------------------------------------
* Date	   Name        Description of modification
* -------  ----------  -----------------------------------------------------
*
*
* ==========================================================================
*/
#ifndef _NCBI_SeqCode_
#define _NCBI_SeqCode_
#ifndef _ASNTOOL_
#include <asn.h>
#endif
#ifdef __cplusplus extern "C" {
#endif
/*****************************************************************************
*
*   loader
*
*****************************************************************************/ extern Boolean SeqCodeAsnLoad PROTO((void));
/*****************************************************************************
*
*   SeqMapTable - Table from mapping sequence codes to each other
*      Codes ALWAYS start from 0 and increase continuously
*      IUPAC has 65 empty rows
*
*****************************************************************************/ typedef struct seqmaptable {
    Uint1 from,                 /* as ENUMERATED in Seq-code-type */
        to;
    Uint1 num;
    Uint1 start_at;
    Uint1Ptr table;
    struct seqmaptable PNTR next;
} SeqMapTable, PNTR SeqMapTablePtr; SeqMapTablePtr SeqMapTableNew PROTO((void)); Boolean SeqMapTableAsnWrite PROTO((SeqMapTablePtr smp, AsnIoPtr aip, AsnTypePtr atp)); SeqMapTablePtr SeqMapTableAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); SeqMapTablePtr SeqMapTableFree PROTO((SeqMapTablePtr smp)); SeqMapTablePtr SeqMapTableFind PROTO((Uint1 to, Uint1 from));
/*****************************************************************************
*
*   SeqCodeTable - Table of sequence codes
*       in code order, starting with 0 and increasing continuously
*
*****************************************************************************/ typedef struct seqcodetable {
    Uint1 code;                        /* as ENUMERATED in Seq-code-type */
    Uint1 num;                          /* number of codes */
    Boolean one_letter;                  /* one letter codes? */
    Uint1 start_at;                     /* index offset of first code */
    CharPtr letters;                   /* one letter codes */
    CharPtr PNTR symbols;              /* multi-length codes */
    CharPtr PNTR names;                /* explanatory names */
    Uint1Ptr comps;                    /* maps to complements */
    struct seqcodetable PNTR next;
} SeqCodeTable, PNTR SeqCodeTablePtr; SeqCodeTablePtr SeqCodeTableNew PROTO((void)); Boolean SeqCodeTableAsnWrite PROTO((SeqCodeTablePtr scp, AsnIoPtr aip, AsnTypePtr atp)); SeqCodeTablePtr SeqCodeTableAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); SeqCodeTablePtr SeqCodeTableFree PROTO((SeqCodeTablePtr scp)); SeqCodeTablePtr SeqCodeTableFind PROTO((Uint1 code));
/*****************************************************************************
*
*   SeqCodeSet - Set of sequence codes and maps
*
*****************************************************************************/ typedef struct seqcodeset {
    SeqCodeTablePtr codes;
    SeqMapTablePtr maps;
} SeqCodeSet, PNTR SeqCodeSetPtr; SeqCodeSetPtr SeqCodeSetNew PROTO((void)); Boolean SeqCodeSetAsnWrite PROTO((SeqCodeSetPtr ssp, AsnIoPtr aip, AsnTypePtr atp)); SeqCodeSetPtr SeqCodeSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp)); SeqCodeSetPtr SeqCodeSetFree PROTO((SeqCodeSetPtr ssp));
/*****************************************************************************
*
*   SeqCodeSetLoad()
*       loads the standard sequence codes from "seqcode.val" in "data"
*
*****************************************************************************/ SeqCodeSetPtr SeqCodeSetLoad PROTO((void));
#ifdef __cplusplus
}
#endif
#endif