This chapter discusses a number of high level sequence manipulation functions available in the api directory of the toolkit release. These include functions for exploring data structures, comparing them, or querying their attributes. Also included are routines to output Seq-entrys in GenBank format, report format or FASTA format, and for outputting MEDLINE records in MEDLARS format.
Until the narrative text of this chapter can be written, we have simply included the relevant header files and some examples of function uses with listing of some of the programs in the demo directory of the toolkit release.
Seqtest reads an ASN.1 formatted Seq-entry file into memory. It then uses BioseqExplore to count the number of "real" sequences. The file "example.prt" contains a segmented nucleic acid with two segments of DNA and with two proteins connected by a CdRegion. BioseqExplore reports 3 sequences for it, one nucleic acid (the segmented one... the two segments are really part of one nucleic acid sequence) and 2 proteins. It then uses utility functions to determine their lengths and number of gaps (unspecified spaces, such as between the two segments of the segmented sequence). It prints:
A.variabilis nifD gene 5' recombination site. len=497 gaps=1 segs=3 len = 204 len = 0 len = 293 xisA peptide A (alt.) len=44 gaps=0 segs=1 xisA peptide B (alt.) len=5 gaps=0 segs=1
It then opens a SeqPort on the segmented nucleic acid sequence, specifying the plus strand and the IUPAC nucleic acid alphabet. A SeqPort allow you to treat a Bioseq , even a complicated segmented sequence, as if it were a disk file in the alphabet requested. You can seek to a location (like fseek()), get a residue at the current position and increment the current position (like fgetc()), or read a buffer of residues at a time (like fread()). It returns special characters for end of sequence (SEQPORT_EOF), end of an internal segment (SEQPORT_EOS), invalid residue (INVALID_RESIDUE), and supports a little macro (IS_Residue()) to filter these. This is generated from a SeqPort on the plus strand of the segmented nucleic acid sequence:
SeqPort: plus strand with SeqPortGetResidue ATCGATAACGCCACCATCATTTATGATGACGTTACCGCCTACGAATTTGAAGAGTTCGTA AAAGCTAAGAAGCCTGATTTAATCGCTTCTGGTATTAAAGAGAAGTATGTCTTCCAAAAG ATGGCTCTTCCCTTCCGTCAAATGCACTCTTGGGATTACTCCGAACCTAGCGATGGGGTG CAAATGTCAGATCAGATAAGGTTT Segment Segment ACTTTTGTTCTCATGTGTTCTCTTTGCTGCTGTGCTTGCAGACTTGAGCCGAGAAAACTG CCGTCGGTAGATGAAAGTGGCTCCAAGTCTGCAAAGGCTTGTTGATATTTGTCTTGACCC TGATTTTGCATCGCTGTGGTATTAGCCTATATTTAGCCTAAAAATTAATGTGTTATCAGC AAACAATGTTCATCACTAACACTGCTCAGTGCAAACATTAAGCTGTTGAAAGCTATTAAA CCACAAAAAGGATTACTCCGGCCCTTATCACGGTTACGACGGATTTGCTATCT EOF
Next, it opens a Seqport on the minus strand of the same segemented sequence. Note that a SeqPort is a "view" on the sequence data. It does not change the underlying sequence in any way. This it is possible to have many different SeqPorts open on the same sequence at the same time without interfering with each other. On the minus strand the SeqPort runs backwards and complements the residues. In this example, buffers of residues are read with SeqPortRead instead of one residue at a time with SeqPortGetResidue. Of course either approach could have been used.
SeqPort on minus with SeqPortRead AGATAGCAAATCCGTCGTAACCGTGATAAGGGCCGGAGTAATCCTTTTTGTGGTTTAATA GCTTTCAACAGCTTAATGTTTGCACTGAGCAGTGTTAGTGATGAACATTGTTTGCTGATA ACACATTAATTTTTAGGCTAAATATAGGCTAATACCACAGCGATGCAAAATCAGGGTCAA GACAAATATCAACAAGCCTTTGCAGACTTGGAGCCACTTTCATCTACCGACGGCAGTTTT CTCGGCTCAAGTCTGCAAGCACAGCAGCAAAGAGAACACATGAGAACAAAAGT Segment Segment AAACCTTATCTGATCTGACATTTGCACCCCATCGCTAGGTTCGGAGTAATCCCAAGAGTG CATTTGACGGAAGGGAAGAGCCATCTTTTGGAAGACATACTTCTCTTTAATACCAGAAGC GATTAAATCAGGCTTCTTAGCTTTTACGAACTCTTCAAATTCGTAGGCGGTAACGTCATC ATAAATGATGGTGGCGTTATCGAT EOF
Finally, seqtest.c calls the SeqEntryToFasta function (found in tofasta.h). This function traverses the SeqEntry finding every raw Bioseq for either nucleic acids or proteins and prints them out in FASTA format using what alphabet is specified. This is useful for generating data for tools which require only a simple view of a sequence entry. First the nucleic acids are printed, then the proteins:
Nucleic Acids in FASTA format: >gb|M28152|ANANIFDR1 A.variabilis nifD gene 5' recombination site. ATCGATAACGCCACCATCATTTATGATGACGTTACCGCCTACGAATTTGAAGAGTTCGTA AAAGCTAAGAAGCCTGATTTAATCGCTTCTGGTATTAAAGAGAAGTATGTCTTCCAAAAG ATGGCTCTTCCCTTCCGTCAAATGCACTCTTGGGATTACTCCGAACCTAGCGATGGGGTG CAAATGTCAGATCAGATAAGGTTT >gb|M28153|ANANIFDR2 A.variabilis nifD gene 3' recombination site, and xisA gene, 5' end. ACTTTTGTTCTCATGTGTTCTCTTTGCTGCTGTGCTTGCAGACTTGAGCCGAGAAAACTG CCGTCGGTAGATGAAAGTGGCTCCAAGTCTGCAAAGGCTTGTTGATATTTGTCTTGACCC TGATTTTGCATCGCTGTGGTATTAGCCTATATTTAGCCTAAAAATTAATGTGTTATCAGC AAACAATGTTCATCACTAACACTGCTCAGTGCAAACATTAAGCTGTTGAAAGCTATTAAA CCACAAAAAGGATTACTCCGGCCCTTATCACGGTTACGACGGATTTGCTATCT Proteins in FASTA format: >gim|86660 xisA peptide A (alt.) MQNQGQDKYQQAFADLEPLSSTDGSFLGSSLQAQQQREHMRTKV >gim|86661 xisA peptide B (alt.) MRTKV
SeqEntryToFasta uses the SeqIdPrint function to print out the Seq-ids for these sequences. Since this data model supports Seq-ids from many different sources and in many different formats, it displays them in a variety of ways. First comes a code for the source database followed by a vertical bar ( | ). "gb" is GenBank, "gim" is for a temporary "import id" assigned to sequences lacking a source Seq-id (such as these translated coding regions). Following the source identifier are various numbers of fields depending on the source type. For GenBank, the fields are accession number and locus name.
This demo simply serves to show a few of the capabilities of current function library while we are finishing the documentation. There are functions to print a Seq-entry in GenBank format, SeqEntryToGenBank
(in togenbnk.h) which is used the demo program asn2gnbk.c. There is also a human readable report generator (SeqEntryToFile) which is used in asn2rpt.c. ProteinFromCdRegion (in seqport.h) will translate a CdRegion feature, taking into account alternate genetic codes, unusual start codons, and code-breaks. The reader is encouraged to look over the code in \demo.
/***************************************************************************** * * seqtest.c * test program for sequence display, SeqPort and ToFasta * *****************************************************************************/ #include <seqport.h> #include <tofasta.h> void BuildList (SeqEntryPtr sep, Pointer data, Int4 index, Int2 indent); Int2 Main() { AsnIoPtr aip; SeqEntryPtr sep; BioseqPtr PNTR seqlist; Int4 seqnum, i, numseg, lens[10], j; Int2 ctr; SeqPortPtr spp; Uint1 residue; FILE* fp; CharPtr title; Char buffer[101]; /* ** Load SeqEntry object loader and sequence alphabets */ if (! SeqEntryLoad()) { Message(MSG_ERROR, "SeqEntryLoad failed"); return 1; } /* ** Use the file "example.prt" as the ASN I/O stream. This file ** can be found in the ncbi/demo. It is in ASN.1 Print Value format. */ if ((aip = AsnIoOpen("example.prt", "r")) == NULL) return 1; /* ** Write the output to "seqtest.out". */ fp = FileOpen("seqtest.out", "w"); fprintf(fp, "Sequence summary:\n\n"); /* ** Read in the whole entry into the Sequence Entry Pointer, sep. ** Close the ASN stream, which in turn closes the input file. */ sep = SeqEntryAsnRead(aip, NULL); aip = AsnIoClose(aip); /* ** Determine how many Bioseqs are in this SeqEntry. Allocate ** enough memory to hold a list of pointer to all of these ** Bioseqs. Invoke an Explore function to "visit"each Bioseq. ** We are allowed to pass one pointer for use by the exploring ** function, in this case, "BuildList". */ seqnum = BioseqCount(sep); seqlist = MemNew((size_t)(seqnum * sizeof(BioseqPtr))); BioseqExplore(sep, (Pointer) seqlist, BuildList); /* ** For each Bioseq in the SeqEntry write out it's title ** len, number of gaps, and number of segments. Write out ** the length of each segment, up to 10. */ for(i = 0; i < seqnum; i++) { numseg = BioseqCountSegs(seqlist[i]); title = BioseqGetTitle(seqlist[i]); FilePuts((VoidPtr)title, fp); FilePuts("\n", fp); fprintf(fp, "len=%ld gaps=%ld segs=%ld\n", BioseqGetLen(seqlist[i]), BioseqGetGaps(seqlist[i]), numseg); if ((numseg > 1) && (numseg <= 10)) { BioseqGetSegLens (seqlist[i], lens); for (j = 0; j < numseg; j++) fprintf(fp, " len = %ld\n", lens[j]); } FilePuts("\n", fp); } spp = SeqPortNew(seqlist[0], 0, -1, 0, Seq_code_iupacna); if (spp == NULL) Message(MSG_ERROR, "fail on SeqPortNew"); fprintf(fp, "SeqPort: plus strand with SeqPortGetResidue\n\n"); i = 0; while ((residue = SeqPortGetResidue(spp)) != SEQPORT_EOF) { if (residue == SEQPORT_EOS) { buffer[i] = '\0'; fprintf(fp, "%s\nSegment\n", buffer); i = 0; } else { buffer[i] = residue; i++; if (i == 60) { buffer[i] = '\0'; fprintf(fp, "%s\n", buffer); i = 0; } } } if (i) { buffer[i] = '\0'; fprintf(fp, "%s\n", buffer); } fprintf(fp, "EOF\n"); SeqPortFree(spp); fprintf(fp, "\nSeqPort on minus with SeqPortRead\n\n"); spp = SeqPortNew(seqlist[0], 0, -1, Seq_strand_minus, Seq_code_iupacna); if (spp == NULL) Message(MSG_ERROR, "fail on SeqPortNew"); do { ctr = SeqPortRead(spp, (Uint1Ptr)buffer, 60); if (ctr > 0) { buffer[ctr] = '\0'; fprintf(fp, "%s\n", buffer); } else { ctr *= -1; if (ctr == SEQPORT_EOS) fprintf(fp,"Segment\n"); else if (ctr == SEQPORT_EOF) fprintf(fp,"EOF\n"); } } while (ctr != SEQPORT_EOF); SeqPortFree(spp); /* ** Write out the nucleic acid sequences in this SeqEntry */ fprintf(fp, "\nNucleic Acids in FASTA format:\n\n"); SeqEntryToFasta(sep, fp, TRUE); /* ** Write out the protein sequences in this SeqEntry. */ fprintf(fp, "\nProteins in FASTA format:\n\n"); SeqEntryToFasta(sep, fp, FALSE); /* ** Close the output file and free up allocated space. */ fclose(fp); MemFree(seqlist); SeqEntryFree(sep); return 0; } /* ** This SeqEntry exploration function copy the current pointer position inthe ** the Bioseq entry to a list of Bioseq pointers */ void BuildList(SeqEntryPtr sep, Pointer data, Int4 index, Int2 indent) { ((BioseqPtr PNTR) data)[index] = (BioseqPtr)sep->data.ptrvalue; return; }
The features attached to one particular SeqEntry can be obtained as follows:
#include <objall.h> #include <sequtil.h> BioseqPtr bsp; BioseqSetPtr bsetp; ValNodePtr descr; SeqAnnotPtr annot; SeqFeatPtr feat; descr = NULL; annot = NULL; feat = NULL; bsp = NULL; bsetp = NULL; if (IS_Bioseq (sep)) { bsp = (BioseqPtr) sep->data.ptrvalue; if (bsp != NULL) { descr = bsp->descr; annot = bsp->annot; } } else if (IS_Bioseq_set (sep)) { bsetp = (BioseqSetPtr) sep->data.ptrvalue; if (bsetp != NULL) { descr = bsetp->descr; annot = bsetp->annot; } } while (descr != NULL) { /* Do something with the descr. */ descr = descr->next; } while (annot != NULL) { if (annot->type == 1) { feat = (SeqFeatPtr) annot->data; while (feat != NULL) { /* Do something with the feat. */ feat = feat->next; } } annot = annot->next; }
The features applicable to a given Bioseq, which may be attached at any point in the hierarchy, can be obtained using the BioseqContext functions as follows:
#include <objall.h> #include <sequtil.h> BioseqPtr bsp; BioseqSetPtr bsetp; ValNodePtr descr; SeqAnnotPtr annot; SeqFeatPtr feat; BioseqContextPtr bcp; if (IS_Bioseq (sep)) bcp = BioseqContextNew ((BioseqPtr) sep->data.ptrvalue); if (bcp != NULL) { descr = BioseqContextGetSeqDescr (bcp, 0, NULL, NULL); while (descr != NULL) { /* Do something with the descr. */ descr = BioseqContextGetSeqDescr (bcp, 0, descr, NULL); } BioseqContextFree (bcp); } bcp = BioseqContextNew ((BioseqPtr) sep->data.ptrvalue); if (bcp != NULL) { feat = BioseqContextGetSeqFeat (bcp, 0, NULL, NULL, 0); while (feat != NULL) { /* Do something with the feat. */ feat = BioseqContextGetSeqFeat (bcp, 0, feat, NULL, 0); } BioseqContextFree (bcp); } }
An alternative method of obtaining features utilizes exploration of the object loading in memory by the functions which writes it to an ASN.1 stream. This methods allows you to call a function(s) of your design with a data structure of your own on any ASN.1 defined object by just giving the string defining its ASN.1 path. The callback function in this example prints the sequences referenced by coding region features (choice 3):
#include <seqport.h> #include <asn.h> static AsnExpOptPtr aeop; static AsnIoPtr aip; static Int2 charsPerLine = 50; static FILE *fp; static void GetSeqFeat (AsnExpOptStructPtr aeosp) { Char buffer [101]; Int2 i; Uint1 residue; SeqFeatPtr sfp; SeqPortPtr spp; if (aeosp->dvp->intvalue != START_STRUCT) { return } sfp = (SeqFeatPtr) aeosp->the_struct; if (sfp->data.choice == 3) { spp = SeqPortNewByLoc (sfp->location, Seq_code_iupacna); i = 0; while ((residue = SeqPortGetResidue(spp)) != SEQPORT_EOF) { if (residue == SEQPORT_EOS) { buffer [i] = '\0'; fprintf (fp, "%s\n>Segment\n", buffer); i = 0; } else { buffer [i] = residue; i++; if (i >= charsPerLine) { buffer [i] = '\0'; fprintf (fp, "%s\n", buffer); i = 0; } } } if (i != 0) { buffer [i] = '\0'; fprint (fp, "%s\n", buffer); } SeqPortFree(spp); } }
The callback is attached to an AsnIoPtr stream with AsnExpOptNew. In this case we use AsnIoNullOpen to attach the AsnIo stream to a null device (will exhaustively traverse the object in memory, but not actually write out ASN.1). When the object loader function SeqEntryAsnWrite is called, passing the specified ASN.1 entity (Seq-feat in this example) triggers the callback:
fp = FileOpen ("test.out", "w"); aip = AsnIoNullOpen (); aeop = AsnExpOptNew (aip, "Seq-feat", NULL, GetSeqFeat); if (aeop != NULL) { SeqEntryAsnWrite (sep, aip, NULL); } AsnIoClose (aip); FileClose (fp);
/* sequtil.h * =========================================================================== * * PUBLIC DOMAIN NOTICE * National Center for Biotechnology Information * * This software/database is a "United States Government Work" under the * terms of the United States Copyright Act. It was written as part of * the author's official duties as a United States Government employee and * thus cannot be copyrighted. This software/database is freely available * to the public for use. The National Library of Medicine and the U.S. * Government have not placed any restriction on its use or reproduction. * * Although all reasonable efforts have been taken to ensure the accuracy * and reliability of the software and data, the NLM and the U.S. * Government do not and cannot warrant the performance or results that * may be obtained by using this software or data. The NLM and the U.S. * Government disclaim all warranties, express or implied, including * warranties of performance, merchantability or fitness for any particular * purpose. * * Please cite the author in any work or product based on this material. * * =========================================================================== * * File Name: sequtil.h * * Author: James Ostell * * Version Creation Date: 4/1/91 * * $Revision: 2.11 $ * * File Description: Sequence Utilities for objseq and objsset * * Modifications: * -------------------------------------------------------------------------- * Date Name Description of modification * ------- ---------- ----------------------------------------------------- * * * ========================================================================== */ #ifndef _NCBI_SeqUtil_ #define _NCBI_SeqUtil_ #ifndef _NCBI_Seqset_ #include <objsset.h> /* the object loader interface */ #endif #ifdef __cplusplus extern "C" { #endif /***************************************************************************** * * What am I? * *****************************************************************************/ extern Uint1 Bioseq_repr PROTO((BioseqPtr bsp)); extern Uint1 BioseqGetCode PROTO((BioseqPtr bsp)); ValNodePtr BioseqGetSeqDescr PROTO((BioseqPtr bsp, Int2 type, ValNodePtr curr)); CharPtr BioseqGetTitle PROTO((BioseqPtr bsp)); NumberingPtr BioseqGetNumbering PROTO((BioseqPtr bsp)); extern Int4 BioseqGetLen PROTO((BioseqPtr bsp)); extern Int4 BioseqGetGaps PROTO((BioseqPtr bsp)); extern Int4 BioseqGetSegLens PROTO((BioseqPtr bsp, Int4Ptr lens)); #define BioseqCountSegs(x) BioseqGetSegLens(x, NULL) extern Boolean BioseqRawConvert PROTO((BioseqPtr bsp, Uint1 newcode)); extern Boolean BioseqRawPack PROTO((BioseqPtr bsp)); extern ByteStorePtr BSConvertSeq PROTO((ByteStorePtr bsp, Uint1 newcode, Uint1 oldcode, Int4 seqlen)); BioseqPtr BioseqFind PROTO((SeqIdPtr sip)); Boolean BioseqMatch PROTO((BioseqPtr bsp, SeqIdPtr sip)); CharPtr StringForSeqMethod PROTO((Int2 method)); /***************************************************************************** * * Context routines for Bioseqs in Seq-entrys * Context is the chain of Seqentries leading to the bioseq. * context[count-1] is SeqEntry for bsp itself * If Bioseq not in a Seqentry, count is 0 * *****************************************************************************/ #define BIOSEQCONTEXTMAX 20 typedef struct bioseqcontxt { BioseqPtr bsp; /* the Bioseq in question */ Int2 count; /* number of elements in context */ Boolean hit; /* used by BioseqContextNew and ..GetSeqFeat */ SeqEntryPtr context[BIOSEQCONTEXTMAX]; /* array of SeqEntryPtr (last is count -1) */ SeqFeatPtr sfp; /* current sfp */ SeqAnnotPtr sap; /* current sap */ Int2 sftype, /* SeqFeat type to look for */ in; /* 0=location, 1=product, 2=either */ } BioseqContext, PNTR BioseqContextPtr; BioseqContextPtr BioseqContextNew PROTO((BioseqPtr bsp)); BioseqContextPtr BioseqContextFree PROTO((BioseqContextPtr bcp)); ValNodePtr BioseqContextGetSeqDescr PROTO((BioseqContextPtr bcp, Int2 type, ValNodePtr curr, SeqEntryPtr PNTR the_sep)); CharPtr BioseqContextGetTitle PROTO((BioseqContextPtr bcp)); SeqFeatPtr BioseqContextGetSeqFeat PROTO((BioseqContextPtr bcp, Int2 type, SeqFeatPtr curr, SeqAnnotPtr PNTR sapp, Int2 in)); /***************************************************************************** * * SeqCodeTable routines * SeqMapTable routines * both may return INVALID_RESIDUE when a residue is out of range * *****************************************************************************/ #define INVALID_RESIDUE 255 Uint1 SeqMapTableConvert PROTO((SeqMapTablePtr smtp, Uint1 residue)); Uint1 SeqCodeTableComp PROTO((SeqCodeTablePtr sctp, Uint1 residue)); /***************************************************************************** * * Numbering routines * *****************************************************************************/ /* convert any numbering value to seq offset */ extern Int4 NumberingOffset PROTO((NumberingPtr np, DataValPtr avp)); /* convert seq offset to numbering value */ extern Int2 NumberingValue PROTO((NumberingPtr np, Int4 offset, DataValPtr avp)); extern Int2 NumberingValueBySeqId PROTO((SeqIdPtr sip, Int4 offset, DataValPtr avp)); extern void NumberingDefaultLoad PROTO((void)); extern NumberingPtr NumberingDefaultGet PROTO((void)); /***************************************************************************** * * SeqEntry and BioseqSet stuff * *****************************************************************************/ Uint1 Bioseq_set_class PROTO((SeqEntryPtr sep)); /***************************************************************************** * * traversal routines * SeqEntry - any type * *****************************************************************************/ typedef void (* SeqEntryFunc) PROTO((SeqEntryPtr sep, Pointer mydata, Int4 index, Int2 indent)); extern Int4 SeqEntryList PROTO((SeqEntryPtr sep, Pointer mydata, SeqEntryFunc mycallback, Int4 index, Int2 indent)); #define SeqEntryCount( a ) SeqEntryList( a ,NULL,NULL,0,0); #define SeqEntryExplore(a,b,c) SeqEntryList(a, b, c, 0L, 0); /***************************************************************************** * * traversal routines * Bioseq types only - "individual" sequences * do NOT traverse component parts of seqmented or constructed types * *****************************************************************************/ extern Int4 BioseqList PROTO((SeqEntryPtr sep, Pointer mydata, SeqEntryFunc mycallback, Int4 index, Int2 indent)); #define BioseqCount( a ) BioseqList( a ,NULL,NULL,0,0); #define BioseqExplore(a,b,c) BioseqList(a, b, c, 0L, 0); /***************************************************************************** * * Get parts routines * *****************************************************************************/ /* gets next Seqdescr after curr in sep of type type */ ValNodePtr SeqEntryGetSeqDescr PROTO((SeqEntryPtr sep, Int2 type, ValNodePtr curr)); /* gets first title from sep */ CharPtr SeqEntryGetTitle PROTO((SeqEntryPtr sep)); /***************************************************************************** * * Manipulations * *****************************************************************************/ extern Boolean SeqEntryConvert PROTO((SeqEntryPtr sep, Uint1 newcode)); #define SeqEntryPack(x) SeqEntryConvert(x, (Uint1)0) /***************************************************************************** * * SeqLoc stuff * *****************************************************************************/ #define PRINTID_FASTA_SHORT ( (Uint1)1) #define PRINTID_FASTA_LONG ( (Uint1)2) #define PRINTID_TEXTID_LOCUS ( (Uint1)3) #define PRINTID_TEXTID_ACCESSION ( (Uint1)4) #define PRINTID_REPORT ((Uint1)5) SeqIdPtr SeqIdSelect PROTO((SeqIdPtr sip, Uint1Ptr order, Int2 num)); Int2 SeqIdBestRank PROTO((Uint1Ptr buf, Int2 num)); SeqIdPtr SeqIdFindBest PROTO(( SeqIdPtr sip, Uint1 target)); CharPtr SeqIdPrint PROTO((SeqIdPtr sip, CharPtr buf, Uint1 format)); SeqIdPtr SeqIdParse PROTO((CharPtr buf)); /***************************************************************************** * * Boolean SeqIdMatch(a,b) * returns TRUE if seqids match * *****************************************************************************/ Boolean SeqIdMatch PROTO((SeqIdPtr a, SeqIdPtr b)); /************************* SeqIdForSameBioseq(a,b) trys to locate all ids for a or b and determine if (a and b refer the the same Bioseq) **************************/ Boolean SeqIdForSameBioseq PROTO((SeqIdPtr a, SeqIdPtr b)); /************************* * Boolean SeqIdIn (a,b) * returns TRUE if a in list of b ******************/ Boolean SeqIdIn PROTO((SeqIdPtr a, SeqIdPtr b)); SeqLocPtr SeqLocFindNext PROTO((SeqLocPtr seqlochead, SeqLocPtr currseqloc)); Boolean IS_one_loc PROTO((SeqLocPtr anp)); /* for SeqLoc */ Int4 SeqLocStart PROTO((SeqLocPtr seqloc)); Int4 SeqLocStop PROTO((SeqLocPtr seqloc)); Uint1 SeqLocStrand PROTO((SeqLocPtr seqloc)); Int4 SeqLocLen PROTO((SeqLocPtr seqloc)); Int4 SeqLocGetSegLens PROTO((SeqLocPtr slp, Int4Ptr lens, Int4 ctr, Boolean gaps)); #define SeqLocCountSegs(x) SeqLocGetSegLens(x, NULL,0,FALSE) #define SeqLocGetGaps(x) SeqLocGetSegLens(x,NULL,0,TRUE) SeqIdPtr SeqLocId PROTO((SeqLocPtr seqloc)); Uint1 StrandCmp PROTO((Uint1 strand)); Boolean SeqLocRevCmp PROTO((SeqLocPtr anp)); Int4 GetOffsetInLoc PROTO((SeqLocPtr of, SeqLocPtr in, Boolean start)); Int4 GetOffsetInBioseq PROTO((SeqLocPtr of, BioseqPtr in, Boolean start)); Int2 SeqLocOrder PROTO((SeqLocPtr a, SeqLocPtr b, BioseqPtr in)); Int2 SeqLocMol PROTO((SeqLocPtr seqloc)); CharPtr SeqLocPrint PROTO((SeqLocPtr slp)); /***************************************************************************** * * SeqLocCompare(a, b) * returns * 0 = no overlap * 1 = a is completely contained in b * 2 = b is completely contained in a * 3 = a == b * 4 = a and b overlap, but neither completely contained in the other * *****************************************************************************/ Int2 SeqLocCompare PROTO((SeqLocPtr a, SeqLocPtr b)); #define SLC_NO_MATCH 0 #define SLC_A_IN_B 1 #define SLC_B_IN_A 2 #define SLC_A_EQ_B 3 #define SLC_A_OVERLAP_B 4 Boolean SeqIntCheck PROTO((SeqIntPtr sip)); /* checks for valid interval */ Boolean SeqPntCheck PROTO((Int4 point, SeqIdPtr seq_id)); /* checks valid pnt */ CharPtr TaxNameFromCommon PROTO((CharPtr common)); #ifdef __cplusplus } #endif #endif
/* seqport.h * =========================================================================== * * PUBLIC DOMAIN NOTICE * National Center for Biotechnology Information * * This software/database is a "United States Government Work" under the * terms of the United States Copyright Act. It was written as part of * the author's official duties as a United States Government employee and * thus cannot be copyrighted. This software/database is freely available * to the public for use. The National Library of Medicine and the U.S. * Government have not placed any restriction on its use or reproduction. * * Although all reasonable efforts have been taken to ensure the accuracy * and reliability of the software and data, the NLM and the U.S. * Government do not and cannot warrant the performance or results that * may be obtained by using this software or data. The NLM and the U.S. * Government disclaim all warranties, express or implied, including * warranties of performance, merchantability or fitness for any particular * purpose. * * Please cite the author in any work or product based on this material. * * =========================================================================== * * File Name: seqport.h * * Author: James Ostell * * Version Creation Date: 7/13/91 * * $Revision: 2.2 $ * * File Description: Ports onto Bioseqs * * Modifications: * -------------------------------------------------------------------------- * Date Name Description of modification * ------- ---------- ----------------------------------------------------- * * * ========================================================================== */ #ifndef _NCBI_Seqport_ #define _NCBI_Seqport_ #include <sequtil.h> #ifdef __cplusplus extern "C" { #endif /***************************************************************************** * * SeqPort * will attach only to "individual" sequence * Get/Read return 253 for end of sequence, * 252 for end of segment * 251 when skipping a virtual sequence with length * and INVALID_RESIDUE (255) on invalid residues * *****************************************************************************/ #define SEQPORT_EOF 253 /* end of sequence data */ #define SEQPORT_EOS 252 /* end of segment */ #define SEQPORT_VIRT 251 /* skipping virtual sequence with length */ #define IS_residue(x) (x <= 250) typedef struct seqport { BioseqPtr bsp; /* 1 seqentry per port */ Int4 start, stop, /* region of bsp covered */ curpos, /* current position 0-(totlen-1) */ totlen, /* total length of covered region */ bytepos; /* current byte position in bsp->data */ NumberingPtr currnum; /* current numbering info */ Uint1 strand, /* as in seqloc */ lastmsg; /* used by SeqPortRead() */ Boolean is_circle , /* go around the end of a circle? */ is_seg , /* return EOS at the end of segments? */ do_virtual, /* deliver '-' over virtual seqs */ eos; /* set when comp strand tries to back off */ SeqMapTablePtr smtp; /* for mapping to requested alphabet */ SeqCodeTablePtr sctp; /* for getting symbols */ Uint1 newcode, /* requested output code */ oldcode; /* current input seq code (0 if not raw) */ Uint1 byte, /* current byte in buf */ bc, /* value to start bitctr */ bitctr, /* current shift */ lshift, /* amount to left shift on decompact */ rshift, /* amount to right shift residue value */ mask; /* mask for compact byte */ struct seqport PNTR curr , /* current active seqport if seg or ref */ PNTR segs, /* segments if seg or ref */ PNTR next; /* if part of a segment chain */ } SeqPort, PNTR SeqPortPtr; SeqPortPtr SeqPortNew PROTO((BioseqPtr bsp, Int4 start, Int4 stop, Uint1 strand, Uint1 code)); SeqPortPtr SeqPortNewByLoc PROTO((SeqLocPtr seqloc, Uint1 code)); SeqPortPtr SeqPortFree PROTO((SeqPortPtr spp)); Int4 SeqPortTell PROTO((SeqPortPtr spp)); Int2 SeqPortSeek PROTO((SeqPortPtr spp, Int4 offset, Int2 origin)); Int4 SeqPortLen PROTO((SeqPortPtr spp)); Uint1 SeqPortGetResidue PROTO((SeqPortPtr spp)); Int2 SeqPortRead PROTO((SeqPortPtr spp, BytePtr buf, Int2 len)); /***************************************************************************** * * ProteinFromCdRegion(sfp, include_stop) * produces a ByteStorePtr containing the protein sequence in * ncbieaa code for the CdRegion sfp. If include_stop, will translate * through stop codons. If NOT include_stop, will stop at first stop * codon and return the protein sequence NOT including the terminating * stop. Supports reading frame, alternate genetic codes, and code breaks * in the CdRegion * *****************************************************************************/ ByteStorePtr ProteinFromCdRegion PROTO(( SeqFeatPtr sfp, Boolean include_stop)); #ifdef __cplusplus } #endif #endif