Intro_resolution


Resolving entity ID's

One of the tasks of the XML parser is to open entities. Entities can be external files, but also strings, or channels, or anything that can be considered as a stream of bytes. Entities are identified by ID's. PXP knows four kinds of ID's:

Resolution means now the following. The starting point is that we find a SYSTEM or PUBLIC identifier in the parsed XML text, or we have a private or anonymous identifier that was passed down by some user program. The second step is to make the identifier absolute. This step is only meaningful for SYSTEM identifiers, because they can be given by relative URL's. These URL's are made absolute. Finally, we run a lookup algorithm that gives us the entity to open back as stream of bytes. The lookup algorithm is highly configurable in PXP, and this chapter of the PXP manual explains how to do this.

Links to other documentation

Various types that are involved

The simple form of an (external) entity ID is Pxp_types.ext_id: It enumerates the four cases:

Tip: To create an URL from a filename, use

let file_url = Pxp_reader.make_file_url filename
let file_url_string = Neturl.string_of_url file_url

During resolution, a different representation of the ID is preferred - Pxp_types.resolver_id:

type resolver_id = 
      { rid_private: private_id option;
        rid_public:  string option;
        rid_system:  string option;
        rid_system_base: string option;

A value of resolver_id can be thought as a matching criterion:

It is sufficient that one of the criterions matches for associating the resolver_id with a particular entity. Note that Anonymous is missing in this list - it simply matches with any resolver_id.

The resolver_id value can be modified during the resolution process, for example by rewriting. For example, one could rewrite all URL's http://sample.org to some local file URL's when the contents of this web site are locally available.

It is not said that rid_system is already an absolute URL when the resolution process starts. It is usually rewritten into an absolute URL during this process. For that reason, we also remember rid_system_base. This is the base URL relative to which the URL in rid_system is to be interpreted.

The resolution algorithm is expressed as Pxp_reader.resolver. This is an object providing a method open_rid (open by resolver ID) that takes a resolver_id as input, and returns the opened entity. There are a number of predefined classes in Pxp_reader for setting up resolver objects. Some classes can even be used to construct more complex resolvers from simpler ones, i.e. there is resolver composition.

Besides Pxp_reader.resolver, there are also sources, type Pxp_types.source. Sources are concrete applications of resolvers to external ID's, i.e. they represent the task of opening an entity with a certain algorithm, applied to a certain ID. There are several ways of constructing sources. First, one can directly use the source values Entity, ExtID or XExtID. Second, there are a number of functions for creating common cases of sources, e.g. Pxp_types.from_file.

For example, to open the ext_id value e with a resolver r, the source has to be

 let source = ExtID(e,r) 

There is also XExtID which allows one to set the base URL in the resolver_id, and for very advanced cases there is Entity (which is beyond an introduction).

How to use the following list of classes

We give a short summary of the function provided by the resolver class. Some classes provide quite low-level functionality, especially those named resolve_to_*. A beginner should avoid them.

Every resolver matches the ID to open with some criterion of ID's the resolver is capable to open. If this matching is successul we also say the resolver accepts the ID. After being accepted the rest of the resolution process is deemed to be successful, e.g. a non-existing file will lead to a "file not found" error. Not accepting an ID means that in a composed resolver another part resolver might get the chance, and tries to open it.

We especially mention whether relative URL's are specially handled (i.e. converted to absolute URL's). If not, but you would like to support relative URL's, it is always possible to wrap the resolver into norm_system_id. This is generally recommended.

Some resolvers can only be used once because the entity is "consumed" after it has been opened and the XML text is read. Think of reading from a pipe.

Also note that you can combine all resolvers with the from_* functions in Pxp_types, e.g.

let source = Pxp_types.from_file 
               ~alt:r
               filename

The resolver given in alt is tried when the resolver built-in to from_file does not match the input ID. Here, from_file only matches file URL's, so everything else is passed down to alt, e.g. PUBLIC names.

List of base resolver classes

These classes open certain entities. Some also allow you to pass the resolution process over to a subresolver, but the resolver_id is not modified.

resolve_to_this_obj_channel

Example.

This example matches against the id argument, and reads from the object channel ch when the resolver matches:

let ch = new Netchannels.string_channel "<foo></foo>"
let r = new Pxp_reader.resolve_to_this_obj_channel
              ~id:(Public("-//FOO//"""))
              ()
              ch

This is a one-time resolver because the data of ch is consumed afterwards.

resolve_to_any_obj_channel

resolve_to_url_obj_channel

resolve_as_file

Example.

let r = new Pxp_reader.resolve_as_file ()

If the file "/data/foo.xml" exists, and the user wants to open SYSTEM "file://localhost/data/foo.xml" this resolver will do it.

lookup_id

lookup_id_as_file

Example.

let r = new Pxp_reader.lookup_id_as_file
          [ System "http://foo.org/file.xml""/data/download/foo.org/file.xml";
            Private p, "/data/private/secret.xml"
          ]

If the user opens SYSTEM "http://foo.org/file.xml", the file /data/download/foo.org/file.xml is opened. Note that relative URL's are not handled. To enable that, wrap r into a norm_system_id resolver.

If the user opens the private ID p, the file /data/private/secret.xml is opened.

lookup_id_as_string

Example. We want to parse a private ID whose corresponding entity is given as constant string:

let p = alloc_private_id()
let r = new Pxp_reader.lookup_id_as_string
          [ Private p, "<foo>data</foo>" ]
let source = ExtID(Private p, r)

lookup_public_id

lookup_public_id_as_file

lookup_public_id_as_string

lookup_system_id

lookup_system_id_as_file

lookup_system_id_as_string

Example. See below at norm_system_id

List of rewriting resolver classes

These classes pass the resolution process over to a subresolver, and the resolver_id to open is rewritten before the subresolver is invoked. Note that the rewritten ID is only visible in the subresolver, e.g. in

let r = new Pxp_reader.combine
          [ new Pxp_reader.norm_system_id sub_r1;
            sub_r2
          ]

the class norm_system_id rewrites the ID, and this is only visible in sub_r1, but not in sub_r2.

norm_system_id

Example.

let r = new Pxp_reader.norm_system_id
          (new lookup_system_id_as_string
             [ "http://foo.org/file1.xml""<foo>&file2;</foo>";
               "http://foo.org/file2.xml""<bar>data</bar>";
             ]
          )

We also assume here that the general entity file2 is declared as SYSTEM "file2.xml", i.e. with a relative URL. (The declaration should be added to the file1 XML text to make the example complete.) The resolver norm_system_id adds the support for relative URL's that is otherwise missing in lookup_system_id_as_string. The XML parser would read the text "<foo><bar>data</bar></foo>".

Without norm_system_id, the user can only open the ID's when they are exactly given as in the catalog list, e.g. as SYSTEM "http://foo.org/file1.xml".

rewrite_system_id

Example. All files of foo.org are locally available, and so foo.org URL's can be rewritten to file URL's:

let r =
  new Pxp_reader.rewrite_system_id
    [ "http://foo.org/""file:///usr/share/foo.org/"
    ]
    (new Pxp_reader.resolve_as_file())

Alternation of resolvers

combine