Chapter 4. Programming interface

Table of Contents
4.1. Writing a document filter
4.2. Field data processing
4.3. API

Recoll has an Application Programming Interface, usable both for indexing and searching, currently accessible from the Python language.

Another less radical way to extend the application is to write filters for new types of documents.

The processing of metadata attributes for documents (fields) is highly configurable.

4.1. Writing a document filter

Recoll filters are executable programs which translate from a specific format (ie: openoffice, acrobat, etc.) to the Recoll indexing input format, which may be text/plain or text/html.

As of Recoll 1.13, there are two kinds of filters:

The following will just describe the simple filters. If you can program and want to write one of the other kind, it shouldn't be too difficult to make sense of one of the existing modules. For example, look at rclzip which uses Zip file paths as internal identifiers (ipath), and rclinfo, which uses an integer index.

4.1.1. Simple filters

Recoll simple filters are usually shell-scripts, but this is in no way necessary. Extracting the text from the native format is the difficult part. Outputting the format expected by Recoll is trivial. Happily enough, most document formats have translators or text extractors which can be called from the filter. In some cases the output of the translating program is completely appropriate, and no intermediate shell-script is needed.

Filters are called with a single argument which is the source file name. They should output the result to stdout.

When writing a filter, you should decide if it will output plain text or HTML. Plain text is simpler, but you will not be able to add metadata or vary the output character encoding (this will be defined in a configuration file). Additionally, some formatting may be easier to preserve when previewing HTML. Actually the deciding factor is metadata: Recoll has a way to extract metadata from the HTML header and use it for field searches..

The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells the filter if the operation is for indexing or previewing. Some filters use this to output a slightly different format, for example stripping uninteresting repeated keywords (ie: Subject: for email) when indexing. This is not essential.

You should look at one of the simple filters, for example rclps for a starting point.

Don't forget to make your filter executable before testing !

4.1.2. Telling Recoll about the filter

There are two elements that link a file to the filter which should process it: the association of file to mime type and the association of a mime type with a filter.

The association of files to mime types is mostly based on name suffixes. The types are defined inside the mimemap file. Example:


.doc = application/msword
If no suffix association is found for the file name, Recoll will try to execute the file -i command to determine a mime type.

The association of file types to filters is performed in the mimeconf file. A sample will probably be of better help than a long explanation:


[index]
application/msword = exec antiword -t -i 1 -m UTF-8;\
     mimetype = text/plain ; charset=utf-8

application/ogg = exec rclogg

text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html

application/x-chm = execm rclchm

The fragment specifies that:

  • application/msword files are processed by executing the antiword program, which outputs text/plain encoded in utf-8.

  • application/ogg files are processed by the rclogg script, with default output type (text/html, with encoding specified in the header, or utf-8 by default).

  • text/rtf is processed by unrtf, which outputs text/html. The iso-8859-1 encoding is specified because it is not the utf-8 default, and not output by unrtf in the HTML header section.

  • application/x-chm is processed by a persistant filter. This is determined by the execm keyword.

4.1.3. Filter HTML output

The output HTML could be very minimal like the following example:

<html><head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
</head>
<body>some text content</body></html>
          

You should take care to escape some characters inside the text by transforming them into appropriate entities. "&" should be transformed into "&amp;", "<" should be transformed into "&lt;". This is not always properly done by translating programs which output HTML, and of course never by those which output plain text.

The character set needs to be specified in the header. It does not need to be UTF-8 (Recoll will take care of translating it), but it must be accurate for good results.

Recoll will also make use of other header fields if they are present: title, description, keywords.

Filters also have the possibility to "invent" field names. This should be output as meta tags:

<meta name="somefield" content="Some textual data" />

See the following section for details about configuring how field data is processed by the indexer.

4.1.4. Page numbers

The indexer will interpret ^L characters in the filter output as indicating page breaks, and will record them. At query time, this allows starting a viewer on the right page for a hit or a snippet. Currently, only the PDF, Postscript and DVI filters generate page breaks.