Since rtf2xml is a command line utility, there is no graphical user interface. There are no windows, no buttons to click, and no user-interaction once the script starts to run.
Instead, rtf2xml runs from a shell or terminal and outputs to a file
A simple example illustrates this. If I want to convert the document "hello_world.rtf" to XML, I open a shell and type:
rtf2xml hello_world.rtf
The script converts the RTF file to XML and outputs it to the terminal.
The configuration file allows you to override the default options without having to use an option on the command line. See the installation for the details on putting the configuation file in the right location.
If you are unsure of what configuration file the script uses, type:
rtf2xml --config
The script will print the path and quit.
The actual values of the configuration file will be explained below in each section explaining a specific option.
The --output option has a shortened version, like many other options. The shortened version takes on "-" instead of "--". Using the above command with the shortened version, we see:
rtf2xml -o hello_world.xml hello_world.rtf
If you are ever in doubt about the options, you can either access the man page, or type:
rtf2xml --help
If you want to output the XML file to a specific file rather than to the terminal, use the --output option:
rtf2xml --output hello_world.xml hello_world.rtf
When the script finishes, you will have a new file called hello_world.xml.
You might get tired of typing "--output" each time you invoke rtf2xml. In this case, you might want to set the "smart-output" to true in the configuration file.
Open ".rtf2xml" in a text editor and scroll down until you see:
smart-output value = false
Change "false" to "true", and save the file.
Now from the command line, you can simply type:
rtf2xml hello_world.xml hello_world.rtf
The script creates a file called "hello_world.xml" in the current directory. If the file "hello_world.xml" already exists, the script over-writes this file. If the file you want to transform does not have the ".rtf" extension at the end, rtf2xml will not use smart output. Instead, it will tell you to use the --output options.
Microsoft's RTF stores certain characters in strange ways. Let's say I type "title" in Microsoft Word, hi-light this text and choose to display it in all caps. On the screen, I see "TITLE." But the RTF actually stores the word as "title."
By default, rtf2xml corrects these faults. It changes any lower-case characters that should be capital to upper-case.
However, in some cases, you might want to have XML that looks exactly like the RTF characters. In this case, use the --no-caps
This option disables the script's ability to convert from regular characters to capitals.
Example:
rtf2xml --no-caps hello_world.rtf
rtf2xml converts decorative fonts such as Symbol and Zapf Dingbats to their correct unicode representations when possible. It is best not to use these decorative fonts. If you have control over the input RTF file, use unicode fonts.
The default configuration automatically tells the script to convert caps. If you want to have the script never do such conversions, open the configuration file in a text editor and scroll down until you see:
convert-caps = true
Change "true" to false and save the file.
Once these values are set to "false", you can still tell rtf2xml to over-ride this options from the command line. In this case, use the --caps
These options tell the script to convert caps.
If you want to find out what version of rtf2xml you are using, type:
rtf2xml --version
By default, rtf2xml creates a file with no newlines. This means that resulting XML will only be one line long. Such output insures that no arbritrary white space creeps into any further transformations. However, it also means that the XML will be very difficult to read.
If you want XML that is easier to read, choose the --indent option. The --indent options takes a number. For now, only two posibilities exsit: either "0", or any other non-zero number.
Choosing "0" for the --indent option means the XML will be output on one line. Choosing any other number results in elements having newlines after the closing tag. To get readable XML, you type:
rtf2xml --indent 1 hello_world.rtf
You can also change the default value by chaning the value in configuration file:
indent = 1
The --level option controls the way the script handles erros. The lower the level, the more the script ignores errors. Here is a summary of each level:
0: Never quit; never print out error messages
1: Never quit. Print out error messages useful to user, but not those that would only be useful for programmer.
2: Never quit, but print out all messages.
3: Print out all messages. Make a debug directory and output debugging info to that directory.
4: Print out all messages. Quit with minor error.
5: Print out all messages. Quit with any error.
By default, rtf2xml gives structure to lists. This means that the script must make some guesses based on indentation, since an RTF document in itself gives no structure for lists.
If your rtf2xml produces lists that you don't like, use the --no-listsoption:
rtf2xml --no-lists hello_world.rtf
You can set the lists option in configuration file as well. Scroll down to the line that looks like:
lists = true
Set this value appropriately and save configuration file.
You can tell the script not to form lists, regardless of your setting in the configuration file file, by using the --lists option.
By default, rtf2xml writes an empty paragraph element (<para>) for paragraphs with no content.
If you do not wish for this emtpy element to be written, use the --empty-para option.
rtf2xml --empty-para hello_world.rtf
Be aware that if you eliminate empty paragraphs, you might also eliminate border elements that surround them. You probably would not want borders if you are converting an RTF document to docbook, but you probably would if converting the RTF document to XHTML.
You can set the empty-para option in configuration file as well. Scroll down to the line that looks like:
write-empty-paragraphs = true
Set this value appropriately and save configuration file.
You can tell the script not to write empty paragraphs, regardless of your setting in the configuartion file, by using the --no-empty-para option.
Use the --group-styles option to provide a wrapper element to paragrahs that have the same style name.
For example, say you had an RTF file that you converted to XML that looked like this:
<paragraph-definition name = "quote" num = "s001" widow-control="false"> .... <para/> </paragraph-defintion> <paragraph-definition name = "quote" num = "s002" widow-control="true"> .... <para/> </paragraph-defintion>
becomes:
<style-group name = "quote"> <paragraph-definition name = "quote" num = "s001" widow-control="false"> .... <para/> </paragraph-defintion> <paragraph-definition name = "quote" num = "s002" widow-control="true"> .... <para/> </paragraph-defintion> </style-group>
You can set the group-styles option in configuration file as well. Scroll down to the line that looks like:
group-styles = false
Set this value appropriately and save configuration file.
You can tell the script not to group styles, regardless of your setting in the configuration file, by using the --no-group-styles option.
By default, rtf2xml wraps paragraphs with a <border-group> element. If you do not wish for paragraphs to be wrapped in this element, use the --no-group-borders option.
rtf2xml --no-group-borders hello_world.rtf
For example, say you had an RTF file that you converted to XML that looked like this:
<border-group border-paragraph-bottom-line-width="0.50" border-paragraph-bottom-padding="1.00" border-paragraph-bottom-style="hairline" num="s0001" > <paragraph-definition name="Normal" style-number="s0002" font-style="Not Defined" > <para>This parpagraph has a border.</para> </paragraph-definition> </border-group>
If you used the --no-group-borders option, the above fragment becomes:
<paragraph-definition name="Normal" style-number="s0002" font-style="Not Defined" > <para>This parpagraph has a border.</para> </paragraph-definition>
You can set the group-borders option in configuration file as well. Scroll down to the line that looks like:
group-borders = true
Set this value appropriately and save configuration file.
You can tell the script to still group styles, regardless of your setting in the configuration file, by using the --group-borders option.
You can provide further structure to the resulting XML document by using the --headings-to-sections option. This option creates section elements from the heading styles.
For example, if you used the style "heading 1" to create a heading in your RTF document, and you used the --headings-to-sections option when converting the document to XML, all text under heading 1 would be enclosed in a section like this:
<section num="1.1" num-in-level="1" level="1" type="heading 1"> ... </section>
If you used a style with the name of "heading 2" below "heading 1", rtf2xml nests a section inside the heading made from "heading 1":
<section num="1.1" num-in-level="1" level="1" type="heading 1"> ... <section num="1.1.1" num-in-level="1" level="2" type="heading 2"> ... </section> </section>
If you created a "heading 3" (instead of "heading 2") below "heading 1", rtf2xml nests this section as if it were "heading 2":
<section num="1.1" num-in-level="1" level="1" type="heading 1"> ... <section num="1.1.1" num-in-level="1" level="2" type="heading 3"> ... </section> </section>
The script rtf2xml never creates ill-formed XML, regardless of how you use headings in the RTF document. It knows to nest headings when it finds a heading smaller than the previous one, and it knows to close-out sections when it finds a heading greater than the prveious one.
In order to set the --headings-to-sections option in configuration file, scroll down to the line that looks like:
headings-to-sections = false
Set this value appropriately and save configuration file.
You can tell the script not to turn headings into sections, regardless of your setting in the configuration file, by using the --no-headings-to-sections option.
The script rtf2xml converts RTF to a type of raw XML that represents the structure of the original document. Most likely you will want to convert this raw XML into another format. I have provided an XSL stylesheet that will transform this raw XML into simplified docbook and TEI.2.
In order to use the scripts and stylesheets, you must first download them at the same site that you downloaded the rtf2xml script (https://sourceforge.net/projects/rtf2xml/)