RtfParser

This is a collection of classes to parse RTF files to use in REALbasic applications.

The Rich Text Format is the major interchange format between word processing programs of all major platforms. It is called richt text, because it cannot only hold text, but also formatting information.Also, it provides encoding for special characters and encodes them to a 7bit-textfile, so that it can be sent safely over Internet.

The class mainly reads a file, maintains formatting information and writes to the Class Interface RtfTarget, which can be any user defined class like EditFields, Canvas etc.

For the moment, an EditField class and a RTFWriter class are written as RtfTarget, but I plan to implement soon a styled string class (which is supposed to be faster than using the EditField directly) and a picture class which allows to preserve complexe formatting and displaying in a Canvas.

Also, for the moment, the RtfParser only understands a subset of the formatting commands. However, it should quietly pass over on the commands it cannot understand without crashing.

Download sample project

RtfParser works with REALBasic 4.0 or later. The collection of classes is self-contained and works with all compiled platforms (PowerPC, Carbon, Win32).

If you want to use it with REALBasic 2.1-3.5 and want to compile to 68k, you need to comment out some code which supports selalignment property in the RtfEditField class (search for "comment this out if you are using RB 3").

Introduction

The project contains a folder with 8 classes:

  • RtfParser The main class your application will interact with when parsing
  • RtfState The class which holds information about the current information.
  • RtfStylesheet,
  • RtfFont,
  • RtfColor are supporting classes to RtfParser (if REALbasic would allow it, i'd have them declared private)
  • RtFModule is a supporting module with global methods (again, if REALbasic would allow a hierarchical structure of privacy, it would have been private, too.
  • RtfTarget is an interface class, which actually renders the RTF
  • RtfEditField is the first implementation of an RtfTarget
  • RtfWriter is the second implementation of an RtfTarget. It does export an RTF file.

As an user, you will have to deal only with the RtfParser and the RtfEditField, and eventually the RtfWriter, if you want to export styled text as RTF files.

As a developper, you may want to develop other RtfTargets, so you'll have to understand the structure of the RtfState class.

This is an alpha project and open source. It has not been widely tested, and there are a lot of very exotic RTF files around. If you experience bugs, you are kindly invited to report them to me at matti@belle-nuit.com , by adding, if possible, the RTF file. I am especially interested in files which

  • crash on parsing
  • have problems with character encoding
  • style differently than they should

This is very alpha also in the sense that the parsing is not at all optimized. I just wanted to see if it is possible. You will notice that the parsing for the EditField ist rather slow. This is mainly because it is rendering directly to a RectControl. I am aware of this, and my next step will be to create an abstract RtfTarget with a text and a style string, which you can use with the method SetTextAndStyle from the EditField. I also plan to write a picture as RtfTarget. Possible are also WasteFields, HTML etc. Feel free to suggest or to implement directly other RtfTargets. I will add them to the project.

Also the subset of interpreted commands is very small, mainly because a styled EditField has much less styles than an RTF file. Currently implemented are:

  • Fonttable
  • Colortable
  • basic styles: bold, italic, underline, expanded, condensed, caps, small caps
  • Stylesheets
  • Main text area
  • character sets, Unicode (buggy)

Currently ignored are

  • paragraph formatting (alignment, line spacing)
  • absolute placing on the page
  • footnotes, headers, indexes, tables of content
  • embedded pictures
  • embedded binary data
  • info section

These parts will be more and more implemented, first by adding the properties to RtfState andthen by implementing the properties in the RtfTargets.

How to use the classes as a user

  1. Import the classes in your project.
  2. Add an RtfEditField to your window.
  3. Create an instance of RtfParser.
  4. Let the user select a FolderItem, open it as a BinaryFile.
  5. Call the method Parse from the RtfParser providing as parameters the BinaryStream and the RtfEditField.

The user can stop the parsing of a long file at quite any moment with Command-period.In this case, you will only have a part of the RtfFile in the EditField.

You should create a new instance of RtfParser for every new text you import (or initialize by calling the RtfParser constructor method).

This could be a sample code:

dim f as FolderItem
dim b as BinaryStream
dim parser as RtfParser
f = GetOpenFolderItem("text")
if f=nil then
return
end
b = f.OpenAsBinaryFile(false)
if b = nil then
return
end
outputfield.text = "" // outputfield is an RtfEditField in the window
parser = new RtfParser
parser.parse b,outputfield
b.close

How to develop other RtfTargets

You can render the RTF to any class which is a subclass of the ClassInterface RtfTarget. The RtfParser does already the dirty work for you, you just need to provide the following methods.

SetState (st as RtfState)

Whenever the RtfParser encounters a command, it calls the SetState method. You can then examine the RtfState object to define the current formatting. The current properties are:

  • backgroundcolor as color
  • bold as boolean
  • capitals as boolean
  • encoding as integer
  • expanded as double
  • font as string
  • foregroundcolor as color
  • italic as boolean
  • outline as boolean
  • shadow as boolean
  • smallcapitals as boolean
  • textsize as double
  • underline as boolean

Textsize is double, because RTF allows half textsizes. And expanded is double, because RTF allows quarter expansions. Negative values for expanded stand for compressed.

You have no direct access to the stylesheets. In fact, the parser directly applies the stylesheets to do the formatting. Access to the stylesheet may be added later.

More properties will be added later.

You actually get a clone of the RtfParser internal state, and the RtfParser will not change it after, so you can savely keep a reference to it.

Write(t as string)

Write text. The text is always encoded in Mac, so you can use it as "is".

At this moment, t is always one character long. But this may change, when the code for the parser is optimised.

PreProcess() as boolean

PostProcess() as boolean

are methods which are calledbefore the Parser starts parsing and after the Parser has finished. These methods allow you to initialise your class before parsing and clean up after parsing.
If the method does not return TRUE, the parser will stop parsing, so if you don't have any init code, put at least "return true".

Export(target as RtfTarget)

=

Put here the code to reexport your class to another RtfTarget, most of the time the RtfWriter class. You do not have to know about RTF code, all you have to do is to create the following loop

if not target.preprocess then
return
end if

// go through your text
// if style has changed, then set the currentstyle to your style properties and

target.setstyle currentstyle

// if not, just write

target.write mytext

// finally

if not target.postprocess then
return
end if

You may have a look at the RtfEditField to write your own RtfTarget.

You may also notice, that the RtfParser and the RtfField have also a method called SetProgressBar. As the parsing of an RTF file always takes some time, we need a to inform the user about the progress of the importing. This method allows the application to set a window's progressbar which then will be set during the import.

Technical comments about RTF parsing

The specification for the RTF Format is at http://msdn.microsoft.com/library/specs/rtfspec.htm . Apparently, there is only a html file collection, so you may create an archive with Explorer (save 1 link depth).

You should however not expect that all RTF-Writers export exactly as specified. For example, the specification makes the assumption that there is a header and then the actual text. The RtfParser does not make this assumption, but, of course, it will not format text to formats which do not have been defined before the occurance of the text.

The RTF file is a text file using only ASCII (7 bit) characters.

It is organized in groups which are enclosed in opening and closing brackets { }. Groups can ve within other groups. Every new group opens a new state, which inherits the properties of the outer state. This is implemented in the parser as a stack of states.

The file then contains of text and of commands.Commands start with a back and contain a command word and often an integer parameter (in ascii, not binary). There are also some special character commands consisting of one one-character long nonalphabetic character (mostly special characters escapes). Some commands add a label at their end which is terminated by a hyphen ; and should not be interpreted as text.

The commands change the current state or add definitions for the document.

You may note the following commands

  • rtf defines the RTF-version which is still 1.
  • ansi and mac define the character set, but they are completed by ansicp and font level charactersets.
  • colortbl defines the RGB values for a color table (implemented as RtfColor class), so that they can be referred by a number (startingby one)
  • fonttbl defines a fonttable that it can be referred by f and a number. The numbers of the fonttable are not consecutive, but any 16 bit integer. This is why I had to implement a GetFont method in the RtfFont class.
  • stylesheet defines stylesheets, which are a series of formatting commands. The RtfStyleSheet just stores the commands and applies them when they are used. The stylesheets are cascading, eg. they are always based on another stylesheet (the most basic being Normal, id 222). Also, they are incomplete, so they do not affect all aspects of an RtfState.
  • * is commenting out nonstandard commands, so that our reader can ignore the group.
  • ' is using a special character (upper half of the codepage), depending on the encoding.
  • bin introduces an embedded binary file and we will also ignore it.

A word about Carbon

As parsing a RB Editfield is very slow in Carbon on OS9, the RTF export is very slow, also.Unfortunately, we have really to parse because the selalignment property is not accessible as an array. REALbug it.

A word about Windows

Version 1.1 will also create valuable code for Windows builds. But, in fact, this is only an academic exercise, because editfields in Windows do read and write RTF natively. Actually, if the Editfield is styled, you can't even turn it of. Both text=s and selext=s will parse RTF, if it is valid RTF code. You will see that in the example project, the left side field is already styled. The same is valid for saving the fields with SaveStyledEditfield.

With this feature, Realsoftware actually supports the Windows platform better than Macintosh.

Terms of use / Disclaimer

© Belle Nuit Montage / Matthias Bürcher October 2002. All rights reserved. Written in Switzerland.

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

Comments please to matti@belle-nuit.com

The latest version is available at http//www.belle-nuit.com/

History

4.2.02 1.1 Added support for OS X: the nonstandard newline tag +chr(10) is intepreted. Added support for new selalignment property in RB4 editfields. Added support for Windows Editfields.

9.11.00 1.0a2 added methods Preprocess,Postprocess and Export to the RtfTarget. Added methods SetProgressBar to the RtfParser and the RtfEditField. Added RtfWriter class. Fixed bug (nil object exception) when the parser reads a non-RTF textfile to the EditField. Fixed bug color import and export.

14.10.00 1.0a1 Released open source