Bloat War - Part 2


The weapon for defeating bloatware is to share a programming task across several different types of languages.  Compiled, interpreted and graphic languages, when combined together, provide a well tested and stable programming environment - each type of language complements the other.

In the example below it will be seen that by simply using different language families in combination, it is possible to easily produce compact and readable code that uses few system resources - code that is both flexible and powerful.

The Case of Punctuated Batch Processing - An Example

The Problem:
The need to convert thousands of indifferently typed text files into XML (eXtensible Markup Language).  The files are collections of English literature (i.e. from poems to three volume novels) and include a bewildering number of unique structures.

Unfortunately the files have little dependable implied markup (sequences and patterns).  All the files have needless returns, yet some of these line-endings are important and must be preserved (e.g. plays and poems).  Even fairly straightforward data such as the title, publication details and authorship are not reliably placed in the files.

The Aim:
It may be possible to create a fully automatic system, but the most practical scheme is to build a program that will assist the operator to markup and classify files quickly and accurately.  Ideally the program should combine batch processing techniques with manual assistance tools - i.e. a form of punctuated batch processing.

Efficient application use requires a flexible GUI front-end for both tools and batch processes.  It is a task that will frequent require interruptions over a considerable amount of time.  Stray and alien texts
will, from time to time, be added.  The operator needs to be able to use the application whenever the opportunity arises, therefore an informative and intuitive interface is a necessity.

Task automation is thus foreseeable even if the nature of this automation is not.  It may also be the case, once a few hundred files have been processed, that hidden patterns may yet be recognized.  In which case, the program needs to be easily modified on-the-fly!

Ideally both the processing of text files and writing the application should go hand in hand, especially as the operator and author are one and the same person. Worse, the whole project must be completed within limited spare time.

The Combined Language Approach:
Our example uses a combined language strategy.  We will touch on the main operating features of the program and some of the general methods used to put these into effect.  The aim of this example is to convey the outlines rather than the details of design.

However, the reader should bear in mind that writing the same program in pure REXX would result in a great many enormous and awkward batch-processing scripts - it is simply not a practical option under the circumstances outlined.

Considering the GUI requirements and the operating system demands even a compiled C version would be large and complex.  A compiled version would be easier to use, but the programming would be very time consuming.  And very little actual file processing could be done until a substantial
amount of code had been written and compiled.

In both cases the program would be bloatware.

In this scenario, the compiled language will be used in the form of a free function library RexxIO.dll (available from http://www.lestec.com.au).  The REXX being used is bog-standard Regina.  The graphic language is Modular And Integrated Design (MAID, available from the address above).

RexxIO.dll needs to be further explained.  It is 215k long and contains nearly a hundred operating system commands, file manipulation functions and general REXX functions.  Written in C, the library is extremely generalised and naturally fast - many functions output to both REXX stem variables and to files.

The Example's Solution:
What is needed first is a GUI to pick the particular batch of files and place them into a list so that each can, in turn, be examined.  In MAID this is accomplished by drag and dropping a list-box and placing a short script in the initialization event of the dialog (a dir_to_stem function places all files into a stem.name.n variable, and these are simply added to the listbox by a simple DO loop).

Because the user needs to open the top part of the file in order to write a XML header, a MLE (Multi-Line-Entryfield) is added to the dialog (the MLE keeps track of all selections, number of lines and characters etc., via a series of stem variables based on its name - this becomes important later on).

Once a file is selected from the listbox a short script is needed to load the top of the text file into the MLE. Because MAID takes care of messaging, a few lines of script need to be added to the listbox-selection-event.

Having now got the first fifty lines or so of the text into the MLE, the user needs to get publication details and the descriptions which will be used in the XML markup.  The simplest method to achieve this uses a REXX function that lifts whatever is selected in the MLE.

Thus in the relevant Entryfields, such as AUTHOR and TITLE, the string between the cursor positions of the MLE are placed in the Entryfield when it is clicked.  Another Entryfield grabs the position of the cursor itself to indicate the point what should be deleted from the file - a single function call!

Naturally, the OK button contains the script that deletes the top of the file and inserts the variables that will become the new XML header.

There is of course much more to the application.  For instance, GREP-like find-and-replace functions that process the files in batch mode, drop down lists which allow various tags to be given values, and fail-safe copies of the files that are copied and periodically destroyed.  The application is
not finished nor is it perfect, but it does work and it has worked from the very first day.

Excluding the 215 kilobytes RexxIO.dll, all the scripts (REXX and MAID together), consist of less than 33 kilobytes and that includes nine GUI dialogs.  Up to this point, substantially less than eight hours has been spent writing the application and already within that time most of the collected work of Sir Arthur Conan Doyle (4.84 megabytes) has had preliminary markup.

By any measure 33 kilobytes is not a lot.  Yet even this does not reflect how much script has actually been written, as a good portion of it has been automatically generated by MAID in order to create the GUIs.

In future columns we will explore in more detail how this space saving is achieved, why readability increases, and explore the mystery of script shrinkage - or why do the scripts do more while they become smaller?

Greg Schofield, schofield@taunet.net.au, the Darwin correspondent for the RexxLA Newsletter