HTM2ASC Documentation

By Scott M. Baker (smbaker@primenet.com)

I'm not big on documentation, so don't expect to find a whole lot here. HTM2ASC is designed to perform a relatively straightforward task, so there's not a whole lot to say!

Purpose: To convert an HTML document to plain ASCII text while preserving as much of the original formatting as possible.

Syntax: HTM2ASC <filename.htm>

Optional Parameters:
Switch Parameter
-lxxx specify initial left margin (default is 0)
-rxxx specify initial right margin (default is 79)
-t dump HTML tree to stdout
-h disable high ascii use (i.e. tables)
-c display html tag count statistics to stdout
-x do not write output file
-u dump all HTML tags found in anchors to stdout
-uf dump only full HTML tags found in anchors to stdout (full tags start with http://)
-w (Windows executable only) close the main window open after program terminates
-d Disable horizontal lines between table rows

Distribution: The distribution archive will be named SBH2Axxx.ZIP, where xxx is the current version number. The following files should be present in the archive:
htm2asc.exe DOS executable file
htm2ascw.exe Windows (3.1) executable file
htm2asc.htm Documentation in HTML format
htm2asc.txt Documentation in plain ASCII format

Some comments....

The reason I wrote HTM2ASC is because I wanted to shift my documentation efforts to HTML, due to it's benefits over ascii text. However, there still existed a need to support users who did not have HTML support available. Rather than update two documents at the same time, my decision was to do all my writing in HTML and then use a converter program of some sort to convert it to ASCII.

I was unable to find an existing utility that adequatly handled the job, so I took on the project of writing my own. My understanding of HTML is still very basic, but at this point, HTM2ASC handles my documentation reasonably well.

I have the capability of producing both 16-bit DOS DPMI and 16-bit OS/2 character mode executables if needed. These would have the benefit of larger available memory and probably support larger HTML documents. If there is demand, I can do it...

The Windows Version...

The windows version uses Borland Pascals WinCRT unit, which is basically a simple emulation of the dos interface. The only real advantage to using the windows version is additional memory availability and perhaps better cooperation with other applications.

By default, the windows version will leave an "inactive" window onscreen with the program results displayed on it. If you want the window to automatically "go-away" then use the -w switch. This should be good for batch operations.

How HTM2ASC deals with some common HTML constructs:

Lists. Bulleted and ordered lists (UL and OL) are easy to handle. I just indent each element in six spaces, while placing the appropriate bullet in that six space area. Definition lists are similar - <DT>'s all are placed at the current left margin and <DD>'s are placed 4 spaces in.
Tables. I have put forth a reasonable effort to handle basic tables. Tables are by far the most complex HTML element that i've had to deal with, and as such, are the most likely to not work correctly.
Anchors. Anchors have no use whatsoever in an ASCII document and are therefore ignored completely.
Type Styles: Bold (B), Italics (I), etc. Since there is no way to render these in ASCII, they are ignored.
Headings: <H1>, <H2>, etc. There's no way to do these in ASCII, so the formatting is ignored. Each heading will always start on an empty line.
Horizontal Rule: <HR> This will be emitted as a series of dashes (-) from the right to left margin.
Images. Images are completely ignored.
Preformatted text <PRE>. Any newlines (ascii #10) are translated as line breaks. Multiple spaces are preserved. <P>'s are ignored.
Blockquote <BR>. The left and right margins are each indented 4 spaces.
Escaped characters:
Escape Sequence Translation
< <
> >
& &

Some implementation details:

HTM2ASC works by reading the entire ASCII file into memory and creating a tree structure to hold the data. This has the side effect of consuming up a large amount of memory, and may cause trouble on extremely large HTML documents. None of my own documents are large enough to present a problem yet, so I'm not worrying about this at this time.

Each HTML tag will be a seperate node in this tree. Each bundle of text will also be a seperate node. When text is read in, duplicate spaces are eliminated and carriage returns and line feeds are completely ignored.

The largest contiguous text block (i.e. a text block with no HTML commands in it) that HTM2ASC can handle is 64k. This could present a problem in large documents.

When the tree is processed and written back out, a virtual page is used to hold the text as it is being formatted. This was the only way I could figure out how to handle tables. The virtual page is represented as a circular array/queue. When it fills up, old lines are dumped out to disk. Sometimes the page will be backtracked, as in the case of tables where table cells span multiple text lines.

Revision History

Version 1.01:

Added -c (HTML tag count), -x (no output) switches
Support for <BLOCKQUOTE>
Better handling of blank lines between paragraphs, etc.

Version 1.02:

Windows version (HTM2ASCW.EXE)
Added -w to make mainwindow go-away in windows version
Support for <CENTER> and paragraph alignment tags in <P>
Added horizontal lines between table rows
Added -D option to disable horizontal lines between table rows
Fixed problem with <TH> nodes
Rewrote wordwrap to better deal with leading and trailing spaces around character formatting tags.

Is this program freeware? Shareware? or what?

If there is enough positive response, I will probably make it into a shareware program. In the meantime, it's primary purpose is to handle my own documentation, and all of my development efforts will be centered on making it work for my own needs, rather than what other people want.

How to contact me:

US-Mail:

Scott M. Baker

2241 W Labriego

Tucson, Az 85741

My Bulletin board:

The Not-Yet-Named BBS
(520) 544-4655 (USR Dual 14.4k)
(520) 797-8573 (USR Sportster 28.8k)

Email:

smbaker@primenet.com

My Homepage:

http://www.primenet.com/~smbaker

Switch	Parameter
-lxxx	specify initial left margin (default is 0)
-rxxx	specify initial right margin (default is 79)
-t	dump HTML tree to stdout
-h	disable high ascii use (i.e. tables)
-c	display html tag count statistics to stdout
-x	do not write output file
-u	dump all HTML tags found in anchors to stdout
-uf	dump only full HTML tags found in anchors to stdout (full tags start with http://)
-w	(Windows executable only) close the main window open after program terminates
-d	Disable horizontal lines between table rows

htm2asc.exe	DOS executable file
htm2ascw.exe	Windows (3.1) executable file
htm2asc.htm	Documentation in HTML format
htm2asc.txt	Documentation in plain ASCII format

Escape Sequence	Translation
<	<
>	>
&	&