HTM2ASC Documentation

By Scott M. Baker (smbaker@primenet.com)


I'm not big on documentation, so don't expect to find a whole lot here. HTM2ASC is designed to perform a relatively straightforward task, so there's not a whole lot to say!


Purpose: To convert an HTML document to plain ASCII text while preserving as much of the original formatting as possible.

Syntax: HTM2ASC <filename.htm>

Optional Parameters:
Switch Parameter
-lxxx specify initial left margin (default is 0)
-rxxx specify initial right margin (default is 79)
-t dump HTML tree to stdout
-h disable high ascii use (i.e. tables)
-c display html tag count statistics to stdout
-x do not write output file
-u dump all HTML tags found in anchors to stdout
-uf dump only full HTML tags found in anchors to stdout (full tags start with http://)
-w (Windows executable only) close the main window open after program terminates
-d Disable horizontal lines between table rows

Distribution: The distribution archive will be named SBH2Axxx.ZIP, where xxx is the current version number. The following files should be present in the archive:
htm2asc.exe DOS executable file
htm2ascw.exe Windows (3.1) executable file
htm2asc.htm Documentation in HTML format
htm2asc.txt Documentation in plain ASCII format


Some comments....

The reason I wrote HTM2ASC is because I wanted to shift my documentation efforts to HTML, due to it's benefits over ascii text. However, there still existed a need to support users who did not have HTML support available. Rather than update two documents at the same time, my decision was to do all my writing in HTML and then use a converter program of some sort to convert it to ASCII.

I was unable to find an existing utility that adequatly handled the job, so I took on the project of writing my own. My understanding of HTML is still very basic, but at this point, HTM2ASC handles my documentation reasonably well.

I have the capability of producing both 16-bit DOS DPMI and 16-bit OS/2 character mode executables if needed. These would have the benefit of larger available memory and probably support larger HTML documents. If there is demand, I can do it...


The Windows Version...

The windows version uses Borland Pascals WinCRT unit, which is basically a simple emulation of the dos interface. The only real advantage to using the windows version is additional memory availability and perhaps better cooperation with other applications.

By default, the windows version will leave an "inactive" window onscreen with the program results displayed on it. If you want the window to automatically "go-away" then use the -w switch. This should be good for batch operations.


How HTM2ASC deals with some common HTML constructs:


Some implementation details:

HTM2ASC works by reading the entire ASCII file into memory and creating a tree structure to hold the data. This has the side effect of consuming up a large amount of memory, and may cause trouble on extremely large HTML documents. None of my own documents are large enough to present a problem yet, so I'm not worrying about this at this time.

Each HTML tag will be a seperate node in this tree. Each bundle of text will also be a seperate node. When text is read in, duplicate spaces are eliminated and carriage returns and line feeds are completely ignored.

The largest contiguous text block (i.e. a text block with no HTML commands in it) that HTM2ASC can handle is 64k. This could present a problem in large documents.

When the tree is processed and written back out, a virtual page is used to hold the text as it is being formatted. This was the only way I could figure out how to handle tables. The virtual page is represented as a circular array/queue. When it fills up, old lines are dumped out to disk. Sometimes the page will be backtracked, as in the case of tables where table cells span multiple text lines.


Revision History

Version 1.01:

Version 1.02:


Is this program freeware? Shareware? or what?

If there is enough positive response, I will probably make it into a shareware program. In the meantime, it's primary purpose is to handle my own documentation, and all of my development efforts will be centered on making it work for my own needs, rather than what other people want.


How to contact me:

US-Mail:

Scott M. Baker

2241 W Labriego

Tucson, Az 85741

My Bulletin board:

The Not-Yet-Named BBS

(520) 544-4655 (USR Dual 14.4k)

(520) 797-8573 (USR Sportster 28.8k)

Email:

smbaker@primenet.com

My Homepage:

http://www.primenet.com/~smbaker