ASORT Version 1.1 - Copyright 1989 Roger E. Donais


GENERAL:

ASORT is a generic extract and sort utility.  The input file is expected
to be a carriage return - line feed delimited ascii text file.  Source
lines may be random length, but cannot exceed 4096 characters.

Type ASORT with no parameters to obtain an on-screen command/option
summary.

The sort will *NOT* be attempted if there is than 48k of memory, and
will abort at the end of the first pass if there is insufficient memory
for the merge buffers. The elapsed time to reach this point will depend
on your disk through-put, which is typically 3% longer than the time it
takes to read and write the file in question using a 25 mHz 486 with a
680 Meg ESDI Drive reporting a 16ms access and 1012 Meg transfer rate.


OUTPUT FILE:

The output file is defined as one or more segments that are to be
extracted from each source line.  Each extract segment is defined as a
starting column and ending column separated by a colon, or as a starting
column and segment length separated by the letter "x".  Any extract
segment that is beyond the end of a short source line will be space
filled. Portions of the source line that are not described in the
extract definition are discarded.  Each line of the completed extract
file will be the combined length of the concatenated extract segments
followed by an ASCII carriage-return and line-feed character sequence.
Multiple segment descriptions are comma separated with no intervening
spaces.  The resulting output file is a series of fixed length
carriage-return/line-feed delimented records.

        Don't forget to add 2 additional characters for the
        CR/LF to the declared record length when using random
        access tecniques to access the resulting output file.


SORT KEY:

The sort key is defined as one or more key segments that are to be
constructed from each source line.  Each key segment is defined as a
starting column and ending column separated by a colon, or as a starting
column and field length separated by the letter "x".  Multiple sort key
segment descriptions are comma separated with no intervening spaces.

The key description list is read from left to right.  The left most
segment definition representing the major sort key and the right most
segment definition representing the minor, or least significant key.

Any key segment that is beyond the end of a short source line will be
padded with spaces.  This means that non-existent key segments will
have the same collating sequence as an existing space filled key
segment.

An optional letter suffix may be applied to each sort key segment to
control case sensitivity and specify the ascending or descending order
to be applied to that key segment.  The letter "A" or "a" specifies an
ascending sort sequence, and the letter "D" or "d" specifies a
descending sequence.  Use an uppercase "A" or "D" to force conversion to
upper case, thereby removing case sensitivity; and use a lowercase "a"
or "d" to retain case sensitivity.

A pseudo numeric key segment may be declared by appending a minus sign
"-", to any of the lettered options, or may be appended directly to the
physical description.  If used, this option must be the last character
of a segment description.

Each segment of the sort key retains its ASCII characteristics,
therefore non-numeric characters may be contained in a right justified
numeric field.  Entries are determined to be negative when the first
non-space character is a minus sign, "-"; the last non-space character
is a minus sign, "-"; or when the first non-space character is a left
parenthesis, "(", and the last non-space character is a right
parenthesis, ")".  The minus sign, or parenthesis defining a negative
value are discared, leading and trailing spaces are removed from the
remaining portion, and the result is then right justified in a space
filled field.

Justification and case conversion are applied only to the sort key. Each
section of the data extract is faithfully copied from the source file
retaining the same character case, justification, and/or centering that
occurred in the original file.

If the completed sort records are larger than the available memory,
ASORT will create an intermediate work file. Sufficient free disk space
must be available on the scratch drive to contain the work files, and
sufficient memory must exist to provide a minimal buffer for each
intermediate block flushed to disk.


USAGE:

ASORT is invoked using the following syntax -

  asort <infile >outfile {extract} {sort key} [temp path]

  where   <infile                  is a standard DOS redirected
                                   input specification.

          >outfile                 is a standard DOS redirected output
                                   specification.

          {extract}                is a comma separated field
                                   description list.

          {sort key}               is a comma separated field
                                   description list.

          [temp path]              is an optional DOS drive/path
                                   specification for the
                                   intermediate work file.

  and each extract and sort key field entry is specified as -

      {start column}:{end column} OR {start column}X{field width}

  and each sort key field may have one lettered and/or a justification suffix

          a        case sensitive ascending sequence (default option)
          d        case sensitive descending sequence
          A        case insensitive ascending sequence
          D        case insensitive descending sequence
          -        right justified / pseudo numeric


EXAMPLE:    { Single line command, multiple lines for display only }

asort  <e:\test1.def  >tempdata.$$$
            1:1,6:13,53x2,65:68,55x2,92:98
                  1:1D,6:13A,92:98-,53x2A,65:68A,55x2-  D:\

Will extract data from TEST1.DEF located in the root directory of drive
E: and write the sorted result to TEMPDATA.$$$ located in the current
directory of the default drive. The root directory of drive D: will be
used to contain the intermediate work file.

The extract will consist of the concatenation of

                 column 1 through column 1
         and     column 6 through column 13
         and     two characters starting at column 53
         and     column 65 through column 68
         and     two characters starting at column 55
         and     column 92 through column 98

Records (carriage return - line feed delimited lines) will be sorted
from major to minor keys as -

         column 1 through column 1              upper case descending
         column 6 through column 13             upper case ascending
         column 92 through column 98            ascending numeric
         two characters starting at column 53   upper case ascending
         column 65 through column 68            upper case ascending
         two characters starting at column 55   ascending numeric

MESSAGES:

All runtime messages are written to the stderr device, and will, by
default, be directed to the crt display.  When invoked with no
parameters, ASORT displays a full screen help message.  Additional
messages that may be displayed are -

         ASORT COMMAND ERROR:
                 indicates an error was detected in the command line.
                 The command line will be displayed with a carat (^)
                 marking the position where the error was detected.

         ** insufficient memory **
                 indicates that there is not enough memory available to
                 provide the necessary merge buffers.

         ** unable to create intermediate file **
                 indicates an error occurred when ASORT attempted to
                 create the intermediate work file (ASORT000.$$$).

         ** intermediate file write error **
                 indicates an error occurred when writing to the
                 intermediate work file.

         ** output file write error **
                 indicates an error occurred when writing to the
                 redirected stdout file handle.

SYSTEM REQUIREMENTS:

This release of ASORT was tested using a 9-Meg file consisting of
128,000 seventy (70) character carriage-return line-feed delimited
records.  The record extract was defined as the original seventy (70)
characters, and the sort key was defined as the first ten (10)
characters.  A TSR was used to consume all but 64k of available memory.
The sort was satisfactorily completed in less than forty-nine minutes
using a 16 mHz 386 equipped with a 16ms MiniScribe 9380E.  The same test
completed in less than five minutes when 490,000 bytes of memory was
reported available by Turbo Power's MAPMEM utility.

The largest known task to which ASORT has been applied, was a 219 meg
file consisting of 153 byte records, sorted on a five part key totalling
96 bytes.  The result of this task was an appalling 7 hours on a 25 mHz
486 equipped with a 16ms MiniScribe.  (I'll do better next time boss...)

If you are concerned about really large files, memory per record and/or
minimal buffer requirement can be estimated as -

     9 + extract length + key length + number of numeric sort fields

Assuming 100k of memory was available for work space, approximately 600
merge buffers would be available to sort a file of 78 character records
using a 78 character key, which would be sufficient to allow sorting a
28-Meg file.

Naturally, the more memory, the faster the sort!

The size of the intermediate file can be computed as the number of
records times the sum of:

     (extract length + key length + number of numeric sort fields)

Thus, the above example will require approximately 56-meg in order to
produce a 28-meg output file.  Athough memory is an important factor of
speed, it will usually not be the limiting factor for the size file that
can be created.

You will know if the existing resources are sufficient to complete the
sort by the end of the first pass. The elapsed time to reach this point
will depend of your disk through-put, and will usually be only 3% longer
than the time it takes to read and write the file in question. The 3%
factor is based upon tests performed using a 25 mHz 486 and the 16ms
access and 1012 Meg transfer provided by a 680 Meg ESDI Drive.

SUMMARY:

The algorithm has migrated in its present form from Turbo-3, Turbo-4,
Turbo-5, Turbo-C, and finally to the assembly language version presented
here.  Of the high level languages, Turbo-3 performed best, sorting a
10-meg file of 80 character records using a 10-Mhz Wyze in 35 minutes.
Turbo-4 and 5 took one minute longer.  Turbo Pascal 4 & 5 produced
substantially smaller executable files, but lost the race with Turbo-C
when sorting small files, but performed substantially better when
sorting larger files.  This improvment was probably due to my ability to
control the Pascal heap better than the Turbo-C heap.  For all its
inefficiencies, the assembly version is one fourth the size of the
smallest high level implementation, and six times faster the the best
them.

-----------------------------------------------------------------------
                        --- The legal Stuff ---

ASORT includes trade secrets and confidential information, which is the
copyrighted intellectual property of Roger Donais. 

ASORT may be freely copied and distributed on a non-profit basis, but
may not be sold or traded for monetary value, nor used commercially
without the written consent of the author.

ASORT is provided without warranty, expressed or implied, including but
not limited to fitfulness for a particular purpose. The author does not
guarantee the accuracy of the program and accepts no responsibility for
its use.

                Roger Donais (Compuserve ID: 70414,525)