ASORT Version 1.1 - Copyright 1989 Roger E. Donais GENERAL: ASORT is a generic extract and sort utility. The input file is expected to be a carriage return - line feed delimited ascii text file. Source lines may be random length, but cannot exceed 4096 characters. Type ASORT with no parameters to obtain an on-screen command/option summary. The sort will *NOT* be attempted if there is than 48k of memory, and will abort at the end of the first pass if there is insufficient memory for the merge buffers. The elapsed time to reach this point will depend on your disk through-put, which is typically 3% longer than the time it takes to read and write the file in question using a 25 mHz 486 with a 680 Meg ESDI Drive reporting a 16ms access and 1012 Meg transfer rate. OUTPUT FILE: The output file is defined as one or more segments that are to be extracted from each source line. Each extract segment is defined as a starting column and ending column separated by a colon, or as a starting column and segment length separated by the letter "x". Any extract segment that is beyond the end of a short source line will be space filled. Portions of the source line that are not described in the extract definition are discarded. Each line of the completed extract file will be the combined length of the concatenated extract segments followed by an ASCII carriage-return and line-feed character sequence. Multiple segment descriptions are comma separated with no intervening spaces. The resulting output file is a series of fixed length carriage-return/line-feed delimented records. Don't forget to add 2 additional characters for the CR/LF to the declared record length when using random access tecniques to access the resulting output file. SORT KEY: The sort key is defined as one or more key segments that are to be constructed from each source line. Each key segment is defined as a starting column and ending column separated by a colon, or as a starting column and field length separated by the letter "x". Multiple sort key segment descriptions are comma separated with no intervening spaces. The key description list is read from left to right. The left most segment definition representing the major sort key and the right most segment definition representing the minor, or least significant key. Any key segment that is beyond the end of a short source line will be padded with spaces. This means that non-existent key segments will have the same collating sequence as an existing space filled key segment. An optional letter suffix may be applied to each sort key segment to control case sensitivity and specify the ascending or descending order to be applied to that key segment. The letter "A" or "a" specifies an ascending sort sequence, and the letter "D" or "d" specifies a descending sequence. Use an uppercase "A" or "D" to force conversion to upper case, thereby removing case sensitivity; and use a lowercase "a" or "d" to retain case sensitivity. A pseudo numeric key segment may be declared by appending a minus sign "-", to any of the lettered options, or may be appended directly to the physical description. If used, this option must be the last character of a segment description. Each segment of the sort key retains its ASCII characteristics, therefore non-numeric characters may be contained in a right justified numeric field. Entries are determined to be negative when the first non-space character is a minus sign, "-"; the last non-space character is a minus sign, "-"; or when the first non-space character is a left parenthesis, "(", and the last non-space character is a right parenthesis, ")". The minus sign, or parenthesis defining a negative value are discared, leading and trailing spaces are removed from the remaining portion, and the result is then right justified in a space filled field. Justification and case conversion are applied only to the sort key. Each section of the data extract is faithfully copied from the source file retaining the same character case, justification, and/or centering that occurred in the original file. If the completed sort records are larger than the available memory, ASORT will create an intermediate work file. Sufficient free disk space must be available on the scratch drive to contain the work files, and sufficient memory must exist to provide a minimal buffer for each intermediate block flushed to disk. USAGE: ASORT is invoked using the following syntax - asort outfile {extract} {sort key} [temp path] where outfile is a standard DOS redirected output specification. {extract} is a comma separated field description list. {sort key} is a comma separated field description list. [temp path] is an optional DOS drive/path specification for the intermediate work file. and each extract and sort key field entry is specified as - {start column}:{end column} OR {start column}X{field width} and each sort key field may have one lettered and/or a justification suffix a case sensitive ascending sequence (default option) d case sensitive descending sequence A case insensitive ascending sequence D case insensitive descending sequence - right justified / pseudo numeric EXAMPLE: { Single line command, multiple lines for display only } asort tempdata.$$$ 1:1,6:13,53x2,65:68,55x2,92:98 1:1D,6:13A,92:98-,53x2A,65:68A,55x2- D:\ Will extract data from TEST1.DEF located in the root directory of drive E: and write the sorted result to TEMPDATA.$$$ located in the current directory of the default drive. The root directory of drive D: will be used to contain the intermediate work file. The extract will consist of the concatenation of column 1 through column 1 and column 6 through column 13 and two characters starting at column 53 and column 65 through column 68 and two characters starting at column 55 and column 92 through column 98 Records (carriage return - line feed delimited lines) will be sorted from major to minor keys as - column 1 through column 1 upper case descending column 6 through column 13 upper case ascending column 92 through column 98 ascending numeric two characters starting at column 53 upper case ascending column 65 through column 68 upper case ascending two characters starting at column 55 ascending numeric MESSAGES: All runtime messages are written to the stderr device, and will, by default, be directed to the crt display. When invoked with no parameters, ASORT displays a full screen help message. Additional messages that may be displayed are - ASORT COMMAND ERROR: indicates an error was detected in the command line. The command line will be displayed with a carat (^) marking the position where the error was detected. ** insufficient memory ** indicates that there is not enough memory available to provide the necessary merge buffers. ** unable to create intermediate file ** indicates an error occurred when ASORT attempted to create the intermediate work file (ASORT000.$$$). ** intermediate file write error ** indicates an error occurred when writing to the intermediate work file. ** output file write error ** indicates an error occurred when writing to the redirected stdout file handle. SYSTEM REQUIREMENTS: This release of ASORT was tested using a 9-Meg file consisting of 128,000 seventy (70) character carriage-return line-feed delimited records. The record extract was defined as the original seventy (70) characters, and the sort key was defined as the first ten (10) characters. A TSR was used to consume all but 64k of available memory. The sort was satisfactorily completed in less than forty-nine minutes using a 16 mHz 386 equipped with a 16ms MiniScribe 9380E. The same test completed in less than five minutes when 490,000 bytes of memory was reported available by Turbo Power's MAPMEM utility. The largest known task to which ASORT has been applied, was a 219 meg file consisting of 153 byte records, sorted on a five part key totalling 96 bytes. The result of this task was an appalling 7 hours on a 25 mHz 486 equipped with a 16ms MiniScribe. (I'll do better next time boss...) If you are concerned about really large files, memory per record and/or minimal buffer requirement can be estimated as - 9 + extract length + key length + number of numeric sort fields Assuming 100k of memory was available for work space, approximately 600 merge buffers would be available to sort a file of 78 character records using a 78 character key, which would be sufficient to allow sorting a 28-Meg file. Naturally, the more memory, the faster the sort! The size of the intermediate file can be computed as the number of records times the sum of: (extract length + key length + number of numeric sort fields) Thus, the above example will require approximately 56-meg in order to produce a 28-meg output file. Athough memory is an important factor of speed, it will usually not be the limiting factor for the size file that can be created. You will know if the existing resources are sufficient to complete the sort by the end of the first pass. The elapsed time to reach this point will depend of your disk through-put, and will usually be only 3% longer than the time it takes to read and write the file in question. The 3% factor is based upon tests performed using a 25 mHz 486 and the 16ms access and 1012 Meg transfer provided by a 680 Meg ESDI Drive. SUMMARY: The algorithm has migrated in its present form from Turbo-3, Turbo-4, Turbo-5, Turbo-C, and finally to the assembly language version presented here. Of the high level languages, Turbo-3 performed best, sorting a 10-meg file of 80 character records using a 10-Mhz Wyze in 35 minutes. Turbo-4 and 5 took one minute longer. Turbo Pascal 4 & 5 produced substantially smaller executable files, but lost the race with Turbo-C when sorting small files, but performed substantially better when sorting larger files. This improvment was probably due to my ability to control the Pascal heap better than the Turbo-C heap. For all its inefficiencies, the assembly version is one fourth the size of the smallest high level implementation, and six times faster the the best them. ----------------------------------------------------------------------- --- The legal Stuff --- ASORT includes trade secrets and confidential information, which is the copyrighted intellectual property of Roger Donais. ASORT may be freely copied and distributed on a non-profit basis, but may not be sold or traded for monetary value, nor used commercially without the written consent of the author. ASORT is provided without warranty, expressed or implied, including but not limited to fitfulness for a particular purpose. The author does not guarantee the accuracy of the program and accepts no responsibility for its use. Roger Donais (Compuserve ID: 70414,525)