FWKCS(TM) Contents_Signature System, Ver. 2.04, 1995 Aug 30. (C)Copyright Frederick W. Kantor 1989, 1995. All rights reserved. New or changed in FWKCS version 2.04: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On the basis of experimental tests, the contents_signature ("cs") originally introduced in this software in the 1980's, would appear to carry a typical pairwise statistical error rate of less than, or of the order of, one part in ten trillion (1/(10,000,000,000,000)) -- more than 1000 times as good as the 32_bit CRC. This additional statistical resolution has been important in serving the needs of electronic bulletin boards, because they often contain more files than can be reliably distinguished by the 32_bit CRC. This has played a role in reducing the risk of accident during the automatic recognition of duplicate files with changed names. In the years since the original contents_signature was introduced, file collections have grown until they contain so many files that the original cs does not provide enough statistical resolution for the level of reliability desired. Starting with version 2.04, FWKCS supports a new, enhanced contents_signature ("long cs"), and continues to support the original contents_signature ("short cs"). The new, enhanced FWKCS cs has an estimated typical statistical pairwise error rate of less than, or of the order of, one part in 1.0 E+51; that is, less than, or of the order of, one part in 1,000,000,000,000,000, 000,000,000,000,000,000,000,000,000,000,000,000. The original cs and the new long cs are generated using assembly_language code, in a conveniently short time. The new cs long includes the original short cs. The two cs's can be used in the same data base: FWKCS uses a set of rules to automatically handle the combination. This enhanced statistical resolution permits FWKCS to serve the needs of the bulletin board community, makes the FWKCS contents_signature a more powerful tool for identifying and protecting intellectual property, and can assist in using FWKCS in situations involving larger statistical bases. With this enhanced statistical resolution, FWKCS continues to provide split_second lookup for finding matching contents_signatures. Background ~~~~~~~~~~ FWKCS is the premier system for automatically recognizing duplicate files and duplicate zipfiles, independent of filename. It is used on major electronic bulletin board systems; quality control is backed by more than 5,000,000 node_hours on giant systems, through which have passed copies of a large fraction of all the different shareware zipfile products that our civilization has seen. When used on a network or in a multitasking environment, FWKCS can provide 24_hour_a_day operation with no down_time for maintenance -- all normal maintenance operations, including consolidating the data base etc, are transparent to users; they can be done while the system is up and running and handling traffic. It is even possible to search the system and rebuild the data base while the system is up and running. Series 2.nn includes features to help you protect your system from becoming involved in software piracy. For ease in updating, the anti-piracy resource material for use with FWKCS is being distributed in a companion series, FWKCXnnn.ZIP; that series started with FWKCX001.ZIP, issued 1995 Jan 16 (note: that series number is not tied to the FWKCS version number). This resource material allows you to use powerful FWKCS features for the automatic recognition and automatic blocking of files, independent of filename. For where to get the most recent release, see "Note 2:" near the end of README.TXT. The executable code needed, is provided in this package (FWKCS204.ZIP). See especially XCLEANUP.BAT and FLAG_REV.BAT (both are automatically installed in your \CS directory; for on_line help in your \CS directory, while in that directory do CSM ). ------------------------ For current users, below is a summary of what is new or changed in FWKCS(TM) Version 2.04. This release includes a program (REPLACE.BAT) which lets you replace your existing, working version of FWKCS, Ver. 1.12 or later, while keeping your working CS lists, logs, special messages, and configuration. Changes in FWKCS.EXE: ~~~~~~~~~~~~~~~~~~~~~ 1. To control the generation and use of the original contents_signature and the new, longer contents_signature, the following commands have been added, where N = 0...5, M = 0...15 : internal variable setting FWKCS N.M /dcs (see CS line on FWKCS /d! screen) environment variable FWKCSL=N.M (if not (0...5).(0...15), then ignored) preface option &N.M Auxiliary Function 2...6 option &N.M In each case, for N.M N is bit mapped, 0...5 : 0 - compatibility mode (default). 1 - make long cs for plain files. 2 - if long cs input, then require a long cs for match. 4 - use only short cs for match even if long cs input. For example, "3" means do 1 and do 2. M is bit mapped, 0...15 : 0 - compatibility mode (default). 1 - if long_cs data found in zipfile, make long cs. 2 - if missing, prepare long_cs data (+ measure filelengths). 4 - prepare data whether or not missing. 8 - revise central directory of zipfile. Notes: 2+4 is allowed, but component 4 automatically overrides component 2; e.g., 6 has same effect as 4, 14 has same effect as 12; FWKCS /1 may call PKUNZIP (using whatever name is specified in line "6." on the FWKCS /d screen) to unzip files; if (2 or 4) and not 8, then 1 is automatically enabled; a zip_encrypted (PKZIP option -s) file is not processed to make a long cs; if a zipfile contains one or more entries for which a long cs is not provided, then its zipfile_contents_signature is short rather than long; if on analysis by FWKCS a zipfile appears defective, FWKCS may still prepare long cs's, but in the case of single zipfiles returns errorlevel = 7; for making the composite long_zipfile_contents_signature, each file with 32_bit_CRC=0 and uncompressed_filelength=0 is skipped; if every file in a zipfile has CRC = 0 and uncompressed filelength = 0, the long zipfile contents signature includes the MD5 hash for an empty file (this avoids treating the non_zero definition for the MD5 nul as cumulative); MD5 hash is written in the style set by its originators: the lowest byte is at the left, broken into its high hexadecimal "nibble" (a "nibble" is 4 bits) followed by its low hexadecimal "nibble", then the next higher byte is written in the same style (high hexadecimal nibble, then low hexadecimal nibble), etc; the long zcs uses the MD5 128_bit numbers summed mod 2^128 before being converted to an MD5_style string of hexadecimal characters; when revising the zipfile central directory to insert MD5 data, FWKCS also inserts the measured uncompressed file length (to avoid tampering); when processing and revising zipfiles, FWKCS can process zipfiles which contain files which have DOS filenames, long filenames, filenames which contain gaps, and filenames with multiple "." (e.g., OS/2, Unix), including zipped paths each of which can contain up to 127 levels of subdirectories; if the zipfile has a Zipfile Authenticity Verification stamp, its AV stamp is preserved. For example, 1.11 means make long cs for plain files, make long cs's for zipfiles if missing and revise the zipfiles, use long cs's of plain files and of files in zipfiles, and allow match testing against long or short cs (if both types are compared, statistical resolution is limited by shorter cs). 2. A long zipfile contents signature is supported, generated in a way which avoids treating the non_zero MD5 hash for a zero_length file as cumulative (see 'Zipfile_Contents_Signature ("zcs")' via the FWKCS204.REF Table of Contents). 3. New function, /1z- , to remove MD5 data from zipfile central directory. 4. When a long cs is present, the "Column_17" flags appear instead in column 50. 5. sorting CSLIST on flags, filenames, etc, with mixed long and short cs's: /s option A - adjust key pointers as needed to allow for different cs lengths: the key positions are defined as for the original cs's, and those which fall on column 17 or later are automatically shifted when a long cs is found. For example, to sort on filenames, FWKCS CSLIST1.SRT OUTFILE /sa18:12 6. Revised code to support scanning for duplicates and making MULTCNT.RPT, /1sm , with mixed long and short cs's; can override with N=4 in /&N.M/ or in environment variable FWKCSL=N.M 7. New function /c4 , to convert a contents_signature list containing long_cs's and short_cs's to all short_cs's, delete named outfile if no output: FWKCS CSLIST.SRT /c4 OLDCS Note that the output file should be sorted before being indexed. If you have multiple lines with different column_50 flags and the same contents_signature, you may want to separate out those lines which contain a j, k, l, or r flag in column_50 before using /c4, process them separately, append them to the main processed file using FWKQA, and then sort using FWKCS filename /s . 8. New option c for use with function f (Find) or g (Get), to test only the 32_bit CRC (first 8 characters) for a match; the input can be as short as 8 characters (the hexadecimal representation for a 32_bit CRC). 9. New option c under function /c2 : /c2c tests only the 32_bit CRC (first 8 characters) for a match; the input can be as short as 8 characters (the hexadecimal representation for a 32_bit CRC). 10. FWKCS checks to see if the computer supports 32_bit code; if so, it uses 32_bit code where appropriate for generating contents_signatures. 11. New preface option /&p suppress 32_bit code. 12. Modified function /A7.2, so that /A7.2!! can be used to divide a CSLIST containing both long and original contents_signatures into two separate files each containing only one kind of contents_signature. For any other character used as a flag, the search automatically adjusts for the presence of a long cs, so that flags in column 17 for original cs's and column 50 for long cs's are treated as equivalent. 13. When processing zipfile uploads and using the new enhanced cs, FWKCS can now provide virus testing for files which have DOS filenames, long filenames, filenames which contain gaps, and filenames with multiple "." (e.g., OS/2, Unix), including zipped paths each of which can contain up to 127 levels of subdirectories; multiple files with the same name but different zipped paths can be processed, without permitting one file to overwrite and block the virus testing of another file with the same name. 14. New format option for function /A7.8 : w1 option for single space listing of filenames, ELSE doublespace. 15. The "swap" commands #; swap all, #luvz; swap if List Unzip scanV Zip, long available under Auxiliary Functions 2-6, are now available as "preface options" which can be specified before any family of non_preface functions. Note that the swap command cannot appear as the first preface option, because the combination "/#" is reserved for other use. However, the default time value for tN is 3 seconds, so you can use the preface option combination "/t3#;" without affecting the t value. 16. New option 0 for the reVise function, v0 instead of v , to keep only the first entry, each, for an unflagged short cs with CRC=0 Filelength=0, and for an unflagged long cs with CRC=0 Filelength=0 MD5=MD5(nul); this also applies to Concurrent revision using /tNNNv0cNNNNN. 17. Added FWKCS header line, and registration, to messages prepared under Auxiliary Functions 5 and 6 (BBS); added new option &wN N=0 1 2, 0 put statement at top of FWKCS mid_message 1 put at top of (composite) output 2 put at bottom of (composite) output 0 is default, other values are enabled in registered copy. 18. Modified code re registration key, to accept a registration name up to 64 characters long (spaces count as characters); it still accepts the prior keys. 19. Added ability to find zipfile central directory information in the presence of a variety of pointer errors; this permits processing of "VENDINFO.DIZ" files, which, as of this writing, do not comply with the published standards for zipfile central directory structure and contents. 20. Added option v7 under Auxiliary Functions 4 - 6, to not count VENDINFO.DIZ files found inside zipfiles as zipfile_in_zipfile. 21. Added option v7 under /1 commands, to treat VENDINFO.DIZ as nonzip. 22. Added z7 option under Auxiliary Functions 3 - 6, so that if a filename has the extension .ZIP, then the file is required to be a zipfile. For Auxiliary Functions 5, 6, added in &a; (how to treat files ATTACHed to messages (e.g., PCBoard 15.0 or later)) a corresponding option z7, to retain any z7 restriction used for non_ATTACHed files (default is to drop any such restriction when processing an ATTACHed file). 23. Added /1 options re committing output file: o - commit Output of contents_signature lines, use default buffer size (networks) o1 - like o, but write cs output promptly for each processed file or zipfile. 24. Modified to /1 option m report of overflow more than 4157 matches are found in a single case; if that occurs, mNNN can be used to capture samples, with the ability to display a much larger number of matches at the end of the sample line. 25. Added option e for index command /i, to Evaluate contents_signature format for both short cs and long cs; when e not used, /i can make an index for a list containing long cs, short cs, and 32_bit CRC. 26. Added four exit errorlevels: 90 - successive lines not in ascending ASCII order. 91 - bad contents_signature. 92 - line too long. 93 - line too short. Note: if sorting using option A (see 5, above), and errorlevel 93 is generated when a line is too short for a keyed sort, the value reported for the minimum required line length is that for a line using a short cs; add 33 (decimal) when the line starts with a long cs. 27. Added /1 display option: g onGoing display of filecount + current d:\path\filename.ext. 28. Fixed a bug in keyed sorting when not_last key open or longer than line. 29. Fixed a bug which, apparently under certain circumstances under a certain network driver, could result in accidental deletion of files in the \CS directory. 30. Modified code to avoid a shift of file date when deleting spurious tails from .GIF files on HPFS drive under OS/2 Warp 3. 31. Various minor changes. Changes in FWKCSC.COM: ~~~~~~~~~~~~~~~~~~~~~~ 1. Added test for UPLOAD at second position on command tail when FWKCSC is running as a bulletin board system client under an FWKCS host and there has not been a timely reply from host: IF UPLOAD AND a directory has been designated for storing unchecked uploads AND running as BBS client THEN move the input file to that target directory copy any file description to a companion target directory (with ".D" added to directory name) and use the same name of the file it describes for the copy of the description 2. Various minor changes. Changes in FWKDG.COM: ~~~~~~~~~~~~~~~~~~~~~ 1. Added v Verbose text, to capture full entry from text descriptions in directories DIR0...DIR999 made in same format used in Clark Development Company's PCBoard, including multiline file descriptions. (added to support new FWKFF.COM, below; for an example of how to make a search file for a BBS, see "Using CRLF0, FWKCS, FWKFF, and FWKDG together" (without the quotes) in FWKCS204.REF) 2. Added support for text directory names listed on command line, wildcards OK; can use d:\path\. Format: FWKDG (drive (lastdriv)) /tNNNoption ((d:\path\)textdir) ...) The number of text directory identifiers (each of which can contain wild cards * ?) which can be listed on the command line is limited by the DOS command line length of 127 characters. If a text directory is specified on the command line, the default of searching text directories DIR0 - DIR999 in the current directory is suppressed; to also search the default text directories, use "," (without quotes) as an entry on the command line; to suppress search of any text directory, put " ." alone after the /option. To suppress searching drives, put a "." (without quotes) before the /option, instead of a drive letter. 3. If on network, FWKDG opens text directory files "read_only deny_none"; tNNN option specifies how long FWKDG tries to access a text directory file not known to be missing. 4. Various minor changes. Changes in FWKFT.COM: ~~~~~~~~~~~~~~~~~~~~~ 1. Can now directly open file listed on command line for input, and in that mode can support network read_only deny_none (it can still accept redirected input). 2. Various minor changes. Changes in FWKQA.COM: ~~~~~~~~~~~~~~~~~~~~~ 1. Added network feature of opening files "deny_none". 2. Various minor changes. Changes in other .COM programs: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Various minor changes in FWKHI.COM, DSA.COM, FWKCSS.COM, FWKCST.COM, CRLF0.COM ("FWKCRLF0(TM)"), FWKEM.COM, FWKM.COM, and FWKLW.COM. New .COM program: FWKFF.COM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ FWKFF(TM) provides conveniently fast, case sensitive search, starting in the left column, of a sorted (ascending) ASCII file, where each line ends with "carriage_return line_feed" (hexadecimal 0d,0a). Although FWKFF does not provide the phenomenal speed of FWKCS's contents_signature search, this small program fits well with human_interface applications, and does not require any index. Format: FWKFF (/options) "search string" (<) INFILE (>(>) OUTFILE ) For use with sorted (ascending) ASCII text files (line terminator = 0d,0a). To include quote " in search string, use "" . Comparison for match starts in left column. Comparison is case sensitive. When opening file directly in network or multitask environment, FWKFF opens INFILE 'read_only deny_none'. options: * - show this help screen; set errorlevel = 99 decimal. c - Capitalize search string before searching (won't find lower case item). f - text block begins Flush left (indented lines included in text block). ver - set exit errorlevel per version number sans ".". exit errorlevel: 0 - match found. 1 - match not found. 99 - see help screen. re system error: exit errorlevel = DOS error + 100 decimal. FWKFF is especially convenient for use with DIRGUIDE.TXT, routinely prepared using FWKDG and FWKCS. (DIRGUIDE.TXT provides a convenient list of all the files on a system, their d:\path\, and the identification of the respective text directories DIR0-DIR999, if present and suitably formatted, in which they appear.) FWKFF can also be used to search a sorted text file containing multiline entries where each text block starts flush left and the rest of the lines in that text block are indented; e.g., multiline file descriptions used with Clark Development Company's PCBoard. (for a detailed example, see "Using CRLF0, FWKCS, FWKFF, and FWKDG together" (without the quotes) in FWKCS204.REF) The new GT.BAT (below) uses FWKFF.COM. Search time reported for finding a file in a 3.79 Meg copy of DIRGUIDE.TXT, using a 486, was less than half a second; running the same test with an 8088, the reported search time was less than 2 seconds. This simple search program can be used with many different sorted files. New .COM program: FWKTLCSL.COM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Special utility to truncate a file at end of last " cs " line. (for S_REVCSL.BAT crash recovery) New .COM program: FWK1D21.COM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Special utility to convert long_cs column_17 structure flag 1Dh to 21h, or vice versa; e.g., for sending long_cs data in email. New .COM program: FWKFACC.COM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Utility to test file accessibility, e.g., for use in batch programs in a network or multitasking environment. If file is not accessible, FWKFACC returns the extended DOS errorlevel, which can be used to control branching in .BAT programs. New or changed .BAS and .BAT programs: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ See item 6, below, about change in calling S_REVCSL.BAT. 1. New: SET_CS.BAT, to set internal defaults for contents_signature length and usage. 2. New: BLOKSORT.BAT, to sort single_line or multiline text block files, where the first line of each multiline block of text starts flush left, and all other lines in each block are indented at least one space (every line must end with 0d,0a); shipped with maximum multiline block size = 2000 bytes; can be set to up to 4000 bytes; maximum filesize is limited by various operating systems to slightly less than 2, or slightly less than 4, Gigabytes, drive space permitting. 3. New: GT.BAT, template for batch program to get text re file location and/or text description. This template should be edited to set the correct d:\path for where the files DIRGUIDE.TXT, FILEDESC.SRT, and (optional file) ALL_F.SRT are stored. GT.BAT uses the new FWKFF.COM (above), and may use FWKCS and CRLF0. 4. New: MISSED.BAT, to find and test zipfiles which lack long_cs data. 5. Modified REVCSL.BAT and S_REVCSL.BAT, so that in transferring unique signatures from the prior data base, the zipfile contents signature ("zcs") of a zipfile which contained only one file is dropped in favor of the contents signature of the file it contained. Added support for long contents_signatures. 6. Modified S_REVCSL.BAT, to support use of long contents_signatures; added crash recovery, to resume building new data base after interruption (uses new FWKTLCSL.COM). S_REVCSL.BAT now has two input digits on the command line. The new first digit specifies buffering for output, using the new FWKCS network commit option. 7. Modified CSAMACS.BAT and SETFWKCS.BAT to automatically install #v; - (swap most of FWKCS.EXE out of memory when running virus test programs) a default setting in macro [x]. Virus test programs have been getting bigger, and may not be able to use memory above 1 Meg because BBS software may already be using it. v7 - do not count VENDINFO.DIZ files as zip in zip. z7 - if file has .ZIP extension, the file is required to be a zipfile. If you wish to remove #v; , v7 , or z7 from macro [x], run GET_DFLT in your \CSA directory to create PUT_DFLT.BAT, search PUT_DFLT.BAT for the first line which ends with dX (it starts with "FWKCS"), remove the triplet "#V;" (without the quotation marks), and/or remove the pair "V7" (without the quotation marks), and/or remove the pair "Z7" (without the quotation marks), and run the modified copy of PUT_DFLT.BAT. 8. Modified GET_DFLT.BAT, so that if it finds a prior PUT_DFLT.BAT in the current directory, it renames prior PUT_DFLT.BAT as PUT_DFLT.OLD ... PUT_DFLT.OL7. 9. Modified CSM.BAT and CSAM.BAT on_line help menus, to add new material, etc. 10. Numerous changes, for compatibility with changes and new features in FWKCS.EXE described above. 11. Corrected a bug in FINISH.BAT which in some cases reduced execution speed. 12. Various other changes. Changes in docs: ~~~~~~~~~~~~~~~~ 1. Various changes re the new long cs, new options, etc. Notes: ~~~~~~ 1. The remote lookup functions, including Rcrosref, are available in a relatively small kit, FWKLU204.ZIP, released 1995 Aug 30. Most of the remote lookup functions (but without Rcrosref), are available in a special, even smaller kit, FWKLZ204.ZIP, releaased 1995 Aug 30. FWKLZ204.ZIP does not require registration. If you run a BBS, you may wish to get FWKLU204.ZIP and FWKLZ204.ZIP for your users, especially if your BBS is a "feeder BBS" and many of your users are other BBS's. The kits come with instructions, and FWKLU204.ZIP contains a short bulletin, FWKLU204.BLT, suitable for posting. 2. The longer form of FWKCS contents_signature includes the 32_bit CRC, the uncompressed file length, and the "MD5" hash: Thanks, to R. Rivest, of MIT Laboratory for Computer Science and RSA Data Security, Inc., for introducing the MD5 algorithm and placing it in the public domain. (see RFC1321, April 1992, including the statement, "The MD5 algorithm is being placed in the public domain for review and possible adoption as a standard."). FWKCS uses an algorithm which generates the 128_bit "MD5" hash. Noting also the work of Colin Plumb (1993), there are at least four different logical sequences which satisfy the truth table for generating an MD5 hash; the one used here is different from, and faster than, the one provided by Rivest. Note is made also of work by Ray Gwinn (1995). The high_speed 32_bit and 16_bit embodiment for the algorithm used in this application is by Fred Kantor. To the extent that the code used in FWKCS.EXE may, directly or indirectly, be derivative of the C program copyrighted (1991) by RSA Data Security, Inc. ("RSA"), please note RSA's public statement, 'License to copy and use this software is granted provided that it is identified as the "RSA Data Security, Inc. MD5 Message-Digest Algorithm" in all material mentioning or referencing this software or this function.' 'License is also granted to make and use derivative works provided that such works are identified as "derived from the RSA Data Security, Inc. MD5 Message-Digest Algorithm" in all material mentioning or referencing the derived work.'