WHAT?format A file format recognition utility for IBM PCs and compatibles Version 30.0 Boots & Pepper WHAT?format is copyright (c) Boots & Pepper 1989-91. It may be freely distributed provided no changes are made to WHAT.EXE or the WHATDOC files. It may not be bundled or distributed with commercial software without the author's permission. WHAT is shareware. If you find it useful, please register your copy by sending $15 (NOK 100) to: Boots & Pepper, Pilestredet 97, N-0358 Oslo, Norway. Giro: 1730.18.96921 Registration entitles you to a free copy of the next public release of the program. Other feedback bug reports, information about formats not currently supported, whatever may be sent to the address above or to: CompuServe: 76057,246 (Steve Pepper) email: pepper@falch.no (from medio May 1991) Boots & Pepper is Steve Pepper and Dag (Boots) Hasvold. We run Computertext BBS a bulletin board specialising in aspects of computing in the printing industry (DTP, PostScript, SGML, text conversion etc.) on +47-2-162650 (2400-8N1). WHAT?format 30.0 A file format recognition utility for IBM PCs and compatibles Introduction WHAT?format started life as a simple utility whose purpose was to distinguish between text files created by WordStar and WordPerfect. It was originally written for people working in a typesetting house which received a lot of raw text on floppy disks all too often without any information regarding the system that had created the files. As time has gone by, the program's capabilities have been extended so that the present version can distinguish between the native formats of thirty of the most common word processors. In many cases it will also distinguish between files created by different versions of the same program, e.g. WordPerfect 4.x (4.2 and earlier), 5.0, 5.1, etc. In addition to native word processor formats, version 30.0 also supports a number of text interchange formats (e.g. Document Content Architecture and Rich Text Format), various text-only formats (ASCII, DOS, EBCDIC, etc.), and a few print-formatted/page description formats (PostScript, PCL, etc.). Although it is primarily concerned with text files, WHAT will also recognise an assortment of other common data formats stemming from database, spreadsheet, graphics and other applications. (A complete list of supported formats is given in Appendix A.) Other features include the ability to create straight hex dumps and character set maps. These are described in detail below under Options. Usage To use WHAT in its basic mode as a format recognition utility, simply type WHAT at the DOS prompt, followed by the name of the file to be analysed. The file name may include optional drive and path specifications, as well as standard DOS wildcards: C:\>WHAT myfile.doc (analyse myfile.doc in current directory) C:\>WHAT a:*.* (analyse all files on disk in drive A:) WHAT takes each file matching the file specification and writes its name and size on the screen, analyses it and reports the result. Unrecognized files are reported as being of UNKNOWN FORMAT. The file name and size are written to DOS' CON device; the result is WHATDOC 30.0 3 written to the standard output device (normally the console), making it possible to send the result to a file using the usual DOS redirection techniques. The DOS Each major format supported by WHAT has its own format Errorlevel code that distinguishes it from all other major formats. For example, WordPerfect has the format code 32. Upon exiting to DOS, WHAT sets the DOS errorlevel variable to the format code corresponding to the result it arrived at for the last of the files that were analysed. Thus, if the last file analysed by WHAT turns out to be in WordPerfect format, the DOS errorlevel will be set to 32. This feature can be used in batch files, both to automate various kinds of file processing (conversion, cataloging, etc.) and as a way of ensuring that files of the wrong type do not get sent through a particular process. An example of such a batch file, WPCONV.BAT, is given in Appendix B. (See the description of the /E and /F switches below for more information on using the errorlevel feature.) NOTE: WHAT's format codes change from version to version, as new formats are added to the program, so be sure to update your batch files when you receive a new version of WHAT. The %WHAT% Where possible, WHAT attempts to distinguish between variable different versions of the same file format. Thus a WordPerfect file will be identified as version 4.x (meaning 4.2 or older), 5.0, 5.1 or whatever. In the case of bitmap files, WHAT will often report the size of the image, and possibly also the number of colours. With a file in MacBinary format, WHAT reports the file's TYPE and CREATOR. It is possible to test for all these kinds of information using the %WHAT% environment variable. Before exiting to DOS, WHAT looks to see if the envir- onment variable %WHAT% exists. If it does, WHAT sets it to the exact result shown on the screen for the last file analysed, e.g. "WordPerfect 5.1" or "PCX IV 640x480x256". (Note that this string can be up to 19 characters in length. An error is reported if there is not enough room in the environment.) Appendix B gives an example (SORTPICS.BAT) of a batch file that uses the %WHAT% variable to sort graphics files by format. Options NOTE: All options can be used in either upper or lower case, and may be preceded by either a slash or a hyphen. Those used in conjunction with file names may appear either before or after the file specification. WHATDOC 30.0 4 Character set Usage: WHAT /C [ >filename ] Creates an on-screen map of all characters appearing in the first file that matches . The user is then given the option of writing more detailed information (including the offset and context of the first occurrence of each character) to the standard output. If redirection has been specified on the command line, the result will be a text file suitable for viewing with Vern Buerg's LIST. The character set option is particularly useful with plain text files that do not use one of the standard character sets. Note that /C uses the underline attribute in order to create a 16x16 character set matrix on the screen. This gives better results on monochrome than on colour monitors. Errorlevel Usage: WHAT /E [ >filename ] Generates a list of format codes in a form which can easily be modified to create batch files like those shown in Appendix B. The output can be sent to a file using DOS redirection. Format code Usage: WHAT /F Presents a list of all supported major formats which contain the substring , together with the corresponding format code. The operation is case insensitive. For example: WHAT /fperfect will give the following result: 32 WordPerfect 46 DataPerfect 56 PlanPerfect 62 DrawPerfect Hex dump Usage: WHAT /X [ >filename ] Creates a hex dump of the first file that matches . The output contains only hex values no file offsets or character equivalents. The main purpose of this switch is to simplify the analysis of long and complicated formatting instructions contained within a text file. (The resulting file is easy to edit since it only contains hex values.) If you merely want to view the contents of a file in hex format, you will be better off using a file browsing utility like LIST, PC-Tools or Norton Utilities that also displays file offsets and ASCII equivalents. WHATDOC 30.0 5 Help Usage: WHAT /H Shows WHAT's help screen. The help screen is also shown when WHAT is invoked without any command line parameters. List formats Usage: WHAT /L Presents an on-screen list of all file formats supported by the current version of WHAT. Quiet mode Usage: WHAT /Q Suppresses screen output (for use in batch files). Redirection Usage: WHAT /R Enables redirection of all three elements of WHAT's screen output (i.e. the file's name, size and format). Normally only the format is written to DOS' standard output. Commentary WHAT is not foolproof, nor is it meant to be. It belongs to the venerable family of Q&D-utilities, and its basic philosophy is to be right as often as possible but without spending all day about it. It is not as Quick as it could be, and it is no doubt a good deal Dirtier than it would have been if I'd been a real programmer. That said, it has been tested fairly thoroughly on a number of systems and performs as described in this documentation. No problems have been reported that would consitute a threat to your computer or data, but as always, no responsibility is taken for damage resulting from incorrect or careless use of the program. How it works WHAT works by scanning the beginning of a file and looking for specific formatting features that can identify its format. The precise features looked for vary. Some applications especially newer ones create files with headers containing an ID-tag, a kind of "thumbprint" consisting of a special sequence of bytes that the application itself uses to determine whether or not the file is in its native format. For example, all files created by WordPerfect 5.0 or later (and other WP Corp products) begin with the byte sequence FF 57 50 43 (-1,"WPC"). These kind of files are an easy match, and WHAT will handle them quickly and flawlessly. Problems Other programs present greater problems, especially those with a native format closely akin to pure ASCII. PC-Write, for example, produces ASCII files if the document doesn't contain guide line font commands or text with attributes such as bold, underline etc. Such a file will be reported as being ASCII by WHAT. WHATDOC 30.0 6 If on the other hand, the PC-Write document contains a few words that are underlined, the file will resemble an ASCII file interspersed with the odd 17h a "non-ASCII" character. This will probably be enough for WHAT to reach a verdict of PC-Write, but it is not difficult to imagine that the file could have been produced by another program and that the 17h means something quite different. In such borderline cases a programming decision has been made based upon the assumed popularity of particular applications. (If you disagree with the decision, don't hesitate to let me know!) When WHAT makes a mistake, it is often in this kind of situation. More problems Another example will further illustrate the problems involved in differentiating between word processing systems that use similar formats. I recently down- loaded an ARChive file containing a number of text files from a bulletin board system. These files looked like ASCII when I viewed them with LIST, but WHAT said they were WordPerfect 4.x. In actual fact they turned out to be UNIX-type ASCII files with line endings marked by a single LF instead of the CR/LF pair used under DOS. (The archive file seems to have been put together on an Amiga.) LF (0Ah) is the code used by WordPerfect to represent a hard return (hence WHAT's diagnosis), so the files could equally well have been prepared using WordPerfect (except that they also had hard returns where there should have been soft returns). The question here is whether the result reached by WHAT was acceptable. My answer based mainly upon pragmatic considerations is yes: Wherever the file might have come from, it is now on a PC (otherwise I wouldn't be using WHAT!), and if it is to be edited on a PC, the best program to use is WordPerfect. Most ASCII editors would complain bitterly about the missing CR at the end of each line; but WordPerfect is over the moon, and it will even allow me to regenerate most of the soft returns (by reading in the file, saving it as DOS text, and reading it in again, this time as DOS text, using the option of converting hard returns in the hyphenation zone to soft returns). So in this case, WordPerfect is the best answer even though strictly speaking it is the wrong one. Dirty tricks If there is one thing that really slows WHAT down it is a lot of files in unsupported formats. A couple of dirty tricks are used to minimise this problem. Firstly, WHAT never reads more than the first 5 Kb of a file, reasoning that if it hasn't made up its mind by then, it probably never will. This could in theory lead to problems. For example, a PC-Write document consisting of 2 3 pages of straight ASCII followed by WHATDOC 30.0 7 a few pages of heavily formatted text will be judged to be ASCII but you'll be in trouble if you try to import it to, say, WordPerfect as "DOS Text". Such situations occur so rarely in practice, however, that the speed advantages of just looking at the beginning of a document outweigh the potential disadvantages. Secondly, WHAT doesn't bother to try to ascertain whether a COM-file really is executable: The present version quite simply ignores files with the extension .COM (except when the only files that match the file specification have this extension, in which case WHAT will attempt to analyse the last one hopefully unsuccessfully). ASCII files The criterion for differentiating between what WHAT and DOS files calls "ASCII text files" and "DOS text files" is whether or not characters from the Extended ASCII set appear in the file. An ASCII file can only contain 7-bit characters. This is an important distinction in certain European countries where accented characters may be represented by national versions of the (7-bit) ISO 646 character set, so English-speaking users will just have to live with it! In neither format does WHAT expect to encounter any control characters other than TAB (09h), CR (0Dh), LF (0Ah), FF (0Ch) or a single Control-Z end-of-file marker (1Ah). Feedback The biggest problem with a program like WHAT is keeping it up to date. New word processing programs are appearing all the time, and most of them use their own native format. Occasionally the format is described in the documentation that follows the application, but usually that is not the case. Some software publishers are willing to make the details of the format available to developers; others (like Microsoft and IBM) keep them a closely guarded secret. Upgrades of existing programs also present problems. As new formatting features are added to the applica- tion, the native format changes in order to accommo- date them. Sometimes these changes amount to no more than the addition of new codes to the old format (as when WordPerfect was upgraded from 4.1 to 4.2). More major revisions, on the other hand, can lead to a complete revamping of the native format (as was the case with WordPerfect 5.0). WHAT has been designed as far as possible to be able to handle new versions of formats that are already supported, but no guarantees are made. (I am fairly certain that WHAT will recognise documents created by version 6.5 of WordPerfect, but what happens with 9.0 documents is anybody's guess!) WHATDOC 30.0 8 Keeping abreast of all these changes and additions is no easy matter (I have yet to find a company that runs a mailing list for people interested in this kind of information!). What that means is that WHAT can only be improved and kept up to date with the assistance of its users. So if you find that WHAT makes a mistake when analysing a supported format, experience trouble with the latest version of a particular program, or can provide information on file formats not currently supported by WHAT, please do not hesitate to get in touch. The more example files and technical information you can provide for a particular format the better. Your efforts will be rewarded with an acknowledgement in the next version of WHATDOC and a typeset copy of this one. (The "wish list" for the next version of WHAT includes support for CGM, CUT, DXF, GEM, and PIC graphics, Quattro, PFS, Q&A, PageMaker, and the latest versions of DisplayWrite, Framework and Lotus 1-2-3; more information on Word for Windows and Excel and whatever else you and I can get our hands on!) Thanks to... Dag Hasvold, Aron Gurski, Gisle Hannemyr, Truls Meland, Tor Nordahl, Mike Robertson, Mats Tande and Chris Wolf for suggestions and help. Send comments, files and format documentation to Steve Pepper, Pilestredet 97, N-0358 Oslo, Norway (email: pepper@falch.no), or log on to Computertext BBS (2400 8-N-1) +47-2-420825. One final thing: Don't bother suggesting that the next version of WHAT ought to be able to recognise non-DOS disk formats unless you are prepared to tell me how to implement such a feature. I know it would be enormously useful, but I am a typographer, not a programmer! Steve Pepper Oslo, 19 April 1991 WHATDOC 30.0 9 Appendix A Text and data formats supported by WHAT?format v. 30.0 Here is a complete list of all formats supported by version 30.0 of WHAT. Those formats for which extra information is given (other than version number) are shown in bold type. Please support WHAT by helping to make this list more comprehensive! Word Ability WP processors Acto WP Am Professional ASCII text file (09,0A,0C,0D,1A and 20..7E) ASCII even parity Cicero DisplayWrite DOS text file (as ASCII, plus 80..FE) DSI Tekst EBCDIC file Enable WPF Framework Manuscript MASS-11 Microsoft Word MicroWord Multimate Notis WP OfficeWriter Ordbehandling Palantir PC-Write Samna Word Sprint Super WP Symphony Ventura Publisher Volkswriter Windows Write Word for Windows WordPerfect WordStar WordStar 2000 XPress tagged ASCII XyWrite Formatted PostScript Structuring Conventions version text DCA/RFT (DCA Revisable Form Text) DEC DX HP LaserJet (PCL) IBM DCF-GML (Generalised Markup Language) RTF (Microsoft Rich Text Format Data bases Ability DataPerfect WHATDOC 30.0 10 dBASE Enable Reflex Spreadsheets Ability DIF Enable Excel Lotus 1-2-3 PlanPerfect SuperCalc SYLK (Microsoft Symbolic Link) Graphics Ability Am Metafile DrawPerfect EPSF (Encapsulated PostScript) GIF resolution and number of colours IFF resolution for ILBM files IMG width and height Lotus PIC MacPaint Microsoft Paint width and height PCX version, size and number of colours TIFF version and type (Motorola or Intel) WPG version and type (bitmap/drawing) Various Ability comms ARC archive DOS Code Page font EXE file LZH archive MacBinary TYPE and CREATOR PostScript outline font StuffIt! archive Windows EXE file Windows font ZIP archive Miscellaneous file types from WordPerfect Corp. WHATDOC 30.0 11 Appendix B Example batch files using the DOS errorlevel and %WHAT% variable SORTPICS.BAT @echo off rem (using the DOS rem SORTPICS.BAT errorlevel) rem rem Sort your pics using WHAT?format! rem rem Change to a directory containing an rem assortment of graphics files and give rem the command: rem rem for %f in (*.*) do sortpics %f rem rem The files are copied to different rem directories depending on their format rem if not exist %1 goto :end what %1 if errorlevel 72 goto :end if errorlevel 71 goto :TIFF if errorlevel 70 goto :PCX if errorlevel 69 goto :end if errorlevel 66 goto :IMG if errorlevel 65 goto :end if errorlevel 64 goto :GIF goto :end :TIFF copy %1 c:\graphics\tiff del %1 goto :end :PCX copy %1 c:\graphics\pcx del %1 goto :end :IMG copy %1 c:\graphics\img del %1 goto :end :GIF copy %1 c:\graphics\gif del %1 goto :end :end WHATDOC 30.0 12 WPCONV.BAT @echo off rem (using the rem WPCONV.BAT %WHAT% rem variable) rem Automate conversion using WHAT?format! rem rem Change to a directory containing rem assorted WordPerfect files and give rem the command: rem rem for %f in (*.*) do wpconv %f rem rem The files are converted to DCA format rem using the correct version of WP's rem CONVERT.EXE. rem if not exist %1 goto :end set what=what what %1 if errorlevel 33 goto :notwp if errorlevel 32 goto :wp if errorlevel 1 goto :notwp goto :end :wp if "%what%"=="WordPerfect 5.1" goto :wp51 if "%what%"=="WordPerfect 5.0" goto :wp50 if "%what%"=="WordPerfect 4.x" goto :wp4x echo New version: %what% goto :end :wp51 convwp51 %1 d:\DCAstuff\%1 1 1 std.crs goto :end :wp50 convwp50 %1 d:\DCAstuff\%1 1 1 goto :end :wp4x convwp42 %1 d:\DCAstuff\%1 1 1 goto :end :notwp echo File is not WordPerfect format! :end set what= WHATDOC 30.0 13