PCI EIDE CONTROLLER FLAWS REV 18 Revision 18: 1995 October 5 SUMMARY OF RECENT CHANGES 1) EIDEtest 1.8 and CDtest 1.1 released. The only change is a warning to run your tests with background excecution configured on. 2) Intel's CtrlTest to check for both the RZ-1000 and CMD- 640 chips is now available under the name RZtest.exe. Beware! the MS Word documentation contains a macro virus. See http://www.intel.com/procs/support/rz1000/index.html 3) Fixpack 10 contains the necessary fixes for Warp. Beware! There are leaked, buggy copies of Fixpack 10 out on the net. 4) PJ19409.zip has been changed. It now contains all the fixes necessary for the RZ-1000 and for the CMD-640. Follow the installation instructions carefully. If you just follow your nose, chances are you will be worse off than you are now. This fix has been incorporated into Fixpack 10. 5) Intel contradicts itself on the performance hit from disabling prefetch to bypass the flaws. Robert Schultz (robert.schultz@execnet.com) reports a 50% performance hit after applying the CMD-640 fix. Marco Trunzer (ujjm@rzstud1.rz.uni-karlsruhe.de) reports a 15% slowdown. There are still no benchmarks on the effects on background bus-intensive processes. 6) Dell is upgrading its XPS 90 to avoid the flawed chips, but they are keeping the old kiss of death name. 7) Micron P5-90 M54Pi-N 11P has flawed CMD 640 chip on the primary channel, but a working SMC chip on the secondary channel. By moving your EIDE devices to the secondary channel, you can avoid the flawed chip. 8) The precise mechanism of failure for both the RZ-1000 and CMD-640 is now understood. The RZ-1000 has two different flaws and the CMD-640 has five. In addition most motherboard manufacturers using these two chips hooked them up improperly. 9) SMC 37650 controller is probably ok. 10) NT 3.5 not immune after all. It handles the RZ-1000 but not the CMD PCIO 640. Fix is available. 11) Software from IBM and Intel to detect both faulty chips directly. 12) Explanation of what "Intel Inside" means. 13) Dell offers upgrade BIOS to turn off the prefetch buffers. 14) List of safe and unsafe operating system software. 15) IBM hardware is clean. 16) Stonewall rebuilds. Intel recants on offer to replace defective motherboard. 17) Problem is showing up under Windows For WorkGroups in 32-bit mode. INTRODUCTION There are serious flaws affecting about 1/3 of all PCI motherboards. The flaws affect any motherboard or EIDE controller paddleboard containing the PC-Tech RZ-1000 PCI EIDE controller chip or the CMD PCIO 640 PCI EIDE controller chip. The flaws affect motherboards from ASUSTeK, AT&T, DEC, Dell, Gateway, Intel, Micron, NEC, Zeos and others. Since Intel makes so many of the motherboards sold under other brand names, the flaws affect many machines, both 486 and Pentium PCI. The flaws show up most frequently when you run a true multitasking operating system such as OS/2 Warp or NT. It also shows up under Windows For WorkGroups in 32-bit mode during tape or floppy backup and restore. In theory the flaws could do damage under DOS, DESQview, Windows and Windows For WorkGroups in 16-bit mode, but so far there have been no damage reports. Windows-95 contains code to bypass the flaws. The RZ-1000 has two flaws. The CMD-640 has those same two flaws plus three others. To make matters worse, most motherboard manufacturers using these two flawed chips connected them up incorrectly. There are software bypasses for these flaws. However, the Warp fix the CMD-640 reduces disk performance by 15 to 50%. The RZ-1000 fix has negligible impact on disk I/O though it can slow down background processes. I would advise new hardware to bypass the CMD-640 flaws, and living with software fixes to bypass the RZ-1000 flaws. WHAT ARE THE SYMPTOMS? When you are using an IDE or EIDE hard disk attached to the EIDE motherboard port, the flaws subtly corrupt your files by randomly changing bytes every once in a while. The flaws introduce bugs into EXE files, subtle errors into your spreadsheets, stray characters into your word processing documents, changes to the deductions in last year's tax return files, and random changes to engineering design files. This corruption happens when you are simultaneously using your EIDE or IDE hard disk and some other device, most commonly the floppy drive or mag tape backup. The same sorts of problem may occur on reading a CD-ROM drive attached to an EIDE port. IS IT SERIOUS? These flaws are nasty. They are causing hundreds of times more havoc than the infamous Pentium divide flaw ever did. "I am Pentium of Borg. You will be approximated." Not only does this corruption occur, but it occurs quietly, often going unnoticed. If the system crashes, you usually put the blame on the operating system software, or the application. It might actually be a faulty RZ-1000 or CMD-640 EIDE controller chip nailing you. When a directory becomes corrupted, you may not notice it until the damage is irreparable. If a spreadsheet application reads a comma-delimited ASCII file, it may simply miss a few bytes in a number, an error that may go unnoticed, and that error could cascade through the rest of the spreadsheet. If you have had unexplained crashes in OS/2, you have probably experienced the problem, and should make a thorough check for hidden corruption. Remember that the bug may only slightly alter your data, and the corruption may not be obvious. Keep in mind that not every problem is the RZ-1000's or the CMD-640's fault. Overheating, unrelated hardware faults and design flaws, or software bugs can cause similar symptoms. DMA channel conflicts also cause similar symptoms. Happily, EIDEtest and CDTest can unmask all manner of simultaneous I/O faults. Unfortunately, correcting the problem just stops further file corruption. It will not help to clean up the existing damage to your files. Right now, the focus is on bypassing the flaws. Preventing further corruption is child's play compared with the nightmare of trying to track down all the existing random errors in files. Backups even from day one may be corrupt. If you have the either of the flawed chips, you will probably never be able to completely eliminate the effects of past corruption. HOW DO YOU TELL IF YOU HAVE THE FLAWED CHIPS? There are four categories of motherboard: 1) Definitely safe. Motherboards may still have flaws, but all software in use bypasses them. 2) Probably safe. In theory there could be problems, but no one has reported any so far. 3) Possibly dangerous. You will have to run EIDEtest, CDtest, or IOTest to find out. 4) Probably dangerous. You will still have to run the tests to find out for sure. Definitely Safe Definitely safe includes older machines with ISA. EISA, or MCA buses. The flaws only affect machines with the new PCI bus or the VESA VL bus. PCI machines that use the new Triton chipset from Intel do not have the flaws PCI machines with Intel BIOSes that run only DOS, DESQview, Windows 3.1 or Windows-95 are safe. If you have a non-Intel BIOS and run only DOS, DESQview, Windows 3.1, Windows-95 and never use the "fast mode" simultaneous disk I/O feature on floppy or tape backup/restore, you are safe. You still might want to test your machine. There are similar problems with other causes the tests will unmask. Probably Safe If you have a non-Intel BIOS and run only DOS, DESQview, Windows 3.1, or Windows for WorkGroups 3.11 in 16-bit disk access mode, you probably will not see the problem, even though you may have one of the faulty chips. Possibly Dangerous Most auxiliary chipsets (e.g., OPTI Viper, SMC, Mercury and Neptune) used on PCI motherboards do not include a built in EIDE controller. Such motherboards use a separate EIDE controller chip -- often the flawed RZ-1000 or CMD-640. If you use a separate no-name EIDE paddleboard, it will likely use the one of the flawed chips. In theory, the flaws could affect DOS, Windows, and Windows For WorkGroups with 16-bit disk access during floppy/tape backup and restore, though no one has reported problems yet. Windows For WorkGroups with 32-bit disk access is dangerous if you have the flaws. Probably Dangerous PCI Motherboards (both 486 and Pentium) with the older Mercury and Neptune chipsets are likely to have the flawed chips. The Mercury chipset was popular in P60 and P66 systems, and the Neptune in P70, P90 and P100 systems. Mercury chipsets are labelled with an MX suffix and Neptune with NX. If you are using NT, OS/2 Warp or Linux, you are likely to have already experienced extensive file corruption if either of the flawed chips are present. Check the list later in the article for motherboards known to carry the flawed chips. TESTING FOR THE FLAWS Scot Llewelyn, one of the eight authors of PowerQuest's PartitionMagic, discovered one of the RZ-1000 flaws and made it public. Prior to that, only employees of PC-Tech, Intel and Microsoft were aware of how to bypass the flaws. In the process of tracking the RZ-1000 problems down, Internet comp.os.os2.bugs participants discovered a second flawed chip, the CMD-640. Scot did most of the initial work documenting the first RZ- 1000 flaw. He wrote a program called IOtest that can detect the flaws if: 1) You are using OS/2 Warp. 2) You are willing to go through the hassle of creating a separate small partition to run the test. You can use his program, PartitionMagic, to make room to create one. 3) You have an EIDE hard disk attached to your EIDE port. It cannot detect the problem if you only have an EIDE CD- ROM, or if the EIDE port is currently unused. Scot originally called his test program DMAtest because he erroneously thought simultaneous DMA was the sole culprit. Do not confuse PowerQuest DMAtest with Gazelle's DMAtest which only tests if the floppy drive will work happily simultaneously with the hard disk. The world needed an easier-to-use test that would run under DESQview, Windows, Windows For WorkGroups, Windows 95, NT and OS/2. So I wrote EIDEtest to test for the flaws without requiring you to create a special partition or buy Warp OS/2. I also wrote CDTest to test for the flaws when you have an EIDE CD-ROM drive. You can also get both programs from me by snail mail. If these tests fail, it proves you have a serious problem, but not necessarily that you have the RZ-1000 or CMD-640 chip. If the tests pass, you still may have a problem since, especially under DOS, DESQview and Windows, the flaws may only show up very rarely. If you run the tests under Windows- 95 they will always pass, even if you have the defective chip, because the operating system already bypasses the flaws. If you suspect trouble, run the tests several times. VISUAL INSPECTION You can also have a look at your motherboard. Between the PCI slots, at the edge of the motherboard, look for a rectangular chip about 1 by 2 cm (0.5" x 0.75") that says RZ- 1000 near the top of the chip. There are variations on the chip name, e.g., "RZ-1000BP". Unfortunately, the markings are not always present, especially in ASUSTeK motherboards which may have the "CMD PCIO 640A" or "CMD PCIO 640B" chip. DIRECT TESTS The OS/2 Warp Bonus Pack Sysinfo version 3.02 utility will report on your EIDE controller. The signature for the RZ- 1000 looks like this: manufacturer: PC TECHNOLOGY INC class code : 0001 Vendor ID: 1042 Device ID: 1000 Revision ID: 0001 For the CMD-640B it will look like this: manufacturer : CMD TECHNOLOGY INC class code : 0001 Vendor ID :1095 Device ID : 0640 Revision ID : 0002 The Warp disk driver IBM1S506.ADD with the /V switch will tell you if you have the RZ-1000 or CMD-640 chip. Intel has written a new test that looks directly for either of the two faulty chips called Ctrltest.exe, however it is filed under its old name RZTest.exe. The Windows-95 Control panel will also report on the EIDE controller chip. WHERE HAVE FLAWS BEEN FOUND? Via email, on BIX and on the Internet and in comp.os.os2.bugs, people have reported finding flaws in the following specific motherboards. Motherboard Chip Reporters Acculogic VL CMD-640 Mark Lord (mlord@bnr.ca) Paddleboard tentative ACMA P590 ? Bob Smith AST Bravo MS-T P/75 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu) ASUSTeK PCI/I CMD-640 Marco Trunzer P54SP4 (ujjm@rzstud1.rz.uni- karlsruhe.de) Maurice Schekkerman (schekker@prl.philips.nl) Mike Coplien (kcoplien@facstaff.wisc.edu) Robert Schultz (robert.schultz@execnet.com) Thomas L. Kusterer (kustetl1@aplcomm.jhuapl.edu) AT&T Globalyst 630 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu) DEC Celbris 590 CMD-640 Fred Thomsen (fthomsen@lexis.pop.upenn.edu) DEC Starion 700I CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu) DEC Venturis 466 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu) DEC Venturis 560 CMD-640 Fred Thomsen (fthomsen@lexis.pop.upenn.edu) Dell Dimension XPS RZ-1000 Scot Llewelyn P100 (scotl@itsnet.com) Dell Dimension XPS RZ-1000 Steve Ertman P75 (sertman@ocean.fsu.edu) Dell Dimension XPS RZ-1000 Dong Chen (D_Chen@netcom.com) P90 Larry Lai (lai@iastate.edu) Lawrence Rounds (ljrounds@netcom.com) Mike Griggs (mpg@iadfw.net) Mike Heath (heath@rohan.sdsu.edu) Moira Watson (watson6@uwindsor.ca) Nathaniel Beck @weber.ucsd.edu Pete (pag@interramp.com) Shallenberg (bobshall@subtone.wanet.com) Wijadi Jodi (r2nw@dax.cc.uakron.edu) Dell Optiplex 575 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu) Dell Optiplex XM CMD-640 Aron Eisenpress 590 (afecu@cunyvm.cuny.edu) Dell XPS-133c neither Blake Scholl (bscholl@one.net) EliteGroup UM8810P- CMD-640 Bodo Huckestein (bh@thp.Uni- AIO Koeln.DE) Guy Kapteijns (W.Kapteijns@kub.nl) Escom P5/60 CMD-640 Detlef Meier (Intel Premiere (detlef.meier@materna.de) ATLX) Rogier van Wanroij (wanroij@cs.utwente.nl) Escom P60I CMD-640 Tim Schofield (schofieldt@logica.com) Escom P90 RZ-1000 Karl Knoflach (151579kk@student.eur.nl ) (Xav@mantra01.demon.co.uk) Gateway 2000 P5-60, RZ-1000 Angus Black Intel Mercury Rev 3 (angus@spanner.hiway.co.uk) Gary Farr (garyfarr@ix.netcom.com) Daron Davis (daron_davis@dca.com) Jerry Lynch (lynch.94@osu.edu) Keith Patterson (dinosaur@buffnet.net) Rick Gregory (rfg@us.dynix.com) Roy L. Smith (smittyry@ix.netcom.com) Gateway 2000 P5-66 RZ-1000 Randy Nerwick (nerwick@netcom.com) Gateway 2000 P5-90 RZ-1000 Alan Murphy (alan@jac.co.uk) Roy L. Smith (smittyry@ix.netcom.com) Intel Hendrix CMD-640 Clif Purkiser Intel Corp (support@cs.intel.com) Intel Insight P5-60 RZ-1000 Jim Arnone Premiere PCI II (arnone@primenet.com) Baby AT, Neptune Chipset Intel Plato 90 RZ-1000 Adrian Teo (adriant@singnet.com.sg) Alain Rassel (Alain.Rassel@restena.lu) Chris Norman (cnorman@oboe.aix.calpoly.edu) Clif Purkiser Intel Corp (support@cs.intel.com) Kevin Chua (chua@server.uwindsor.ca) Kevin T. Van Maren (vanmaren@cs.utah.edu) Kim Hvarre (kims@crash.ping.dk) Martin Kogelbauer (e8826847@student.tuwien.ac.at ) Rick Nelson (rnelson2@ccmail.unl.edu) Richard Techmanski (richt@netcom.com) Intel Premiere RZ-1000 Clif Purkiser Intel Corp (support@cs.intel.com) Intel Premiere LPX CMD-640 Clif Purkiser Intel Corp (support@cs.intel.com) Intel Premiere MM CMD-640 Clif Purkiser Intel Corp (support@cs.intel.com) Intel Robin LC CMD-640 Clif Purkiser Intel Corp (support@cs.intel.com) Knowledgebase P90 CMD-640 Andy Longton laptop (alongton@clark.net) Micron P5-90 CMD-640 Primary fails, secondary is OK. Eric Johnson (johnson@scripps.edu) Jim Short (jdshort@primenet.com) Mike Coplien (kcoplien@facstaff.wisc.edu) Micronics M54Pi CMD-640 Adam Haar (s9406709@yallara.cs.rmit.edu. au) Midwest Micro P90 CMD-640 (412d25$e8j@clarknet.clark.net ) NEC Image P90 CMD-640 Mike Coplien (kcoplien@facstaff.wisc.edu) Packard Bell Legend CMD-640 James Treworgy 100CD (jamie@access.digex.net) PCI-EIDE local CMD-640 (whelk@ios.com) clone, Phoenix BIOS 4.04, ALI chipset Quantex P5/90 PM-2 RZ-1000 Jay Schamus (jaylord@rcinet.com) Soyo SY-4SA2 486 ? Jeffrey Hurwit prior to B5 (jhurwit@netcom.com) Unknown 486 DX SMC3765 Eric Stephen Mountain 0 (esm1@oak70.doc.ic.ac.uk ) Unknown 90 MHz ? Andreas (abenamou@galaxy.csc.calpoly.e du) Carol Lim (law30185@nus.sg) Viglen P90 (Intel RZ-1000 Phil Buckley Plato) (phil@starbug.swstyle.co.uk) Vobis RZ-1000 Vobis 4886DX2-66 CMD-640 Guy Kapteijns (W.Kapteijns@kub.nl) Zenon P90 RZ-1000 Aria Novianto (novianap@cs.purdue.edu) ZEOS Pantera RZ-1000 Paul Whitelock (paulw9DDFL3r.DDI@netcom.com) KNOWN GOOD MOTHERBOARDS The following motherboards have been tested with EIDEtest or CDtest and found to be ok. Not to worry, there are many more good boards than I have listed here: Motherboard Chip Reporters Arsys P200-PCI Triton Robert Aboud /sis (raboud@pacific.telebyte.c om) ASUSTek PCI/I- Triton Roedy Green P54TP4 (Roedy@bix.com) Dell Dimension ? Note: older versions of XPS P90c this board were flawed. Dave Nuttall (dnuttall@texas.net) Intel Zappa Triton Ron McGlade (ronmc@primenet.com) Micronics 486 ? Bob Meredith VLB (meredith@interactive.net) Seanix Opti Bill Unruh Viper (unruh@physics.ubc.ca) Soyo SY-4SA2 SYS Jeffrey Hurwit 486/B5 (jhurwit@netcom.com) WHAT CAN YOU DO IF YOU HAVE A FLAW? 1) Pester the manufacturer. Unfortunately, the EIDE controller chips are soldered in. The only way to repair a flaw is to replace the whole motherboard, recycling the socketed chips -- the CPU, DRAM and SRAM cache. It would be very expensive for computer and motherboard manufacturers to fix a flaw. After a month of stonewalling, Dell has announced it will offer a BIOS upgrade to turn off the prefetch buffers. You can contact Dell at support@us.dell.com or (800) 624-9896. Intel is now acknowledging the problem. For a short while, Intel offered to replace defective motherboards, then they reneged. You can contact them at support@cs.intel.com or call their tech support line (800) 628-8686. Select options 1-3-1. You can find international contact numbers at: http://www.intel.com/intel/intelis/contact.html. You can call ASUSTeK at (408) 956-9077. Call PC-Tech at (612) 345-4555. Call CMD Technology at (714) 454-0800, (800) 426-3832 or (714) 455-1656 FAX. 2) Buy a new unpopulated Triton PCI motherboard and recycle the CPU, DRAM and SRAM cache chips from the old motherboard. Unfortunately, the Triton chipset has design shortucts that hamper performance in simultaneous I/O situations. At least they don't corrupt data. 3) Run the controller in degraded mode. Some BIOSes have a feature disable the EIDE prefetch buffer. Vendors may offer a BIOS upgrade to allow you to manually disable prefetch. The BIOS may also turn it off automatically if either of the defective chips is present. This will bypass both RZ-1000 flaws and two of the five CMD-640 flaws. 4) Buy a PCI EIDE paddleboard controller such as the DTC 2130S, the Tekram 290N/290S, the Promise 2300+ or the BusLogic BT-910 to replace the one on the motherboard. You must disable the EIDE controller on the motherboard. This fix will waste one of your precious slots. Be careful. You could be leaping out of the RZ-1000 frying pan into the CMD- 640 fire since paddleboards often use the CMD-640. 5) Buy a SCSI hard disk and CD-ROM, and avoid using the EIDE ports entirely. Under OS/2 and Linux, SCSI gives better performance, but costs more. DOS, Windows, Windows For WorkGroups and Windows-95 are unable to exploit the advanced features of SCSI, but at least avoid the EIDE flaws when you go pure SCSI. 6) Find a software work-around. There are fixes for Warp to bypass all the flaws in the RZ-1000 and CMD-640. Fixpack 10 is the first fixpack to bypass the flaws. Now that Intel and IBM have finally revealed the technical details, all the operating system writers can patch their EIDE drivers to bypass the flaws. There are also fixes for NT 3.1 and 3.5. See below for details. 7) Get a BIOS upgrade. For DOS, DESQview, and Windows 3.1, to bypass the flaws you may need a new BIOS -- an EPROM chip. If you have a flash BIOS, you can update it simply by downloading a file. Most BIOSes already have code to bypass the flaws for DOS, DESQview and Windows. However, more advanced operating systems bypass the BIOS, so even a smart BIOS will not protect you. However, the BIOS CMOS settings may allow you to disable prefetch, which also protects you even in true multitasking operating systems. 8) Cut the trace. Cut the trace on the motherboard from the floppy changeline to the EIDE controller. However this just bypasses one of the CMD-640's five flaws and one of the RZ-1000's two flaws. 9) Use the Secondary EIDE Controller. Some motherboards such as the Micron P5-90 M54Pi-N 11P use different kinds of controller on the primary and secondary EIDE ports. The primary may be flawed, but the secondary OK. Whatever method you use to bypass the flaws, retest with EIDEtest and CDTest afterwards to be sure your fix worked and you caught all the problems. CLEANING UP THE MESS Once you have bypassed the flaws, you can start working the problem of cleaning up your files. The first thing to do is to re-install your operating system and all your application programs. This will replace any damaged EXE and DLL files. Catching errors in your data files is more difficult. Keep your eyes peeled for any improbable spreadsheet results. You may have to hire a programmer to write you some comb programs to sniff through your databases, looking for suspicious values. If you routinely use the verify feature of Lotus Magellan, it can detect changes to files that should not have changed. This may help you uncover some of the damage. The flaws are not polite enough to redate the files they corrupt. If you have backups from before the time you bought the faulty machine, you can restore them and re-key everything. Most people will not be so fortunate. All their backups will also be corrupt. Most people with flaws will just have to put up with random errors dotting their data files ever after. OPERATING SYSTEM SUMMARY Operating System Work Around Netware - No problems reported. Unixware 1.1 NEXTSTEP Banyan Solaris 2.4+ SCO Unix 3.1+ Windows-95 DOS - No problems reported so far. If you do DESQview have trouble: Windows 3.1 - Turn off EIDE prefetch in CMOS settings. - Upgrade BIOS chip. - Turn off simultaneous disk/floppy/tape I/O in your backup programs. Windows For - Turn off 32 disk access mode. WorkGroups - Turn off EIDE prefetch in CMOS settings. - Upgrade BIOS chip. - Turn off simultaneous disk/floppy/tape I/O in your backup programs. Windows NT 3.1 - Turn off EIDE prefetch in CMOS settings. - Apply ATDISK.SYS fix. Windows NT 3.5 - Turn off EIDE prefetch in CMOS settings. - Apply the 640XNT35.ZIP fix. OS/2 2.1 - Disable prefetch buffer in CMOS settings. - Load the IBMINT13.I13 driver instead of the IBM1S506.ADD driver. This trick will only work if your BIOS has flaw bypass code. It will be slow. - Upgrade to Warp OS/2 Warp 3 - Apply Fixpack 10, it contains all the special fixes. If for some reason, you are unwilling to apply Fixpack 10, you can do the following: - Disable prefetch buffer in CMOS settings. - Apply the RZ-1000 portion of pj19409.zip if you have the RZ-1000. - Apply the CMD portion of pj19409.zip including IBMIDECD.FLT if you have the CMD-640. - If that does not work, try basedev=CMD640x.add /16BIT. - In a pinch, if you cannot do either of the first two things, add a line to config.sys BASEDEV=IBMINT13.I13 and remove the line BASDEV=IBM1S506.SYS. The IBMINTI3.I13 Device driver lives in C:\OS2\BOOT, and on the first install diskette, and the on the CDROM in \OS2IMAGE\DISK_1. This trick will work only if your BIOS has flaw-bypass code. It will be slow. Linux - Disable prefetch buffer in CMOS settings. - To bypass the CMD-640 flaws use the boot time kernel parameter: hda=serialize. - To bypass the prefetch flaws, use the default settings to suppress interrupts during I/O on the external Hard Disk Parameter utility hdparm.. REPORTING YOUR FINDINGS Whether or not you find any flaws, please email me at Roedy@bix.com or post the following information in the Internet newsgroup comp.os.os2.bugs: 1) Test results. (I would like to hear about both machines with and without flaws.) 2) Brand and model of your motherboard. 3) Brand and model of your entire system. 4) Which chip did you find, the RZ-1000, the CMD-640, the SMC 37650? What did SYSINFO 3.02 report about your EIDE controller chip? 5) Have you noticed data file corruption? 6) Which tests and versions did you use? (IOtest, EIDEtest, CDtest, RZtest, Ctrltest or visual inspection) 7) What activities did you run in the background during the test? 8) Which operating system and version you used to run the test (e.g. Warp Connect blue spine) 9) Which fixpacks and patches did you applied before running the test? 10) Brand and model of EIDE hard disk 11) Brand and model of EIDE CD-ROM 12) Markings on the suspect chip, e.g., "RZ-1000BP", "CMD PCIO640B", "SMC 37650". 13) Vendor's name 14) Vendor's response on informing him of your problem. Please do not bother to report after 1995 October 31. The Internet is allowing the user community to rapidly sort this problem out, and all will be well-documented by then. WHOSE FAULT IS IT? The wags will have fun tormenting Intel for using the flawed RZ-1000 and CMD-640 in its motherboard designs, even though Intel did not manufacture either of the two faulty chips. Intel is not the only company to manufacture motherboards with the faulty chips, but Intel will bear the brunt of the bad publicity. PC-Tech manufactured the faulty RZ-1000 EIDE controller chip used in many PCI motherboards. PC-Tech is a subsidiary of ZEOS, the clonemaker. In turn Micron Electronics owns ZEOS. PC-Tech has offices just down the street from Zeos in Minnesota. Intel bought the chips from PC-Tech, and in turn many clone makers bought motherboards from Intel. Other motherboard manufacturers also used the faulty chips. In a similar way Intel and other companies also used the CMD-640 chip from the CMD Technology Corporation of Irvine California. PC-Tech, Intel and the clone makers all failed to test their designs properly. The software makers did not test their software on enough machines to show up the problem before releasing it. Even worse, in some motherboard designs, Intel used the CMD- 640 chip. This goof was inexcusable, since the chip, by deliberate design, is incapable of simultaneous I/O. How did the flawed CMD-640 chip and the RZ-1000 slip through Quality Assurance testing? My guess is no one did real world testing; technicians only tested under laboratory conditions using only simple operating systems like DOS. They might have ignored flaws that happened only sporadically, blaming it on a faulty chip rather than a faulty design. It is very hard to catch a flaw that only manifests rarely. CMD, PC-Tech, Intel, and Microsoft have known about how to bypass these problems for quite some time. IBM was aware there was a problem but was unaware of the solution. For obvious reasons, these companies were reluctant to inform the public of the danger of the ongoing subtle corruption. No one who understood the RZ-1000 and CMD-640 flaws publicised their findings. If PC-TECH, Intel and Microsoft had not been so secretive, they could have averted the damage. Perhaps they were silent because the flaws primarily hurt the customers of competitor, IBM. The collective damage done by withholding information about the flaws is huge, certainly many millions of dollars for those large companies whose backups are corrupt as well. It will be interesting to see if anyone launches a damage lawsuit against CMD, PC-Tech, Intel or Microsoft. If they do, it might make both hardware and software makers more careful about releasing improperly tested products. There is potential here for some massive lawsuits. No wonder the companies who knew about the flaws have been so tight- lipped. Think of the damage if Boeing or GM had its plans for coming products stored on flawed machines. Literally, these flaws could cause plane crashes. INTEL'S SPIN There are three levels of "Intel Inside". 1. Weak. Your motherboard has an Intel CPU but a support chipset from another manufacturer. 2. Medium. Your motherboard has an Intel CPU and Intel support chipset such as the Neptune or Triton, but some other company built the BIOS and motherboard. 3. Strong. Your motherboard has an Intel CPU, Intel support chipset, Intel motherboard and Intel BIOS. Intel literature on the RZ-1000 and CMD-640 only refers to (3). Intel cannot very well speak for (1) and (2) where the PCI EIDE controller design was out of their control, even though these machines bear the "Intel Inside" logo. Intel does not make this distinction clear in their literature. According to Intel, "This problem is a consequence of the RZ- 1000's inability to fully compensate for all the implications of running an IDE hard disk as an extension of the PCI bus, instead of running as an extension of the AT bus which it was originally designed to do." Intel would have us believe the problems are flaws per se, but rather a limitation that the programmers forgot to take into consideration. The truth is grey. UART chips have similar flaws. Programmers have gradually learned to code around them. We don't insist that all COM port hardware be recalled. We now tend to blame a programmer if he does not bypass the known UART flaws. Given that software work-arounds are now possible, the primary blame shifts for any perpetuation of the problem to the software authors. However, there are many other EIDE chip designs that do not have this "limitation". Since the chip are supposedly generic implementations of the ATA interface standard, I cannot so lightly excuse these flaws. SPECULATION Because setting the flaws right would be so expensive, I suspect that clone makers and motherboard manufacturers will continue to refuse to replace the defective equipment. At best they may offer BIOS upgrades to bypass the flaws. Microsoft has already added code to Windows-95 to bypass the flaws. Clone makers will rely on software vendors to write drivers that bypass the flaws for Warp, NT, Linux and the various UNIXes. Now that the OS/2 fixes are out, the pressure to set things right will dwindle. Since DOS, Windows in 16-bit mode, Windows-95 are immune, little pressure to correct the problem is likely to come from those camps. The motherboard manufacturer has five options: 1) Replace the motherboard. Recalls on a mass scale would be extremely costly for the motherboard manufacturers, so you can count on them to fight. ($400 parts + $250 labour) 2) Provide a replacement paddleboard EIDE controller that takes up a PCI slot. ($75) 3) Provide a new BIOS chip that bypasses potential problems for DOS and Windows. The BIOS could also turn off prefetch which would rescue multitasking operating systems that do not use the BIOS for I/O. ($10) 4) Tell the users to upgrade to software that bypasses the flaws, and to turn off simultaneous disk/tape/floppy I/O in any backup software run under DOS, DESQview or Windows. Users won't like the performance hit, however. ($0) 5) Stonewall and refuse to even acknowledge the problem. This will be more difficult now that Intel and Dell have publicly admitted the problem. ($0) Intel has already set the precedent by offering to replace defective Pentiums, even though software can bypass its divide flaw. The RZ-1000 flaws are far more serious, and the CMD-640 flaws are even more serious still. Keeping this under wraps is going to be hard for the clone builders. Brooke Crothers of Infoworld did several stories based on my compilations. I have been in contact with Jerry Pournelle of Byte. I sent email to John Dvorak. Even Dean Takahashi of the San Jose Mercury Daily News did story. A 1000-word abridged version of this essay is appearing in the October edition of The Computer Paper that goes across Canada. The stonewall is coming tumbling down. As one individual pointed out, I read your postings on the Internet, and see them the next day quoted in my daily newspaper. WHAT ARE THE FLAWS? IBM Confirmed the RZ-100 has two different flaws: 1. In prefetch mode, multi-sector reads often fail. 2. The chip erroneously responds to floppy status commands and corrupts hard disk or CD-ROM I/O in the process. IBM confirmed the CMD-640 has five different flaws: 1. It has the same prefetch problem as the RZ-1000. 2. It has the same floppy status problem as the RZ-1000. 3. It does not support simultaneous I/O on the primary and secondary EIDE ports. 4. Confusion over legacy and PCI mode. 5. Does not support 32-bit writes. THE FLAWS UNDER A MICROSCOPE After the manner of Ionesco, Roedy Green said, "All great programmers are paranoid." Programmers have to anticipate problems that could happen only once in a trillion machine cycles since such problems would still show up on average every three hours. EIDE problems sometimes go days without manifesting. Sometimes they show up within seconds, depending on the unrelated I/O activity in the machine. I have read about ten conflicting explanations from authorities on the cause of the problems. Much of the confusion comes because there are so many different flaws -- all generating similar symptoms. I based the following explanations on postings from Sam Detweiler of IBM's Warp Device Driver section (sdetweil@vnet.ibm.com). The RZ-1000 and CMD-640 both have the prefetch flaw and the floppy status flaw. The CMD-640 has three additional flaws. I will focus on the three most important. FLAW 1: PREFETCH BUFFER FLAW The RZ-1000 and CMD-640 both have the prefetch flaw. Data moves from the hard disk to RAM via a bit bucket brigade. The RZ-1000 grabs data 16 bits at a time from a buffer in the integrated controller in the hard disk, and hands it off 32 bits at a time off to the PCI bus. The CPU sits in a tight loop grabbing data from PCI bus and storing it in RAM. In prefetch mode, the RZ-1000 keeps ahead of the CPU, requesting two 16-bit chunks from the hard disk, in order to have a 32-bit chunk ready when the CPU asks. When you disable the prefetch buffer, you turn off the parallelism and run in a degraded lock-step mode. In degraded mode, the RZ-1000 waits until the CPU asks for a 32- bit chunk. Then it puts the CPU on hold while it asks the hard disk for two 16-bit chunks. It glues them together, and puts them on the PCI bus and allows the CPU to continue. I advise all but the most dedicated technophiles to skip the next paragraph. If the RZ-1000 is running with prefetch enabled, it erroneously considers a sector read complete as soon as it has grabbed the last 16 bits from the hard disk and stuffed it into the prefetch FIFO buffer. It should not consider it complete until the CPU has stuffed all the data into RAM. The RZ-1000 then starts to read the next sector. If the current read operation is interrupted, or delayed by simultaneous DMA from some unrelated device, before the last two bytes are read from the FIFO, and the next sector is prefetched into the FIFO before the current data transfer completes, then the chip will erroneously signal yet another Data Available Interrupt. Because OS/2 has already signalled EOI (End Of Interrupt) to the PIC (Programmable Interrupt Controller) and enabled interrupts, it recurses into the disk driver interrupt handler. The driver then reads the status register. Unfortunately, because of a cheap design shortcut, the FIFO is used both for data and status. The CPU reads the data in front of the status as if it were the status. This causes the interrupted data transfer to later read the following status as if it were data, resulting in corruption. Both the RZ-1000 and CMD-640 fail in exactly the same way. There are two software techniques to bypass this flaw: 1) Never schedule more than one I/O at a time. Use strict polled mode with no interrupts. Turn off all unrelated interrupts during I/O. This is the DOS/Windows approach. The disadvantage is poor performance and possible lost incoming modem characters. 2) Turn off the prefetch buffer. According to Intel and IBM, in a lightly loaded system, there is sufficient spare capacity on the PCI bus so running in degraded mode only slows the disk down by 1%. However, programs making extensive use of the PCI bus such as LANs or video bit-map painting will also slow down. Both Intel and IBM tell us that turning off prefetch to bypass the flaw has negligible effect on performance. Yet in the Plato BIOS rev 12, Intel says that enabling the prefetch buffers will "significantly increase PCI IDE Hard Disk performance." They can't have it both ways. FLAW 2: FLOPPY STATUS The RZ-1000 and CMD-640 both have the floppy status flaw. This flaw is the result of an incredible chain of blunders. The original MFM (the predecessor to IDE) interface design blunder was using different bits of the same I/O port, 3F7, for two unrelated purposes, detecting the floppy changeline and reporting hard disk status. Modern EIDE controllers are no longer supposed to do this, but some chips carry on in the old tradition and provide legacy logic. Motherboard manufacturers then often blunder by attaching the floppy changeline to the EIDE controller. This way both the EIDE controller and the floppy controller think they are in charge of reporting floppy changeline status. On top of that, the designers of both the RZ-1000 and CMD-640 chips both blundered by trying to save a little silicon by using the same registers to store both hard disk status and data. For the insatiably curious here is precisely how the corruption occurs. Simultaneously I/Os to both the hard disk are floppy disk are running. The floppy controller generates an I/O complete interrupt. The floppy driver then check the floppy status. Part of reading floppy status is checking the changeline bit -- contained in the ambiguous port 3F7. If the motherboard manufacturer goofed and hooked up the floppy changeline to the EIDE controller, the RZ-1000 erroneously responds to the floppy status request. It is in charge of the hard disk, not the floppy. It is the floppy controller's job is to respond. The RZ-1000 feeds two data bytes from its FIFO out as floppy status. These data were was supposed to go to the hard disk driver. Thus the chip loses two bytes from the hard disk transfer, corrupting data. Turning off prefetch also solves this problem. Unlike the first flaw, only simultaneous floppy I/O start can trigger this problem. Simultaneous I/O of any kind can trigger the first flaw. FLAW 3: NO SIMULTANEOUS I/O Only the CMD-640 has this flaw. The CMD-640 can't do more than one I/O at a time. This flaw was so obvious everyone found out about it long ago. All EIDE controllers (even fully functioning ones) cannot run master and slave simultaneously. However, two separate EIDE controllers are supposed to allow primary and secondary channels to run at once. The CMD-640 has dual controllers on one chip. However, because of a lack of two register sets, the primary and secondary channels will not work simultaneously unlike every other design. For example, you can't run your EIDE hard disk and EIDE CD-ROM at the same time. Simultaneous I/O speed is the reason we put two EIDE devices on separate channels, both as masters, rather than making one a master and one a slave on the same channel. IBM has a bypass for this blunder. When it detects a CMD- 640, Warp never schedules more than one I/O at a time when the CMD-640 is active, reducing the operating system to DOS- like performance. Independent experiments show the degradation from using the CMD fix is 15 to 50%. BACKGROUND If you read the literature on this problem, you will see various daunting technical terms. Here is a rough explanation. There are six kinds of I/O used in PCs. 1) PIO - Programmed I/O. The CPU spoon-feeds each byte to the I/O port. The port can usually accept data as fast as the CPU can feed it. Typical IDE drives work this way under DOS. For slower devices, the CPU polls the status to see if the device is ready for yet another byte. 2) Scheduled I/O. This is a variant of PIO where the operating system feeds the I/O device some bytes, then calculates how long it should take for the I/O device to digest them, then it goes away for a while to do something else, then it comes back when it figures the I/O should be complete, and feeds the device a few more bytes. This is how Warp usually controls parallel port printers. 3) Interrupt I/O. Every time the port is ready to eat another byte, it raises an interrupt and the CPU feeds it some more. This is the typical way COM ports work and how Warp uses printers with the /IRQ option. Warp EIDE drivers combine methods (1) and (2). The hard disk interrupts when it has completed the read into its on-board buffer. Then the CPU fetches data out of the buffer with PIO mode. 4) Third party DMA. The DMA controller on the motherboard copies data from RAM to the port and generates an interrupt when it is done with a block. Floppy drives and inexpensive mag tape backup drives use this method. Because of the unfortunate original AT design compromises, this method is exceedingly slow. Third Party DMA is never used for PCI bus devices though it is still used for ISA or motherboard-based floppy controllers on PCI motherboards. 5) First party DMA, sometimes called Bus Mastering. A DMA controller on the device copies data from RAM to the port and generates an interrupt when done High end SCSI cards -- such as the Adaptec 2940 or 2940W use this ultimate way to fly. 6) Memory mapped I/O. The CPU copies data to a magic region of RAM which is actually on the I/O device. LAN cards or REGEN VRAM on video cards use this technique. In a true multi-tasking system, such as OS/2, the CPU goes off and works on behalf of applications when the port is busy, and trusts an interrupt to bring it back when the device needs more service. It schedules several I/Os simultaneously. In contrast, DOS and Windows never do more than one I/O at a time. Further, under DOS/Windows the CPU idles while waiting for its single I/O to complete rather than working on applications. LEARNING MORE You can use the Internet to learn more about this problem. If you do not have Internet access, I can provide you these files on diskette. See below for details. When accessing files on the Internet generally you must use lower case. TEST PROGRAMS Roedy Green's EIDEtest and CDtest progams for DOS, DESQview, Windows, Windows For WorkGroups, Windows 95, NT, OS/2 and Warp. They ensure your hard disk and CDROM will function without interference from background I/O activity. These indirectly detect the flawed RZ-1000 and CMD-640 chips. By the time you read this, I may have posted a newer version. ftp://garbo.uwasa.fi/pc/diskutil/eidete18.zip alternatively ftp://ftp.cdrom.com/.4/os2/incoming/eidete18.zip or ftp://ftp.cdrom.com/.4/os2/sysutil/eidete18.zip Intel's RZ-1000 and CMD-640chip detect program. RZtest.exe expands to form CtrlTest.exe. Beware! the CtrlTest.Doc documentation contains an MSWord macro virus. http://www.intel.com/procs/support/rz1000/index.html IOTest from PowerQuest, the makers of Partition Magic, a Warp test for the flaws. http://www.powerquest.com/download/iotest.zip Version 3.02 of the self-extracting Warp utility, that should be placed in OS2\APPS. SYSIGUI.EXE will emerge. ftp://ftp.software.ibm.com/ps/products/os2/fixes/v3.0wa rp/english-us/sitcsd/sysinfo.exe FIXES Warp Fixpack 10. This bypasses the flaws for both the RZ- 1000 and CMD-640 faulty EIDE chips. It also fixes numerous other bugs in Warp. It comes as a set of six files file -- totalling about 8 MB. Make sure you get it from an official IBM CSD site because there are leaked pre-released buggy copies floating about the net. Before applying it, verify that the readme.1st on the first fixpack disk is dated 9/21/95 at 17:40. The package as a whole should be dated 9/22/95 or later. This fixpack applies to all versions of Warp including Warp Connect. It contains in itself all earlier fixpacks. You don't need to apply any previous fixpacks first. If you have the CMD-640, it is especially important you carefully read the installation instructions. You need to manually modify config.sys. DO A COMPLETE BACKUP FIRST. Many people are having a variety of troubles with Fixpack 10 -- often traced to failure to carefully follow the installation instructions, including a COMMIT step. ftp://service.boulder.ibm.com/ps/products/os2/fixes/v3.0warp /english-us/xr_w010/xr_w010.1dk ftp://service.boulder.ibm.com/ps/products/os2/fixes/v3.0warp /english-us/xr_w010/xr_w010.2dk ftp://service.boulder.ibm.com/ps/products/os2/fixes/v3.0warp /english-us/xr_w010/xr_w010.3dk ftp://service.boulder.ibm.com/ps/products/os2/fixes/v3.0warp /english-us/xr_w010/xr_w010.4dk ftp://service.boulder.ibm.com/ps/products/os2/fixes/v3.0warp /english-us/xr_w010/xr_w010.5dk ftp://service.boulder.ibm.com/ps/products/os2/fixes/v3.0warp /english-us/xr_w010/xr_w010.6dk alternatively ftp://ftp.pcco.ibm.com/pub/corrective_service/xr_w010.1dk ftp://ftp.pcco.ibm.com/pub/corrective_service/xr_w010.2dk ftp://ftp.pcco.ibm.com/pub/corrective_service/xr_w010.3dk ftp://ftp.pcco.ibm.com/pub/corrective_service/xr_w010.4dk ftp://ftp.pcco.ibm.com/pub/corrective_service/xr_w010.5dk ftp://ftp.pcco.ibm.com/pub/corrective_service/xr_w010.6dk Microsoft Windows NT 3.1 ATDISK.SYS fix for the CMD-640 chip: http://www.microsoft.com/KB/softlib/mslfiles/pciatdsk.e xe Microsoft Windows NT 3.5 fix for the CMD-640 chip: CMD's BBS at (714) 454-1134. File 640XNT35.ZIP If you don't want to install the entire Fixpack 10, you can install these Warp bypasses for the RZ-1000 and the CMD flaws. Warning. This file has been updated several times without changing the name. Make sure you get the most recent. The installation instructions are tricky. Follow them carefully. ftp://service.boulder.ibm.com/ps/products/os2/fixes/v3. 0warp/english-us/pj19409/pj19409.zip Warp bypass for the early CMD-640 chip flaws. It has been superceded by pj19409.zip. You no longer need to install it before pj19409.zip. ftp://ftp-os2.cdrom.com/pub/os2/drivers/cmd640x.zip ESSAYS Roedy Green's FAQ (Frequently Asked Questions) an unabridged copy of this article in both Winword and ASCII format. By the time you read this, I may have posted a newer version. ftp://garbo.uwasa.fi/pc/diskutil/eidete18.zip alternatively ftp://ftp.cdrom.com/.4/os2/incoming/eidete18.zip or ftp://ftp.cdrom.com/.4/os2/sysutil/eidete18.zip PowerQuest essay: http://www.powerquest.com/ Intel's FAQ http://www.intel.com/procs/support/rz1000 PC-Tech's essay: http://www.mei.micron.com/rz1000/rz1000.txt Catch Pat Duffy's (duffy@theory.chem.ubc.ca) essays each Sunday in: comp.os.os2.misc, comp.os.os2.setup.misc, comp.os.os2.setup.storage and comp.sys.ibm.pc.hardware.misc Check out Pat Duffy's Web site at: http://warp.eecs.berkeley.edu/os2/workbench/work.htm and ftp:://ftp.netcom.com/pub/ab/abe/ CONTACTING THE AUTHOR The author, Roedy Green is a computer consultant who prefers to work on Forth, C++, Delphi, DOS, OS/2 and Internet Web projects. If you send me $5 (US or Canadian) to cover duplication, postage to anywhere in the world, and handling I will send you a diskette containing the relevant test programs, fixes, Internet postings and essays. Sorry, but for various reasons I do not provide this package via EMAIL. Please report any machines with flaws. Send email to: Roedy@bix.com or discuss this problem on the Internet newsgroup in: comp.os.os2.bugs. You can also write via snail mail: Roedy Green Canadian Mind Products #601 - 1330 Burrard Street Vancouver, BC CANADA V6Z 2B8 (604) 685-8412 -30-