rev 15 faq EIDE controller flaws part 1 of 2 From: roedy@BIX.com (Roedy Green) Newsgroups: comp.os.os2.bugs Subject: rev 15 faq EIDE controller flaws part 1 of 2 Date: 1 Sep 1995 01:08:35 GMT Organization: Canadian Mind Products Lines: 545 Message-ID: <425mej$hgo@news2.delphi.com> NNTP-Posting-Host: bix.com X-Newsreader: Galahad 1.1f EIDE CONTROLLER FLAWS part 1 of 2 Revision 15: 1995 August 31 SUMMARY OF RECENT CHANGES 1) EIDEtest 1.5 and CDTest 1.0 released. 2) Yet another suspect EIDE controller chip: the SMC 37650. 3) Intel contradicts itself on the performance hit from disabling prefetch to bypass the flaw. 4) Software from IBM and Intel to detect both faulty chips directly. 5) The precise mechanism of failure for both the RZ-1000 and CMD 640B is now understood. The RZ-1000 and CMD 640B both have the prefetch flaw. The CMD 640B has two additional flaws. 6) Explanation of what "Intel Inside" means. 7) Dell offers upgrade BIOS to turn off the prefetch buffers. 8) RZ-1000 flaw bypass for APAR PJ19409 for Warp now available. 9) List of safe and unsafe operating system software. 10) IBM hardware is clean. 11) Stonewall rebuilds. Intel recants on offer to replace defective motherboard. 12) Problem is showing up under Windows For WorkGroups in 32 bit mode. 13) Cleaning up past damage is very difficult. 14) Assigning blame. 15) The Triton chipset is immune. These chips are marked with an FX suffix. 16) Windows-95, NT are immune. 17) DOS and Windows 3.1 are immune if you have an Intel BIOS. INTRODUCTION There are serious flaws affecting about 1/3 of all PCI motherboards. The flaws affect any motherboard or EIDE controller paddleboard containing the PC-Tech RZ-1000 PCI EIDE controller chip or the CMD PCIO 640B PCI EIDE controller chip. There are preliminary reports of yet a third flawed chip -- the SMC 37650. The flaws affect motherboards from ASUSTeK, AT&T, Dell, Gateway, Zeos and Intel. Since Intel makes so many of the motherboards sold under other brand names, the flaws affect many machines, both 486 and Pentium PCI. The flaw shows up most frequently when you run a true multitasking operating system such as OS/2 Warp. It also shows up under Windows For WorkGroups in 32 bit mode during tape or floppy backup and restore. In theory the flaw could do damage under DOS, DESQview, Windows and Windows For WorkGroups in 16 bit mode, but so far there have been no damage reports. Recent versions of Microsoft NT and Windows- 95 contain code to bypass the flaw. WHAT ARE THE SYMPTOMS? When you are using an IDE or EIDE hard disk attached to the EIDE motherboard port, the flaw subtly corrupts your files by randomly changing bytes every once in a while. The flaw introduces bugs into EXE files, subtle errors into your spreadsheets, stray characters into your word processing documents, changes to the deductions in last year's tax return files, and random changes to engineering design files. This corruption happens when you are simultaneously using your EIDE or IDE hard disk and some other device, most commonly the floppy drive or mag tape backup. The same sorts of problem may occur on reading a CD-ROM drive attached to an EIDE port. IS IT SERIOUS? These flaws are nasty. They are causing hundreds of times more havoc than the infamous Pentium divide flaw ever did. "I am Pentium of Borg. You will be approximated." Not only does this corruption occur, but it occurs quietly, often going unnoticed. If the system crashes, you usually put the blame on the operating system software, or the application. It might actually be a faulty RZ-1000 or CMD 640B EIDE controller chip nailing you. When a directory becomes corrupted, you may not notice it until the damage is irreparable. If a spreadsheet application reads a comma-delimited ASCII file, it may simply miss a few bytes in a number, an error that may go unnoticed, and that error could cascade through the rest of the spreadsheet. If you have had unexplained crashes in OS/2, you have probably experienced the problem, and should make a thorough check for hidden corruption. Remember that the bug may only slightly alter your data, and the corruption may not be obvious. Keep in mind that not every problem is the RZ-1000's or the CMD 640B's fault. Overheating, unrelated hardware faults and design flaws, or software bugs can cause similar symptoms. DMA channel conflicts also cause similar symptoms. Happily, EIDEtest and CDTest can unmask all manner of simultaneous I/O faults. Unfortunately, correcting the problem just stops further file corruption. It will help to clean up the existing damage to your files. Right now, the focus is on bypassing the flaw. Preventing further corruption is child's play compared with the nightmare of trying to track down all the existing random errors in files. Backups even from day one may be corrupted. If you have the flaw, you will probably never be able to completely eliminate the effects of past corruption. HOW DO YOU TELL IF YOU HAVE THE FLAW? There are four categories of motherboard: 1) Definitely safe. Motherboards may still have the flaw, but all software in use bypasses it. 2) Probably safe. In theory there could be problems, but no one has reported any so far. 3) Possibly dangerous. You will have to run EIDEtest, CDtest, or IOTest to find out. 4) Probably dangerous. You will still have to run the tests to find out for sure. Definitely Safe Definitely safe includes older machines with ISA. EISA, VESA VL or MCA buses. The flaw only affects machines with the new PCI bus. PCI machines that use the new Triton chipset from Intel do not have the flaw. PCI machines with Intel BIOSes that run only DOS, DESQview, Windows 3.1, Windows-95 or NT 3.5 are safe. If you have a non-Intel BIOS and run only DOS, DESQview, Windows 3.1, Windows-95 or NT 3.5 and never use the "fast mode" simultaneous disk I/O feature on floppy or tape backup/restore, you are safe. You still might want to test your machine. There are similar problems with other causes the tests will unmask. Probably Safe If you have a non-Intel BIOS and run only DOS, DESQview, Windows 3.1, or Word For Windows in 16-bit disk access mode, you probably will not see the problem, even though you may have one of the faulty chips. Possibly Dangerous Most auxiliary chipsets (e.g., OPTI Viper, SMC, Mercury and Neptune) used on PCI motherboards do not include a built in EIDE controller. Such motherboards use a separate EIDE controller chip -- often the flawed RZ-1000 or CMD 640B. If you use a separate EIDE paddleboard, it will likely use the one of the flawed chips. In theory, the flaw could affect DOS, Windows, and Windows For WorkGroups with 16 bit disk access during floppy/tape backup and restore, though no one has reported problems yet. Windows For WorkGroups with 32 bit disk access is dangerous if you have the flaw. Probably Dangerous PCI Motherboards (both 486 and Pentium) with the older Mercury and Neptune chipsets are likely to have the flaw. The Mercury chipset was popular in P60 and P66 systems, and the Neptune in P70, P90 and P100 systems. Mercury chipsets are labelled with an MX suffix and Neptune with NX. If you are using NT 3.1, OS/2 Warp or Linux, you are likely to have already experienced extensive file corruption if the flaw is present. TESTING FOR THE FLAW Scot Llewelyn, one of the eight authors of PowerQuest's PartitionMagic, discovered the RZ-1000 flaw and made it public. Prior to that, only employees of PC-Tech, Intel and Microsoft were aware of how to bypass the flaw. In the process of tracking the RZ-1000 problem down, Internet comp.os.os2.bugs participants discovered a second flawed chip, the flawed CMD 640B, and are now suspicious about the SMC 37650. Scot did most of the initial work documenting the RZ-1000 flaw. He wrote a program called IOtest that can detect the flaw if: 1) You are using OS/2 Warp. 2) You are willing to go through the hassle of creating a separate small partition to run the test. You can use his program, PartitionMagic, to make room to create one. 3) You have an EIDE hard disk attached to your EIDE port. It cannot detect the problem if you only have an EIDE CD- ROM, or if the EIDE port is currently unused. You can find his test program on the Internet at: http://www.powerquest.com/ Scot originally called his test program DMAtest because he erroneously thought simultaneous DMA was the culprit. Do not confuse PowerQuest DMAtest with Gazelle's DMAtest which only tests if the floppy drive will work happily simultaneously with the hard disk. The world needed an easier-to-use test that would run under DESQview, Windows, Word For Windows, Windows 95, NT and OS/2. So I wrote EIDEtest to test for the flaw without requiring you to create a special partition or buy Warp OS/2. I also wrote CDTest to test for the flaw when you have an EIDE CD-ROM drive. I posted them on the Internet at: ftp://ftp.cdrom.com/.4/os2/incoming/EIDEte15.zip The file also contains a 16-page unabridged copy of this article. You can also get both programs from me by snail mail. If these tests fail, it proves you have a serious problem, but not necessarily that you have the RZ-1000 or CMD 640B chip. If the tests pass, you still may have a problem since, especially under DOS, DESQview and Windows, the flaw may only show its head very rarely. If you run the tests under NT or Windows-95 they will always pass, even if you have the defective chip, because the operating system already bypasses the flaw. If you suspect trouble, run the tests several times. VISUAL INSPECTION You can also have a look at your motherboard. Between the PCI slots, at the edge of the motherboard, look for a rectangular chip about 1 by 2 cm (0.5" x 0.75") that says RZ- 1000 near the top of the chip. There are variations on the chip name e.g., RZ-1000BP. Unfortunately, the markings are not always present, especially in ASUSTeK motherboards which may have the CMD PCIO 640B chip. The other suspect chip is the SMC 37650. DIRECT TESTS The OS/2 Warp Bonus Pack Sysinfo 3.02 utility will report on your EIDE controller. The signature for the RZ-1000 looks like this: manufacturer: PC TECHNOLOGY INC class code : 0001 Vendor ID: 1042 Device ID: 1000 Revision ID: 0001 For the CMD 640B it will look like this: manufacturer: CMD class code : ??? Vendor ID: ??? Device ID: ??? Revision ID: ??? The Warp disk driver IBM1S506.ADD with the /V switch will tell you if you have the RZ-1000 chip. Intel posted a test that looks directly for either of the two faulty chips: http://www.intel.com/procs/support/ctrltest/ctrltest.exe The Windows-95 Control panel will also report on the EIDE controller chip. WHERE HAS THE FLAW BEEN FOUND? Via email, on BIX and on the Internet and in comp.os.os2.bugs, people have reported finding flaws in the following specific motherboards. Motherboard Chip Reporters ASUSTeK PCI/I P54SP4 CMD Maurice Schekkerman 640B (schekker@prl.philips.nl) Dell Dimension XPS RZ-1000 Scot Llewelyn P100 (scotl@itsnet.com) Dell Dimension XPS RZ-1000 Steve Ertman P75 (sertman@ocean.fsu.edu) Dell Dimension XPS RZ-1000 Larry Lai (lai@iastate.edu) P90 Mike Heath (heath@rohan.sdsu.edu) Moira Watson (watson6@uwindsor.ca) Pete (pag@interramp.com) Wijadi Jodi (r2nw@dax.cc.uakron.edu) Dell Escom P60I CMD Tim Schofield, 640B schofieldt@logica.com Dell Optiplex 575 CMD David W. Mittlefehldt 640B (duck@snmail.jsc.nasa.gov) Dell XPS-133c neither Blake Scholl (bscholl@one.net) EliteGroup UM8810P- CMD Bodo Huckestein (bh@thp.Uni- AIO 640B Koeln.DE) Escom P5/60 RZ-1000 Rogier van Wanroij (wanroij@cs.utwente.nl) Escom P90 RZ-1000 Karl Knoflach (151579kk@student.eur.nl ) (Xav@mantra01.demon.co.uk) Gateway 2000 P5-60, RZ-1000 Angus Black Intel Mercury Rev 3 (angus@spanner.hiway.co.uk) Gary Farr (garyfarr@ix.netcom.com) Daron Davis (daron_davis@dca.com) Jerry Lynch (lynch.94@osu.edu) Keith Patterson (dinosaur@buffnet.net) Roy L. Smith (smittyry@ix.netcom.com) Gateway 2000 P5-66 RZ-1000 Randy Nerwick (nerwick@netcom.com) Gateway 2000 P5-90 RZ-1000 Roy L. Smith (smittyry@ix.netcom.com) Intel Hendrix CMD Clif Purkiser Intel Corp 640B (support@cs.intel.com) Intel Insight P5-60 ? Jim Arnone (arnone@primenet.com) Intel Plato 90 RZ-1000 Clif Purkiser Intel Corp (support@cs.intel.com) Adrian Teo (adriant@singnet.com.sg) Alain Rassel (Alain.Rassel@restena.lu) Chris Norman (cnorman@oboe.aix.calpoly.edu ) Kevin Chua (chua@server.uwindsor.ca) Kevin T. Van Maren (vanmaren@cs.utah.edu) Kim Hvarre (kims@crash.ping.dk) Rick Nelson (rnelson2@ccmail.unl.edu) Intel Premiere RZ-1000 Clif Purkiser Intel Corp (support@cs.intel.com) Intel Premiere LPX CMD Clif Purkiser Intel Corp 640B (support@cs.intel.com) Intel Premiere MM CMD Clif Purkiser Intel Corp 640B (support@cs.intel.com) Intel Robin LC CMD Clif Purkiser Intel Corp 640B (support@cs.intel.com) Knowledgebase P90 RZ-1000 Andy Longton laptop (alongton@clark.net) Midwest Micro P90 RZ-1000 (412d25$e8j@clarknet.clark.ne t) PCI-EIDE local CMD (whelk@ios.com) clone, Phoenix BIOS 640B 4.04, ALI chipset Quantex P5/90 PM-2 RZ-1000 Jay Schamus (jaylord@rcinet.com) Unknown 486 DX SMC3765 Eric Stephen Mountain 0 (esm1@oak70.doc.ic.ac.uk ) Unknown 90 MHz ? Andreas (abenamou@galaxy.csc.calpoly. edu) Carol Lim (law30185@nus.sg) Vobis RZ-1000 Thomas Wagner (twagner@bix.com) ZEOS Pantera RZ-1000 Paul Whitelock (paulw9DDFL3r.DDI@netcom.com) WHAT CAN YOU DO IF YOU HAVE THE FLAW? 1) Pester the manufacturer. Unfortunately, the EIDE controller chips are soldered in. The only way to repair the flaw is to replace the whole motherboard, recycling the socketed chips -- the CPU, DRAM and SRAM cache. It would be very expensive for computer and motherboard manufacturers to fix the flaw. After a month of stonewalling, Dell has announced it will offer a BIOS upgrade to turn off the prefetch buffers. You can contact Dell at support@us.dell.com or (800) 624-9896. Intel is now acknowledging the problem. For a short while, Intel offered to replace defective motherboards, then they reneged. You can contact them at support@cs.intel.com or call their tech support line (800) 628-8686. Select options 1-3-1. You can find international contact numbers at: http://www.intel.com/intel/intelis/contact.html. 2) Buy a new unpopulated Triton PCI motherboard and recycle the CPU, DRAM and SRAM cache chips from the old motherboard. 3) Run the controller in degraded mode. Some BIOSes have a feature to turn off the EIDE prefetch buffer. Vendors may offer a BIOS upgrade to allow prefetch to be configured or to turn it off automatically if either of the defective chips is present. 4) Buy a PCI EIDE paddleboard controller such as the Promise 2300+ to replace the one on the motherboard. You must disable the controller on the motherboard. This fix will waste one of your precious slots. Be careful. You could be leaping out of the RZ-1000 frying pan into the CMD 640B fire since paddleboards often use the CMD 640B. 5) Buy a SCSI hard disk and CD-ROM, and avoid using the EIDE ports entirely. Under OS/2 and Linux, SCSI gives better performance, but costs more. DOS, Windows, Windows For WorkGroups and Windows-95 are unable to exploit the advanced features of SCSI, but at least avoid the EIDE flaw when you go pure SCSI. 6) Switch to Windows-95 or NT 3.5. Microsoft has already modified its EIDE drivers to bypass the flaws. 7) Find a software work-around. The Warp fix for the RZ- 1000 turns off the prefetch buffer. Fixpack 5 and pre- release Fixpack 9 do not bypass the flaw. Now that Intel and IBM have finally revealed the technical details, all the operating system writers can patch their EIDE drivers to bypass the flaw. The Warp fix for the CMD 640B should be available soon. 8) Get a BIOS upgrade. For DOS, DESQview, and Windows 3.1, to bypass the flaw you may need a new BIOS -- an EPROM chip. If you have a flash BIOS, you can update it simply by downloading a file. Most BIOSes already have code to bypass the flaw for DOS, DESQview and Windows. However, more advanced operating systems bypass the BIOS, so even a smart BIOS will not protect you. However, the BIOS CMOS settings may allow you to disable prefetch, which protects you in true multitasking operating systems as well. Whatever method you use to bypass the flaw, retest with EIDEtest and CDTest afterwards to be sure your fix worked and you caught all the problems. CLEANING UP THE MESS Once you have bypassed the flaw, you can start working the problem of cleaning up your files. The first thing to do is to re-install your operating system and all your application programs. This will replace any damaged EXE and DLL files. Catching errors in your data files is more difficult. Keep your eyes peeled for any improbable spreadsheet results. You may have to hire a programmer to write you some comb programs to sniff through your databases, looking for suspicious values. If you routinely use the verify feature of Lotus Magellan, it can detect changes to files that should not have changed. This may help you uncover some of the damage. The flaw is not polite enough to redate the files it corrupts. If you have backups from before the time you bought the faulty machine, you can restore them and re-key everything. Most people will not be so fortunate. All their backups will also be corrupt. Most people with the flaw will just have to put up with random errors dotting their data files ever after. continued ... Roedy@bix.com -30- -------------------------------------------------------------------------------- From: roedy@BIX.com (Roedy Green) Newsgroups: comp.os.os2.bugs Subject: rev 15 faq EIDE controller flaws part 2 of 2 Date: 1 Sep 1995 01:09:00 GMT Organization: Canadian Mind Products Lines: 546 Message-ID: <425mfc$hgo@news2.delphi.com> NNTP-Posting-Host: bix.com X-Newsreader: Galahad 1.1f EIDE CONTROLLER FLAWS part 2 of 2 SUMMARY Operating System Work Around Netware -no problems reported Unixware 1.1 NEXTSTEP Banyan Solaris 2.4+ SCO Unix 3.1+ Windows NT 3.5 Windows-95 DOS -no problems reported so far. If you do DESQview have trouble: Windows 3.1 -turn off EIDE prefetch in CMOS settings. -Upgrade BIOS chip. -Turn off simultaneous disk/floppy/tape I/O in your backup programs. Windows For -turn off 32 disk access mode. WorkGroups -turn off EIDE prefetch in CMOS settings. -Upgrade BIOS chip. -Turn off simultaneous disk/floppy/tape I/O in your backup programs. Windows NT 3.1 -turn off EIDE prefetch in CMOS settings. -apply ATDISK.SYS patch available at http://www.microsoft.com/KB/softlib OS/2 2.1 - disable prefetch buffer in CMOS settings. - Load the IBMINT13.I13 driver instead of the IBM1S506.ADD driver. This trick will only work if your BIOS has flaw bypass code. It will be slow. - upgrade to Warp OS/2 Warp 3 - disable prefetch buffer in CMOS settings. - apply fix for APAR PJ19409 from IBM at ftp://service.boulder.ibm.com/ps/product s/os2/fixes/v3.0warp/english- us/pj19409/pj19409.zip - in a pinch, if you cannot do either of the first two things, add a line to config.sys BASEDEV=IBMINT13.I13 and remove the line BASDEV=IBM1S506.SYS. The IBMINTI3.I13 Device driver lives in C:\OS2\BOOT, and on the first install diskette, and the on the CDROM in \OS2IMAGE\DISK_1. This trick will work only if your BIOS has flaw-bypass code. It will be slow. Linux - disable prefetch buffer in CMOS settings. - To bypass the original CMD 640B flaw use the boot time kernel parameter: hda=serialize. - Use the default settings to suppress interrupts during I/O on the external Hard Disk Parameter utility hdparm.. REPORTING YOUR FINDINGS Whether or not you find the flaw, please email me at Roedy@bix.com or post the following information in the Internet newsgroup comp.os.os2.bugs: 1) Test results. (I need to hear about both machines with and without the flaw.) 2) Brand and model of your motherboard. 3) Brand and model of your entire system. 4) Which chip did you find, the RZ-1000, the CMD 640B, the SMC 37650? What did SYSINFO 3.02 report about your EIDE controller chip? 5) Have you noticed data file corruption? 6) Which tests and versions did you use? (IOtest, EIDEtest, CDtest, RZtest, Ctrltest or visual inspection) 7) What activities did you run in the background during the test? 8) Which operating system and version you used to run the test (e.g. Warp Connect blue spine) 9) Brand and model of EIDE hard disk 10) Brand and model of EIDE CD-ROM 11) Markings on the suspect chip, e.g., "RZ-1000BP", "CMD PCIO640B", "SMC 37650". 12) Vendor's name 13) Vendor's response on informing him of your problem. Please do not bother to report after 1995 September 30. The Internet is allowing the user community to rapidly sort this problem out, and all will be well-documented by then. WHOSE FAULT IS IT? The wags will have fun tormenting Intel for using the flawed RZ-1000 chip and the triply flawed CMD 640B in its motherboard designs, even though Intel did not manufacture either of the two faulty chips. Intel is not the only company to manufacture motherboards with the faulty chips, but Intel will bear the brunt of the bad publicity. PC-Tech manufactured the faulty RZ-1000 EIDE controller chip used in many PCI motherboards. PC-Tech is a subsidiary of ZEOS, the clonemaker. In turn Micron Electronics owns ZEOS. PC-Tech has offices just down the street from Zeos in Minnesota. Intel bought the chips from PC-Tech, and in turn many clone makers bought motherboards from Intel. Other motherboard manufacturers also used the faulty chips. PC-Tech, Intel and the clone makers all failed to test their designs properly. The software makers did not test their software on enough machines to show up the problem before releasing it. Even worse, in some motherboard designs, Intel used the CMD 640B chip. This goof was inexcusable, since the chip, by deliberate design, is incapable of simultaneous I/O. How did the triply-flawed CMD 640B chip and the RZ-1000 slip through Quality Assurance testing? My guess is no one did real world testing; technicians only tested under laboratory conditions using only simple operating systems like DOS. They might have ignored flaws that happened only sporadically, blaming it on a faulty chip rather than a faulty design. It is very hard to catch a flaw that only manifests rarely. CMD, PC-Tech, Intel, and Microsoft have known about how to bypass these problems for quite some time. IBM was aware there was a problem but was unaware of the solution. For obvious reasons, these companies were reluctant to inform the public of the danger of the ongoing subtle corruption. The collective damage done by withholding information about the flaw is huge, certainly many millions of dollars for those large companies whose backups are corrupt as well. It will be interesting to see if anyone launches a damage lawsuit against CMD, PC-Tech, Intel or Microsoft. If they do, it might make both hardware and software makers more careful about releasing improperly tested products. There is potential here for some massive lawsuits. No wonder the companies who knew about the flaw have been so tight- lipped. Think of the damage if Boeing or GM had its plans for coming products stored on flawed machines. Literally, this flaw could cause plane crashes. INTEL'S SPIN There are three levels of "Intel Inside". 1. Your motherboard has an Intel CPU but a support chipset from another manufacturer. 2. Your motherboard has an Intel CPU and Intel support chipset such as the Neptune or Triton, but some other company built the BIOS and motherboard. 3. Your motherboard has an Intel CPU, Intel support chipset, Intel motherboard and Intel BIOS. Intel literature on the RZ-1000 and CMD 640B only refers to (3). Intel cannot very well speak for (1) and (2) where the PCI EIDE controller design was out of their control, even though these machines bear the "Intel Inside" logo. Intel does not make this distinction clear in their literature. According to Intel, "This problem is a consequence of the RZ- 1000's inability to fully compensate for all the implications of running an IDE hard disk as an extension of the PCI bus, instead of running as an extension of the AT bus which it was originally designed to do." Intel would have us believe the problem is not a flaw per se, but rather a limitation that the programmers forgot to take into consideration. The truth is grey. UART chips have similar flaws. Programmers have gradually learned to code around them. We don't insist that all COM port hardware be recalled. We now tend to blame a programmer if he does not bypass the known UART flaws. No one who understood the RZ-1000 and CMD 640B flaws publicised their findings. If PC-TECH, Intel and Microsoft had not been so secretive, the damage would have been averted. Perhaps they were silent because the flaw primarily hurt the customers of competitor, IBM. Given that software work-arounds are now possible, the primary blame shifts for any perpetuation of the problem to the software authors. However, there are many other EIDE chip designs that do not have this "limitation". Since the RZ-1000 chip was a supposedly generic implementation of the ATA interface standard, this flaw cannot be so lightly excused. The CMD 640B is triply flawed: 1. It has the same prefetch problem as the RZ-1000. 2. It erroneously responds to floppy status commands, and even worse, in the process, corrupts hard disk data. 3. It does not support simultaneous I/O on the primary and secondary EIDE ports. The CMD 640B chip should never have been used in any PC. I am unaware of any legitimate use for such a brain-damaged chip. Intel and ASUSTeK must take full blame here for using such an inappropriate part in their motherboards. In my eyes, Intel and ASUSTeK have irreparably ruined their reputations. SPECULATION Because setting the flaw right would be so expensive, I suspect that clone makers and motherboard manufacturers will continue to refuse to correct the flaw. At best they may offer BIOS upgrades to bypass the flaws. Microsoft has already added code to Windows-95 and NT 3.5 to bypass the flaws. Clone makers will rely on software vendors to write drivers that bypass the flaws for Warp, Linux and the various UNIXes. Now that the OS/2 patches will be out soon, the pressure to set things right will dwindle. Since DOS, Windows in 16 bit mode, Windows-95 and NT 3.5 are immune, little pressure to correct the problem is likely to come from those camps. The motherboard manufacturer has five options: 1) Replace the motherboard. Recalls on a mass scale would be extremely costly for the motherboard manufacturers, so you can count on them to fight. ($400 parts + $250 labour) 2) Provide a replacement paddleboard EIDE controller that takes up a PCI slot. ($75) 3) Provide a new BIOS chip that bypasses potential problems for DOS and Windows. It could also turn off prefetch which would rescue multitasking operating systems that do not use the BIOS for I/O. ($10) 4) Tell the users to upgrade to software that bypasses the flaw, and to turn off simultaneous disk/tape/floppy I/O in any backup software run under DOS, DESQview or Windows. ($0) 5) Stonewall and refuse to even acknowledge the problem. This will be more difficult now that Intel and Dell have publicly admitted the problem. ($0) Intel has already set the precedent by offering to replace defective Pentiums, even though software can bypass its divide flaw. The RZ-1000 flaw is far more serious, and the CMD 640B is even more serious still. Keeping this under wraps is going to be hard for the clone builders. Brooke Crothers of Infoworld did a story based on my compilation. I have been in contact with Jerry Pournelle of Byte. I sent email to John Dvorak. Even the San Jose Mercury Daily News did story. An 1000 abridged version of this essay is appearing in The Computer Paper that goes across Canada. The stonewall is coming tumbling down. As one man pointed out, I read your postings on the Internet, and see them the next day quoted in my daily newspaper. TECHNICALLY WHAT ARE THE FLAWS? After the manner of Ionesco, Roedy Green said, "All great programmers are paranoid." Programmers have to anticipate problems that could happen only once in a trillion machine cycles since such a problem would still show up on average every three hours. The EIDE problem sometimes goes for days without manifesting. Sometimes it shows up within seconds, depending on the unrelated I/O activity in the machine. I have read about ten conflicting explanations from authorities on the cause of the problems. I based my explanations on postings from Sam Detweiler of IBM's Warp Device Driver section (sdetweil@vnet.ibm.com). The RZ-1000 and CMD 640B both have the prefetch flaw. The CMD 640B has two additional flaws: lack of simultaneous I/O support and floppy controller interference. Flaw 1: Prefetch Buffer Flaw The RZ-1000 and CMD 640B both have the prefetch flaw The fatal co-incidence tends to happen when you have both the EIDE controller (Hard disk or CD-ROM) and the floppy controller (floppy or tape backup) working at once. Data moves from the hard disk to RAM via a bit bucket brigade. The RZ-1000 grabs data 16 bits at a time from a buffer in the integrated controller on the hard disk, and hands it off 32 bits at a time off to the PCI bus. The CPU sits in a tight loop grabbing data from PCI bus and storing it in RAM. In prefetch mode, the RZ-1000 keeps ahead of the CPU, requesting two 16-bit chunks from the hard disk, in order to have a 32 bit chunk ready when the CPU asks. When you disable the prefetch buffer, you turn off the parallelism and run in a degraded lock-step mode. In degraded mode, the RZ-1000 waits until the CPU asks for a 32 bit chunk. Then it puts the CPU on hold while it asks the hard disk for two 16-bit chunks. It glues them together, and puts them on the PCI bus and allows the CPU to continue. When there is a delay from some other unrelated device generating an interrupt or DMA bus cycles, the EIDE chip sometimes becomes confused and stores status instead of data into RAM, thus corrupting your data. This flaw is the result of a shortcut in the chip design -- using the same registers for both status and data. There are two software techniques to bypass this flaw: 1) Never schedule more than one I/O at a time. Use strict polled mode with no interrupts. Turn off all unrelated interrupts during I/O. This is the DOS/Windows approach. The disadvantage is poor performance and possible lost incoming modem characters. 2) Turn off the prefetch buffer. In a lightly loaded system, there is sufficient spare capacity on the PCI bus so running in degraded mode only slows the disk down by 1%. However, programs making extensive use of the PCI bus such as LANs or video bit-map painting will also slow down. No one has yet done benchmarks to measure the amount of degradation. Both Intel and IBM tell us that turning off prefetch to bypass the flaw has negligible effect on performance. Yet in the Plato BIOS rev 12, Intel says that enabling the prefetch buffers will "significantly increase PCI IDE Hard Disk performance." They can't have it both ways. Flaw 2: No Simultaneous I/O Only the CMD 640B has this flaw. The CMD 640B can't do more than one I/O at a time. This flaw was so obvious everyone found out about it long ago. All EIDE controllers (even fully functioning ones) cannot run master and slave simultaneously. However, two separate EIDE controllers are supposed to allow primary and secondary channels to run simultaneously. The CMD 640B has dual controllers on one chip. However, the primary and secondary channels will not work simultaneously unlike every other design. For example, you can't run your EIDE hard disk and EIDE CD-ROM at the same time. Simultaneous I/O speed is the reason we put two EIDE devices on separate channels, both as masters, rather than making one a master and one a slave on the same channel. IBM has a bypass for this blunder. When it detects a CMD 640B, Warp never schedules more than one I/O at a time when the CMD 640B is active, reducing the operating system to DOS- like performance. Flaw 3: Floppy Controller Interference This flaw only affects some CMD 640B designs, not all. The CMD 640B controller contains logic to have it act also as a floppy controller. This feature is never used. However, some motherboard manufacturers failed to hook the chip up properly to fully disable this function. The CMD 640B thinks it is in charge of floppy I/O when it is not. It erroneously responds to status commands directed to the real floppy controller. What is worse, when it responds, it becomes confused and corrupts any hard disk or CD-ROM I/O in progress. IBM is working on a Warp fix for this problem. Primitive operating systems are immune to this flaw since they never attempt to run the hard disk and floppy at the same time. BACKGROUND If you read the literature on this problem, you will see various daunting technical terms. Here is a rough explanation. There are six kinds of I/O used in PCs. 1. PIO - Programmed I/O. The CPU spoon-feeds each byte to the I/O port. The port can usually accept data as fast as the CPU can feed it. Typical IDE drives work this way under DOS. For slower devices, the CPU polls the status to see if the device is ready for yet another byte. 2. Scheduled I/O. This is a variant of PIO where the operating system feeds the I/O device some bytes, then calculates how long it should take for the I/O device to digest them, then it goes away for a while to do something else, then it comes back when it figures the I/O should be complete, and feeds the device a few more bytes. This is how Warp usually controls parallel port printers. 3. Interrupt I/O. Every time the port is ready to eat another byte, it raises an interrupt and the CPU feeds it some more. This is the typical way COM ports work and how Warp uses printers with the /IRQ option. Warp EIDE drivers combine methods (1) and (2). The hard disk interrupts when it has completed the read into its on-board buffer. Then the CPU fetches data out of the buffer with PIO mode. 4. Third party DMA. The DMA controller on the motherboard copies data from RAM to the port and generates an interrupt when it is done with a block. Floppy drives and inexpensive mag tape backup drives use this method. Because of the unfortunate original AT design compromises, this method is exceedingly slow. Third Party DMA is never used for PCI bus devices though it is still used for ISA or motherboard-based floppy controllers on PCI motherboards. 5. First party DMA, sometimes called Bus Mastering. A DMA controller on the device copies data from RAM to the port and generates an interrupt when done High end SCSI cards -- such as the Adaptec 2940 or 2940W use this ultimate way to fly. 6. Memory mapped I/O. The CPU copies data to a magic region of RAM which is actually on the I/O device. LAN cards or REGEN VRAM on video cards use this technique. In a true multi-tasking system, such as OS/2, the CPU goes off and works on behalf of applications when the port is busy, and trusts an interrupt to bring it back when the device needs more service. It schedules several I/Os simultaneously. In contrast, DOS and Windows never do more than one I/O at a time. Further, under DOS/Windows the CPU idles while waiting for its single I/O to complete rather than working on applications. LEARNING MORE You can use the Internet to learn more about this problem. If you do not have Internet access, I can provide you these files on diskette. Roedy Green's FAQ (Frequently Asked Questions) an unabridged version of this article including the EIDEtest and CDTest programs: ftp://ftp.cdrom.com/.4/os2/incoming/eidete15.zip Warp bypass for the RZ-1000 chip ftp://service.boulder.ibm.com/ps/products/os2/fixes/v3.0warp /english-us/pj19409/pj19409.zip Intel's FAQ http://www.intel.com/procs/support/rz1000 Intel's RZ-1000 chip detect program http://www.intel.com/procs/support/rz1000/rztest.exe Intel's CMD 640B and RZ-1000 chip detect program. http://www.intel.com/procs/support/ctrltest/ctrltest.exe IBM's bypass for the first CMD 640B chip flaw. IBM will soon be replace it with one that bypasses all three of the CMD 640B's faults. ftp://ftp-os2.cdrom.com/pub/os2/drivers/cmd640x.zip IOTest from PowerQuest, the makers of Partition Magic, a Warp test for the flaw. http://www.powerquest.com/ PC-Tech's essay: http://www.mei.micron.com/rz1000/rz1000.txt CONTACTING THE AUTHOR The author, Roedy Green is a computer consultant who prefers to work on Forth, C++, Delphi, DOS, OS/2 and Internet Web projects. If you send me $5 (US or Canadian) to cover duplication, shipping and handling I will send you a diskette containing all the relevant test programs, patches and essays. Please report which machines you find the flaw in, and which software and fixpacks you were using at the time. Send email to: Roedy@bix.com or discuss this problem on the Internet newsgroup in: comp.os.os2.bugs. You can also write via snail mail: Roedy Green Canadian Mind Products #601 - 1330 Burrard Street Vancouver, BC CANADA V6Z 2B8 (604) 685-8412 Roedy@bix.com -30-