SERIOUS PCI MOTHERBOARD FLAW Revision 12: 1995 August 13 SUMMARY OF RECENT CHANGES 1) CMD PCI0640B, yet another EIDE controller chip exhibits the flaw. 2) fix for APAR PJ19409 due next week will immunise Warp. 3) EIDEtest is 1.3 posted. CDtest 1.0 is late. 4) Cleaning up past damage is very difficult. 5) Assigning blame. 6) Intel divulged technical info about the flaw. It explains nearly all the observations. This information will allow all operating systems authors to soon bypass the flaw. 7) The file corruption shows up with EIDE activity simultaneously with DMA from the floppy or SCSI controller. 8) The Triton chipset is immune. 9) Windows-95, NT are immune. 10) DOS and Windows 3.1 are immune if you have an Intel BIOS. 11) Intel still reluctant to share information on how to detect the RZ-1000 chip or concomitant chipsets. INTRODUCTION There is an extremely serious flaw affecting about 1/3 of all PCI motherboards. Any motherboard containing the PC-Tech RZ- 1000 PCI EIDE controller chip is affected. This includes motherboards from AT&T, Dell, Gateway, IBM and Intel. Since Intel makes so many of the motherboards sold under other brand names, many machines are affected, both 486 and Pentium PCI. This flaw only shows up most frequently when you run a multitasking operating system such as OS/2, Linux, SCO XENIX or other flavours of UNIX. The flaw does less obvious harm under DOS, DESQview, Windows and Windows For WorkGroups. Recent versions of Microsoft NT and Windows-95 contain code to bypass the flaw. WHAT ARE THE SYMPTOMS? When you are using an EIDE hard disk attached to the EIDE motherboard port, the flaw subtly corrupts your files by changing or shifting bytes every once in a while. The flaw introduces bugs into EXE files, and subtle errors into your spreadsheets, stray characters into your word processing documents, changes to the deductions in last year's tax return files, and random changes to engineering design files. This corruption mainly happens when you are simultaneously using your EIDE hard disk and some DMA (Direct Memory Access) device such as a floppy drive, mag tape backup, SCSI CD-ROM, SCSI hard disk or SCSI scanner. The same sorts of problem may occur on reading a CD-ROM drive attached to the motherboard EIDE port. IS IT SERIOUS? This flaw is nasty. It is causing hundreds of times more havoc than the infamous Pentium divide flaw ever did. "I am Pentium of Borg. You will be approximated." Not only does this corruption occur, but it occurs quietly, often going unnoticed. Under DOS, DESQview, Windows or Windows For WorkGroups it is even harder to detect because the corruption happens rarely. If the system crashes, you usually put the blame on the operating system software, or the application. It might actually be this faulty RZ-1000 EIDE controller chip nailing you. When a directory becomes corrupted, you may not notice it until the damage is irreparable. If a spreadsheet application reads a comma-delimited ASCII file, it may simply "miss" a few bytes in a number, an error that may go unnoticed, and that error could cascade through the rest of the spreadsheet. If you have had unexplained crashes in OS/2, you have probably experienced the problem, and should make a thorough check of your data to make sure you don't have hidden corruption. Remember that the bug may only slightly alter your data, and the corruption may not be obvious. Keep in mind that not every problem is the RZ-1000's fault. Overheating, unrelated hardware faults and design flaws, or software bugs can cause similar symptoms. Happily, EIDEtest can unmask unrelated problems with similar symptoms. Unfortunately, correcting the problem just stops further file corruption. It won't do a thing to clean up the existing damage to your files. Right now, the focus is on bypassing the flaw. Preventing further corruption is child's play compared with the nightmare of trying to track down all the existing random errors in files and corrupted backups from day 1. These errors will never be completely eliminated. HOW DO YOU TELL IF YOU HAVE THE FLAW? There are three categories of motherboard: 1) Definitely safe. Motherboards may still have the flaw, but all software in use bypasses it. 2) Possibly dangerous. You will have to run IOtest, EIDEtest or CDtest to find out. 3) Probably dangerous. You will still have to run the tests to find out for sure. Definitely Safe Definitely safe includes older machines with ISA. EISA, VESA or MCA buses. The flaw only affects machines with the new PCI bus. PCI machines that use the new Triton chipset from Intel do not have the flaw. PCI machines with Intel BIOSes that run only DOS, DESQview, Windows 3.1, Windows-95 or NT 3.5 are safe. If you have a non- Intel BIOS and run only DOS, DESQview, Windows 3.1, Windows-95 or NT 3.5 and never use the "fast mode" simultaneous disk/io feature on floppy or tape backup/restore, you are safe. You still might want to test your machine. There are similar problems with other causes the tests will unmask. Possibly Dangerous Nearly all PCI motherboards chipsets e.g. Viper, SMC, Mercury and Neptune chipset use a separate EIDE controller chip, which is often the flawed RZ-1000. The flaw affects DOS, Windows, and Windows For WorkGroups with 16 bit disk access only during backup and restore. The safety of Windows For WorkGroups with 32 bit disk access is still unknown. Probably Dangerous. PCI Motherboards (both 486 and Pentium) with the older Mercury and Neptune chipsets are likely to have the flaw. The Mercury chipset was popular in P60 and P66 systems, and the Neptune in P70, P90 and P100 systems. If you are using NT 3.1, OS/2 Warp, Linux or any flavour of UNIX such as SCO XENIX, you are likely to have already experienced extensive file corruption if the flaw is present. If you have both SCSI and EIDE devices the problem is even more likely to show up frequently. TESTING FOR THE FLAW Scott Llewelyn, the author of PowerQuest's PartitionMagic discovered the flaw and made it public. Prior to that only employees of PC-Tech, Intel and Microsoft were aware of the cause of the flaw. He has done most of the work documenting it. He wrote a program called IOtest that can detect the flaw if: 1) You are using OS/2 Warp. 2) You are willing to go through the hassle of creating a separate small partition to run the test in. His program PartitionMagic can be used to make room to create one. 3) You have an EIDE hard disk attached to your EIDE port. It cannot detect the problem if you only have an EIDE CD-ROM, or if the EIDE port is currently unused. You can find the test program on the Internet at: http://www.powerquest.com/ The program used to be called DMAtest when it was erroneously thought the problem was caused by simultaneous DMA. This should not be confused with Gazelle's DMAtest which ensures that the floppy drive will work happily simultaneously with the hard disk. The world needed an easier-to-use test that would run under DESQview, Windows, Word For Windows, Windows 95, NT and OS/2. So I wrote EIDEtest 1.3 to test for the flaw without requiring you to create a special partition or buy Warp OS/2. I posted it on the Internet at: ftp://ftp.cdrom.com/.4/os2/incoming/EIDEte13.zip I am working on CDtest 1.0 to test for the flaw if you have only an EIDE CDROM without hard disk. It should be ready for beta test on August 14. I will post CDtest on the Internet at: ftp://ftp.cdrom.com/.4/os2/incoming/CDtest10.zip You can also get both program from me by snail mail. If these tests fail, it proves you have a problem, but not that you have an RZ-1000 chip. So far there is no direct test for the chip. If the tests pass, you still may have a problem since, especially under DOS, DESQview and Windows, the flaw may only show its head very rarely. If you run the tests under NT or Windows-95 they will always pass, even if you have the defective chip, because the operating system already bypasses the flaw. If you suspect trouble, run the tests several times. VISUAL INSPECTION You can also have a look at your motherboard. Between the PCI slots, at the edge of the motherboard, look for a rectangular chip about 1 by 2 cm (0.5" x 0.75") that says RZ-1000 near the top of the chip. There are variations on the chip name e.g. RZ- 1000BP. Unfortunately, the markings are not always present, especially in ASUS motherboards. WHERE HAS THE FLAW BEEN FOUND? Via email, on BIX and on the Internet, in comp.os.os2.bugs, people have reported finding this flaw in the following specific motherboards that use the Mercury and Neptune chipsets: ASUSTeK PCI/I P54SP4 Dell Dimension XPS P90 but not the Dell Dimension P120C Escom P90 Gateway P90 Intel Premiere Intel Plato 90 Knowledgebase P90 laptop Midwest Micro P90 PCI-EIDE local clone, Phoenix BIOS 4.04, ALI chipset, CMD PCI0640B EIDE controller chip Vobis WHAT CAN YOU DO IF YOU HAVE THE FLAW? 1) Pester the manufacturer. Unfortunately, the RZ-1000 chip is soldered in. The only way to repair the flaw is to replace the whole motherboard, recycling the socketed chips e.g. the CPU, DRAM and SRAM cache. It would be very expensive for computer and motherboard manufacturers to fix the flaw. Dell has so far refused to replace the defective motherboards or even acknowledge the problem. Intel is now acknowledging the problem. 2) Buy a new unpopulated Triton PCI motherboard and recycle the CPU, DRAM and SRAM cache yourself from the old motherboard. 3) Run the controller in degraded mode. Some BIOSes have a feature to turn off the prefetch buffer on the EIDE controller. This makes disk i/o bus traffic run 16 bits at a time rather than 32 bits, thus chewing up twice as many bus cycles to get the job done. However, it bypasses the flaw. 4) Buy a PCI EIDE paddleboard controller to replace the one on the motherboard. You must disable the one on the motherboard. This would waste one of your precious slots, however. 5) Buy a SCSI hard disk and CD-ROM, and avoid using the EIDE ports entirely. Under OS/2 and Linux, SCSI gives better performance, but costs more. DOS, Windows, Windows For WorkGroups and Windows-95 are unable to exploit the advanced features of SCSI, but at least avoid the EIDE flaw when you go pure SCSI. 6) Switch to Windows-95 or NT 3.5. Microsoft has already modified its EIDE drivers to bypass the flaw. 7) Wait for the software work-arounds. The emergency Warp fix either will run the chip in a degraded PIO (Programmed I/O) mode with the prefetch buffer turned off, or it may bypass the flaw in a more elaborate way (used by Windows-95 and NT) that does not hurt performance. However, a Warp fix still would not help the Linux, UNIX, DOS or Windows users. IBM's Warp fix for APAR PJ19409 expected around August 14, is yet unnamed. Fixpack 5 and pre-release Fixpack 9 do not bypass the flaw. Now that Intel has finally revealed the technical details, all the operating system writers can patch their EIDE drivers to bypass the flaw. Mark Lord is working on fix for Linux EIDE drivers. 8) Get a BIOS upgrade. For DOS, DESQview, and Windows 3.1, to bypass the flaw you may need a new BIOS -- an EPROM chip. If you have a flash BIOS, it can be updated by downloading a file. Some BIOSes already have code to bypass the flaw for DOS, DESQview and Windows. However, more advanced operating systems don't even use the BIOS, so even if you have a smart BIOS, it won't protect you. CLEANING UP THE MESS Once you have stopped further corruption, you can start working the problem of cleaning up your files. The first thing to do is to re-install your operating system and all your application programs. This will replace any damaged EXE and DLL files. Catching errors in your data files is more difficult. Keep your eyes peeled for any improbable spreadsheet results. You may have to hire a programmer to write you some comb programs to sniff through your databases, looking for suspicious values. If you routinely use the verify feature of Lotus Magellan, it can detect changes to files that should not have changed. This may help you discover some of the damage. The flaw is not polite enough to redate the files it corrupts. If you have backups from before the time you bought the faulty machine, you can restore them and re-key your data. Most people will not be so fortunate. All their backups will also be corrupted. Most people with the flaw will just have to put up with random errors dotting their data files ever after. SUMMARY Operating System Work Around DOS -turn off EIDE prefetch in CMOS DESQview settings. Windows 3.1 -Upgrade BIOS chip. -Turn off simultaneous disk/floppy/tape i/o in your backup programs. Windows For -turn on 32 disk access mode to bypass WorkGroups Windows BIOS use. -turn off EIDE prefetch in CMOS settings. -Upgrade BIOS chip. -Turn off simultaneous disk/floppy/tape i/o in your backup programs. Windows NT 3.1 -turn off EIDE prefetch in CMOS settings. -apply ATDISK.SYS patch. Windows NT 3.5 -no problem Windows-95 OS/2 2.1 - disable prefetch buffer in CMOS settings. - Load the IBMINT13.I13 driver instead of the IBM1S506.ADD driver. This trick will only work if your BIOS has flaw bypass code. It will be slow. OS/2 Warp 3 - disable prefetch buffer in CMOS settings. - apply fix for APAR PJ19409 from IBM. - in a pinch, if you cannot do either of the first two things, add a line to config.sys BASEDEV=IBMINT13.I13 and remove the line BASDEV=IBM1S506.SYS. The IBMINTI3.I13 Device driver lives in C:\OS2\BOOT, and on the first install diskette, and the on the CDROM in \OS2IMAGE\DISK_1. This trick will work only if your BIOS has flaw-bypass code. It will be slow. Linux - disable prefetch buffer in CMOS SCO UNIX settings. - rewrite the EIDE drivers to avoid checking status inside the interrupt handlers. - keep your ears open for fixes. REPORTING YOUR FINDINGS Whether or not you find the flaw, please email me at Roedy@bix.com or post the following information in the Internet newsgroup comp.os.os2.bugs: 1) Test results. (I need to hear about both machines with and without the flaw.) 2) Have you noticed data file corruption? 3) brand and model of your motherboard 4) which test and version you used (IOtest, DMAtest, EIDEtest, CDtest, or visual inspection) 5) which operating system and version you used to run the test (e.g. Warp Connect blue spine) 6) brand and model of EIDE hard disk 7) brand and model of EIDE CD-ROM 8) markings on the suspect chip, e.g. "RZ-1000BP" 9) Vendor's name 10) Vendor's response on informing him of your problem. Please don't bother to report after 1995 August 31. The Internet is allowing the user community to rapidly sort this problem out, and all will be well-documented by then. I DON'T USE WARP OR LINUX. WHY SHOULD I CARE? The corruption occurs when a certain co-incidence occurs. These co-incidences are rarer under Windows and DOS because these operating systems normally only do one I/O at a time. (Backup software in the exception.) Warp typically has many I/Os on the go at any one time, so the fatal constellation appears more frequently. Under DOS/Windows the corruption is less obvious; but still occurs. WHOSE FAULT IS IT? The wags will have fun tormenting Intel for this second major booboo, even though Intel did not manufacture the faulty chip. Intel is not the only company to manufacture motherboards with the faulty chip, but Intel will bear the brunt of the bad publicity. PC-Tech manufactured the faulty RZ-1000 EIDE controller chip used in many PCI motherboards. PC-Tech is a subsidiary of ZEOS, the clonemaker. PC-Tech has offices just down the street from Zeos in Minnesota. Intel bought the chips from PC-Tech, and in turn many clone makers bought motherboards from Intel. Other motherboard manufacturers besides Intel, e.g. ASUSTeK, also used the faulty RZ-1000 chips. PC-Tech, Intel and the clone makers all failed to test their designs properly. The software makers did not test their software on enough machines to show up the problem before releasing it. How did this flaw slip through? My guess is no one did real world testing; technicians only tested under laboratory conditions using only simple operating systems like DOS. They might have ignored flaws that happened only sporadically, blaming it on a faulty chip rather than a faulty design. It is very hard to catch a flaw that only manifests rarely. PC-Tech, Intel, and Microsoft have known about the cause of this problem for quite some time. IBM was aware there was a problem but was unaware of the solution. For obvious reasons, these companies were reluctant to inform the public of the danger of the ongoing subtle corruption. The collective damage done by withholding information about the flaw is huge, certainly many millions of dollars for those large companies whose backups are corrupt as well. It will be interesting to see if anyone launches a damage lawsuit against PC-Tech, Intel or Microsoft. If they do, it might make both hardware and software makers more careful about releasing improperly tested products. There is potential here for some massive lawsuits. No wonder the companies who knew about the flaw have been so tight- lipped. Think of the damage if Boeing or GM had its plans for coming products stored on flawed machines. Literally, this flaw could cause aeroplane crashes. INTEL'S SPIN According to Intel, "this problem is a consequence of the RZ- 1000's inability to fully compensate for all the implications of running an IDE hard disk as an extension of the PCI bus, instead of running as an extension of the AT bus which it was originally designed to do." Intel would have us believe the problem is not a flaw per se, but rather a limitation that the programmers forgot to take into consideration. The truth is grey. UART chips have similar flaws. Programmers have gradually learned to code around them. We don't insist that all COM port hardware be recalled. We now tend to blame a programmer if he does not bypass the known UART flaws. The problem was that the flaw with the RZ-1000 was not publicised. If PC-TECH, Intel and Microsoft had not been so secretive, the damage would have been averted. Perhaps they were silent because the flaw primarily hurt the customers of competitor IBM. Given that a software work-around is doable, though awkward, the blame now shifts for any perpetuation of the problem, primarily, to the software authors. However, there are many other EIDE chip designs that do not have this "limitation". Since the RZ-1000 chip was a supposedly generic implementation of the ATA interface standard, this flaw cannot be excused. The RZ-1000 used a shoddy design shortcut. SPECULATION Because setting the flaw right would be so expensive, I suspect that clone makers and motherboard manufacturers will continue refuse to correct the flaw. At best they may offer BIOS upgrades to bypass the flaw in DOS, DESQview and Windows. Microsoft has already added code to Windows-95 and NT 3.5 to bypass the flaw. Clone makers will rely on software vendors to write drivers that bypass the bug in Warp, Linux and the various UNIXes. Once the OS/2 patch is out, the pressure to set things right will dwindle. Since the flaw only sporadically corrupts DOS and Windows 3.1 and since Windows-95 and NT are immune, little pressure to correct the problem is likely to come from those camps. The motherboard manufacturer has five options: 1) Replace the motherboard. Recalls on a mass scale would be extremely costly for the motherboard manufacturers, so you can count on them to fight it. ($400) 2) Provide a replacement paddleboard EIDE controller that takes up a PCI slot. ($75) 3) Provide a new BIOS chip that bypasses the problem for DOS and Windows. ($10) 4) Tell the users to upgrade to software that bypasses the flaw, and to turn off simultaneous disk/tape/floppy i/o in any backup software run under DOS, DESQview or Windows. ($0) 5) Stonewall and refuse to even acknowledge the problem. This will be more difficult now that Intel has publicly admitted the problem. ($0) Intel already set the precedent by offering to replace defective Pentiums, even though the divide flaw can be bypassed with software. The RZ-1000 flaw is far more serious. Intel has now admitted the problem. Intel is still reeling from the divide flaw. This second goof will seriously tarnish Intel's reputation. Keeping this under wraps is going to be hard for the clone builders, especially now that Intel has admitted the problem. Brooke Crothers of Infoworld called me. I have been in contact with Jerry Pournelle of Byte. I sent email to John Dvorak. Even the San Jose Mercury Daily News is doing a story on it. This essay is appearing in The Computer Paper that goes across Canada. The stonewall is coming tumbling down. TECHNICALLY WHAT IS THE FLAW? In order for the bug to appear, a "rare" co-incidence must happen. Something that can happen only one time in a trillion, inside a computer, will happen on average every 3 hours. As the great Roedy Green said, after the manner of Ionesco, "All great programmers are paranoid". They have to anticipate problems that could happen only once in a trillion machine cycles since such a problem would still show up on average every three hours. The EIDE problem, sometimes goes for days without manifesting. Sometimes it shows up within seconds, depending on the unrelated I/O activity in the machine. The fatal co-incidence tends to happen when you have two or more I/O operations happening at once -- one on the EIDE controller and one on some other unrelated device that uses DMA (Direct Memory Access) -- a I/O technique that does need the CPU's help. The most common DMA devices are the floppy drive controller, tape backup controller or high-end first party DMA SCSI controllers such as the Adaptec 2940 or 2940W. Technophobes are invited to skip the next section. The CPU uses polled mode to read the EIDE hard disk. The RZ- 1000 grabs data 16 bits at a time from the hard disk, and hands it off 32 bits at a time off to the PCI bus. The CPU sits in a tight loop grabbing data from PCI bus and storing it in RAM. When the last word arrives from the hard disk into the RZ- 1000, it generates an interrupt to signal the end of the transfer. Then there is a race. Nearly always the CPU finishes transferring the last 32 bits, before the interrupt triggers. However, if some unrelated DMA device is using the PCI bus at just the wrong time, it holds the CPU up for a few cycles. Then the CPU loses the race and the interrupt happens before the I/O is truly complete. This is not the end of the world. After the interrupt handler has finished its job, the CPU resumes its work and stuffs the last 32 bits properly into RAM. The problem comes when the interrupt handler routine asks the RZ-1000 for its status, i.e. were there any errors in that last transfer? At this point the deadly flaw hits. The RZ-1000 becomes confused and forgets the 32 bits of data still left to be transferred. Instead of later handing off the 32 bits it was supposed to, it hands off gibberish (mangled status actually). There are two software techniques to bypass this flaw: 1) The interrupt handler must never probe for status. Status probing must be done later after the CPU has completed copying all the data into RAM. This is the technique that Microsoft's NT 3.5 and Windows 95 use. The big advantage is the low overhead. I/O is almost as fast on a flawed machine as on a perfect one. 2) Turn off the prefetch buffer. In this case the RZ-1000 talks 16 bits at a time to the hard disk, and 16 bits at a time to the PCI bus. Then the pathological co-incidence cannot happen. However, it takes twice as many PCI bus cycles to transfer the data. In a heavily loaded system, this a stiff penalty to pay. In a lightly loaded system, there is sufficient spare capacity on the PCI bus so that this extra traffic only slows the disk down by 1%. The advantage of this method is simplicity. This is the method of choice for quick- and-dirty emergency patches. There are two other unrelated flaws that show similar symptoms. 1) Older non-PCI AT machines often cannot handle more than one DMA transfer at a time. Gazelle's freeware DMAtest can detect this flaw. DOS and Windows tolerate a faulty DMA controller since they never do more than one I/O at a time except in floppy or tape backup programs. However OS/2 and Linux will not work with a faulty DMA controller. Since the RZ- 1000 flaw mimics that old problem, the RZ-1000 flaw is often erroneously referred to as the "DMA bug". The RZ-1000 flaw only indirectly concerns DMA. EIDE devices run in polled PIO mode and do not use DMA themselves. 2) Intel Premiere motherboards have a couple of known bugs. One of these was due to a bug in the early revision of Intel's Neptune PCI chipset, so it only affected early-revision boards with 90/100 MHz Pentiums. In contrast the RZ-1000 flaw affects PCI motherboards at any speed. FOR FURTHER RESEARCH Here are the outstanding questions. 1) Is the CMD PCI0640B EIDE controller just another name for the RZ-1000, is it yet another chip with the same or similar flaw, or is something entirely different causing the trouble? 2) Are IDE disks (as opposed to EIDE) immune? 3) How do you tell if you have a Mercury, Neptune or Triton chipset either visually or with software? Intel said they were unwilling at this time to release that information, though they would consider it. 4) How do you tell if you have the RZ-1000 using software? Intel is refusing to release this information. They are concerned people will focus on whether the flaw is present rather on whether the flaw is successfully bypassed. 5) Are there any variants of the RZ-1000 that do not have a flaw? CONTACTING THE AUTHOR The author, Roedy Green is a computer consultant who prefers to work on Forth, C++, Delphi, DOS, OS/2 and Internet Web projects. If you send me $5 (US or Canadian) to cover duplication, shipping and handling I will send you a diskette containing four programs: PowerQuest IOtest, Gazelle DMAtest, and both EIDEtest and CDtest from Canadian Mind Products. Please report which machines you find the flaw in, and which software and fixpacks you were using at the time. Send email to: Roedy@bix.com or discuss this problem on the Internet newsgroup in: comp.os.os2.bugs. You can also write via snail mail: Roedy Green Canadian Mind Products #601 - 1330 Burrard Street Vancouver, BC CANADA V6Z 2B8 (604) 685-8412 -30-