CBDIFF 0.2 Copyright Rob Weir, 1995-1996 CompuServe: 71165,2722 Internet: rweir@cybercom.net This program is free for personal use. ============================================================================ WARNING: This program creates ChessBase data files, something quite difficult, and quite undocumented. This program seems to work for me, but don't you think it would be better if you made a backup of your BIG ChessBase database before using me?! ============================================================================ Files you now have: CBDIFF.TXT the file you are reading CBDIFF.EXE the CBDIFF program ============================================================================ New in CBDIFF 0.2 This 32-bit version should be functionally identical to the previous 16-bit release. The CRC files should also be interchangable with those created with the 16-bit CBDIFF 0.1. I've rewrite much of the file access and sorting routines to take advantage of the capabilities of WIN32, using virtual memory, memory-mapped files, etc. On the same hardware, CBDIFF 0.2 is much faster than version 0.1. Also, I've fixed a bugs which cased CBDIFF 0.1 to miss some games, especially when finding all games in a big database not in a smaller one. ============================================================================ The program CBDIFF is used to create a new ChessBase file "C" which contains games from a set of databases "X" but which are not in a set of database "Y". Read that last sentence over again and let it sink in. This sounds pretty abstract, and at first, perhaps useless. So, before going into the details of how this system works, let's show a quick scenario of how it can be used. This afternoon I went on to CompuServe in the Chess Forun (GO CHESSFORUM) and saw that there was a new file of 4161 Alapin Sicilian games (ECO B22). This file looked familiar -- in fact I had uploaded an smaller version (3111 games) of the file last year, and from member contributions, the file had grown. Ordinarily, I would download the file, import the games into a file with my other Sicilian games and run the Nunn Utils (UTIL4.EXE) or CBDEDUPE to remove duplicates. However, the Nunn Utils are slow, can use only conventional memory, and are easily fooled by small differences in the player's names, year, etc. This is what I did instead: 1) I first created a CRC file (more about that later) for all my existing ChessBase files, like this: CBDIFF C:\CB\DATA\MAINBASE.CBF This took less than 10 minutes to go through 500,000 games. Luckily, this only needs to be done once to produce the CRC file, which can then be reused. 2) I then created a CRC file for these new Alapin games like this: CBDIFF C:\TEMP\B22.CBF 3) I then created a difference file which contained all the games in the Alapin file (B22.CBF) which were not in my CRC file. CBDIFF C:\TEMP\B22.CBF C:\CB\DATA\MAINBASE.CBF CBDIFF created a new file, called CBDIFF.CBF, which contained only 254 games, which I then imported into my main Sicilian file using ChessBase. So, the beauty of this system is that I detected and prevented duplicates before importing rather than after. ============================================================================ Now that you have a rough idea what CBDIFF does, let's go into a bit more detail. When writing a program to compare a large number of games to detect duplicates, there are essential two ways to go: the conservative, deterministic, memory and time intensive way, or the more liberal, fast, probabalistic approach. Neither method is really better than the other. The Nunn Utilities, for example, uses the first approach, comparing the moves, the game length, the players names, game result, etc. If the match is not exact, a duplicate is not detected. The price for this approach is a relatively slow, memory intensive program. If you want to de-dupe a 100,000 game database, you would be best to let it run over night! I have choosen to take a complentary approach. My system finds more duplicates faster, but with the small chance that an occasional pair of games that are not duplicates will be mistaken as such. These "mistakes" occur for two reason: 1) CBDIFF only looks at the moves of the game, not the year, result, or players. So, if two games have the exact same sequence of moves, they are considered to be the same game. This happens rarely in chess, except with "grandmaster draws". 2) CBDIFF doesn't even compare each move. Instead, it a 32-bit Cyclic Redundency Check (CRC) for each game and compares that. Now, a 32-bit CRC has over 4 billion possible values, so the possibilty that two different games would just happen to have the same CRC is very small. So, these two "mistakes" are the price of the proabalistic design. It is trade between accuracy and performance. Overall, I think CBDIFF gives vastly improved performance with a minimal loss of accuracy. ============================================================================ CBDIFF is easy to use. Remember, we want to create a new ChessBase file "C" which contains games from a set of databases "X" but which are not in a set of database "Y". So, first we need to record information about these games we already have, the database sets "Y" and "X". We do this by running CBDIFF and passing the name of the CBF file: CBDIFF big.cbf CBDIFF newgames.cbf Now that you have created your CRC index file, you are ready to find the games you don't already have. You do this by running CBDIFF with two parameters, like this: CBDIFF newgames.cbf big.cbf Any new games are written to a new datafile called "CBDIFF.CBF". ============================================================================ A few additional notes: When comparing games, CBDIFF counts games with identical moves, but different annotations and comments to be different games. CBDIFF leaves behind the CRC index files. This is intentional. Keeping these CRC files saves time the next time you rub CBDIFF. Also, it makes it easier to exchange games for your friends. If you give them a copy of your CRC file, they can give you a difference file of "all the games you don't already have". Enjoy! ============================================================================