CBDIFF 0.2 Copyright Rob Weir, 1995-1996

CompuServe: 71165,2722
Internet: rweir@cybercom.net

This program is free for personal use.

============================================================================
WARNING: This program creates ChessBase data files, something quite 
difficult, and quite undocumented.  This program seems to work for me, but 
don't you think it would be better if you made a backup of your BIG 
ChessBase database before using me?! 
============================================================================
Files you now have:

CBDIFF.TXT    the file you are reading
CBDIFF.EXE    the CBDIFF program

============================================================================

New in CBDIFF 0.2

This 32-bit version should be functionally identical to the
previous 16-bit release.  The CRC files should also be interchangable with
those created with the 16-bit CBDIFF 0.1.  

I've rewrite much of the file access and sorting routines to take advantage 
of the capabilities of WIN32, using virtual memory, memory-mapped files, etc.  
On the same hardware, CBDIFF 0.2 is much faster than version 0.1.

Also, I've fixed a bugs which cased CBDIFF 0.1 to miss some games, especially
when finding all games in a big database not in a smaller one.

============================================================================
The program CBDIFF is used to create a new ChessBase file "C" 
which contains games from a set of databases "X" but which are not in a set 
of database "Y".  Read that last sentence over again and let it sink in.  
This sounds pretty abstract, and at first, perhaps useless.  So, before going 
into the details of how this system works, let's show a quick scenario of how 
it can be used.

This afternoon I went on to CompuServe in the Chess Forun (GO CHESSFORUM) and 
saw that there was a new file of 4161 Alapin Sicilian games (ECO B22).  This 
file looked familiar -- in fact I had uploaded an smaller version 
(3111 games) of the file last year, and from member contributions, the file 
had grown.  Ordinarily, I would download the file, import the games into a 
file with my other Sicilian games and run the Nunn Utils (UTIL4.EXE) or CBDEDUPE 
to  remove duplicates.  However, the Nunn Utils are slow, can use only 
conventional memory, and are easily fooled by small differences in the 
player's names, year, etc.  This is what I did instead:

1) I first created a CRC file (more about that later) for all my existing 
ChessBase files, like this:

CBDIFF C:\CB\DATA\MAINBASE.CBF

This took less than 10 minutes to go through 500,000 games.  Luckily, this only 
needs to be done once to produce the CRC file, which can then be reused.

2) I then created a CRC file for these new Alapin games like this:

CBDIFF C:\TEMP\B22.CBF

3) I then created a difference file which contained all the games in the Alapin 
file (B22.CBF) which were not in my CRC file.

CBDIFF C:\TEMP\B22.CBF C:\CB\DATA\MAINBASE.CBF

CBDIFF created a new file, called CBDIFF.CBF, which contained only 254 
games, which I then imported into my main Sicilian file using ChessBase.

So, the beauty of this system is that I detected and prevented duplicates 
before importing rather than after.

============================================================================
Now that you have a rough idea what CBDIFF does, let's go into a bit 
more detail.

When writing a program to compare a large number of games to detect 
duplicates, there are essential two ways to go:  the conservative, 
deterministic, memory and time intensive way, or the more liberal, fast, 
probabalistic approach.  Neither method is really better than the other.  


The Nunn Utilities, for example, uses the first approach, comparing the moves, 
the game length, the players names, game result, etc.  If the match is not 
exact, a duplicate is not detected.  The price for this approach is a 
relatively slow, memory intensive program.  If you want to de-dupe a 100,000 
game database, you would be best to let it run over night!

I have choosen to take a complentary approach.  My system finds more 
duplicates faster, but with the small chance that an occasional pair of games 
that are not duplicates will be mistaken as such. 
 
These "mistakes"  occur for two reason:

1) CBDIFF only looks at the moves of the game, not the year, result, or 
players.  So, if two games have the exact same sequence of moves, they are 
considered to be the same game.  This happens rarely in chess, except with 
"grandmaster draws". 

2) CBDIFF doesn't even compare each move.  Instead, it a 32-bit Cyclic 
Redundency Check (CRC) for each game and compares that.  Now, a 32-bit CRC 
has over 4 billion possible values, so the possibilty that two different 
games would just happen to have the same CRC is very small.

So, these two "mistakes" are the price of the proabalistic design.  It is 
trade between accuracy and performance.  Overall, I think CBDIFF gives vastly 
improved performance with a minimal loss of accuracy.

============================================================================
CBDIFF is easy to use.   Remember, we want to create a new ChessBase file 
"C" which contains games from a set of databases "X" but which are not in a 
set of database "Y".  

So, first we need to record information about these games we already have, 
the database sets "Y" and "X".  We do this by running CBDIFF and passing the
name of the CBF file:

CBDIFF big.cbf
CBDIFF newgames.cbf


Now that you have created your CRC index file, you are ready to find the games you 
don't already have.  You do this by running CBDIFF with two parameters, like 
this:

CBDIFF newgames.cbf	big.cbf

Any new games  are written to a new datafile called "CBDIFF.CBF".  

============================================================================         
A few additional notes:

When comparing games, CBDIFF counts games with identical moves, but different
annotations and comments to be different games.

CBDIFF leaves behind the CRC index files.  This is intentional.  Keeping these 
CRC files saves time the next time you rub CBDIFF.  Also, it makes it easier 
to exchange games for your friends.  If you give them a copy of your CRC file,
they can give you a difference file of "all the games you don't already have".

Enjoy!

============================================================================