CBDEDUPE 0.2 32-bit version Copyright Rob Weir, 1996 CompuServe: 71165,2722 INTERNET: rweir@cybercom.net This program is free for personal use. ======================================================================= WARNING: This program produces modified ChessBase data files, something quite difficult, and quite undocumented. This program seems to work for me, but don't you think it would be better if you made a backup of your BIG ChessBase database before using me?! ======================================================================= Files you now have: CBDEDUPE.TXT the file you are reading CBDEDUPE.EXE the CBDEDUPE program CBDEDUPE.CFG the weights used for game scoring The program CBDEDUPE takes a ChessBase data file and searches for games which are duplicates, marking these duplicated as deleted. By "mark as deleted" I mean that CBDEDUPE does not actually remove the game from the CB datafile, but instead sets the delete flag on the game, which makes the game appear grayed-out in ChessBase. ChessBase calls this a "virtual deletion" The game can be "physically deleted" using the ChessBase utility CBFRESH, the built-in option in CBWIN, or my CBSTRIP Utility. ======================================================================= New in version 0.2 32-bit This 32-bit version should be functionally identical to the previous 16-bit release. What I have done is rewrite much of the file access and sorting routines to take advantage of the capabilities of WIN32, using virtual memory, memory-mapped files, etc. Testing shows that this has resulted in a 10x improvement over the 16-bit version. Also, I've added a "-p" "practice mode" flag. If you run like this: CBDEDUPE radical mainbase.cbf -p then CBDEDUPE will not delete the games from the database, but will instead give a list of what games it would have deleted. ======================================================================= ChessBase for has the Nunn Utilities and CBWIN, both which allow the user to remove duplicate games, so why would you want something else? I'm glad you asked ! 1) CBDEDUPE is faster than the Nunn Utilities or CBWIN in finding duplicates. 2) CBDEDUPE finds more duplicates than the Nunn Utilities or CBWIN. 3) CBDEDUPE is free. To be fair to ChessBase, their products have the following advantages: 1) They have a better user interface 2) With CBWIN, duplicate removal is integrated with the product 3) They have the support and reliability which only commercial products can offer. All in all, each program has advantages. I've seen duplicate games which the Nunn Utilities misses and CBDEDUPE finds, and I've seen it the other way around. If you want, use both. It never hurts to have several tools in your collection! ========================================================================== CBDEDUPE is easy to run. You just pass in a "Search Level" option and the name of a ChessBase file as an argument and let it run. For example: CBDEDUPE RADICAL C:\CB\DATA\DUTCH.CBF The Search Level option lets you choose how close games have to be in order for CBDEDUPE to consider them to be duplicates. There are three levels: "CONSERVATIVE" in which two games are duplicates if the moves, comments, variations and year of the games are identical. The game with the lower "Score" is deleted. The Score of a game is based on several factors, including the presence or absence of a valid year, elo score, length of comments and variations. The idea is that if we have two games which are otherwise identical, we should keep the game which has the more information. There is a text file, CBDEDUPE.CFG, which allows you to adjust the weights used in calculating the Score. So, if having complete player data (longer name, elo score, etc.) is more important to you than annotations, you can adjust this file to your taste. "LIBERAL" is like "CONSERVATIVE" except that when given two games, one of which has comments and/or variations while the other has none, CBDEDUPE will delete the unannotated game. "RADICAL" in which two games are duplicates if the moves are identical. Like the "LIBERAL" approach, the game with the lower score is deleted. The main point of the "RADICAL" approach is that if you have twenty copies of the same game, but with different annotations, CBDEDUPE will find the one with the most comments/variations and delete the rest. Also, with the "RADICAL" method, the games don't need to have the same year, so if you have a game with year=1987 and others with year=2024 and year=0, CBDEDUPE will delete all but year=1987 (assuming they have the same moves). Optionally, you can start CBDEDUPE with a "-v" parameter, and run in "Verbose mode". For example: CBDEDUPE radical big.cbf -v In verbose mode, CBDEDUPE writes a file called CBDEDUPE.OUT which shows how the program decided which games to delete. The output of the file looks like this: 4366 (1040) vs 4615 (1595) Delete 4366 4615 (1595) vs 4372 (1624) Delete 4615 ------------------------------- 2834 (94) vs 2758 (72) Delete 2758 ------------------------------- 137 (383) vs 136 (449) Delete 137 136 (449) vs 138 (275) Delete 138 ------------------------------- 1548 (53) vs 1597 (59) Delete 1548 ------------------------------- 2139 (176) vs 2141 (252) Delete 2139 2141 (252) vs 2137 (452) Delete 2141 2137 (452) vs 2140 (302) Delete 2140 2137 (452) vs 2138 (326) Delete 2138 2137 (452) vs 2249 (410) Delete 2249 ------------------------------- Each line gives the game number and score (in parentheses) for each game along with the number of the game which was marked to be deleted. So, the first line says that game 4366 (with score 1040) had the same moves as game 4615 (with score 1595) and that game 4366 (with the lower score) was deleted. A row of dashes seperates groups of games with identical moves. As a benchmark, I ran CBDEDUPE 16-bit, CBDEDUPE 32-bit, the Nunn Utilities and CBWIN against a test database of 76,866 games. I got the following results: PROGRAM DUPES FOUND TIME TO RUN ===================================================== CBWIN 131 3' 56" Nunn Utilities 526 24' 02" CBDEDUPE 16-bit 324 14' 11" CBDEDUPE 32-bit 324 1' 00" One important point to note in all of this, is that CBDEDUPE has to make two decisions when it thinks it finds a duplicate: 1) Are the two games really duplicates? The criterion for this varies with mode (conservative, liberal or radical). 2) If they really are duplicates, which game shoudl be deleted? CBDEDUPE always deletes the lower scoring game, based on the weights in CBDEDUPE.CBF. ======================================================================= Now that you have a rough idea what CBDEDUPE does, let's go into a bit more detail. When writing a program to compare a large number of games to detect duplicates, there are essential two ways to go: the conservative, deterministic, memory and time intensive way, or the more liberal, fast, probabalistic approach. Each method has its advantages. The Nunn Utilities, for example, seems to use the first approach, comparing the moves, the game length, the players names, the game result, etc. If the match is not exact, a duplicate is not detected. The price for this approach is a relatively slow, memory intensive program. If you want to de-dupe a 100,000 game database, you would be best to let it run over night! I have chosen to take a complementary approach. My system finds more duplicates faster, but with the small chance that an occasional pair of games that are not duplicates will be mistaken as such. These "mistakes" occur for two reason: 1) CBDEDUPE only looks at the moves of the game and the year, not the result, or players' names. So, if two games in the same year have the exact same sequence of moves, they are considered to be the same game. This happens rarely in chess, except with "grandmaster draws". 2) CBDEDUPE doesn't even compare each move. Instead, it calculates a 32-bit Cyclic Redundancy Check (CRC) for each game and compares that. Now, a 32-bit CRC has over 4 billion possible values, so the possibility that two different games would just happen to have the same CRC is very small. Based on my measurements this leads to around one error every 20,000 duplicates. Typical databases have around 10% duplicates, so in a database of 500,000 games, you might incorrectly delete around 3 games. So, these two "mistakes" are the price of the proabalistic design. It is trade between accuracy and performance. Overall, I think CBDEDUPE gives vastly improved performance with a minimal loss of accuracy. ===========================================================================