the problem with engine testing is that it is almost impossible to play a sufficiently large number of games to get a significant change in the score, at least if you try small changes in the eval function.
What if we could measure the quality of every move? This would immediately enlarge the database by a factor of 40 to 60.
What I'm thinking about is the following test:
1. Grab some 100.000 positions from master's games
2. For each of the positions, compute the best move and its value with a strong engine and sufficient time. This will take some weeks but must be done only once.
3. For each of the positions, compute a move with your engine and shorter time. If the move is the same like the move computed in 2. score zero minus points. Otherwise, compute the value of your engine's move with the strong engine and sufficient time, remember the result for future tests, and score the difference of the values as minus points.
Now try to minimize your minus points.
The time for computing move's values with the strong engine is large in the beginning, but assymptotically zero

Has anybody tried this? Do you think this will help to improve an engine's results in actual gameplay?
Some problems that might arise:
- If engine A plays 40 OK moves and engine B plays 39 excellent moves and one terrible blunder, A might be better at gameplay, but B might be better at above test.
- The strong engine becomes the mother of all evaluation, your engine starts to imitate its style, disregarding your personal strengths and weaknesses.