chess opening exlorer api

5 minute read

This is an api that will give the most popular moves played from a given position. This is a nice way to study openings.


The api usage is simple:


this will result in a json data structure with the following fields:

cnt - the number of times the move was player nextmove - the SAN notation of the move

The database was generated by analyzing around 190M position from grand master games and FICS games played by players with a rating of over 2000.

The database is also downloadable as an SQL file [[!AvhnLreYwL-P_GvnOSGdgFMDhDxA | here ]]

How I generated this data

First get a nice big list of games in PGN format.

Next use this groovy script to create a CSV file that will contain the following data


fen is the current state of the board, SAN is the move made from that position

@GrabResolver(name='', root='')
@Grab(group='org.apache.commons', module='commons-lang3', version='3.4')
import static groovyx.gpars.GParsPool.withPool
import chesspresso.move.IllegalMoveException;
import chesspresso.move.Move;
import chesspresso.pgn.PGNReader;
import static
import org.apache.commons.lang3.RandomStringUtils
class OpeningExtractor {
        def pgnFiles = [];
        static void main(String[] args) {
                def cli = new CliBuilder(usage: "opening ")
                def options = cli.parse(args)
                if (options.arguments().size() < 1) {
                def start = new Date().getTime();
                new OpeningExtractor().run(options.arguments().first());
                def end = new Date().getTime();
                println ("took " + (end-start))
        void run(dir) {
                println "processing directory " + dir;
                def path = new File(dir)
                if (!path.exists()) {
                        println dir + " does not exist"
                path.eachFileRecurse(FILES) {
                        if('.pgn.gz')) {
                                pgnFiles << it
                                println it
                println "Found " + pgnFiles.size() + " files to process"
                        def fc = 1;
                        pgnFiles.each { file->
                                println ("processing file " + file + " " +fc+"/"+pgnFiles.size())
                                def fis = new GZIPInputStream(new FileInputStream(file))
                                def reader = new PGNReader(fis, "");
                                int count = 0;
                                def rand =  RandomStringUtils.randomAlphabetic(6).toLowerCase();
                                def csvdir = new File(dir + "/out")//new File(dir + "/" + rand.substring(0,2))
                                if (!csvdir.exists()) csvdir.mkdir()
                                def csvFile = new File(csvdir.getAbsolutePath() + "/fens-" + rand + ".csv.gz")
                                def csvWriter = new BufferedWriter(new OutputStreamWriter(new GZIPOutputStream(new FileOutputStream(csvFile)), "UTF-8"))
                                while (true) {
                                        if (count % 10000 == 0)
                                                println("parsing game " + (count))
                                        def game
                                        try {
                                                game = reader.parseGame();
                                                if (game == null) break;
                                                def header = game.getModel().getHeaderModel()
                                                //all fics games have a timecontrol tag
                                                def tc = header.getTag("TimeControl")
                                                if (tc != null) {
                                                        //no time control tag? probably a master game
                                                        def(time, inc) = tc.tokenize("+")
                                                        if (time.toInteger() < 60) continue;
                                                def mainLine = game.getMainLine();
                                                def fens = []
                                                mainLine.each { move->
                                                        fens << game.getPosition().getFEN().split(" ")[0];
                                                        fens<< ","
                                                        fens<< move
                                                        fens<< "\n"
                                        } catch(Exception e) {

the script expects a source directory as a parameter and will output a compressed CSV file for each input file in a folder called out. This will take some time if there is a large number of games to process. now that we have every move from every position we need to combine and count them. There are numerous ways of doing this for example you cloud use a spark cluster if you want to get fancy but I went the really simple way and used the tools at hand: bash tools. If you use the games I used the total number of rows in your CSV files will be around 570 million so you need fast tools if don’t want this to take forever. Bash tools are written in C and support parallel running so they are relatively fast. So the script to combine and count these games is like this

find <outdir> -name "*.csv" -exec zcat {} \; | sort -S20G --parallel 8 \
| uniq -c | sort -S 20G --parallel 8 > counts.csv

the second sort is optional I just used it to check the head and tail of the file to see if it looked correct. The parameters to sort are tuned for my machine with 32GB of ram and 8 cores. You may need to adjust these parameters for your environment. This script took around 6-7 hours to complete on my machine so leave it overnight or do something else as it will consume a lot of cpu and ram so the machine may not be usable.

After the script ends you will get a file that contains rows like this


COUNT is the number of times SAN was played from FEN exactly what we were looking for. You may stop here if this suits your needs but this file in its current state will take up a lot of disk space when you import it into a database. So I went 1 step further and added some space optimizations as the following: The fen takes as average of 40 bytes per position. But if you were to store the position in a different way like only storing the positions of the 32 pieces you will get a better average. To here is a groovy script that will convert a given fen string to piece coordinates

println "cnt,wp1,wp2,wp3,wp4,wp5,wp6,wp7,wp8,wr1,wr2,wn1,wn2,wb1,wb2,wq,wk," + 
"bp1,bp2,bp3,bp4,bp5,bp6,bp7,bp8,br1,br2,bn1,bn2,bb1,bb2,bk,bq" { line->
  def tokens = line.split(",")
  def fen = tokens[1].toCharArray()
  def wp=1,wr=1,wn=1,wb=1,bp=1,br=1,bn=1,bb=1;
  def square = 0;
  def res = [
    "wp1": -1,
    "wp2": -1,
    "wp3": -1,
    "wp4": -1,
    "wp5": -1,
    "wp6": -1,
    "wp7": -1,
    "wp8": -1,
    "wr1": -1,
    "wr2": -1,
    "wn1": -1,
    "wn2": -1,
    "wb1": -1,
    "wb2": -1,
    "wq" : -1,
    "wk" : -1,
    "bp1": -1,
    "bp2": -1,
    "bp3": -1,
    "bp4": -1,
    "bp5": -1,
    "bp6": -1,
    "bp7": -1,
    "bp8": -1,
    "br1": -1,
    "br2": -1,
    "bn1": -1,
    "bn2": -1,
    "bb1": -1,
    "bb2": -1,
    "bq" : -1,
    "bk" : -1
  fen.each {
        if (it == '/') return
        else if (it.toString().isNumber()) square += it.toString().toInteger();
        else {
          if (it == "P") res["wp${wp++}"] = square
          if (it == "R") res["wr${wr++}"] = square
          if (it == "N") res["wn${wn++}"] = square
          if (it == "B") res["wb${wb++}"] = square
          if (it == "Q") res["wq"] = square
          if (it == "K") res["wk"] = square
          if (it == "p") res["bp${bp++}"] = square
          if (it == "r") res["br${br++}"] = square
          if (it == "n") res["bn${bn++}"] = square
          if (it == "b") res["bb${bb++}"] = square
          if (it == "q") res["bq"] = square
          if (it == "k") res["bk"] = square
  println tokens[0] + "," +  (res.values().collect{ it == - 1 ? "" : it}.join(",")) + "," + tokens[2]

this script will read a line from the counts.csv file and output a line with the piece indexes on the board. I also filtered out the position and moves that only occurred once and imported both version into a mysql database. The unoptimized version takes around 5GB (without indexes) and the optimized version around 2.1GB for around 30M rows