Solr – DIH Batchsize | datafireball

I have written a post that the DIH’s import performance was horrible 27 minutes to index 1 million rows while CSV post took only 22 seconds.

We realized that there is an attribute for the datasource tag called “batchsize” which is supposed to change something.

Here is a page that contains a lot of DIH FAQ which covers a lot. However, here is a quote that worth mentioning:

“DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.”

If you think about batchsize, in the extreme scenario, when you set batchsize to be super small, like 1, it basically means SQL will fetch one record at a time and the back and forth, commit and network will be amplified linearly. In the other extreme scenario, if you turn batchsize to be a huge number, it will try to load everything from database into a big batch first, hold into memory and then send over the network to Solr. This could easily go wrong where either runs into memory issue or network issue. Is there supposed to be a sweet spot somewhere in the middle, ideally yes, however, it doesn’t looks that obvious considering this dummy experiment.

I have a MySQL table that has 1 millions rows where I change the batch size to be different values and run a full-import and I am trying to analyze indexing speed. Here are a few screenshots.

batchsize_10 batchsize_10000 batchsize_100000 batchsize_n1

Clearly, change the batchsize to -1 completely did the magic and load 1 million records in half an minute, however, changing it to any other numbers really didn’t make a huge difference and regardless of the number that I put there, 10, 1E4, 10E5 all ended up loading the in the time of like half an hour.

If we really want to come up with some sort of conclusion, I will still say increase the batchsize will increase the number of index rates, by really not that much.

In the end, set the batchsize to be negative one (batchsize=”-1″) 🙂