Solr – DIH Batchsize

I have written a post that the DIH’s import performance was horrible 27 minutes to index 1 million rows while CSV post took only 22 seconds.

We realized that there is an attribute for the datasource tag called “batchsize” which is supposed to change something.

Here is a page that contains a lot of DIH FAQ which covers a lot. However, here is a quote that worth mentioning:

“DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.”

If you think about batchsize, in the extreme scenario, when you set batchsize to be super small, like 1, it basically means SQL will fetch one record at a time and the back and forth, commit and network will be amplified linearly. In the other extreme scenario, if you turn batchsize to be a huge number, it will try to load everything from database into a big batch first, hold into memory and then send over the network to Solr. This could easily go wrong where either runs into memory issue or network issue. Is there supposed to be a sweet spot somewhere in the middle, ideally yes, however, it doesn’t looks that obvious considering this dummy experiment.

I have a MySQL table that has 1 millions rows where I change the batch size to be different values and run a full-import and I am trying to analyze indexing speed. Here are a few screenshots.

batchsize_10 batchsize_10000 batchsize_100000 batchsize_n1

Clearly, change the batchsize to -1 completely did the magic and load 1 million records in half an minute, however, changing it to any other numbers really didn’t make a huge difference and regardless of the number that I put there, 10, 1E4, 10E5 all ended up loading the in the time of like half an hour.

If we really want to come up with some sort of conclusion, I will still say increase the batchsize will increase the number of index rates, by really not that much.

In the end, set the batchsize to be negative one (batchsize=”-1″) 🙂

Solrcloud – get started

Solrcloud is a cluster of solr servers which combines fault tolerance and high availability. It can be coordinated by zookeeper and store all the index files into HDFS which makes it almos feasible to index any data, big and even huge data.

You can run Solr in the cloud mode just by downloading the plain Apache Solr, and if you are already using some sort of big data platform, like Cloudera Hadoop (CDH), Solr Cloud is already preinstalled and it has even been packaged/integrated into Cloudera Search.

In CDH, they use a command line utility solrctl. Here is a dummy example where it shows the idea of how to use solrctl to get something into Solr.

solrctl instancedir --generate $HOME/<solr_config_dir>

solrctl instancedir will manipulate the instance directory where contains the configuration files for certain solr collection, something like the conf folder. This command will create a directory called “<solr_config_dir>” in your home directory.

This is how the generated <solr_config_dir> looks like and then we need to upload the instance directory to zookeeper and create a collection based on that.

solr_configs

$ solrctl instancedir --create <collection_name> $HOME/<solr_config_dir> 
Uploading configs from /home/<user>/<solr_config_dir>/conf to node1.datafireball.com:2181,node2.datafireball.com:2181,node3.datafireball.com:2181,node4.datafireball.com:2181,node5.datafireball.com:2181/solr. This may take up to a minute.

In the end, you need to create a collection using the instance directory you uploaded. Here I am going to create a collection with number of shards to be two.

solrctl collection --create <collection_name> -s 2

In the end, you can run the list command to see the created collections…

$ solrctl collection --list
collection1 (2)
collection3 (2)

After you created the collection, you can upload a few documents by using the built in post command

cd /opt/cloudera/parcels/CDH/share/doc/solr-doc-4.10.3+cdh5.5.1+325/example/exampledocs
java -Durl=http://$SOLRHOST:8983/solr/collection1/update -jar post.jar *.xml

In the end, you should have something similar like this:

Solr – Data Import Request Handler

If the data that you are trying to search/index is already in some sort of database that support JDBC, then you are in good shape.

I was once thinking that I can export the into into CSV-ish file and then use CSV uploader or even simpleposttool to help me upload those data into Solr. Clearly, this is such a common use case they even have a function that connect Solr directly with database to avoid this uploading and uploading process, and this functionality is called “Data Import Request Handler”.

This post is about quick and dirty tutorial where I was trying to load 1 million records into MySQL and then use the data import request handler to feed to Solr to index.

Believe it or not, I have not quite fully understood how to create a brand new core from scratch, in this tutorial, I completely forked the example project – techproducts where all the project structure has been well laid out. Then it is just a matter to change a few parameters here and there, add a few dependencies and that is it.

In the end, there are three places that I need to change:

First is schema.xml: since we are going to index our own data, I have to add in all the fields into schema.xml to make sure all the necessary sql table columns have a corresponding field in the schema.

Then, it is the solrconfig.xml: there are two places in solrconfig.xml where we need to modify. The first one is to add data import request handler

solrconfig_requesthandler

Then the second one is to include necessary dependencies to make sure they know where the jdbc driver is (MySQL in this case) and where the dataimporthandler is.

The dataimporthandler was included as default under the dist directory where I had to download the MYSQL jdbc driver myself to the dist directory and included in solrconfig.xml.

In the end, it is the file data-config.xml where all the mysql credentials and sql query and column name mapping are defined.

In the end, restart your project and you are good to go, you can either by calling the dataimport by issuing an http call or you can log in solr dashboard and click the execute “Button” there, here is how it looks like in the end including the data-config.xml content.

In the end, I have to say that the process was pretty easy, however, the out of box performance was not as satisfactory as I expected. It took almost half an hour to index only 1 million rows, and the bin/post command can upload 1 million rows in 24 seconds!

More research need to be done but this quick experiment is very meaningful since this solution could potentially seamlessly tie different types of relation databases all together (MYSQL, SQL server, Teradata, and even Hive, Impala…) without too much hassle. 🙂

Lucene – Classes Definition

There are a few key concepts in Solr and Lucene that people need to understand, however, one can learn how it works in a high level, but for the people like me, I won’t feel comfortable until I understand the code.

Here I will keep a list of all the classes and its raw definition.

Lucene – Run Indexer and Searcher

You have to buy Lucene in Action! I personally think Lucene in Action and Solr in Action are the best two technology book that I have read. I am trying to follow the example Sample Application 1.4.

You can download and run the example code here. I don’t have that much experience working with Java, most of the projects that I have worked with use the build tool Maven, looks like the POM file will take care of all the dependencies and the build process. This time, the author of Lucene in Action decided to use Ant.

At the beginning, I thought it might be another headache, however, it turned out to be so easy to follow the tutorial, simply download the source code, navigate to the project root directory where build.xml is located. And run the command:

 ant Indexer

And you just type in a few input arguments through the command line and the program will locate the example and everything finishes smoothly.

I was super surprised at how easy this is and looking at the build.xml, we see how this happens:

ant_target
the name attribute of target defines why ant Indexer will locate this build block. Then we have an info tag. Following the info, we have two input blocks where defined the prompt message, variable name. In the name, the run-main block is very interesting, where class defines where to look for the main class in the package and passes the two arguments to the program.

After running the Indexer and Searcher, you will see there is a folder that we specified, indexes/MeetLucene that got generated. And there is the folder where all the indexes have been stored.

$ ls
_0.cfs segments.gen segments_2

There are are three files under the index folder and they are all binary files which is not that straightforward to interpret. I even did not find a good place in the book explaining what those files are for.

After a quick Google search, you can find Lucene Index File formats here.

CFS: compound files

Segment_N: active segment file

Here is a screenshot of the first few lines of the cfs file. And clearly we can see it is a compound file of a few different smaller files.

cfs

_0.tii The index into the Term Infos file
_0.tis Part of the term dictionary, stores term info
_0.fdx Contains pointers to field data
_0.nrm Encodes length and boost factors for docs and fields
_0.fdt The stored fields for documents
_0.prx Stores position information about where a term occurs in the index
_0.frq Contains the list of docs which contain each term along with frequency
_0.fnm Stores information about the fields

Solr – Suggester – FileDictionaryFactory

Autocomplete is a feature that dramatically improves users searching experience. So what is autocomplete or suggester if you have never seen these terms before.

Here is a screenshot of how autocomplete and suggest looks like in Google.

suggest

If you want to learn more about how suggester works in Solr, here is a confluence article I found very helpful.

There are two definitions that I want to highlight, data dictionary and lookup implementation.

Lookup-Implementation: “The lookupImpl parameter defines the algorithms used to look up terms in the suggest index. There are several possible implementations to choose from, and some require additional parameters to be configured.”

Dictionary-Implementation: “The dictionary implementation to use. There are several possible implementations, described below in the section Dictionary Implementations. If not set, the default dictionary implementation is HighFrequencyDictionaryFactory unless a sourceLocation is used, in which case, the dictionary implementation will be FileDictionaryFactory”

Here I am going to share how to get the dictionary factory up and running and in this case, I will use the filedictionaryfactory which Solr will locate an external file where the suggestions are specified.

FileDictionaryFactory

“This dictionary implementation allows using an external file that contains suggest entries. Weights and payloads can also be used.

If using a dictionary file, it should be a plain text file in UTF-8 encoding. Blank lines and lines that start with a ‘#’ are ignored. You can use both single terms and phrases in the dictionary file. If adding weights or payloads, those should be separated from terms using the delimiter defined with the fieldDelimiter property (the default is ‘\t’, the tab representation).

This dictionary implementation takes one parameter in addition to parameters described for the Suggester generally and for the lookup implementation:”

I first created a suggestion file under the collection folder (techproducts). And here is how the file looks like:

$ cat ./example/techproducts/solr/techproducts/bwsuggester.txt

# This is a sample dictionary file.

acquire
accidentally\t2.0
accommodate\t3.0
alex\t4.0

test
test1\t1.0
test2\t2.0
test3\t3.0

ceshi
ceshi1\t1.0
ceshi2\t2.0
ceshi3\t3.0

And then you need to modify the solrconfig.xml to use the filedictionaryfactory:

solrconfig
In the end, you can restart solr by issuing the commands:

bin/solr stop -all
bin/solr start -e techproducts

They should restart solr and take the latest configuration file into consideration.

Then lets take a look at the how to query the suggester using Solr API.

http://localhost:8983
/solr/techproducts/suggest?
suggest=true&
suggest.build=true&
suggest.dictionary=mySuggester&
wt=xml&
suggest.q=ceshi2

And here is how the result looks like:

The good news is Solr is already using the external suggestion dictionary that we provided. But the bad news is that it did not parse out the weight correctly and more work need to be done.

Solr – Search Relevance

Relevance is the degree to which a query response satisfies a user who is searching for information.

There are two terms we probably want to highlight here.

Precision: The percentage of documents in the returned result that are relevant

Recall: the percentage of relevant results return out of all the relevant documents in the whole system.

For example, we have 100 documents in the whole system, a user conducts a query and 30 out of 100 are relevant and the the rest are irrelevant, which is 90 records.

Say a search algorithm returns only 10 records totally and 8 out of 10 are relevant. In this case, lets do the calculation:

Precision = 8 / 10 ~ 80%
Recall = 8 / 30 ~ 26%

Assume the software engineer is lazy and he/she just simply returned all the results, which is 100 documents completely. In that case, the recall is 20/20 ~ 100% because all the relevant documents have been returned but the precision is 20 / 100 ~ 20%, which is really low. 😦

“Once the application is up and running, you can employ a series of testing methodologies, such as focus groups, in-house testing, TREC tests and A/B testing to fine tune the configuration of the application to best meet the needs of its users.”

Solr – Simple Post Tool

Following this quick start tutorial, I realized that they have been using the bin/post command a lot for testing. And the terminal returns this type of response.

$ bin/post -c gettingstarted example/exampledocs/books.csv

/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home//bin/java 
-classpath /Users/datafireball/Downloads/solr-5.4.1/dist/solr-core-5.4.1.jar 
-Dauto=yes 
-Dc=gettingstarted 
-Ddata=files 
org.apache.solr.util.SimplePostTool 
example/exampledocs/books.csv
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.csv (text/csv) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.077

This is an example how Solr index csv file out of box, in the quickstart tutorial, Solr has indexed all different types of input files including but not restricted to CSV. Then I started feel so curious, what else can this magical SimplePostTool does!

Here is the Java doc for class SimplePostTool. Here is a perfect on line description of SimplePostTool

“A simple utility class for posting raw updates to a Solr server, has a main method so it can be run on the command line. View this not as a best-practice code example, but as a standalone example built with an explicit purpose of not having external jar dependencies.”

There is a very interesting method called “webcrawl”:

protected int

webCrawl(int level, OutputStream out)

A very simple crawler, pulling URLs to fetch from a backlog and then recurses N levels deep if recursive>0

Here is the help manual of SimplePostTool:

$ /Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home//bin/java -classpath /Users/myuser/Downloads/solr-5.4.1/dist/solr-core-5.4.1.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool -h

SimplePostTool version 5.0.0
Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]
Supported System Properties and their defaults:
  -Dc=<core/collection>
  -Durl=<base Solr update URL> (overrides -Dc option if specified)
  -Ddata=files|web|args|stdin (default=files) 
  -Dtype=<content-type> (default=application/xml)
  -Dhost=<host> (default: localhost)
  -Dport=<port> (default: 8983)
  -Dauto=yes|no (default=no)
  -Drecursive=yes|no|<depth> (default=0)
  -Ddelay=<seconds> (default=0 for files, 10 for web)
  -Dfiletypes=<type>[,<type>,...] (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
  -Dparams="<key>=<value>[&<key>=<value>...]" (values must be URL-encoded)
  -Dcommit=yes|no (default=yes)
  -Doptimize=yes|no (default=no)
  -Dout=yes|no (default=no)
This is a simple command line tool for POSTing raw data to a Solr port.
NOTE: Specifying the url/core/collection name is mandatory.
Data can be read from files specified as commandline args,
URLs specified as args, as raw commandline arg strings or via STDIN.
Examples:
  java -Dc=gettingstarted -jar post.jar *.xml
  java -Ddata=args -Dc=gettingstarted -jar post.jar '<delete><id>42</id></delete>'
  java -Ddata=stdin -Dc=gettingstarted -jar post.jar < hd.xml
  java -Ddata=web -Dc=gettingstarted -jar post.jar http://example.com/
  java -Dtype=text/csv -Dc=gettingstarted -jar post.jar *.csv
  java -Dtype=application/json -Dc=gettingstarted -jar post.jar *.json
  java -Durl=http://localhost:8983/solr/techproducts/update/extract -Dparams=literal.id=pdf1 -jar post.jar solr-word.pdf
  java -Dauto -Dc=gettingstarted -jar post.jar *
  java -Dauto -Dc=gettingstarted -Drecursive -jar post.jar afolder
  java -Dauto -Dc=gettingstarted -Dfiletypes=ppt,html -jar post.jar afolder
The options controlled by System Properties include the Solr
URL to POST to, the Content-Type of the data, whether a commit
or optimize should be executed, and whether the response should
be written to STDOUT. If auto=yes the tool will try to set type
automatically from file name. When posting rich documents the
file name will be propagated as "resource.name" and also used
as "literal.id". You may override these or any other request parameter
through the -Dparams property. To do a commit only, use "-" as argument.
The web mode is a simple crawler following links within domain, default delay=10s.

In the end, you can get the crawler working by entering the command:

java
-classpath /Users/myuser/Downloads/solr-5.4.1/dist/solr-core-5.4.1.jar 
-Dauto=yes 
-Dc=gettingstarted 
-Ddata=web 
-Drecursive=3 
-Ddelay=0 
org.apache.solr.util.SimplePostTool 
https://datafireball.com/

SimplePostTool version 5.0.0

Here is the log of crawling datafireball.com at the depth of 3: solrwebcrawl_datafireball

For more information about how the webcrawler was written click here.

webcrawl