Friday, 9 December 2011

Apache Solr: DisMax and multiValued

At maxHeap, we try our best to speed up the data delivery to the end user. MySQL just isn't fast enough, so we use Apache Solr. Solr uses Lucene at it's core (both a part of Apache Project) to deliver lightning fast data. So where does Solr fit into CommonFloor? Well, the auto-suggest and all searching is powered by Solr on CommonFloor.

In it's latest versions Solr, has introduced a different search handler called Disjunction Maximum or simply DisMax. The default search handler is pretty stupid, pardon my strong language, but that's really the case. There is no way to search across multiple fields! Rather, you would have to specify the same query for each field that you want to search, so developers came up with an answer to the problem by using the copyField directive which appends the source field into the destination. Using this method developers used to search in one field for all queries. But then data grew complex, and there was a new requirement! Not all fields have the same weight-age, some are less important, and some are very important. For example the title would be very important, while the URL is not so much. But using the copyField method developers could not do it, they needed something more smart and robust and thats where DisMax and eDisMax (extended-DisMax) comes in. DisMax allows you to execute one query across multiple fields, while allowing you to give a different weight-age to each field (they call this boost). This allowed complex searches with a robust method of boosting and selecting results. DisMax features like mm (Minimum should match), pf and ps (Phrase Fields and Phrase Slop) and of course qf (Query Fields) (multiple fields and their boosts are specified here) allow advanced matching criteria and a great method of sorting the results exactly how you want it. I could go on for ever about DisMax and it's uses, but I'll leave it to you to explore!
More about DisMax:

The second and I'd say the more important thing is upgrading the schema (Solr has schemas just like databases, though they differ significantly). While upgrading from Solr version 1.3 to the latest version we experienced a lot of trouble with Solr. Due to the lack of full documentation we were on our own to solve the issues. The problem: Solr can have multi-valued fields (they are like arrays) which are different from normal fields. The main difference apart from single-valued and multi-valued is that Solr cannot sort based on a multi-valued field. The Solr schema specifies all details of the fields, how they should be processed, what filters to apply, how to parse the values etc. along with a very important attribute i.e. the schema version number. Since our Solr was pretty outdated, the schema version was set to '0.1' which directed Solr to use the rules specified with the oldest schema version which is '1.0'. Version '1.0' specified that all fields are multiValued by nature due to which even after specifying multiValued as false for each fields Solr understood it as multiValued! So after a lot of errors and time spent on the problem we couldn't figure it out. Then it clicked! A very simple change i.e. setting the schema version to '1.1' solved the problem as Version '1.1' directed Solr to use multiValued as false by default. This is completely undocumented due to which we had such a hard time figuring out such a small fix!