Friday, 9 December 2011
In it's latest versions Solr, has introduced a different search handler called Disjunction Maximum or simply DisMax. The default search handler is pretty stupid, pardon my strong language, but that's really the case. There is no way to search across multiple fields! Rather, you would have to specify the same query for each field that you want to search, so developers came up with an answer to the problem by using the copyField directive which appends the source field into the destination. Using this method developers used to search in one field for all queries. But then data grew complex, and there was a new requirement! Not all fields have the same weight-age, some are less important, and some are very important. For example the title would be very important, while the URL is not so much. But using the copyField method developers could not do it, they needed something more smart and robust and thats where DisMax and eDisMax (extended-DisMax) comes in. DisMax allows you to execute one query across multiple fields, while allowing you to give a different weight-age to each field (they call this boost). This allowed complex searches with a robust method of boosting and selecting results. DisMax features like mm (Minimum should match), pf and ps (Phrase Fields and Phrase Slop) and of course qf (Query Fields) (multiple fields and their boosts are specified here) allow advanced matching criteria and a great method of sorting the results exactly how you want it. I could go on for ever about DisMax and it's uses, but I'll leave it to you to explore!
More about DisMax: http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/
The second and I'd say the more important thing is upgrading the schema (Solr has schemas just like databases, though they differ significantly). While upgrading from Solr version 1.3 to the latest version we experienced a lot of trouble with Solr. Due to the lack of full documentation we were on our own to solve the issues. The problem: Solr can have multi-valued fields (they are like arrays) which are different from normal fields. The main difference apart from single-valued and multi-valued is that Solr cannot sort based on a multi-valued field. The Solr schema specifies all details of the fields, how they should be processed, what filters to apply, how to parse the values etc. along with a very important attribute i.e. the schema version number. Since our Solr was pretty outdated, the schema version was set to '0.1' which directed Solr to use the rules specified with the oldest schema version which is '1.0'. Version '1.0' specified that all fields are multiValued by nature due to which even after specifying multiValued as false for each fields Solr understood it as multiValued! So after a lot of errors and time spent on the problem we couldn't figure it out. Then it clicked! A very simple change i.e. setting the schema version to '1.1' solved the problem as Version '1.1' directed Solr to use multiValued as false by default. This is completely undocumented due to which we had such a hard time figuring out such a small fix!
Wednesday, 9 November 2011
So, here we go and share the practices we like to stick to. Would love to have your thoughts.
12 Product Management Commandments at CommonFloor
- Look for the unmet need. When talking to users and customers, always be in a fierce desire to capture that unmet need - amid all the suggestions, feature requests and problems you might be listening to.
- Put your best brains (the engineers, developers) closer to users. This can drive through-the-roof innovation.
- You are not the user. Meet a new user every week and let her use your existing product, prototype and take observations.
- Create life like user personas after meeting/studying real users. Understand these personas so well that you could very likely predict your user's action when he/she misses an airport bus and has no cash left for booking a cab.
- Be absolutely clear and thorough with the "One Minute Product Value Proposition", which you validate each time you meet users.
- Be focused on getting the "Right Product" and let the engineers handle how to get the product right.
- Don't just get stuff done. Delight the user!
- New features are not the answer to existing problems. Observe, Analyse and Iterate to get existing things right.
"Remove a feature to add a feature" - Deep, Linkedin.com
- Fail faster. Use quick prototypes (Paper sketch, mockups) to know whats not working and what might.
- Don't ever confuse product completion/launch as the success of product. Strive for creating a rapidly growing community of inspired, enthusiastic and loyal customers.
- No Black Box Nonsense. Let data clear the darkness.
- Have product guiding principles (aka motto/mission of your product) to continue innovating.
Wednesday, 2 November 2011
Say inside dataCollection, we have a nested array attribs containing value => 1200
* To find, we need to query using a '.' in between, attribs.value =>1200. This is not the same behavior as in Mongo shell
* Results returned will be in nested array, no dots in between, this is how results will show up in Mongo shell.
* To update, first argument needs to be specified in a nested array, this is how its done in Mongo shell
In our BI framework, we've developed our own caching. The caching is in a position where it can be used by not just the business intelligence framework, but anything inside the company. Ofcourse, mongodb's quick querying and ability to store and visualize complex data helps – and it has been a huge shift, at least for the tools team, in how applications could now behave and be developed.
So with a very fast cache in place, it was only a matter of time before we started capturing what the users of the BI framework, were doing. All that information was otherwise going to waste.
We capture that information now. And not just that, we learn from it. It takes a clever mathematical function to give a piece of information it's value. For instance, we now know which city and which area is important internally by evaluating now only how many times it has been requested but also the volume of data it yields. Once this was implemented, we started learning. Within a day, we knew what exactly we were looking at, most often, and what was returning the most promising results. And with that, you knew exactly what was more important from a business standpoint without the need of a guy with a calculator.
Our data is not just growing. It's also getting more clever.
Saturday, 15 October 2011
The eventual need of every company is business intelligence and a term synonymous with that is large volumes of data. Like a company, we generate hundreds of thousands of lines of logs a day. And since we call it a log, the data we record in there remains trapped for all eternity unless someone decides to do something about it.
So the tools team rose to the occasion and through our catalyst and founder, Lalit's suggestion, we decided to do something cool here. Lalit basically did an excellent job at analysing what a typical BI system would need. He was spot on about having room to twist malleable data, clean it, record it and use it as we'd want to. So we needed something that could do a little more than a normal data storage backend like a typical SQL database. He suggested MongoDB, and that is when I fell in love with this wonderful database.We had been aiming at a simplistic system earlier, but as I dived deeper into Mongo and noSQL territory, the horizons started widening instantly, expansively. I instantly knew that a lot more was possible with this database – and that was the need of the hour; a system we could scale massively.
So I decided to go much further than a business analytics system. I decided to deliver a framework itself.
The initial phases of development were not taxing. The transition from SQL to noSQL was seamless. And within a couple of weeks, we were dealing with humongous amounts of data, parsing it, cleaning it and feeding it to the DB. Just like that.
We then proceeded to build a report scheduling system that again worked on top of mongoDB. Not just that, we pushed and implemented a caching system that could now be used not just by the BI framework but by anyone in any team in the need of some really fast cache on the web.
All of this didn't need genius coding. It only needed the tools team and mongodb.
- Zero Hassle Maintenance
- Auto Scaling
- Better Security
- Most mature offerings
- Reliable and easy to use 3rd party tools
- We personally like Jeff Bezos and his ideas.