Thursday, 8 November 2012

Anti spam framework - making sense of Spam

Anti Spam System used in CF - Real Estate

Hello Everybody.

In this post, I will be explaining the framework we have implemented to handle spam on our site. As you know commonfloor allows owners,seekers & real estate Agents to find and communicate with each other . The side effects of this wonderful idea is spamming. We needed a system to monitor,detect and protect genuine users from spam messages.

We have 4 categories of users

  • Seeker - looking to buy/rent a property 
  • Owner - looking to sell/rent
  • Real Estate Agent
  • Builder

At commonfloor, we allow only registered (users with verified mobile numbers) users to communicate with each other. When a seeker is interested in a project/property, he/she sends a message to the owner of that project/property via sms & email.

All communications are processed by our anti-spam system before being sent. When the system receives a message to be checked, it sends the msg to the algorithm which returns a 'spam score' (a probability of the msg being spam) We have internally set a threshold for this spam score. if the score is above the threshold, it will be categorized as 'spam' and the message wont be sent. if the score is below the threshold(the algorithm is not sure if the msg is a spam), it will be sent to a moderation console where our team will decide whether the msg is a spam or not . This decision is sent as a feedback to the algorithm for self-learning. The algorithm gets better as it processes more messages and receives more feedback. Based on the decision made in the moderation console the message will either be sent or discarded. All spam messages are stored in a separate db and used to train the algorithm and for future analysis.

With this system in place, I can assure that users will receive messages from genuine users and are protected from receiving spam messages.

Monday, 3 September 2012

Tips in debugging Java script, CSS issues with Internet Explorer

1. When some UI issue like the input size is displayed smaller in size , then the problem may be because the parent elements size may not have been defined or it may be of smaller size.

2. In the case of tables in ie, when one of the column does not have any value , then we should keep an extra   to represent some value , otherwise it wiil show the broken lines. We can also add frame attribute if the above doesnt work.
 
3. getElementByClassName function of prototype does not work in IE7 . We should use $$(element type : classname ).

4. Most common issue with javascript failing in IE is an extra comma is added after last array element also.

5. Including a js file 2 times, shows that element not present in IE7 , IE8.

6. In Jquery new option for adding dynamic select doesn't work in IE7 - the workaround is to use .append .

7. IE7/IE8 - ifrmae gives default border. In order to remove it, need to set this explicity for iframe - frame-border = 0 ;

8. When we are trying to put some javascript code inside some unclosed element , then in IE7 it shows problem.
   Workaround - add some dummy element as first clid of the parent and after evalating the javascript remove the firstchild.

9. directly selecting option with the id not working in IE7,8,9.

10. border attribute doesnt work in IE 7 , 8. Work around is to use images as borders.

11. The css written for the html should be verified from w3schools css validators.

12. The js written for the html should be verified using w3schools js validators.

13. HTML should be proper , like proper defining of the head, body tags.

Friday, 27 April 2012

Strange PHP/Zend/Apache2 issue

Setting up a new laptop for development on a complex system hardly ever goes according to plan. The same thing happened with me and I ended up getting my entire office involved in figuring out the problem.

I ensure I back up every single thing when I switch systems. I did the same here. I've even contemplated writing a small shell script that can install everything I'll need. But anyway, all that happened perfectly. Everything was installed and we were ready to go.

I fired a browser, hit the URL and lo, the scary 403. At commonfloor.com, we use a pretty cool .htaccess file. And for some reason, apache2 was looking for the file four directory levels above where it was pointing to. It had to look for it at /path_to_home/prog/php/commonfloor.com/another_dir/another_dir/ but it was looking for the file here /path_to_home/prog/. That wasn't even the DocumentRoot inside apache, and the errors being thrown were:

(
13)Permission denied: /home/ashesh/prog/.htaccess pcfg_openfile: unable to check htaccess file, ensure it is readable
Gwibberish. The error message wasn't going to lead us anywhere.

What fixed the problem:


  • Set AllowOverride None
  • Ensure libapache2-mod-php5 is installed/loaded [install it using your package manager if not, enable it using sudo a2enmod php5 if your package manager didn't do it already
  • Restart apache2. Hit the URL and groove away in joy!


[I know that the fix is simple but the point is that it's pretty easy to overlook these problems when faced with the task of setting up a new computer.]

The king of fixes: A really cool systems guy like ours. All credits to Goutham ji.



Tuesday, 20 March 2012

Keeping PHP Sessions on Memcached

This happens to be one of the areas where the production/stage setup is different from our local setups. On prod/stage, PHP sessions are stored in memcached. What that does is that it makes it impossible to store medium - large chunks of data in the session. Doing that is anyhow not advisable.
When an attempt is made to write a large piece of data to the session, it results in a write failure from php to memcached. Subsequently, a read failure from memcached immediately follows.
In cases where session information is needed, this results in irregular behavior. In our case, it resulted in content console users being logged out. The failure can be different in different cases but the pattern will be the same.

Check: PHP error logs, SELinux context (if SELinux is enabled) etc.
Clues: Failures occur more irregularly when there is a write failure as discussed. The failures are more regular with SELinux issues.

The fix for the SELinux issue is simple. Just check:
# getsebool httpd_can_network_memcache

That should output:
httpd_can_network_memcache --> on

If it outputs an off, just do:
# setsebool -P httpd_can_network_memcache 1

As simple as that.

Saturday, 11 February 2012

Going Bananas over Mangoes and Oranges

What would you do if you launch few features in span of few days and you see that pageviews go down with them? A natural instinct goes for analysing the features that went in and to figure out what could cause pageviews dip. After all, features were supposed to help users to explore more and pageviews should increase.

A similar situation arose at CF sometime back.
  • A brief look at the features didn't suggest that they were worsening user experience.
  • Though some of the features were such that user could see more information one one page without actually having to go to more pages. (e.g. not having to click on page-2 and search-results getting auto loaded on scrolling down). This meant that pageviews could decrease, but overall user experience would be better.
  • This was strengthened by observation that avg page time spent on the pages had increased. So there was a sigh of relief that people are probably engaging more and hence finding relevant information without visting more pages.
  • It was strengthened by couple of more observations. Unique pageviews had increased in this time. Also, the key business metrics showed growth as well.
But these were not completely convincing for following reasons:
  • Dip was sudden on a particular date
  • Dip continued from that date onwards
  • Pageviews dip was much more significant than what those above mentioned features would cause because of ease of information availability. e.g. People going to page-2 were not very high anyway.
All this sounded like a very interesting puzzle, but clearly it required more time to analyse. Somehow I could not get the time required for this analysis for a while.

But this republic day I got on my system and thought lets look into it deeper.

Various perspectives ran in mind:
  • Did our SEO get bust because of recent changes in Google algorithm?
  • Did we do some marketing earlier which was stopped suddenly?
  • Did something break in google analytics?
  • Did our servers become slow ?
  • If it is some feature, then probably a particular kind of pages should see dip and not others

I checked SEO: Things were affected because of recent Google algo, but not so much, and not on a particular date anyway

No particular marketing activity emerged either.

Few days back we had fixed an issue related to google analytics code being repeated on pages, so that sounded like a possibility. But that had happened much after this dip.

Also, few days back there was a facebook api bug which was causing pageviews to get distorted. But the date on which it was fixed, didn't show any particular change related to this dip.

Checking which kind of pages saw the dip revealed few pages. Project Pages, Listing Pages, Property Search page, Subpages of property pages. (For sake of making it tad bit more readable, lets call them Mango, Orange, Banana, and Grapes pages)

Now this was interesting! Other pages had not shown significant dip. These pages did.

Now the thought went back towards "maybe it is some feature that is causing it, as only a few type of pages have been affected"

So came the natural thought of checking lets see what exactly went live on this particular black date (Trivia: This date was 6th December, which is considered as black day in India because of Babri Mosque Demolition, and consequential riots that took place on 6th December 1992).
Checked my emails. Resulted nothing particular that I didn't know.
Checked with Vinay. Nothing specific came out.

Then I went about analysing each pagetype individually (While Vinay continued to analyse SVN records to see if anything suspicious was to be found). All these pages showed similar dip, but very important learning was TRAFFIC TO THESE PAGES FROM GOOGLE HAD INCREASED.

Now, that's something !! So, only the traffic which was from other pages in CF to these pages had gone down. It has got to be some feature which is causing the dip. It was also strengthened by observation that Mango pages had not even been touched by the recent features.

So I went about analysing traffic from one page to another for these pages. As expected there was a dip in the traversals from one page to another. e.g. Mango -> Orange had about 5.5k traversals per week before this, and now it was about 3.6k traversals per week.

BUT, from this we can't conclude that those pages are behaving badly. Because the pageviews from Mango to Orange has gone down because overall pageviews have gone down as well. So we need to see the percentage of pageviews which result into traversals, rather than absolute numbers.

Alas, all these pages showed improvement in performance in terms of % traversals, except for minor drop in Orange to Mango and Banana pages.

This was not helping. It was not feature because % performance of the pages had improved.

I went back to the unique pageviews, and avg time spent (Probably to make myself feel good).

Important observations here were:
  • Unique pageviews had increased. (More people looking at one page)
  • Ratio of pageviews to unique-pageviews had decreased (Person taking a decision on a page in less views than earlier)
  • Avg time spent per page had increased
  • Total time spent had increased (Avg time spent per page X Pageviews)
I almost concluded that nothing was wrong. Things have actually improved.

Meanwhile, Vinay was researching on SVN. He couldn't find anything, but since he was curious, I shared some graphs with him. Nothing specific could come out.

Just in a bid to see example dip on a particular page, I went to some of the Mango pages and saw the behaviour on few specific top performing pages. There was clear dip, but without any pattern.

I was raking brains in various other directions, and was almost about to give up. There was a sense that things were good as unique pageviews had increased and total time spent was increased. But somewhere in mind there was a sense of not having solved the problem completely as some loose ends were still there.

In this brain-raking, I felt there is more to this. There might be some more pattern to it. Maybe there is a bug. Maybe this behaviour is for a particular type of people. Or from a particular type of source. e.g. particular connection network, particular browser etc.

I tried segmenting the pageviews for IE traffic. BINGO! THE PAGEVIEW DIP WAS ONLY FOR THE IE PAGES, AND NOT FOR OTHER BROWSERS. IN FACT OTHER BROWSER TRAFFIC WAS ALMOST UNAFFECTED OR HAD INCREASED MILDLY.

THIS WAS BIG DISCOVERY.

I had a renewed energy. I thought maybe in IE also it might be specific to some version. Checked it. Yes it was the IE7!

My mind went through all the recent bugs of IE7. There was one important bug, but that was fixed a while back, and that didn't seem related to traffic too. Also, if there was some such big bug, QA would catch it.

I anyway thought of downloading IE7 and trying it myself. Downloaded it, and tried. But my system has very little memory, and it required a lot of memory and started hanging. So I let it be for sometime, and thought, will try it from office.

Meanwhile the analysis continued.

My mind went "Maybe in IE some pages don't behave the way the are expected to. Probably navigation from one page to another is not good in IE". I repeated my above analysis for IE7 traffic. Traversal data was same as above, so no big discovery there. Features were working fine on those pages.

But there was a striking observation there. Earlier the ratio of pageviews to unique pageviews used to be of range 15-20 in IE7, while now it was just above 2 !!

15-20 as the ratio ! Basically that means a person earlier was viewing each of those pages 15-20 times! And that is average! This sounded ridiculous. Avg of about 2 was still sounding okay. Another important stat here. Avg. time spent on these pages in IE7 was about 10-15 seconds before this dip, and after dip, it was about a min. or more.

It looked as if something was wrong earlier, and things have now come to normalcy. But what could it be? As mentioned earlier, the bug of including google analytics code twice was something that was fixed sometime back, but not when this issue occurred.

Anyway I concluded that things are fine. Its just that some pageviews and time-on-site calculation was haywire earlier (for some unknown reason). And now everything is fine.

But of course there was a curiousity to know what is the change that caused this numbers to become normal.

So next day when I went to office, I started taking Vinay's help again in determining what all changed on 6th Dec (or dates surrounding it).

Nothing worthwhile could be seen. But we anyway thought that lets just try on IE7 if everything was fine. Got IE7 downloaded and tried it.

Alas, behaviour in IE7 was BAD! Mango, Banana and one more page were taking forever to load in IE7. Almost like hanging. Basically the behaviour that I noticed at my home pc was not because of my pc having low memory. That was a real issue indeed!

How did QA not catch it? Well, because we used to test on IE9 with changing the compatibility modes to IE7 and IE8. I had always doubted if that will work correctly. Now we had a live case demonstrating that its not advisable to test that way.

It was quickly realised that this was a big issue. It was not very clear if this was the one which was had caused the dip. But it surely was very important as IE7 still brings a lot of visitors.

Now started the task of debugging whats going wrong here. Vinay, Ankush, Deep, Abinash and myself sat till late night and tried all possible ways of debugging, but couldn't catch the bug. We tried everything that we could think of at that hour. Disabling javascript, Partial disabling, Loading page from disk rather than server etc., but none helped. We also suspected if this was related to G+ API having some bug. Some more effort for couple of days was put on it but with no avail (Of course there were other issues too which had demanded people's time).

After few days of failure, Manickam took it up. He tried for couple of days himself but couldn't figure it out. We explored if there are some debuggers which could help us locate JS issues, in IE. Meanwhile Manickam sat tight with some more things in his mind in debugging it, and HE FOUND IT! It was some iepngfix.htc file which was causing the problem. Removing its reference made the page load much faster. Clap Clap! We later found that this file was for some fix in IE6. This shouldn't have any effect on IE7, but for IE7 and IE8, it makes pages slow if there are lot of images in it.

Now the QA took place for this and the fix was made live!

I was like "I can now marry in peace".

It went live and was curious to see the effect of the fix.The pageviews had increased for sure, but not as largely as they had dipped. I had kind of guessed it though because some questions are still unanswered:
  • iepngfix.htc was there even before 6th Dec. How did the issue start only on 6th Dec?
  • How do we explain ridiculous pageviews/unique-pageviews ratio that was there earlier?
  • What was causing avg times of <15 sec. on those pages?
  • Is it that google analytics was getting multiple requests at small intervals? Was page refreshing at those short intervals?
  • Is it just a statistics gone wrong issue, or a real issue where numbers have been impacted?
Writing this has taken long as well. :) Its 4AM now, and I am tired and sleepy, but still craving to know the answers.

More later.