Saturday, 11 February 2012

Going Bananas over Mangoes and Oranges

What would you do if you launch few features in span of few days and you see that pageviews go down with them? A natural instinct goes for analysing the features that went in and to figure out what could cause pageviews dip. After all, features were supposed to help users to explore more and pageviews should increase.

A similar situation arose at CF sometime back.
  • A brief look at the features didn't suggest that they were worsening user experience.
  • Though some of the features were such that user could see more information one one page without actually having to go to more pages. (e.g. not having to click on page-2 and search-results getting auto loaded on scrolling down). This meant that pageviews could decrease, but overall user experience would be better.
  • This was strengthened by observation that avg page time spent on the pages had increased. So there was a sigh of relief that people are probably engaging more and hence finding relevant information without visting more pages.
  • It was strengthened by couple of more observations. Unique pageviews had increased in this time. Also, the key business metrics showed growth as well.
But these were not completely convincing for following reasons:
  • Dip was sudden on a particular date
  • Dip continued from that date onwards
  • Pageviews dip was much more significant than what those above mentioned features would cause because of ease of information availability. e.g. People going to page-2 were not very high anyway.
All this sounded like a very interesting puzzle, but clearly it required more time to analyse. Somehow I could not get the time required for this analysis for a while.

But this republic day I got on my system and thought lets look into it deeper.

Various perspectives ran in mind:
  • Did our SEO get bust because of recent changes in Google algorithm?
  • Did we do some marketing earlier which was stopped suddenly?
  • Did something break in google analytics?
  • Did our servers become slow ?
  • If it is some feature, then probably a particular kind of pages should see dip and not others

I checked SEO: Things were affected because of recent Google algo, but not so much, and not on a particular date anyway

No particular marketing activity emerged either.

Few days back we had fixed an issue related to google analytics code being repeated on pages, so that sounded like a possibility. But that had happened much after this dip.

Also, few days back there was a facebook api bug which was causing pageviews to get distorted. But the date on which it was fixed, didn't show any particular change related to this dip.

Checking which kind of pages saw the dip revealed few pages. Project Pages, Listing Pages, Property Search page, Subpages of property pages. (For sake of making it tad bit more readable, lets call them Mango, Orange, Banana, and Grapes pages)

Now this was interesting! Other pages had not shown significant dip. These pages did.

Now the thought went back towards "maybe it is some feature that is causing it, as only a few type of pages have been affected"

So came the natural thought of checking lets see what exactly went live on this particular black date (Trivia: This date was 6th December, which is considered as black day in India because of Babri Mosque Demolition, and consequential riots that took place on 6th December 1992).
Checked my emails. Resulted nothing particular that I didn't know.
Checked with Vinay. Nothing specific came out.

Then I went about analysing each pagetype individually (While Vinay continued to analyse SVN records to see if anything suspicious was to be found). All these pages showed similar dip, but very important learning was TRAFFIC TO THESE PAGES FROM GOOGLE HAD INCREASED.

Now, that's something !! So, only the traffic which was from other pages in CF to these pages had gone down. It has got to be some feature which is causing the dip. It was also strengthened by observation that Mango pages had not even been touched by the recent features.

So I went about analysing traffic from one page to another for these pages. As expected there was a dip in the traversals from one page to another. e.g. Mango -> Orange had about 5.5k traversals per week before this, and now it was about 3.6k traversals per week.

BUT, from this we can't conclude that those pages are behaving badly. Because the pageviews from Mango to Orange has gone down because overall pageviews have gone down as well. So we need to see the percentage of pageviews which result into traversals, rather than absolute numbers.

Alas, all these pages showed improvement in performance in terms of % traversals, except for minor drop in Orange to Mango and Banana pages.

This was not helping. It was not feature because % performance of the pages had improved.

I went back to the unique pageviews, and avg time spent (Probably to make myself feel good).

Important observations here were:
  • Unique pageviews had increased. (More people looking at one page)
  • Ratio of pageviews to unique-pageviews had decreased (Person taking a decision on a page in less views than earlier)
  • Avg time spent per page had increased
  • Total time spent had increased (Avg time spent per page X Pageviews)
I almost concluded that nothing was wrong. Things have actually improved.

Meanwhile, Vinay was researching on SVN. He couldn't find anything, but since he was curious, I shared some graphs with him. Nothing specific could come out.

Just in a bid to see example dip on a particular page, I went to some of the Mango pages and saw the behaviour on few specific top performing pages. There was clear dip, but without any pattern.

I was raking brains in various other directions, and was almost about to give up. There was a sense that things were good as unique pageviews had increased and total time spent was increased. But somewhere in mind there was a sense of not having solved the problem completely as some loose ends were still there.

In this brain-raking, I felt there is more to this. There might be some more pattern to it. Maybe there is a bug. Maybe this behaviour is for a particular type of people. Or from a particular type of source. e.g. particular connection network, particular browser etc.

I tried segmenting the pageviews for IE traffic. BINGO! THE PAGEVIEW DIP WAS ONLY FOR THE IE PAGES, AND NOT FOR OTHER BROWSERS. IN FACT OTHER BROWSER TRAFFIC WAS ALMOST UNAFFECTED OR HAD INCREASED MILDLY.

THIS WAS BIG DISCOVERY.

I had a renewed energy. I thought maybe in IE also it might be specific to some version. Checked it. Yes it was the IE7!

My mind went through all the recent bugs of IE7. There was one important bug, but that was fixed a while back, and that didn't seem related to traffic too. Also, if there was some such big bug, QA would catch it.

I anyway thought of downloading IE7 and trying it myself. Downloaded it, and tried. But my system has very little memory, and it required a lot of memory and started hanging. So I let it be for sometime, and thought, will try it from office.

Meanwhile the analysis continued.

My mind went "Maybe in IE some pages don't behave the way the are expected to. Probably navigation from one page to another is not good in IE". I repeated my above analysis for IE7 traffic. Traversal data was same as above, so no big discovery there. Features were working fine on those pages.

But there was a striking observation there. Earlier the ratio of pageviews to unique pageviews used to be of range 15-20 in IE7, while now it was just above 2 !!

15-20 as the ratio ! Basically that means a person earlier was viewing each of those pages 15-20 times! And that is average! This sounded ridiculous. Avg of about 2 was still sounding okay. Another important stat here. Avg. time spent on these pages in IE7 was about 10-15 seconds before this dip, and after dip, it was about a min. or more.

It looked as if something was wrong earlier, and things have now come to normalcy. But what could it be? As mentioned earlier, the bug of including google analytics code twice was something that was fixed sometime back, but not when this issue occurred.

Anyway I concluded that things are fine. Its just that some pageviews and time-on-site calculation was haywire earlier (for some unknown reason). And now everything is fine.

But of course there was a curiousity to know what is the change that caused this numbers to become normal.

So next day when I went to office, I started taking Vinay's help again in determining what all changed on 6th Dec (or dates surrounding it).

Nothing worthwhile could be seen. But we anyway thought that lets just try on IE7 if everything was fine. Got IE7 downloaded and tried it.

Alas, behaviour in IE7 was BAD! Mango, Banana and one more page were taking forever to load in IE7. Almost like hanging. Basically the behaviour that I noticed at my home pc was not because of my pc having low memory. That was a real issue indeed!

How did QA not catch it? Well, because we used to test on IE9 with changing the compatibility modes to IE7 and IE8. I had always doubted if that will work correctly. Now we had a live case demonstrating that its not advisable to test that way.

It was quickly realised that this was a big issue. It was not very clear if this was the one which was had caused the dip. But it surely was very important as IE7 still brings a lot of visitors.

Now started the task of debugging whats going wrong here. Vinay, Ankush, Deep, Abinash and myself sat till late night and tried all possible ways of debugging, but couldn't catch the bug. We tried everything that we could think of at that hour. Disabling javascript, Partial disabling, Loading page from disk rather than server etc., but none helped. We also suspected if this was related to G+ API having some bug. Some more effort for couple of days was put on it but with no avail (Of course there were other issues too which had demanded people's time).

After few days of failure, Manickam took it up. He tried for couple of days himself but couldn't figure it out. We explored if there are some debuggers which could help us locate JS issues, in IE. Meanwhile Manickam sat tight with some more things in his mind in debugging it, and HE FOUND IT! It was some iepngfix.htc file which was causing the problem. Removing its reference made the page load much faster. Clap Clap! We later found that this file was for some fix in IE6. This shouldn't have any effect on IE7, but for IE7 and IE8, it makes pages slow if there are lot of images in it.

Now the QA took place for this and the fix was made live!

I was like "I can now marry in peace".

It went live and was curious to see the effect of the fix.The pageviews had increased for sure, but not as largely as they had dipped. I had kind of guessed it though because some questions are still unanswered:
  • iepngfix.htc was there even before 6th Dec. How did the issue start only on 6th Dec?
  • How do we explain ridiculous pageviews/unique-pageviews ratio that was there earlier?
  • What was causing avg times of <15 sec. on those pages?
  • Is it that google analytics was getting multiple requests at small intervals? Was page refreshing at those short intervals?
  • Is it just a statistics gone wrong issue, or a real issue where numbers have been impacted?
Writing this has taken long as well. :) Its 4AM now, and I am tired and sleepy, but still craving to know the answers.

More later.

No comments:

Post a Comment