I know this post is no rocket science and might just appear to be too silly! :)
5 years back i joined a social networking website (orkut), my first. It has a unique feature w.r.t such sites – it displays the number of visits to your profile on each day. Some related features of the web-site are as follows:
1. When you sign into your account, you appear on the home-page of your friends’ profiles (something like “recently online”). As other people sign-in after you did, you’ll gradually stop appearing on the home page of your friends’. This is because, the home page can only show 9 people at a time, that in turn means that if 9 people sign in after you, you would cease to be visible on the home page and would only be accessible on the friends list – that generally people don’t look at unless they are looking for someone specifically.
2. If anyone visits your page or refreshes it. It will be counted as one hit.
Now, it is obvious that if you sign-in very often, you are are more likely to be seen on the home-page of your friends’ pages who are then more likely to click through to your page (I don’t have any research to support this, but see it as common sense and as per experience. If you see someone’s profile on your page you would be more likely to visit that profile casually rather than search for that profile and visit it when it is out of sight from the home page. The latter you would do only if you have to communicate with the concerned person, or if you have some work, OR if you have to spy on the person under question ;-). Thus, in conclusion – when you sign-in regularly you are more likely to get more hits on your profile.
Now there are tricks to avoid appearing online on such websites. What they’ll do is that even when you sign-in you would not appear online on the home pages of your friends. Since you do not appear online on the page of your friends your profile is also less likely to be visited by those who did after seeing you online. In short your profile would be visited by such people only:
1. People who wanted to talk to you for something.
2. People who randomly searched for your name/somebody else sharing your name (this gets removed if you don’t keep a name) and landed up on your profile.
3. People who saw your posts/messages on some community or group and then got curious and visited your page.
4. Somebody searched for his/her favorite cult movie and you have that movie on your profile and thus it shows on the search results and that somebody then checks out your page (extends to artists, music etc too).
5. Somebody randomly remembers you and checks your page out to see what’s up with you.
Thus we can say that if certain conditions are satisfied, the number of hits to your page would be a random number, at least approximately.
I generally satisfied the condition of not appearing online, and I noticed over the years that other than the occasional spike, the number of hits was in a way distributed around a central value. I however never paid any further attention.
Some weeks ago, while writing somewhere., I thought it was time to try and model the same for a blog or a website and see for myself if that number could indeed be considered as random. Please note that if that number would be random then the distribution of page visits over a period of time would be a Gaussian curve (I’ll come back at the end of the post to this for those who wouldn’t be sure).
Now it is difficult to satisfy those conditions that I mentioned for social networking website for some blog or website on the web. I looked for the following and made the following assumptions (please question their wisdom in case you don’t agree and give new suggestions):
1. Suppose you start writing a blog and it starts off rather well. You are enthusiastic and advertise your page and ask people to pay a visit. Such hits can’t be considered random hits. The total number of visits to the page would be non-random plus random hits (from search engines, random visitors etc).
2. You are active on your blog in a big way for the first year say. And suppose the feed/email subscribers keep visiting your blog frequently as and when they get notified of a new post by you. This number too wouldn’t be a random one as the number of hits would be basically a function which has a dependent variable in the number and frequency of posts as well. In short the more you post the more page visits you are likely to get.
3. After the one year you decide to quit your page. For some time the subscribers would keep visiting your page. And since you have stopped blogging as such, you would stop advertising too, you would stop giving its link to people/friends and asking them to pay a visit.
4. After a sufficient period of time, say another year. The “excitement” about the blog has died down and there are no new posts at all. The number of hits that you obtain can be:
(a) Random hits from people searching randomly some stuff on search engines.
(b) Randomly people (mostly friends, former stalkers etc ;-) think of paying your blog a visit just hoping there might be something new.
(c) People keep looking for some tutorials (or similar post) on your page. I have noticed that no matter how old a tutorial on your page gets the old crowd is mostly replaced by new people and the overall number of visits to that page remains roughly around a mean value.
I believe that this sum total (unless bot attacks, or similar events which would result in spikes in the number of visits for a day occur) would roughly be a random number. And also that this above scenario for a website is equivalent to that of a profile I mentioned earlier.
I actually thought it was pretty straight forward. I asked some people about what their opinion on it was. And it appeared to me that either I lacked the communication skills to convey what i meant or maybe it was not so straight forward.
I decided to take it up. I collected the webstats for two websites.
I would like to thank Dr Jonathan Yedidia of the MERL for providing me with the stats of his website over the past three-four months or so. This website has been inactive for over a year and satisfies the other conditions that I spoke about, so it was an ideal candidate.
One more observation was this : The number of hits on weekends is visibly less than on weekdays. So it is not a good idea to use both together. It would be a better idea to use two classes:
1. Only weekdays
2. Only weekends or other holidays.
That is, it is a good idea to only models weekends, or only weekdays. Both together would not be a good idea as they seem to have different distributions.
Let’s only consider the weekdays class. Like I mentioned I collected data for some months for Dr Jonathan Yedidia’s website. And the data plots actually turned out to be Gaussian. That’s interesting.
The staircase plot for the number of visitors to the website is given below: The X Axis represents the days and the Y Axis represents the number of hits.
[Staircase Plot for the number of visitors]
I plotted a simple historgram of this data for 40 bins and for 80 bins and the plot is what I expected it to be! Roughly normal. There is an outlier though (a day when number of hits was 265, which I believe was due to a bot attack or something similar).
[Histogram Plot for the data with 40 bins]
Just for the sake of visual convenience let’s consider the same data for 80 bins.
[Histogram Plot for the data with 40 bins]
The normal fit for the above plots (both for 40 and 80 bins) are given below:
[Normal fit on the data with 40 bins]
[Normal fit on the data with 80 bins]
The estimated values for the mean and the standard deviation are as follows:
Meaning: The average number of hits on each day is about 97 and and on any given day the number of hits would most likely lie in the bracket 97.5 +/- 28.
I found the above confirmation of what i had thought somewhat interesting. This does confirm that when certain things that I mentioned are taken care of, then the number of hits on a particular day can be considered to be a random number. I now plan to collect data for a longer period of about one year and repeat the same for four websites (2 that satisfy those assumptions and 2 that do not, these two would be the control).
The normal distribution never ceases to amaze me. It is one of those things that signify an underlying order in chaos. Which is fundamental. You take a set of random people, take measurements of their foreheads, waists, heights and it would lie along a normal distribution. You take the various shots taken by an archer towards an aim and the distribution of arrows about the center is along a normal distribution.
Infact this can be used to detect irregularities in data. Such patterns of randomness are pretty reliable to make a guess if there is any foul play with the data. For example if you were a recruiter in the the military and had data on the heights of thousands of men. If the data does not lie along a normal distribution would indicate some fudging or faking of heights by individuals. Such statistical analysis has given to a new area called Forensic Economics, which has some extremely enjoyable aspects to it such as Benford’s law, which too deals with checking the statistical fingerprint that data leaves to detect foul play.
Such interesting patterns underlying the chaos of social data led to many interesting philosophical questions in the 19th century. Many of which were simply whimsical, however exploration of which provided much progress in understanding randomness.
Read Full Post »