Add to that my standard traffic wastes of the Make Your Own Virus page which has evolved into a twisted subculture of its own by now and the Religion Sucks page which is similar but not quite as bad and we’re pretty close to our bandwidth limit for the month by now… with about a week left to go.
To keep extra costs from piling up, we have been redirecting some traffic, like this week’s UnicodeChecker update which doesn’t only cause a non-trivial number of downloads but those are even larger than the previous versions because the updated Help in Apple’s Help Book format takes about 100KB more space than the RTF version we had previously. Images on these pages haven’t been served from our own server for ages for the same reason.
But it’s still going to be a close one and I’m not sure we’ll manage to stay under our bandwidth limit. Thankfully, G pointed me to a page that tells you a magic PHP one-liner to turn on gzip compression for the text files you serve. At an estimated 70% saving, a very welcome hint at this stage! I had heard about gzip and its up/downsides before but the state of web-tool documentation being the sorry one that it is, I never investigated further. (In fact, I still think I should be able to have everything on the site gzipped by a single line in a .htaccess file, but that may require additional magic so I’m quite happy with having just all these pages compressed for the time being.)
The funny thing is that once you start running out of bandwidth, your server logs start being interesting. Usually I find those logs merely unhelpful or confusing and they very rarely seem to contain information I’m actually interested in. While it may be amusing to see things like this
i.e. server usage information by the hour of the day by each of the pages, accesses or bytes served, the practical use of that information to me is negligible. I may take guesses about visitors’ time zones by looking at it and I may be thrilled to know how the number of bytes per access changes throughout the day. But none of that information is particularly relevant for me. Perhaps it could help the guy running the server to see when the system has the highest load, but that’s not my job. On the other hand, obvious questions like whether there are any unexpectedly high access numbers to some pages or how new files are doing, remain unanswered – particularly towards the end of the month in these systems which accumulate the server usage one month at a time.
Being the good guys that they are, our provider even gives us two of these useless analysis tools (webalizer and awstats) to play with. awstats has somewhat more interesting information and manages to give you an access count for *.rdf files but it must be extremely buggy. At least its numbers tend to be 20-30% too low. That just sucks. Severly. It means you can’t use its numbers to judge whether you’re close to your bandwidth limit. webalizer, on the other hand, seems to give the correct numbers but presents even less interesting information.
One thing considering ‘interesting’ information is about the search engines coming to your site and the people coming from search engines. This month alone, Googlebot hit our site 20000 times, transferring 360MB of data and getting 160 copies of the robots.txt file. While I appreciate search indices, I wonder whether that isn’t a bit too much. It means they downloaded all of our site more than twice in the past month (actually it doesn’t because there aren’t that many files on the server but bandwidth wise its about right). And only a tiny percentage of those files changed.
Well, that isn’t quite true as the blog pages technically change all the time because of the ‘live’ listing of new comments but shouldn’t a smart search engine be able to figure that out? If they can’t, how can I tell PHP to send the file’s modification date to the browser as the file’s last changed date rather than sending the current date? Perhaps doing that will ease the load generated by Googlebot.
Oh, and it’s not Googlebot alone. There’s MSNbot as well, which reads our robots.txt file ten times a day and downloads even more pages or there’s Inktomi which downloaded a mere 10000 pages but got itself 1500 copies of robots.txt. Congratulations! In total, Google wins. Because their service is actually used and drives traffic to the site. Seeing my numbers of search engine referrals I do wonder how people can deny that Google is extremely powerful in the search engine market:
Those are just the most important ones, but they make more than 98% of the referred searchers and all but 3000 of the hits by crawlers. And while these numbers come from the tool that misses a sizeable bit of our traffic, it really looks like the world of search engines is quite small at the end of the day and that the engines themselves aren’t terribly efficient.
And while I’m sifting through the stats, let’s also look at the browsers that come to visit. Naturally I expect a bias towards the Mac here. Both because of my own interests and because earthlingsoft means Mac software. A rough look gives:
|IE 6.0 (Windows)||40,5%|
|IE 5 (Mac)||0,5%|
|Firefox < 1.0.7||5,8%|
|Firefox > 1.0.7||1,8%|
… and so on. There’s still an 8% chunk which the software couldn’t identifiy (other aggregators, anonymous browsers…. ?). And I was surprised by the strength of Firefox. Not bad at all. Special credit goes to the ~200 hits by PDA and mobile phone users making them ‘win’ over the iCab users.
Received data seems to be invalid. The wanted file does probably not exist or the guys at last.fm changed something.