The thoughts of a web 2.0 research fellow on all things in the technological sphere that capture his interest.

Tuesday, 10 June 2008

Is the web linguistically on the left or right?

I am currently in the middle of reading David Crystal's (2006) 'Language and the Internet', an interesting book that, when it started mentioning style guides, got me wondering about whether style guides could be used to determine whether the UK web space was politically on the left, or on the right. The leading broadsheets from both sides of the political debate have publicly available style guides (i.e., The Telegraph and The Guardian), and the differences could be used for the basis of such a linguistic-webometric investigation.

My personal favourite style guide section is The Telegraph's Banned Words. Whilst the banning of terms such as 'Europhobe' have obvious political motivations, you have to wonder whether it was really necessary to explicitly ban referring to 'perverted Scout leaders' (Whilst Google Trends does not show the phrase to be endemic, that may be because of the Telegraph's quick action). It is interesting to note, however, that despite the Telegraph's authoritarian values, they seem seem to be very lax with their own language, the supposedly banned 'mass exodus' was used only a few days ago. Surely there will be letters to the editor!

Unfortunately these days search engines try to be helpful, and ignore many of the differences. For example, 'Yahoo' and 'Yahoo!' are both treated as the same, when any fool would know that the exclamation mark reflects the searching for more conservative opinions on the search engine. It would be nice to be able to turn a search engine's 'helpful' features off occasionally.

Labels: , , , , ,

posted by David at | 0 Comments Links to this post

Friday, 23 May 2008

The Strange Case of the Webometrician's Fan

It sometimes feels as though there is no piece of information, or opinion, that cannot be found online. If people have something in a digital format it seems natural for a large proportion of the population to publish it on the web, with little thought as to whether they would really want people looking at it for years to come.

Today my attention was drawn by the head of my research group, who had been engaging in some google-self-abuse (although he claims he was looking to see if his latest paper had been published yet), to one particular term paper: '“Webometrics”: Through the eyes of Mike Thelwall'. After working with Mike for four years I can assure readers that his appalling t-shirts are testiment to the fact he is not head of the 'Sense of Humor Diagnosis Service'.

Labels: ,

posted by David at | 0 Comments Links to this post

Friday, 18 April 2008

Alexa changed its ranking system: I'm a winner!

Alexa is an often criticised ranking of web sites, with the criticism largely based on the use of the Alexa toolbar as a source of data. The use of the toolbar data skewed the ranking in favour of those sites visited by internet marketers and search engine optimisers, those who installed the toolbar, rather than the average user. Alexa's big news, which everyone reported yesterday (e.g., Mashable, TechCrunch), is that they are now using additional data sources, although what they are is not very clear in their announcement.

Obviously with any change in the ranking system there will be winners and losers, and those who win are less likely to complain than those who lose. Personally I think the new Alexa rankings are a HUGE step forward. This conclusion is based solely on the increase in my own personal ranking. Back in January I noted that the Alexa ranking for Webometric Thoughts was 3,816,072. Today my Alexa ranking is 1,607,649 (1,389,032 for the 1 week average). Breaking into the top one million suddenly seems much easier.

Labels: , ,

posted by David at | 1 Comments Links to this post

Saturday, 12 April 2008

Suicide and the Internet: Some flaws in the study

Webometric investigations rarely gain mainstream interest, yesterday, however, one did: A content analysis of the top 10 sites, on the four major search engines, for 12 searches relating to suicide. This highlighted the large number of hits that were to 'dedicated suicide sites' (e.g.. pro-suicide, encouraging, describing methods, or portraying suicide in fashionable terms): 90 out of 480 hits. Unsurprisingly this gained the interest of numerous news sites including the BBC. There are, however, a number of problems with the study: not all search terms are equal, and not all search engines are equal. Whilst we all make sweeping statements about web phenomena, we should really save it for our blogs rather than publication in the likes of the British Medical Journal (BMJ).

The main problem of the investigation is a focus on the information that is retrievable rather than the information that is actually being retrieved, which quickly muddies the water. Whilst the combining of search engines would initially seem to underestimate the scale of the problem, the propensity of users to use certain search terms would seem to indicate that the article has overestimated the scale of the problem.

The Google Effect
The majority of the statistics provided in the paper are based on the combined results of Google, Yahoo, MSN, and Ask:
-90/480 were dedicated suicide sites
-62/480 were sites forbidding suicide
-59/480 were sites discouraging suicide
However, almost 70% of searches use Google, which as the results show has the highest number of dedicated suicide sites in the results. This would seem to underestimate the problem: whereas just under a fifth of the hits were dedicated suicide sites overall, for the most influential search engine this has risen to just under a quarter. However, when looking at the search terms used, we soon reaslise that the problem has been over-stated.

Search Term Analysis
Whilst the BMJ lists the 12 search terms used, gathered partly from interview data and search suggestions used by search engines, a quick investigation quickly shows that they are by no means used in equal measure. Of the twelve terms only 4 were used often enough to generate search graphs in Google Trends:
-suicide
-sucide methods
-how to commit suicide
-how to kill yourself
And even amongst these four there was a wide variation in usage, with the overwhelming majority of queries being generated by the term suicide:


A content analysis of Google's 'suicide' results
Below are the top ten links I received when looking at the global results from google.co.uk for the term 'suicide', and how I would classify them. Whilst the BMJ study emphasises that is doesn't restrict the results to the UK, it does not mention whether it uses google.com or google.co.uk. I have used google.co.uk as, unless you ask it otherwise, google.com will redirect British users to google.co.uk.
-Miscellaneous - Wikipedia's suicide page
-Against suicide - Suicide...read this first
-Against suicide - Suicide.com
-Academic or policy site - Mind fact sheet
-Academic or policy site - Stanford encyclopaedia of philosophy
-Prevention or support site - Kids Health answers and advice -suicide
-Prevention or support site - Problems of life: Suicide
-Not relevant - Facebook suicide: the end of a virtual life
-Prevention or support site - Depression and suicide in men
-Prevention or support site- BBC: Health conditions: Suicide
Whilst classification is notoriously difficult to get agreement on, none of these sites could be considered the sort of 'dedicated suicide sites' that will spread panic through middle-England.

I have no doubt that there are plenty of sites on the web that encourage suicide, but before we start a panic we need to have a greater understanding of how people are searching on the topic of suicide when they are feeling suicidal. We can't just lump together the findings of different searches on different search engines and say that statistically we have a problem.

The most popular search on the most popular search engine on the topic of suicide does not find any 'dedicated suicide sites'.

The original BMJ article:
Biddle, L., Donovan, J., Hawton, K., Kapur, N., & Gunnell, D. (2008). Suicide and the internet. British Medical Journal, 336(12 April 2008), p. 800-802.
Can be found here.

Labels: , , ,

posted by David at | 0 Comments Links to this post

Friday, 28 March 2008

Research v. Internet

I have just come across a picture that perfectly sums up why I never get as much work done as I should:

How can webometrics compete with dinosaurs?

(Asher Sarlin's original picture can be found HERE).

Labels: ,

posted by David at | 0 Comments Links to this post

Thursday, 13 March 2008

Classifying the web: Herding ADHD cats

When it comes to boring jobs I like to think I have had some of the worst: taking the shells off of hard boiled eggs, taking the green bits off of tomatoes, and, most recently, classifying web links. Yes, I can classify the links at home with a constant supply of coffee and the music of my choice, but it is still one of the most boring jobs. The reason: web pages come in ever imaginable form, mostly with no discernible purpose, with links placed just because the web owner can. Classifying the web is like herding ADHD cats.

The good and interesting sites that we visit every day are surrounded by a web of crap that we only usually trip across if we are unlucky. These are not necessarily offensive sites, just sites that are absolute rubbish: spam, half-formed, badly written, orphaned. Classifying the web means that we have to wallow in this web of crap. Its not like classifying a library of books, but rather like classifying a whole world of which 90% is the council rubbish tip.

Labels: , , ,

posted by David at | 1 Comments Links to this post

Monday, 10 March 2008

Not all links are equal!

Thanks to a single link on the BBC's delicious roll on Saturday night, yesterday saw Webometric Thoughts get its highest number of hits ever. Whilst for many sites 121 absolute unique visitors in a day (according to Google analytics) wouldn't be worthy of note, the webometric blogging community have fairly low aspirations.

What is interesting, from the perspective of a Google Analytics junkie, is the difference between the amount of traffic this link drove in comparison to a similar on the BBC's delicious roll on the 16th January. Whilst the January link only drove 17 unique users to my site, Saturday's link drove 102 users over a three day period!

Was the extra traffic all due to the extra time the link was visible on the BBC? It was visible a lot longer, but weekend traffic is often slower. Or was it the topic of the posts? The first was about ISPs, whilst the second was about the iPhone. It seems equally likely that the difference in the traffic is due to the link's anchor text. Whereas the first text referred to 'David Stuart research fellow', the second link merely referenced the blog 'Webometric Thoughts' (AC seems to have done much more digging than NR).

Not all links are equal, however equal they may seem.

Labels: , ,

posted by David at | 0 Comments Links to this post

Sunday, 17 February 2008

What's Everyone Twittering About?

Whilst I am not personally a big Twitter fan, I am interested in discovering what people are Twittering about and how the posts differ from other forms of communication. With such thoughts in mind I started my first tentative Twitter steps this evening.

Adapting an open source RSS feed reader I set about downloading the public timeline (http://twitter.com/statuses/public_timeline.rss), for which Twitter has no restrictions on the number of requests that you can send. Whilst the original plan was to download an hour's worth of data for a small pilot investigation, unfortunately I had to stop after about 45 minutes when I received Http 502 Status Code ('Twitter is down or being upgraded' rather than 'exceeded the rate limit').

The first post that was downloaded was numbered 723435732 (just after 7pm), whilst the last was numbered 723547592 (about 45 mins later). As the last number seems to be superfluous, there were a potential 11,186 posts to be downloaded, of which 6,422 posts were successfully downloaded. Many of the 'missing posts' will have been private, whilst others may have been missed due to delays in sending and receiving the RSS feed.

I have not, as yet, had time to do anything more interesting with the collected data than look at the frequency of terms using Text-Stat. So in true informetric style, here is the log-log graph of word frequency in rank-order:

Most noticeable in the frequency data is:
-Over 58% of twitter links are via tinyurl: 'http' appeared 588 times, 'tinyurl' 343 times.
-Twitterers are generally a polite bunch. The more 'popular' swear-words don't appear that often, in 6,422 posts: shit (11), fuck (6), & cunt (zero). Admittedly a large proportion are not in English and the are a few variations on the words, but nonetheless I probably swear more than all these people in my average email.
-And they are not celeb-obsessed: Britney only gets three mentions, whilst there is no word on mention of Winehouse. Instead they err on the side of the geek: windows (19), Mac (25), iPhone (20).

As the analysis shows, these are early (childish) days. But hopefully I will have the opportunity, later in the week, to create the tools to investigate the data more thoroughly before downloading a larger sample.

Labels: , , ,

posted by David at | 0 Comments Links to this post

Thursday, 14 February 2008

Web Impact Factors for Blogs

Oh what a tangled web we weave... has just posted an interesting article on the problems of calculating the impact of a blog. In summary: Whilst a web site's impact has traditionally been measured by dividing the number of inlinks by the number of web pages (the Web Impact Factor), the feed aggregators are having such a disproportionately large effect on the results they are useless. Whilst this is true, this is by no means the only reason for dismissing the use of the traditional WIF in determining the impact that a blog is having.

Other important factors that need to be discussed are:
1. The use of 'number of pages' as a denominator.
The number of pages has been used as a denominator to normalise for the size of organisations, whereas in this case it is normalising for the quantity of output as each of the blogs only has one author. Do we want to assess the value of individual posts, or the value of the blog/blogger?
2. The effect of the blogger posting comments on other people's web sites.
Analysis of the links to my blog from external sites (after dismissing the feed aggregators) would find that I am the author of most of them. Commenting on other people's blogs often provides a link back to your own blog, although these tend to be to the blog's homepage rather than a specific post. Do we need to dismiss these links, or do they provide a useful indicator of a blogger's contribution to the blogosphere?

Any method we use to judge the value of a blog will have its promotors and detractors often depending on how well it portrays their own work. Therefore I think we should stick with the WIF, at least until such time as my own webometric thoughts slip down the table.

nb. as an aside(-ish) I have noticed that for the first time Webometric Thoughts has leapt above both of the other webometricians' blogs on a Google.com search for 'webometrics'. Maybe Google rank is the only important indicator as it has such a disproportionately high effect on the success of a web site that all other indicators are now merely a reflection of it.

Labels: , ,

posted by David at | 0 Comments Links to this post

Friday, 1 February 2008

Webometrician out of retirement: When is a blog ever dead?

After almost a year without a single post the original webometrics blogger has decided to start blogging again. Whilst a more suspicious blogger would put the sudden revival down to the next generation of bloggers starting to encroach upon the top search engine results for the webometrics field, we will choice to believe the "loss of login details" excuse.

The question of when a blog can be officially pronouced 'dead' arises a lot in webometric investigations of the blogosphere. Many studies ignore blogs that have not been updated in the last week, 2 weeks, or the last month. Maybe it is more appropriate to consider all blogs active, unless they state otherwise.

Whilst I welcome any academic back to the blogosphere, maybe future posts will be a bit more accurate that the last:
"Live Search withdrawal of all link search operators"
Whilst Live Search have severely limited the operators it offers, they still offer a linkfromdomain operator, which provides results that are not available via Yahoo or Google.

Labels:

posted by David at | 0 Comments Links to this post

Tuesday, 18 December 2007

Google retaliates: reported collateral damage

According to Mashable some Google users are reporting receiving a large number of messages claiming their searches are looking like automated requests. If Google continues with a tightened security system there will be repercusions for those webometrician's who use scrapers rather than the Google API, but more importantly, Google may use it as an opportunity to encourage/force users to log-on: Surely if you log-on, there is less chance of receiving 'automated request' accusations.

Labels: ,

posted by David at | 0 Comments Links to this post

Thursday, 13 December 2007

Record numbers visit Webometric Thoughts!

Whilst it wouldn't be much of a record in comparison to Google, MSN, or even the local corner shop's web site, Webometric Thoughts finally got the 30 unique visitors in a day that have eluded it for so long. In fact it got 31 on Tuesday, and then 40 yesterday.

If the number of visitors continues grow at 29% each day, in forty days (and forty nights) I will get a million visitors in a day for the first time. Although I may wait before I upgrade my server data package.

Labels: ,

posted by David at | 0 Comments Links to this post

Friday, 30 November 2007

Webometric Thoughts: For the more discerning webometrician

In the small field of webometrics there are few blogs, but after finding the blog readability test (via Halavais), I have discovered that mine is the more mature of three I follow (or is it just that mine is more incomprehensible??).

Thelwall's rarely updated Webometrics:


Holmberg's original Webometrics.fi:


Whereas Webometric Thoughts comes in with a relatively respectable:


Maybe quantitative methodologies do have a few limitations.

Nb. Holmberg's latest blog incarnation webometrics.fi/blog receives a more respectable 'junior high school' ranking, but I live with the expectation that future posts will drag it back down :-).

Labels: ,

posted by David at | 1 Comments Links to this post

Monday, 29 October 2007

Dear Technorati, what is wrong with my authority?

I must admit to having an unhealthy interest in web statistics, especially when they relate to my own web site. It is therefore annoying to note that my Technorati authority doesn't seem to be worth as much as everyone else's. Whilst their filtering system allows users to filter the results according to whether hits have: any authority, a little authority, some authority, or a lot of authority; my authority seems to account for little, and my results (for the term webometrics anyway) only seem to appear for people not interested in the authority of the posts.

I am a reasonable person, and wouldn't expect my hard-earned authority of 5 to appear under 'a lot of authority', and maybe not even 'some authority', but surely under 'a little authority'! Especially as others are appearing under 'a little authority' with an authority of 1.

Web statistics are nothing but trouble.

Labels: ,

posted by David at | 0 Comments Links to this post

Wednesday, 17 October 2007

Week One of Google Analytics

Last Tuesday (at about lunchtime) I started utilising Google Analytics so that I could see whether anyone was accidently stumbling across my blog. Up until then the only indication I received was if someone left a comment, as traffic data from my web host is considered an extra and costs £15 per year! Today I can see, for the first time, a week's worth of data. Although as a webometrician it is not suprising that I have been looking at the data numerous times over the last week.

Since the introduction of Google Analytics I have had 78 unique users from 11 different countries, and whilst that is not exactly setting the world on fire, I can at least rest in the knowledge that I wouldn't be doubling my audience by sending my mother the URL.

The most curious finding was the amount of traffic I had driven to my site by Google for a post I wrote on the 16th of September about a rather idiotic Facebook group called 'Leamington Spa Celebrity Mental Spotting'. The traffic emphasises that it is not necessarily the topic that is important, but rather the uniqueness of the topic. Whilst there are millions of people searching for 'iPhone' and 'Facebook', there are millions of posts on those subjects; whereas there are only a few people searching for 'leamington spa celebrity mental spotting' but the small number of posts means that mine is likely to be near the top of the pile.

The statistics also point out the necessity of making the blog more engaging, most users only viewed the one page. Whilst a pre-defined template is never going be very exciting, the ease of use makes them very appealing.

Who knows, maybe with the help of Google Analytics I will have over 100 unique users next week!

Labels: ,

posted by David at | 0 Comments Links to this post

Friday, 12 October 2007

Yahoo Site Explorer serves conflicting results

Search engines play an important role in webometric studies as most researchers have neither the processing power, the bandwidth, or the inclination to attempt to crawl and index the whole of the web themselves. However, search engine data is very imprecise, they are estimated numbers of results, varying according to which server is being searched, how deep into the results the user is digging, and now it emerges according to whether you are logged in or not...at least according to Yahoo Site Explorer.

The volatility of the results can be seen through looking at the results for www.seroundtable.com, the site that just published a story about this particular cause of search result variety (I'm, not sure if I have come across it before). In fact there seems to be much more variation than the site first mentioned. Results depend on: whether you are logged in; how deep you digg into the results; and whether you are looking at the same page as the results (the number of inlinks can be seen when viewing the pages indexed, and equally the number of pages indexed can be seen when viewing the inlinks).

When looking at the first page (or the tenth page) of results for pages indexed, not logged in:
Inlinks = 35,171
When looking at the first page of result for inlinks, not logged in:
Inlinks = 56,200
When looking at the tenth page of results for inlinks, not logged in:
Inlinks = 55,288
When looking at the first page of results for pages indexed, logged in:
Inlinks = 187,124
When looking at the tenth page of results for pages indexed, logging in:
Inlinks = 223,242
When looking at the first page of results for inlinks, logged in:
Inlinks = 195,239
When looking at the tenth page of results for inlinks, logged in:
Inlinks = 222,681

Whilst appreciating that search engines can't know everything, they could at least have the decency to reflect this by not giving such specific results...obviously what an academic really want is access to the data itself, but we may as well wish for the moon on a stick.

Labels: ,

posted by David at | 0 Comments Links to this post

Friday, 28 September 2007

Christmas search term analysis

Hitwise have just published a list of the top hot christmas gadgets based on search term analysis, which provides "great insight into people's habits and desires". However when the iPhone fails to make the top 10 mobile phones you have to question the methodology.

Hitwise analysed:
the top 2,000 search terms that sent traffic to a Hitwise Custom Category consisting of the top 100 online retail websites in the UK during the four weeks ending 22nd September 2007.

Rather than listing the gadgets that people are after, it may be that the list shows those gadgets that: people are after AND online retail websites dominate the search results.

Hitwise's excuse that: "The new iPod Touch and the UK release of the iPhone were announced too late to have a significant impact on the retail search data", doesn't seem to hold much water, as we can see from Google Trends that searches for the iPhone in the UK are up with the N95, whilst the Nokia 5300 doesn't even register.

There is a lot of interesting data held in the logs of web servers, but it is important that we don't get carried away with how much we read into them.

Labels: , , , ,

posted by David at | 0 Comments Links to this post

Thursday, 13 September 2007

Webometrics is addictive!

Despite knowing the meaninglessness of many the simple web metrics that can be calculated online and the inaccuracies that are inherent in the different tools available, for some reason I find that I am compelled to look at them.

The lack of inlinks or comments is not very surprising for a new blog. Many of the early posts are feeling one's way, determining what sort of areas are going to be discussed; 'finding one's voice' as the more pretenscious may say. Nonetheless there are already things of note for the addicted webometrician, albeit mostly about the tools themselves:
-Why does Blogpulse claim that I enthusiastically posted 16 posts on the 10th of September when looking at the blog I see I posted twice?
-Why has Technorati failed to index my post on Facebook metrics whilst seemingly indexing every other post?

And most importantly:
-Who is the lone Alexa user who visited three of my pages?


Although Alexa statistics are notoriously hit or miss, as relatively few web users have the software installed and once installed is often labelled spyware, it does allow comparisons between web sites. As an addicted webometrician the ability to compare my own blog with a fellow webometrician's is too hard to turn down. Webometrics.fi:

Unfortunately I lose this time, but it is still early days....and surely this is the smallest margin possible?

Labels: , , , ,

posted by David at | 1 Comments Links to this post