Webometric Thoughts

May 23, 2008

The Strange Case of the Webometrician’s Fan

Filed under: MIke Thelwall,webometrics — admin @ 9:01 am

It sometimes feels as though there is no piece of information, or opinion, that cannot be found online. If people have something in a digital format it seems natural for a large proportion of the population to publish it on the web, with little thought as to whether they would really want people looking at it for years to come.

Today my attention was drawn by the head of my research group, who had been engaging in some google-self-abuse (although he claims he was looking to see if his latest paper had been published yet), to one particular term paper: ‘“Webometrics”: Through the eyes of Mike Thelwall’. After working with Mike for four years I can assure readers that his appalling t-shirts are testiment to the fact he is not head of the ‘Sense of Humor Diagnosis Service’.

April 18, 2008

Alexa changed its ranking system: I’m a winner!

Filed under: alexa,webometric thoughts,webometrics — admin @ 10:07 am

Alexa is an often criticised ranking of web sites, with the criticism largely based on the use of the Alexa toolbar as a source of data. The use of the toolbar data skewed the ranking in favour of those sites visited by internet marketers and search engine optimisers, those who installed the toolbar, rather than the average user. Alexa’s big news, which everyone reported yesterday (e.g., Mashable, TechCrunch), is that they are now using additional data sources, although what they are is not very clear in their announcement.

Obviously with any change in the ranking system there will be winners and losers, and those who win are less likely to complain than those who lose. Personally I think the new Alexa rankings are a HUGE step forward. This conclusion is based solely on the increase in my own personal ranking. Back in January I noted that the Alexa ranking for Webometric Thoughts was 3,816,072. Today my Alexa ranking is 1,607,649 (1,389,032 for the 1 week average). Breaking into the top one million suddenly seems much easier.

April 12, 2008

Suicide and the Internet: Some flaws in the study

Filed under: BMJ,search engine,suicide,webometrics — admin @ 10:59 am

Webometric investigations rarely gain mainstream interest, yesterday, however, one did: A content analysis of the top 10 sites, on the four major search engines, for 12 searches relating to suicide. This highlighted the large number of hits that were to ‘dedicated suicide sites’ (e.g.. pro-suicide, encouraging, describing methods, or portraying suicide in fashionable terms): 90 out of 480 hits. Unsurprisingly this gained the interest of numerous news sites including the BBC. There are, however, a number of problems with the study: not all search terms are equal, and not all search engines are equal. Whilst we all make sweeping statements about web phenomena, we should really save it for our blogs rather than publication in the likes of the British Medical Journal (BMJ).

The main problem of the investigation is a focus on the information that is retrievable rather than the information that is actually being retrieved, which quickly muddies the water. Whilst the combining of search engines would initially seem to underestimate the scale of the problem, the propensity of users to use certain search terms would seem to indicate that the article has overestimated the scale of the problem.

The Google Effect
The majority of the statistics provided in the paper are based on the combined results of Google, Yahoo, MSN, and Ask:
-90/480 were dedicated suicide sites
-62/480 were sites forbidding suicide
-59/480 were sites discouraging suicide
However, almost 70% of searches use Google, which as the results show has the highest number of dedicated suicide sites in the results. This would seem to underestimate the problem: whereas just under a fifth of the hits were dedicated suicide sites overall, for the most influential search engine this has risen to just under a quarter. However, when looking at the search terms used, we soon reaslise that the problem has been over-stated.

Search Term Analysis
Whilst the BMJ lists the 12 search terms used, gathered partly from interview data and search suggestions used by search engines, a quick investigation quickly shows that they are by no means used in equal measure. Of the twelve terms only 4 were used often enough to generate search graphs in Google Trends:
-suicide
-sucide methods
-how to commit suicide
-how to kill yourself
And even amongst these four there was a wide variation in usage, with the overwhelming majority of queries being generated by the term suicide:

A content analysis of Google’s ‘suicide’ results
Below are the top ten links I received when looking at the global results from google.co.uk for the term ‘suicide’, and how I would classify them. Whilst the BMJ study emphasises that is doesn’t restrict the results to the UK, it does not mention whether it uses google.com or google.co.uk. I have used google.co.uk as, unless you ask it otherwise, google.com will redirect British users to google.co.uk.
-Miscellaneous – Wikipedia’s suicide page
-Against suicide – Suicide…read this first
-Against suicide – Suicide.com
-Academic or policy site – Mind fact sheet
-Academic or policy site – Stanford encyclopaedia of philosophy
-Prevention or support site – Kids Health answers and advice -suicide
-Prevention or support site – Problems of life: Suicide
-Not relevant – Facebook suicide: the end of a virtual life
-Prevention or support site – Depression and suicide in men
-Prevention or support site- BBC: Health conditions: Suicide
Whilst classification is notoriously difficult to get agreement on, none of these sites could be considered the sort of ‘dedicated suicide sites’ that will spread panic through middle-England.

I have no doubt that there are plenty of sites on the web that encourage suicide, but before we start a panic we need to have a greater understanding of how people are searching on the topic of suicide when they are feeling suicidal. We can’t just lump together the findings of different searches on different search engines and say that statistically we have a problem.

The most popular search on the most popular search engine on the topic of suicide does not find any ‘dedicated suicide sites’.

The original BMJ article:
Biddle, L., Donovan, J., Hawton, K., Kapur, N., & Gunnell, D. (2008). Suicide and the internet. British Medical Journal, 336(12 April 2008), p. 800-802.
Can be found here.

March 28, 2008

Research v. Internet

Filed under: research,webometrics — admin @ 1:36 pm

I have just come across a picture that perfectly sums up why I never get as much work done as I should:

How can webometrics compete with dinosaurs?

(Asher Sarlin’s original picture can be found HERE).

March 13, 2008

Classifying the web: Herding ADHD cats

Filed under: classifying,link analysis,links,webometrics — admin @ 10:24 am

When it comes to boring jobs I like to think I have had some of the worst: taking the shells off of hard boiled eggs, taking the green bits off of tomatoes, and, most recently, classifying web links. Yes, I can classify the links at home with a constant supply of coffee and the music of my choice, but it is still one of the most boring jobs. The reason: web pages come in ever imaginable form, mostly with no discernible purpose, with links placed just because the web owner can. Classifying the web is like herding ADHD cats.

The good and interesting sites that we visit every day are surrounded by a web of crap that we only usually trip across if we are unlucky. These are not necessarily offensive sites, just sites that are absolute rubbish: spam, half-formed, badly written, orphaned. Classifying the web means that we have to wallow in this web of crap. Its not like classifying a library of books, but rather like classifying a whole world of which 90% is the council rubbish tip.

March 10, 2008

Not all links are equal!

Filed under: BBC,link analysis,webometrics — admin @ 9:45 pm

Thanks to a single link on the BBC’s delicious roll on Saturday night, yesterday saw Webometric Thoughts get its highest number of hits ever. Whilst for many sites 121 absolute unique visitors in a day (according to Google analytics) wouldn’t be worthy of note, the webometric blogging community have fairly low aspirations.

What is interesting, from the perspective of a Google Analytics junkie, is the difference between the amount of traffic this link drove in comparison to a similar on the BBC’s delicious roll on the 16th January. Whilst the January link only drove 17 unique users to my site, Saturday’s link drove 102 users over a three day period!

Was the extra traffic all due to the extra time the link was visible on the BBC? It was visible a lot longer, but weekend traffic is often slower. Or was it the topic of the posts? The first was about ISPs, whilst the second was about the iPhone. It seems equally likely that the difference in the traffic is due to the link’s anchor text. Whereas the first text referred to ‘David Stuart research fellow’, the second link merely referenced the blog ‘Webometric Thoughts’ (AC seems to have done much more digging than NR).

Not all links are equal, however equal they may seem.

February 17, 2008

What’s Everyone Twittering About?

Filed under: Twitter,Zipf,tinyurl,webometrics — admin @ 10:09 pm

Whilst I am not personally a big Twitter fan, I am interested in discovering what people are Twittering about and how the posts differ from other forms of communication. With such thoughts in mind I started my first tentative Twitter steps this evening.

Adapting an open source RSS feed reader I set about downloading the public timeline (http://twitter.com/statuses/public_timeline.rss), for which Twitter has no restrictions on the number of requests that you can send. Whilst the original plan was to download an hour’s worth of data for a small pilot investigation, unfortunately I had to stop after about 45 minutes when I received Http 502 Status Code (‘Twitter is down or being upgraded’ rather than ‘exceeded the rate limit’).

The first post that was downloaded was numbered 723435732 (just after 7pm), whilst the last was numbered 723547592 (about 45 mins later). As the last number seems to be superfluous, there were a potential 11,186 posts to be downloaded, of which 6,422 posts were successfully downloaded. Many of the ‘missing posts’ will have been private, whilst others may have been missed due to delays in sending and receiving the RSS feed.

I have not, as yet, had time to do anything more interesting with the collected data than look at the frequency of terms using Text-Stat. So in true informetric style, here is the log-log graph of word frequency in rank-order:

Most noticeable in the frequency data is:
-Over 58% of twitter links are via tinyurl: ‘http’ appeared 588 times, ‘tinyurl’ 343 times.
-Twitterers are generally a polite bunch. The more ‘popular’ swear-words don’t appear that often, in 6,422 posts: shit (11), fuck (6), & cunt (zero). Admittedly a large proportion are not in English and the are a few variations on the words, but nonetheless I probably swear more than all these people in my average email.
-And they are not celeb-obsessed: Britney only gets three mentions, whilst there is no word on mention of Winehouse. Instead they err on the side of the geek: windows (19), Mac (25), iPhone (20).

As the analysis shows, these are early (childish) days. But hopefully I will have the opportunity, later in the week, to create the tools to investigate the data more thoroughly before downloading a larger sample.

February 14, 2008

Web Impact Factors for Blogs

Filed under: web impact factor,webometrics,wif — admin @ 9:52 am

Oh what a tangled web we weave… has just posted an interesting article on the problems of calculating the impact of a blog. In summary: Whilst a web site’s impact has traditionally been measured by dividing the number of inlinks by the number of web pages (the Web Impact Factor), the feed aggregators are having such a disproportionately large effect on the results they are useless. Whilst this is true, this is by no means the only reason for dismissing the use of the traditional WIF in determining the impact that a blog is having.

Other important factors that need to be discussed are:
1. The use of ‘number of pages’ as a denominator.
The number of pages has been used as a denominator to normalise for the size of organisations, whereas in this case it is normalising for the quantity of output as each of the blogs only has one author. Do we want to assess the value of individual posts, or the value of the blog/blogger?
2. The effect of the blogger posting comments on other people’s web sites.
Analysis of the links to my blog from external sites (after dismissing the feed aggregators) would find that I am the author of most of them. Commenting on other people’s blogs often provides a link back to your own blog, although these tend to be to the blog’s homepage rather than a specific post. Do we need to dismiss these links, or do they provide a useful indicator of a blogger’s contribution to the blogosphere?

Any method we use to judge the value of a blog will have its promotors and detractors often depending on how well it portrays their own work. Therefore I think we should stick with the WIF, at least until such time as my own webometric thoughts slip down the table.

nb. as an aside(-ish) I have noticed that for the first time Webometric Thoughts has leapt above both of the other webometricians’ blogs on a Google.com search for ‘webometrics’. Maybe Google rank is the only important indicator as it has such a disproportionately high effect on the success of a web site that all other indicators are now merely a reflection of it.

February 1, 2008

Webometrician out of retirement: When is a blog ever dead?

Filed under: webometrics — admin @ 9:21 am

After almost a year without a single post the original webometrics blogger has decided to start blogging again. Whilst a more suspicious blogger would put the sudden revival down to the next generation of bloggers starting to encroach upon the top search engine results for the webometrics field, we will choice to believe the “loss of login details” excuse.

The question of when a blog can be officially pronouced ‘dead’ arises a lot in webometric investigations of the blogosphere. Many studies ignore blogs that have not been updated in the last week, 2 weeks, or the last month. Maybe it is more appropriate to consider all blogs active, unless they state otherwise.

Whilst I welcome any academic back to the blogosphere, maybe future posts will be a bit more accurate that the last:
Live Search withdrawal of all link search operators
Whilst Live Search have severely limited the operators it offers, they still offer a linkfromdomain operator, which provides results that are not available via Yahoo or Google.

December 18, 2007

Google retaliates: reported collateral damage

Filed under: Google,webometrics — admin @ 7:29 am

According to Mashable some Google users are reporting receiving a large number of messages claiming their searches are looking like automated requests. If Google continues with a tightened security system there will be repercusions for those webometrician’s who use scrapers rather than the Google API, but more importantly, Google may use it as an opportunity to encourage/force users to log-on: Surely if you log-on, there is less chance of receiving ‘automated request’ accusations.

« Newer PostsOlder Posts »

Powered by WordPress