The thoughts of a web 2.0 research fellow on all things in the technological sphere that capture his interest.

Friday, 22 January 2010

Semantic Webometrics - A few thoughts

The other day an academic colleague asked what I was working on at the moment, in my answer I included - semantic webometrics - unsurprisingly he wanted some more detail. However 'working on' would be a bit of an exaggeration, 'have a few ideas but nothing on paper yet' would have been more appropriate. As such I thought I'd write down some of my rough thoughts on semantic webometrics.

Webometrics
For those who may have stumbled upon this blog from a non-webometric background, Webometrics as defined by Björneborn (2004), and as used by most of the webometrics community, means the:
...study of the quantitative aspects of the construction and use of information resources, structures and technologies on the Web drawing on bibliometric and informetric approaches.
Many of these quantitative studies have focused on hyperlinks. For example, investigating whether there is a correlation between a university's inlinks (a.k.a. backlinks) and a university's research ranking, or whether the interconnectedness of organisations in a region (as seen through interlinking web sites) can give an indication of a region's level of innovation [outrageous self-citation].

One of the problems with many of these link-analyses is that they include a lot of noise. For example, when counting a university's inlinks you will be counting both those from an academic highlighting a university's quality research, and those from the disgruntled student highlighting his most hated tutor. Traditionally we have tried to understand the extent of this noise through large scale content analysis - the extremely tedious manual classification of web links and web pages.

The semantic web
A semantic web is one where information on the web is structured so that it is meaningful to computers. Well known examples of the semantic web include FOAF ontology allowing people to express the relationships with one another (e.g., the FOAF of Tim Berners-Lee) and the use of microformats for certain types of structured content including contact details (as included at www.davidstuart.co.uk) and reviews (which are now indexed by Google as Rich Snippets). This extra information information can be used to reduce the amount noise and enable meaningful webometric studies.

Semantic webometrics
So when I say semantic webometrics I mean - webometric studies that make use of the additional information included in an increasingly semantic web.

For example, a semantic webometic study of the connection between an institution's inlinks and research ranking would take into consideration who had placed the links and the attributes that they had associated with them. A semantic webometric study of the relationships between organisations would look at the explicit relationships contained in FOAF files as well as the implicit information on web pages.

Conclusions
Unfortunately there is relatively little semantic information embedded in the majority of web pages/sites, and where it is widespread, e.g., with the nofollow link attribute, webometricians have yet to develop the tools to make use of them.

As such we need to take an information-centred approach to semantic webometric research rather than a problem-centred approach. Whilst still small, there is an increasing amounts of semantic data being embedded in the web all the time, webometricians need to investigate what is available and how they can use it.

Labels: ,

posted by David at | 0 Comments Links to this post

Saturday, 30 May 2009

From Webometrician to Web Analyst?

On 22nd July 2009 my job as web 2.0 research fellow at the University of Wolverhampton finishes. As the only other webometrics research post currently available is in South Korea, and I'm not really a 9-5 office type person, I will [probably] be going into business for myself: Commercialising webometrics. Unfortunately, as there are only a handful of people who know what webometrics is and what a webometrician would do, the hunt is on for a new job title.

The most obvious job title is 'web analyst', although the slightly wordier 'web analytics consultant' would probably give a better indication of the services I can offer. Neither, however, sound particularly cutting edge, exciting, or (like webometrician) rhyme with magician! Even after I have decided on a job title I will have to select names for the services I offer. Is 'web impact analysis' catchy enough? Naming children seems like a piece of cake in comparison.

One thing I am sure about: I will not be a search engine optimizer offering search engine optimization! Any other suggestions welcomed.

Labels: ,

posted by David at | 0 Comments Links to this post

Monday, 27 April 2009

A Wolverhampton Network Diagram: It's a local affair

A couple of posts ago I was complaining about how annoying my job was as I tried to draw conclusions from the jumbled mess of environmental technology websites. Today's post points out that it isn't always such a jumbled mess.

I have just done a far smaller (and less scientific) data collection for a presentation I am doing in Wolverhampton tomorrow [click on picture to enlarge]:

It is a link diagram of a few web sites in Wolverhampton and the surrounding area to illustrate the sort of work my research group does.

What is noticeable from a webometric perspective is how many of the web sites included in the study are actually connected: you can link anywhere in the world, but the web is primarily a local affair.

Labels: , ,

posted by David at | 0 Comments Links to this post

Wednesday, 22 April 2009

I Hate My Job: The Web is Just a Jumbled Mess!

At the moment I am investigating the linking between 1337 environmental technology web sites. Of the 1337 sites, 751 nodes create one large network:

You spend days sorting a list of URLs, collecting data, finding errors, starting again...and at the end you just have a big ball of string.

A webometrician's job is to draw conclusions from such a jumbled mess: I hate my job.

Labels: ,

posted by David at | 2 Comments Links to this post

Friday, 20 February 2009

A Philosophy of Linking: Does The Pirate Bay need a webometrician?

As members of The Pirate Bay stand trial Bill Thompson points out the need for a philosophy of linking:
The Pirate Bay case hinges on what counts as infringement, and whether simply linking to a site is enough to make someone liable, treating a hypertext link to a third-party URL as an endorsement, as something that makes a connection between two web pages or information sources that has real legal significance and weight.

Yet it is nothing of the sort. Ever since Tim Berners-Lee defined the Hypertext Markup Language and its Uniform Resource Locators one fundamental thing has applied - a link is just a link....

Perhaps we need a 'philosophy of linkage' to explore what the use of a link can signify, before the lawyers decide it for us and limit the creative potential of the web through their lack of imagination and understanding.

The theory of linking often comes up as a topic of conversation in webometrics, in much the same way as a theory of citation is discussed in bibliometrics. Unfortunately it often takes a back seat to those webometric areas with more obvious real-world applications, e.g., the creation of web indicators.

Only a couple of months ago a colleague and I started working on a 'Theory of Linking', but other work got in the way and the paper remains unfinished. Who knows, maybe if we had written the paper we could have been the first webometricians to be expert witnesses!

Labels: , , ,

posted by David at | 1 Comments Links to this post

Sunday, 15 February 2009

Twitter, Politics, and Looking for Meaningful Metrics

As Twitter seems to be the latest shiny web site that has everyone interested, and with a general election on its way (well, June 2010 at the latest), I decided to see how the political parties have taken to Twitter.

The most simple comparison is between the raw numbers of the parties:
Obviously these numbers don't look good for the Labour Party, not listening and not many followers. They don't even have a single account, but rather two different streams with the same information.

Whilst such comparisons will be made with increasing regularity as the election approaches, for example:
..., we quickly realise we need to take into consideration a far wider variety of Twitter accounts and take into consideration other metrics.

@DowningStreet, the official Twitter channel for the office of the Prime Minister, provides a total different perspective on the Labour Party's fortunes.
If @DowningStreet's Twitter friends were an indication of support, Gordon could expect a landslide victory at the next general election. Unfortunately things are not that simple. As one comment to @DowningStreet shows, people follow for many different reasons:
any chance next week i can have a pic taken outside No.10? im visiting for a few days? i know its cheeky but i had to ask!
Obviously @DowningStree is not the only other UK political Twitterer, many individuals, groups and departments have accounts. All contributing to the complex picture of the UK political landscape.

Twitter potentially offers a lot of useful information about both the attitude of the parties to the electorate, and the electorate to the parties. Unfortunately, as with all webometric studies, for meaningful answers to be arrived at there needs to be distinct methodical steps rather than just a grabbing of raw data:
1) Select appropriate Twitter accounts to answer the research question.
2) Investigate Twitter interactions:
Not only 'do they follow and have followers', but are they ReTweeting comments and Responding to questions directed at them.
3) Investigate the nature of the interactions:
Unfortunately the simplest way of finding out the nature of many of the connection is to analyse the comments, a very long and tedious process.

As with so many things on the web, it would be interesting to investigate, if only one had the time.

Labels: , , , ,

posted by David at | 0 Comments Links to this post

Thursday, 5 February 2009

An Unimpressive EThOS from the British Library

One of the hundreds of posts in my feed-reader this morning was about the British Library electronic theses service (via SCIT blog). As my own thesis should be included I decided to indulge in a bit of vanity searching. Result: EThOS has a long way to go.

I would expect my thesis to turn up for the term 'webometrics', in fact it is about the only term for which someone might actually want to read it. Unfortunately the only webometric thesis belongs to Xuemei Li:

My thesis does however turn up for the wholly inappropriate 'bibliometrics':

Seemingly the reason for my appearance under 'bibliometrics' and not 'webometrics' is that 'bibliometrics' appears in my abstract whereas 'webometrics' does not. Whilst this may seem reasonable at first, theorectically the University of Wolverhampton are taking part in the project and their record includes a number of keywords carefully selected me, including 'webometrics'. The British Library also fails to provide a link to my thesis, despite it being scattered over the web like confetti: "Not yet available for download".

Young academics brought up on Google Scholar, with full text searching and links to the numerous copies on the web, are unlikely to see the value in EThOS and its traditional OPAC style. Whilst I'd like to see an electronic thesis online service that seperates the wheat from the chaff, with full text searching and links to the documents, and believe that librarians could aid in retrieval with classification of such documents, this is not what EThOS is currently offering. It's still in Beta, and likely to improve, but it has a frighteningly long way to go and you do wonder whether they should have buddied up with one of the big search engines to produce a more user friendly version.

Labels: , , ,

posted by David at | 2 Comments Links to this post

Saturday, 3 January 2009

Webometric Word Clouds: an unscientific comparison

Whilst contemplating creating word clouds from search engine results(what else do people think about on a Saturday afternoons?) I started to wonder what my thesis would look like as a word cloud. More specifically, would it end up looking like the autobiography for Mike Thelwall? A quick copy and paste of 163 pages of text into Wordle later:

Maybe articles and theses should have a word cloud before the abstract to help users decide at a glance whether it is even worth reading the abstract.

How does my word cloud compare with other recent webometric theses?

Labels: , ,

posted by David at | 0 Comments Links to this post

Friday, 21 November 2008

Google SearchWiki: Cleaning up the Webometric results

For some reason Google always saves its big releases for those days when I am busy. Could it be that they are fearful of my criticism? Or merely coincidence? Whatever the reason I couldn't help but push other things to one side and comment on Google's new SearchWiki. Basically, when you are logged into your Google account at google.com (not currently google.co.uk) you can change the results you find on your home page: promoting results, hiding results, commenting on results. Whilst it only affects your results page, you can see how other people have ranked/commented on items, and it seems highly likely that Google will eventually incorporate the findings in its general search results.

SearchWiki is by no means a new idea, sites such Aftervote (now Scour) have done it all before, the difference this time is the amount of people Google can put to work on the idea. At the time of writing this blog a search for 'Google' had already had 908 people make notes; it would probably have taken Aftervote weeks if not months to get that many comments on a single search term. So what is the collective wisdom regarding the best search result for the term 'google' entered into Google.com...that'll be Google.com. Personally I would have thought that people are more likely to be searching for one of Google's other services or information about Google rather than the page they are already on, but noone ever accused the public of being overly bright.

As someone who likes to do his bit for collective wisdom, I have made steps to clean up one of my most regular search 'webometrics':

Just the three adjustments: promotion of the most important site, questioning the validity of a colleague's page, and the removal of a character who has no right to call himself a webometrician. But I am sure everyone would agree that such amendments improve the page astronomically.

Whilst I am sure that shere weight of numbers will prevent the spamming of the top searches, it will be interesting to monitor the spam on the fringes. Will people be looking at the notes other have made? I will. SearchWiki seems as though it will give great insight into what people think of different sites, I just hope Google adds it to their API.

UPDATE: Whilst I initially said it was only available on Google.com, it's seemingly not as simple as that. When I log into Google.com with my webometrics account I get SearchWiki, when I log in with my gmail account I don't get SearchWiki! It seems as though they are taking steps to restrict access geographically.

Labels: , ,

posted by David at | 2 Comments Links to this post

Tuesday, 28 October 2008

Does Bibliometrics need a Blogger?

Whilst searching on Google Blog Search for 'webometrics' I noticed that the usual webometric blogs are listed as 'Related Blogs':

As I had just been blogging on the subject of bibliometrics, I decided to see which the related blogs on that topic. Surprisingly there aren't any:

[Although two blogs are 'related' to Scientometrics].

If blogs are a useful way for sharing the latest news and information in a particular discipline, as well as the promotion of a discipline, then surely bibliometrics would benefit from the odd bibliometrician blogging occasionally [...for the sake of inter-disciplinary relations I will eschew the joke about bibliometricians being odd]. Admittedly the webometric blogs are not the best example of academic blogging, but it is a burgeoning online community of sorts.

Labels: , ,

posted by David at | 0 Comments Links to this post

Friday, 3 October 2008

It's Porn Friday!!!

It's not that today has been designated the official porn day of the year, merely that Friday is the day when adult web sites get most of their traffic. That's just one of the facts scattered throughout Bill Tancer's Click: What Millions of People are Doing Online and Why It Matters, albeit the most memorable:

Whilst very much a popular book, rather than an academic book, it's a worthwhile read from a webometric perspective. If nothing else you can curse the limited amount of data we have access to in comparison to our commercial counterparts: Whereas we have to count links, they get to follow click-streams; following the mood and reactions of people around the world.

Whilst there is obviously big money to made with the Hitwise data, as well as with the data of their competitors, maybe they would find the data even easier to sell if it had been shown to stand up to the rigour of the academic community and the peer review process. My door is always open :-)

Labels: , ,

posted by David at | 0 Comments Links to this post

Thursday, 2 October 2008

Google 2001 v. Google 2008

In honour of their 10th birthday Google brought back their oldest available index a couple of days ago: Google 2001. This provides a great opportunity for looking at how the web has changed, especially the growth of certain terms in comparison to others.

As a webometrician, the obvious choice is to see how 'webometrics' has grown. However with changes in the index size the results are only meaningful in comparison to another result. In this case I have decided on 'Mike Thelwall', the hyper-productive author of over 100 papers in the field, who, luckily, also has an unusual name.


Whilst there were a similar number of documents at the start, and both have grown at an extremely fast rate, webometrics has grown at the faster rate. Scientific proof that there is more to webometrics than Mike Thelwall!

It would be nice if Google opened up some other indexes so that more points to the graph could be added.

Labels: ,

posted by David at | 0 Comments Links to this post

Wednesday, 24 September 2008

Google Insights for Search: Term order is all important!

Unfortunately most poor academics don't have access to the same data as Bill Tancer, instead we generally have to make do with the crumbs from Google and the other search engines. This morning however, I was reminded about how careful we need to be when using the tools the search engines offer us.

Today I was using Google Insights for Search to compare the term cybermetrics and webometrics. Whilst I am part of the Statistical Cybermetrics Research Group, as a group we tend to discuss 'webometrics'. Google Insights for Search clearly shows that whilst there was once a time when cybermetrics ruled supreme, webometrics is now far more popular.

More importantly, however, I also noticed that Iran wasn't highlighted on the map for the term 'webometrics', despite Iran have a (relatively) strong webometrics community.

Basically, because Iran does not appear in the results for 'cybermetrics' (which was my first search term), it is not calculated for 'webometrics'. If I had added the term 'webometrics' first, then the term 'cybermetrics' the map would have looked very different:

The solution would seem to be to include a universal search term first, but those that immediately spring to mind are not necessarily the sort that you would want appearing on a corporate slide-show.

Labels: , , , ,

posted by David at | 0 Comments Links to this post

Friday, 5 September 2008

Webometrician v. Webometrician: Who will conquer the world first?

One of the joys of Google Analytics is watching the map slowly filling up as you get traffic from different parts of the world. However, whilst North America and Western Europe quickly fill up, other parts of the world have been more reluctant to visit my Webometric Thoughts. Almost a year after I started using Google Analytics there has still been no traffic from many countries in Africa.

Oh, what a tangled web we weave... is wondering how to start filling his map, hoping to attract visitors from Ukraine, Belarus, Georgia, Armenia, and Moldova. Whilst I am also waiting for some traffic from Belarus and Georgia, at least I can sleep comfortably in the knowledge of 28 visits from the Ukraine, 2 from Armenia, and 1 from Moldova.

Whilst the gauntlet has been thrown down by Kim at Oh, what a tangled web we weave..., I would expect the Belarusian, Georgian, and Armenian traffic to arrive by the end of the week (especially as I have sensibly included the demonyms as well as country names). And whilst Kim has decided to include the terms Google and Facebook in his post to increase the liklihood of traffic, I'm going with the Google Insights for Search suggestions of Minsk, Tbilisi, and Yerevan.

Update: Ooops...just realised I was chasing Armenian traffic after already having had Armenian traffic. So it should really say "I would expect the Belarusian, Georgian, and EXTRA Armenian traffic to arrive by the end of the week"

Labels: , , , , ,

posted by David at | 0 Comments Links to this post

Thursday, 21 August 2008

Iterasi: Create your own archive!

The UK's web archive is pretty rubbish, therefore Iterasi (highlighted by TechCrunch) is a great addition to the web.

Rather than merely bookmarking a URL, you can archive the actual page, and can continue archiving the page on a regular basis if you so wish. The only downsides to the site are that it only allows you to archive on a daily basis (for the front pages of news sites you may want to archive more regularly), and it only archives when your computer, with its list of scheduled saves, is turned on.

The potential for webometric studies is obvious, it would seem as though even the most technologically incompetent of us can now simply collect longitudinal data. For example, Google searches may be collected on a daily basis to see how the results or the number of hits changes...and once you have archived a page, it's very simple to then embed the page:

It also has potential for bloggers; when they discuss a page or story bloggers can now be sure that their readers will have access to the page that they saw rather than an updated version. How content providers will react to the archiving of their content is yet to be seen.

Labels: , , ,

posted by David at | 0 Comments Links to this post

Thursday, 14 August 2008

Happy Blog-iversary!!


Today is the one year anniversary of my Webometric Thoughts blog! Unfortunately, despite having a Google anniversary logo commissioned especially for the event (way back in January), Google have decided to give preference to another Olympic logo today instead.

Over the last year I have managed to blog fairly regularly (this is my 286th post), and this has been reflected in a steady increase in traffic. Since I started using Google Analytics in October I have had 15,484 absolute unique visitors:

Most importantly, the number of unique visitors can be seen to be increasing month on month. This increase can also be seen in my Alexa ranking:

When checking my Alexa ranking back in September my ranking was 8,926,204, whilst in January it was 3,816,072. Whilst Alexa changed its ranking algorithm in April, today's results show an improvement on the 1,607,649 I got then. Even Technorati shows an improvement, as I am now in the top half a million blogs!

So, what are the aims of Webometric Thoughts over the next year:
-Break into the top 100,000 web sites (according to Alexa)
-Break into the top 100,000 blogs (according to Technorati)
-Make the blog self-financing (since starting to use Google Ads in March I have earned $14.05...I need to earn approximately $50 a year).
-And, obviously, write higher quality posts.

Labels: , , , , ,

posted by David at | 0 Comments Links to this post

Wednesday, 6 August 2008

Google Insights for Search: What next?

In addition to Google Trends, Google are now offering Google Insights for Search (http://google.com/insights/search/#)(via TechCrunch). Not only can you filter the terms by category, for example helping to distinguish between Apple (Computers & Electronics) and apple (Food & Drink), but it will also give a nice visual representation of the geographic data.

We can now quickly see that the Iran is the country most interested in webometrics:

The maps also offer a whole new type of vanity searching. The "David Stuart" brand has yet to make major inroads in Africa, Asia or South America. I was grateful, however, to find that my own vanity searches had not overly effected the results (at a city level London is the hub rather than Wolverhampton).

Some bloke called Barack Obama, on the other hand, seems to have made inroads all over, with the exception of the Middle East.

The obvious question, based on the directory structure of the Insights for Search URL (http://google.com/insights/search/#), is what other insight services are Google going to offer? Insights for Maps? Insights for Shopping? Insights for News?

Labels: , , ,

posted by David at | 0 Comments Links to this post

Webometricians are NOT Web Celebrities!!

When it comes to being a web celebrity, it is not surprising to find that webometricians are near the bottom of the pile; a fact I blame on our spending too much time counting other people's links rather than creating content worth linking to. Anyway, Wired have created a nifty little application (highlighted by Media Futurist) that can help you determine your 'web celebrity' score by using data from Google's Social Graph.

At the moment it only bases your score on MySpace, Twitter, and your blog/web site, so your score depends a lot on how much you use these sites; my thousands of Facebook friends and hundreds of delcious bookmark followers mean nothing. Nonetheless, true to Webometric Thoughts fashion, a comparison of the three main webometrics blogs/bloggers(only using their twitter and blog addresses):

Holmberg's Oh what a tangled web we weave... :
2 (twitter) + 4 (blog) = 6
Thelwall's Webometrics Blog :
10 (twitter) + 15 (blog) = 25
My Webometric Thoughts:
6 (twitter) + 7 (blog) = 13

To give these numbers a bit of perspective, Barack Obama's current ranking is 9,069 (4,509 without MySpace). Thelwall may have won this battle, but we are all losing the war. It would be interesting to see, however, how the Celebrity Meter compares with a qualitative evaluation of web celebrity, such asForbes' list of the top 25 web celebrities.

Whilst 'web celebrity' is just a bit of fun, it does show the potential of the Google Social Graph data, and as far as I am aware no webometrician has used it to any practical purpose yet.

Labels: , ,

posted by David at | 2 Comments Links to this post

Friday, 25 July 2008

A Webometric Thesis

The finishing of a PhD is more of a whimper than a bang. It has been seven months since I handed in my thesis, and despite having had only the most minor of revisions (total time approximately 4hrs), I have only just received the certificate for my masterpiece:

Whilst there are often complaints about the inability of government to work as effectively as 'the marketplace', we should all be grateful that academia is not in charge of the country; nothing would happen for years on end.

As many weeks have also passed since I sent my thesis to the University's electronic repository, and it still hasn't appeared online, I have decided to put it online myself.

Title:
Web Manifestations of Knowledge-based Innovation Systems
Abstract:
Innovation is widely recognised as essential to the modern economy. The term knowledge-based innovation system has been used to refer to innovation systems which recognise the importance of an economy’s knowledge base and the efficient interactions between important actors from the different sectors of society. Such interactions are thought to enable greater innovation by the system as a whole. Whilst it may not be possible to fully understand all the complex relationships involved within knowledge-based innovation systems, within the field of informetrics bibliometric methodologies have emerged that allows us to analyse some of the relationships that contribute to the innovation process. However, due to the limitations in traditional bibliometric sources it is important to investigate new potential sources of information. The web is one such source. This thesis documents an investigation into the potential of the web to provide information about knowledge-based innovation systems in the United Kingdom.

Within this thesis the link analysis methodologies that have previously been successfully applied to investigations of the academic community (Thelwall, 2004a) are applied to organisations from different sections of society to determine whether link analysis of the web can provide a new source of information about knowledge-based innovation systems in the UK. This study makes the case that data may be collected ethically to provide information about the interconnections between web sites of various different sizes and from within different sectors of society, that there are significant differences in the linking practices of web sites within different sectors, and that reciprocal links provide a better indication of collaboration than uni-directional web links. Most importantly the study shows that the web provides new information about the relationships between organisations, rather than just a repetition of the same information from an alternative source. Whilst the study has shown that there is a lot of potential for the web as a source of information on knowledge-based innovation systems, the same richness that makes it such a potentially useful source makes applications of large scale studies very labour intensive.

Obviously the above abstract will have all but the greatest dullard champing at the bit, and I have therefore made it available in both PDF and Word Document formats.

Labels: ,

posted by David at | 2 Comments Links to this post

Tuesday, 10 June 2008

Is the web linguistically on the left or right?

I am currently in the middle of reading David Crystal's (2006) 'Language and the Internet', an interesting book that, when it started mentioning style guides, got me wondering about whether style guides could be used to determine whether the UK web space was politically on the left, or on the right. The leading broadsheets from both sides of the political debate have publicly available style guides (i.e., The Telegraph and The Guardian), and the differences could be used for the basis of such a linguistic-webometric investigation.

My personal favourite style guide section is The Telegraph's Banned Words. Whilst the banning of terms such as 'Europhobe' have obvious political motivations, you have to wonder whether it was really necessary to explicitly ban referring to 'perverted Scout leaders' (Whilst Google Trends does not show the phrase to be endemic, that may be because of the Telegraph's quick action). It is interesting to note, however, that despite the Telegraph's authoritarian values, they seem seem to be very lax with their own language, the supposedly banned 'mass exodus' was used only a few days ago. Surely there will be letters to the editor!

Unfortunately these days search engines try to be helpful, and ignore many of the differences. For example, 'Yahoo' and 'Yahoo!' are both treated as the same, when any fool would know that the exclamation mark reflects the searching for more conservative opinions on the search engine. It would be nice to be able to turn a search engine's 'helpful' features off occasionally.

Labels: , , , , ,

posted by David at | 0 Comments Links to this post

Friday, 23 May 2008

The Strange Case of the Webometrician's Fan

It sometimes feels as though there is no piece of information, or opinion, that cannot be found online. If people have something in a digital format it seems natural for a large proportion of the population to publish it on the web, with little thought as to whether they would really want people looking at it for years to come.

Today my attention was drawn by the head of my research group, who had been engaging in some google-self-abuse (although he claims he was looking to see if his latest paper had been published yet), to one particular term paper: '“Webometrics”: Through the eyes of Mike Thelwall'. After working with Mike for four years I can assure readers that his appalling t-shirts are testiment to the fact he is not head of the 'Sense of Humor Diagnosis Service'.

Labels: ,

posted by David at | 0 Comments Links to this post

Friday, 18 April 2008

Alexa changed its ranking system: I'm a winner!

Alexa is an often criticised ranking of web sites, with the criticism largely based on the use of the Alexa toolbar as a source of data. The use of the toolbar data skewed the ranking in favour of those sites visited by internet marketers and search engine optimisers, those who installed the toolbar, rather than the average user. Alexa's big news, which everyone reported yesterday (e.g., Mashable, TechCrunch), is that they are now using additional data sources, although what they are is not very clear in their announcement.

Obviously with any change in the ranking system there will be winners and losers, and those who win are less likely to complain than those who lose. Personally I think the new Alexa rankings are a HUGE step forward. This conclusion is based solely on the increase in my own personal ranking. Back in January I noted that the Alexa ranking for Webometric Thoughts was 3,816,072. Today my Alexa ranking is 1,607,649 (1,389,032 for the 1 week average). Breaking into the top one million suddenly seems much easier.

Labels: , ,

posted by David at | 2 Comments Links to this post

Saturday, 12 April 2008

Suicide and the Internet: Some flaws in the study

Webometric investigations rarely gain mainstream interest, yesterday, however, one did: A content analysis of the top 10 sites, on the four major search engines, for 12 searches relating to suicide. This highlighted the large number of hits that were to 'dedicated suicide sites' (e.g.. pro-suicide, encouraging, describing methods, or portraying suicide in fashionable terms): 90 out of 480 hits. Unsurprisingly this gained the interest of numerous news sites including the BBC. There are, however, a number of problems with the study: not all search terms are equal, and not all search engines are equal. Whilst we all make sweeping statements about web phenomena, we should really save it for our blogs rather than publication in the likes of the British Medical Journal (BMJ).

The main problem of the investigation is a focus on the information that is retrievable rather than the information that is actually being retrieved, which quickly muddies the water. Whilst the combining of search engines would initially seem to underestimate the scale of the problem, the propensity of users to use certain search terms would seem to indicate that the article has overestimated the scale of the problem.

The Google Effect
The majority of the statistics provided in the paper are based on the combined results of Google, Yahoo, MSN, and Ask:
-90/480 were dedicated suicide sites
-62/480 were sites forbidding suicide
-59/480 were sites discouraging suicide
However, almost 70% of searches use Google, which as the results show has the highest number of dedicated suicide sites in the results. This would seem to underestimate the problem: whereas just under a fifth of the hits were dedicated suicide sites overall, for the most influential search engine this has risen to just under a quarter. However, when looking at the search terms used, we soon reaslise that the problem has been over-stated.

Search Term Analysis
Whilst the BMJ lists the 12 search terms used, gathered partly from interview data and search suggestions used by search engines, a quick investigation quickly shows that they are by no means used in equal measure. Of the twelve terms only 4 were used often enough to generate search graphs in Google Trends:
-suicide
-sucide methods
-how to commit suicide
-how to kill yourself
And even amongst these four there was a wide variation in usage, with the overwhelming majority of queries being generated by the term suicide:


A content analysis of Google's 'suicide' results
Below are the top ten links I received when looking at the global results from google.co.uk for the term 'suicide', and how I would classify them. Whilst the BMJ study emphasises that is doesn't restrict the results to the UK, it does not mention whether it uses google.com or google.co.uk. I have used google.co.uk as, unless you ask it otherwise, google.com will redirect British users to google.co.uk.
-Miscellaneous - Wikipedia's suicide page
-Against suicide - Suicide...read this first
-Against suicide - Suicide.com
-Academic or policy site - Mind fact sheet
-Academic or policy site - Stanford encyclopaedia of philosophy
-Prevention or support site - Kids Health answers and advice -suicide
-Prevention or support site - Problems of life: Suicide
-Not relevant - Facebook suicide: the end of a virtual life
-Prevention or support site - Depression and suicide in men
-Prevention or support site- BBC: Health conditions: Suicide
Whilst classification is notoriously difficult to get agreement on, none of these sites could be considered the sort of 'dedicated suicide sites' that will spread panic through middle-England.

I have no doubt that there are plenty of sites on the web that encourage suicide, but before we start a panic we need to have a greater understanding of how people are searching on the topic of suicide when they are feeling suicidal. We can't just lump together the findings of different searches on different search engines and say that statistically we have a problem.

The most popular search on the most popular search engine on the topic of suicide does not find any 'dedicated suicide sites'.

The original BMJ article:
Biddle, L., Donovan, J., Hawton, K., Kapur, N., & Gunnell, D. (2008). Suicide and the internet. British Medical Journal, 336(12 April 2008), p. 800-802.
Can be found here.

Labels: , , ,

posted by David at | 0 Comments Links to this post

Friday, 28 March 2008

Research v. Internet

I have just come across a picture that perfectly sums up why I never get as much work done as I should:

How can webometrics compete with dinosaurs?

(Asher Sarlin's original picture can be found HERE).

Labels: ,

posted by David at | 0 Comments Links to this post

Thursday, 13 March 2008

Classifying the web: Herding ADHD cats

When it comes to boring jobs I like to think I have had some of the worst: taking the shells off of hard boiled eggs, taking the green bits off of tomatoes, and, most recently, classifying web links. Yes, I can classify the links at home with a constant supply of coffee and the music of my choice, but it is still one of the most boring jobs. The reason: web pages come in ever imaginable form, mostly with no discernible purpose, with links placed just because the web owner can. Classifying the web is like herding ADHD cats.

The good and interesting sites that we visit every day are surrounded by a web of crap that we only usually trip across if we are unlucky. These are not necessarily offensive sites, just sites that are absolute rubbish: spam, half-formed, badly written, orphaned. Classifying the web means that we have to wallow in this web of crap. Its not like classifying a library of books, but rather like classifying a whole world of which 90% is the council rubbish tip.

Labels: , , ,

posted by David at | 1 Comments Links to this post

Monday, 10 March 2008

Not all links are equal!

Thanks to a single link on the BBC's delicious roll on Saturday night, yesterday saw Webometric Thoughts get its highest number of hits ever. Whilst for many sites 121 absolute unique visitors in a day (according to Google analytics) wouldn't be worthy of note, the webometric blogging community have fairly low aspirations.

What is interesting, from the perspective of a Google Analytics junkie, is the difference between the amount of traffic this link drove in comparison to a similar on the BBC's delicious roll on the 16th January. Whilst the January link only drove 17 unique users to my site, Saturday's link drove 102 users over a three day period!

Was the extra traffic all due to the extra time the link was visible on the BBC? It was visible a lot longer, but weekend traffic is often slower. Or was it the topic of the posts? The first was about ISPs, whilst the second was about the iPhone. It seems equally likely that the difference in the traffic is due to the link's anchor text. Whereas the first text referred to 'David Stuart research fellow', the second link merely referenced the blog 'Webometric Thoughts' (AC seems to have done much more digging than NR).

Not all links are equal, however equal they may seem.

Labels: , ,

posted by David at | 0 Comments Links to this post

Sunday, 17 February 2008

What's Everyone Twittering About?

Whilst I am not personally a big Twitter fan, I am interested in discovering what people are Twittering about and how the posts differ from other forms of communication. With such thoughts in mind I started my first tentative Twitter steps this evening.

Adapting an open source RSS feed reader I set about downloading the public timeline (http://twitter.com/statuses/public_timeline.rss), for which Twitter has no restrictions on the number of requests that you can send. Whilst the original plan was to download an hour's worth of data for a small pilot investigation, unfortunately I had to stop after about 45 minutes when I received Http 502 Status Code ('Twitter is down or being upgraded' rather than 'exceeded the rate limit').

The first post that was downloaded was numbered 723435732 (just after 7pm), whilst the last was numbered 723547592 (about 45 mins later). As the last number seems to be superfluous, there were a potential 11,186 posts to be downloaded, of which 6,422 posts were successfully downloaded. Many of the 'missing posts' will have been private, whilst others may have been missed due to delays in sending and receiving the RSS feed.

I have not, as yet, had time to do anything more interesting with the collected data than look at the frequency of terms using Text-Stat. So in true informetric style, here is the log-log graph of word frequency in rank-order:

Most noticeable in the frequency data is:
-Over 58% of twitter links are via tinyurl: 'http' appeared 588 times, 'tinyurl' 343 times.
-Twitterers are generally a polite bunch. The more 'popular' swear-words don't appear that often, in 6,422 posts: shit (11), fuck (6), & cunt (zero). Admittedly a large proportion are not in English and the are a few variations on the words, but nonetheless I probably swear more than all these people in my average email.
-And they are not celeb-obsessed: Britney only gets three mentions, whilst there is no word on mention of Winehouse. Instead they err on the side of the geek: windows (19), Mac (25), iPhone (20).

As the analysis shows, these are early (childish) days. But hopefully I will have the opportunity, later in the week, to create the tools to investigate the data more thoroughly before downloading a larger sample.

Labels: , , ,

posted by David at | 0 Comments Links to this post

Thursday, 14 February 2008

Web Impact Factors for Blogs

Oh what a tangled web we weave... has just posted an interesting article on the problems of calculating the impact of a blog. In summary: Whilst a web site's impact has traditionally been measured by dividing the number of inlinks by the number of web pages (the Web Impact Factor), the feed aggregators are having such a disproportionately large effect on the results they are useless. Whilst this is true, this is by no means the only reason for dismissing the use of the traditional WIF in determining the impact that a blog is having.

Other important factors that need to be discussed are:
1. The use of 'number of pages' as a denominator.
The number of pages has been used as a denominator to normalise for the size of organisations, whereas in this case it is normalising for the quantity of output as each of the blogs only has one author. Do we want to assess the value of individual posts, or the value of the blog/blogger?
2. The effect of the blogger posting comments on other people's web sites.
Analysis of the links to my blog from external sites (after dismissing the feed aggregators) would find that I am the author of most of them. Commenting on other people's blogs often provides a link back to your own blog, although these tend to be to the blog's homepage rather than a specific post. Do we need to dismiss these links, or do they provide a useful indicator of a blogger's contribution to the blogosphere?

Any method we use to judge the value of a blog will have its promotors and detractors often depending on how well it portrays their own work. Therefore I think we should stick with the WIF, at least until such time as my own webometric thoughts slip down the table.

nb. as an aside(-ish) I have noticed that for the first time Webometric Thoughts has leapt above both of the other webometricians' blogs on a Google.com search for 'webometrics'. Maybe Google rank is the only important indicator as it has such a disproportionately high effect on the success of a web site that all other indicators are now merely a reflection of it.

Labels: , ,

posted by David at | 0 Comments Links to this post

Friday, 1 February 2008

Webometrician out of retirement: When is a blog ever dead?

After almost a year without a single post the original webometrics blogger has decided to start blogging again. Whilst a more suspicious blogger would put the sudden revival down to the next generation of bloggers starting to encroach upon the top search engine results for the webometrics field, we will choice to believe the "loss of login details" excuse.

The question of when a blog can be officially pronouced 'dead' arises a lot in webometric investigations of the blogosphere. Many studies ignore blogs that have not been updated in the last week, 2 weeks, or the last month. Maybe it is more appropriate to consider all blogs active, unless they state otherwise.

Whilst I welcome any academic back to the blogosphere, maybe future posts will be a bit more accurate that the last:
"Live Search withdrawal of all link search operators"
Whilst Live Search have severely limited the operators it offers, they still offer a linkfromdomain operator, which provides results that are not available via Yahoo or Google.

Labels:

posted by David at | 0 Comments Links to this post

Tuesday, 18 December 2007

Google retaliates: reported collateral damage

According to Mashable some Google users are reporting receiving a large number of messages claiming their searches are looking like automated requests. If Google continues with a tightened security system there will be repercusions for those webometrician's who use scrapers rather than the Google API, but more importantly, Google may use it as an opportunity to encourage/force users to log-on: Surely if you log-on, there is less chance of receiving 'automated request' accusations.

Labels: ,

posted by David at | 0 Comments Links to this post

Thursday, 13 December 2007

Record numbers visit Webometric Thoughts!

Whilst it wouldn't be much of a record in comparison to Google, MSN, or even the local corner shop's web site, Webometric Thoughts finally got the 30 unique visitors in a day that have eluded it for so long. In fact it got 31 on Tuesday, and then 40 yesterday.

If the number of visitors continues grow at 29% each day, in forty days (and forty nights) I will get a million visitors in a day for the first time. Although I may wait before I upgrade my server data package.

Labels: ,

posted by David at | 0 Comments Links to this post

Friday, 30 November 2007

Webometric Thoughts: For the more discerning webometrician

In the small field of webometrics there are few blogs, but after finding the blog readability test (via Halavais), I have discovered that mine is the more mature of three I follow (or is it just that mine is more incomprehensible??).

Thelwall's rarely updated Webometrics:


Holmberg's original Webometrics.fi:


Whereas Webometric Thoughts comes in with a relatively respectable:


Maybe quantitative methodologies do have a few limitations.

Nb. Holmberg's latest blog incarnation webometrics.fi/blog receives a more respectable 'junior high school' ranking, but I live with the expectation that future posts will drag it back down :-).

Labels: ,

posted by David at | 1 Comments Links to this post

Monday, 29 October 2007

Dear Technorati, what is wrong with my authority?

I must admit to having an unhealthy interest in web statistics, especially when they relate to my own web site. It is therefore annoying to note that my Technorati authority doesn't seem to be worth as much as everyone else's. Whilst their filtering system allows users to filter the results according to whether hits have: any authority, a little authority, some authority, or a lot of authority; my authority seems to account for little, and my results (for the term webometrics anyway) only seem to appear for people not interested in the authority of the posts.

I am a reasonable person, and wouldn't expect my hard-earned authority of 5 to appear under 'a lot of authority', and maybe not even 'some authority', but surely under 'a little authority'! Especially as others are appearing under 'a little authority' with an authority of 1.

Web statistics are nothing but trouble.

Labels: ,

posted by David at | 0 Comments Links to this post

Wednesday, 17 October 2007

Week One of Google Analytics

Last Tuesday (at about lunchtime) I started utilising Google Analytics so that I could see whether anyone was accidently stumbling across my blog. Up until then the only indication I received was if someone left a comment, as traffic data from my web host is considered an extra and costs £15 per year! Today I can see, for the first time, a week's worth of data. Although as a webometrician it is not suprising that I have been looking at the data numerous times over the last week.

Since the introduction of Google Analytics I have had 78 unique users from 11 different countries, and whilst that is not exactly setting the world on fire, I can at least rest in the knowledge that I wouldn't be doubling my audience by sending my mother the URL.

The most curious finding was the amount of traffic I had driven to my site by Google for a post I wrote on the 16th of September about a rather idiotic Facebook group called 'Leamington Spa Celebrity Mental Spotting'. The traffic emphasises that it is not necessarily the topic that is important, but rather the uniqueness of the topic. Whilst there are millions of people searching for 'iPhone' and 'Facebook', there are millions of posts on those subjects; whereas there are only a few people searching for 'leamington spa celebrity mental spotting' but the small number of posts means that mine is likely to be near the top of the pile.

The statistics also point out the necessity of making the blog more engaging, most users only viewed the one page. Whilst a pre-defined template is never going be very exciting, the ease of use makes them very appealing.

Who knows, maybe with the help of Google Analytics I will have over 100 unique users next week!

Labels: ,

posted by David at | 0 Comments Links to this post

Friday, 12 October 2007

Yahoo Site Explorer serves conflicting results

Search engines play an important role in webometric studies as most researchers have neither the processing power, the bandwidth, or the inclination to attempt to crawl and index the whole of the web themselves. However, search engine data is very imprecise, they are estimated numbers of results, varying according to which server is being searched, how deep into the results the user is digging, and now it emerges according to whether you are logged in or not...at least according to Yahoo Site Explorer.

The volatility of the results can be seen through looking at the results for www.seroundtable.com, the site that just published a story about this particular cause of search result variety (I'm, not sure if I have come across it before). In fact there seems to be much more variation than the site first mentioned. Results depend on: whether you are logged in; how deep you digg into the results; and whether you are looking at the same page as the results (the number of inlinks can be seen when viewing the pages indexed, and equally the number of pages indexed can be seen when viewing the inlinks).

When looking at the first page (or the tenth page) of results for pages indexed, not logged in:
Inlinks = 35,171
When looking at the first page of result for inlinks, not logged in:
Inlinks = 56,200
When looking at the tenth page of results for inlinks, not logged in:
Inlinks = 55,288
When looking at the first page of results for pages indexed, logged in:
Inlinks = 187,124
When looking at the tenth page of results for pages indexed, logging in:
Inlinks = 223,242
When looking at the first page of results for inlinks, logged in:
Inlinks = 195,239
When looking at the tenth page of results for inlinks, logged in:
Inlinks = 222,681

Whilst appreciating that search engines can't know everything, they could at least have the decency to reflect this by not giving such specific results...obviously what an academic really want is access to the data itself, but we may as well wish for the moon on a stick.

Labels: ,

posted by David at | 0 Comments Links to this post

Friday, 28 September 2007

Christmas search term analysis

Hitwise have just published a list of the top hot christmas gadgets based on search term analysis, which provides "great insight into people's habits and desires". However when the iPhone fails to make the top 10 mobile phones you have to question the methodology.

Hitwise analysed:
the top 2,000 search terms that sent traffic to a Hitwise Custom Category consisting of the top 100 online retail websites in the UK during the four weeks ending 22nd September 2007.

Rather than listing the gadgets that people are after, it may be that the list shows those gadgets that: people are after AND online retail websites dominate the search results.

Hitwise's excuse that: "The new iPod Touch and the UK release of the iPhone were announced too late to have a significant impact on the retail search data", doesn't seem to hold much water, as we can see from Google Trends that searches for the iPhone in the UK are up with the N95, whilst the Nokia 5300 doesn't even register.

There is a lot of interesting data held in the logs of web servers, but it is important that we don't get carried away with how much we read into them.

Labels: , , , ,

posted by David at | 0 Comments Links to this post

Thursday, 13 September 2007

Webometrics is addictive!

Despite knowing the meaninglessness of many the simple web metrics that can be calculated online and the inaccuracies that are inherent in the different tools available, for some reason I find that I am compelled to look at them.

The lack of inlinks or comments is not very surprising for a new blog. Many of the early posts are feeling one's way, determining what sort of areas are going to be discussed; 'finding one's voice' as the more pretenscious may say. Nonetheless there are already things of note for the addicted webometrician, albeit mostly about the tools themselves:
-Why does Blogpulse claim that I enthusiastically posted 16 posts on the 10th of September when looking at the blog I see I posted twice?
-Why has Technorati failed to index my post on Facebook metrics whilst seemingly indexing every other post?

And most importantly:
-Who is the lone Alexa user who visited three of my pages?


Although Alexa statistics are notoriously hit or miss, as relatively few web users have the software installed and once installed is often labelled spyware, it does allow comparisons between web sites. As an addicted webometrician the ability to compare my own blog with a fellow webometrician's is too hard to turn down. Webometrics.fi:

Unfortunately I lose this time, but it is still early days....and surely this is the smallest margin possible?

Labels: , , , ,

posted by David at | 1 Comments Links to this post