Webometric Thoughts

July 29, 2008

Cuil: You can’t out Google Google

Filed under: Cuil,Google,search engine — admin @ 8:17 am

Cuil.com is the new search engine that, as ReadWriteWeb point out, got rather a lot of publicity for its launch. Its publicity seems to be based on the fact that it is run by some ex-Googlers, and that it makes some big claims about the size of the index. However, I seem to be missing the feature that will make it a Google killer.

Whilst wanting to be part of the next generation of search engines, it seems to be playing a rather old fashioned game by going for the simple interface and bragging about the size of their index.

Most search engines gave up index-bragging years ago. Beyond a certain number of pages the size of an index becomes quite meaningless for all but the most obscure of queries. If anything a larger index may hamper the results as more low quality pages will be included. It is best to focus on a quality crawl rather than the biggest possible crawl.

Whilst the public love a simple interface (they are simple creatures), it brings nothing new to the market. Whatever way you try to rank the data, whether PageRank or BrowseRank, there is only so much you can do with a simple keyword search: people will continue to use homographs and fail to use appropriate search terms.

Whilst you can only really tell how good or bad a ranking algorithm is by using it regularly, first impressions of Cuil are not good. A simple search for webometrics fails to find any of the three main webometrics blogs, whilst the Statistical Cybermetrics Research Group at the University of Wolverhampton is coupled with a photo of a guy in a turban. No one in the group wheres a turban.

Cuil is definitely no Google killer. These days there are a million and one reasons to go to the Google site besides search, and any new entrant into the market needs to offer something outstanding to break the monopoly. Cuil has nothing.

April 12, 2008

Suicide and the Internet: Some flaws in the study

Filed under: BMJ,search engine,suicide,webometrics — admin @ 10:59 am

Webometric investigations rarely gain mainstream interest, yesterday, however, one did: A content analysis of the top 10 sites, on the four major search engines, for 12 searches relating to suicide. This highlighted the large number of hits that were to ‘dedicated suicide sites’ (e.g.. pro-suicide, encouraging, describing methods, or portraying suicide in fashionable terms): 90 out of 480 hits. Unsurprisingly this gained the interest of numerous news sites including the BBC. There are, however, a number of problems with the study: not all search terms are equal, and not all search engines are equal. Whilst we all make sweeping statements about web phenomena, we should really save it for our blogs rather than publication in the likes of the British Medical Journal (BMJ).

The main problem of the investigation is a focus on the information that is retrievable rather than the information that is actually being retrieved, which quickly muddies the water. Whilst the combining of search engines would initially seem to underestimate the scale of the problem, the propensity of users to use certain search terms would seem to indicate that the article has overestimated the scale of the problem.

The Google Effect
The majority of the statistics provided in the paper are based on the combined results of Google, Yahoo, MSN, and Ask:
-90/480 were dedicated suicide sites
-62/480 were sites forbidding suicide
-59/480 were sites discouraging suicide
However, almost 70% of searches use Google, which as the results show has the highest number of dedicated suicide sites in the results. This would seem to underestimate the problem: whereas just under a fifth of the hits were dedicated suicide sites overall, for the most influential search engine this has risen to just under a quarter. However, when looking at the search terms used, we soon reaslise that the problem has been over-stated.

Search Term Analysis
Whilst the BMJ lists the 12 search terms used, gathered partly from interview data and search suggestions used by search engines, a quick investigation quickly shows that they are by no means used in equal measure. Of the twelve terms only 4 were used often enough to generate search graphs in Google Trends:
-sucide methods
-how to commit suicide
-how to kill yourself
And even amongst these four there was a wide variation in usage, with the overwhelming majority of queries being generated by the term suicide:

A content analysis of Google’s ‘suicide’ results
Below are the top ten links I received when looking at the global results from google.co.uk for the term ‘suicide’, and how I would classify them. Whilst the BMJ study emphasises that is doesn’t restrict the results to the UK, it does not mention whether it uses google.com or google.co.uk. I have used google.co.uk as, unless you ask it otherwise, google.com will redirect British users to google.co.uk.
-Miscellaneous – Wikipedia’s suicide page
-Against suicide – Suicide…read this first
-Against suicide – Suicide.com
-Academic or policy site – Mind fact sheet
-Academic or policy site – Stanford encyclopaedia of philosophy
-Prevention or support site – Kids Health answers and advice -suicide
-Prevention or support site – Problems of life: Suicide
-Not relevant – Facebook suicide: the end of a virtual life
-Prevention or support site – Depression and suicide in men
-Prevention or support site- BBC: Health conditions: Suicide
Whilst classification is notoriously difficult to get agreement on, none of these sites could be considered the sort of ‘dedicated suicide sites’ that will spread panic through middle-England.

I have no doubt that there are plenty of sites on the web that encourage suicide, but before we start a panic we need to have a greater understanding of how people are searching on the topic of suicide when they are feeling suicidal. We can’t just lump together the findings of different searches on different search engines and say that statistically we have a problem.

The most popular search on the most popular search engine on the topic of suicide does not find any ‘dedicated suicide sites’.

The original BMJ article:
Biddle, L., Donovan, J., Hawton, K., Kapur, N., & Gunnell, D. (2008). Suicide and the internet. British Medical Journal, 336(12 April 2008), p. 800-802.
Can be found here.

March 21, 2008

Giga-blast from the past

Filed under: API,Gigablast,Google,search engine — admin @ 9:32 am

It is all too easy to forget about some of the alternative search engines out there, and I must admit that I can’t remember the last time I used Gigablast. It was therefore good to read on ResearchBuzz that Gigablast are now offering site search, which I have now added to the right-hand frame of my blog (too often people overlook the blog search in the blogger toolbar/banner).

Gigablast seems to have had a bit of make-over since I last visited (when it looked something like THIS), and now it even has a very limited API. Personally I would like to see the API extended and a few advanced operators, surely that’s an easy way of getting a competitive advantage over the other search engines.

Personally I hate the growth of Google search, and love any opportunity to support other search engines.

February 1, 2008

There is only one story: Microsoft offer $44.6bn for Yahoo!

Filed under: Google,Microsoft,Yahoo,live search,search engine — admin @ 1:45 pm

I have not had a chance to check out the blogosphere today, but I am guessing that there is only one big story, Microsoft’s offer of $44.6bn in cash and shares for Yahoo. Whilst the rumour has been spreading for a while that either Google or Microsoft would buy out Yahoo, it is still a shock to read that the offer has been made. Personally I would like to see Yahoo compete successfully as an independent company, but if it has to be bought I would rather see it bought by Microsoft than Google.

Google is too powerful a web presense, especially in the realm of search. Having a single organisation that provides access to all the information on the web doesn’t really bare thinking about, but that is the situation we seem to be (sleep)walking towards. Surely we have passed the point where Google’s success in web search has meant that it has broken its “do no evil” philosophy.

This is the wake-up call that we need to start breaking the Google addiction. Personally I will be heading to Ask.com for my searching…as long as I remember.

January 9, 2008

Wikia: More than Wiki-Mahalo?

Filed under: Jimmy Wales,mahalo,search engine,wikia — admin @ 10:59 am

The most anticipated launch of the year so far was the Wikia Search Alpha on Monday (although with the PhD calling I had to force myself to ignore it until today). Whilst I welcome the idea of a transparent search engine, and hate the dominance of Google in the English speaking world (such a monopoly is not in the public interest), I am unsure how well the wikia search project will work, or even if I want it to succeed.

Search engine transparency appeals to me both as a regular search engine user, and as a webometrician. I broadly agree with Wales’s awfully worded sentiment that “sunlight is the best disinfectant”, and access to crawling and ranking data could allow webometricians to have access to larger crawls of the web without compromising our understanding of where the data comes from. However, I am not a big fan of human-edited directory approach.

Wikia, like Mahola before, have decided to mix-up what we used to label directories and search engines, and create human-edited search engines, where for certain searches the search engine results include a human edited article on the topic. Whilst Mahalo pays people to create these articles, Wikia will allow any user to add to an article, in the true wiki spirit. But whilst you may be willing to contribute to a wiki article for the good of the community, would you be willing to continue doing this when the money starts to roll into Jimmy Wales’s pockets? Wikia Search is a commercial venture.

Whilst human edited search engines may have their place in world of search, they appeal to the generalist rather than the specialist. Pages for the iPhone and Heroes will quickly appear, but I can’t imagine people will still be jumping on-board with enthusiasm if they starting seeing Wales get rich, and such a search engine needs mass participation for it to be of any use for those interested in more than the latest Paris Hilton gossip.

In a perfect world an individual would have the opportunity to personalise a search engine’s ranking system to their own requirements, preferably for each specific query, varying the effects of different features according to the sort of data that was required, e.g., links, keywords, anchor text, domains. However, such personalisation is still a distant dream.

October 18, 2007

China redirects: Could my ISP do the same?

Filed under: Baidu,China,Dalai Lama,George Bush,Google,search engine — admin @ 5:30 pm

The big news in the blogosphere this afternoon seems to be Chinese surfers being redirected from the US search engines to Baidu, with many suggesting that it is a reaction to George Bush recognising the Dalai Lama. Whilst the blogosphere is unsurprisingly outraged, personally I quite like the idea of having my ISP stopping me going to Google.

We all have certain URLs we type into the address bar automatically. If I am searching for something I find myself typing ‘google.com’ without a thought for Ask, MSN, Yahoo, or any of the thousand other search engines available. If I am momentarily at a loss as to what to do next I find myself returning to my emails for the umpteenth time, or checking my bloglines for the zillionth time. If my ISP forced me to use another search engine every now and again, or forced me to have reasonable periods of time elapse before returning to the same web site again and again, I am sure I would utilise the web much more productively.

Yes, I know, civil liberties, blah blah blah…I’m just saying that there is an up side.

September 27, 2007

Major Live Search updates

Filed under: Microsoft,live search,search engine — admin @ 9:07 am

Microsoft have announced their biggest update to Live Search since its debut. Unfortunately whilst everyone seems to be talking about it, noone is raving about it; whilst it is accepted as an important piece of news, noone seems to think it is a particularly exciting bit of news. The general belief seems to be that the search wars are over (at least in the U.S. and the U.K) and that Google has won. Personally I live in the hope that the existing players manage to take back some of Google’s excessive portion of the search market, and that there will be serious new entrants in the market.

I hate the fact that Google currently deals with over 60% of all searches, and feel ashamed every time I find myself typing in ‘www.google.com’ in a zombie-like trance; no single organisation should have such powerful influence over access to information on the web. When Google entered the search market they raised the bar of expections for search engines, and as yet (many year later) the other search engines have failed to succesfully reply. That is not to say they won’t, but rather that it is going to take something truely new and innovative. The new search engines at the moment seem to just be rehashing old ideas, with some being a repackaging of a directory and others going for the conversational English that failed in the original Ask Jeeves.

As more users start creating on the web, rather than just consuming, there are many new sources of information for a search engine to tap into; rich, formated information. The successful search engines are likely to be those that find the best ways of making use of this new information.

Powered by WordPress