Final proof of my being middle-aged came on Tuesday when I found myself filling in a form on the OED site complaining about the lack of a mobile interface for the dictionary and the term ‘webometrics’ missing from the dictionary. The reply came this morning: they have no plans for a mobile interface…“However, there are plans to provide APIs which would enable third parties to develop different interfaces for querying the OED.” I couldn’t have asked for more!
I am a heavy user of the OED, in fact it is the only subscription service that I inevitably use every day. It is not only that I am an appalling speller (which I am and have always been), but the dictionary is an essential tool for any academic. Unfortunately the lack of a simple mobile interface has meant that I don’t consult the dictionary as often as I should. Despite being provided with a subscription from two different universities and two public libraries the lack of a mobile interface means a long and awkward signing-in process before you even start to look at specific entries.
Whilst a decent mobile application/interface will be of greatest interest to me, an API will enable a wide range of novel applications to be built around the world’s greatest dictionary.
nb. Webometrics has now been added to their files as a ‘hint’ to the new words team.
Twitter have now released two Social Graph Methods (via Scripting News) that enable you to call a list of a user’s friends and followers. Whilst this is far quicker than checking the links between each pair of users, it unfortunately means I will have to write my program again from scratch.
Whilst I welcome the new methods it is still not addressing the fact that Twitter gives preferential treatment to commercial applications whilst lumping researchers together with hobbyists.
Nb. Obviously it is a coincidence, but my site did gain a bit of a stalker yesterday…
I have never particularly seen the point of Twitter; it’s the background noise I can do without. However, as an all-thing-web-2.0 researcher I recognise that there needs to be further investigation of how people are using it. Unfortunately Twitter doesn’t seem to think so; yesterday they turned down my request to be upgraded from the extremely limited 100 API requests per hour.
After visiting Birmingham Social Media Cafe last Friday, and noticing the prevalence of Twitter names, I thought it would be interesting to get an overview of the Birmingham Social Media Cafe Twitter network: Had clusters formed? Did these clusters reflect different interests?…the usual sort of academic questions. In no time I had collected the Twitter IDs of 50 members of BSMC, and written a program that would check which of the members were following one another (using Twitter’s ‘friendship exist’ method).
Unfortunately, to test every combination of names requires the sending of 50*49=2,450 requests. So even this extremely small scale study would require the program to run over 24hrs!! Last time I had collected data using Twitter’s API there seemed to be no such limits. Whilst Twitter do offer the opportunity to be placed on a ‘whitelist’ that allows you 20,000 requests per hour, “…we only approve developers for the whitelist”, and seemingly by their negative response they mean the distinctly commercial type of developer. As the explanation link suggests, I was turned down because, as a researcher, I should be asking for the second-class data-mining feed:
It returns 600 recent public statuses, cached for a minute at a time. You can request it up to once per minute to get a representative sample of the public statuses on Twitter
This is the service Twitter have decided is most appropriate for “researchers and hobbyists”, albeit one that would fail provide the sort of network information that I am interested in. A distinctly second-class service in comparison to the one offered to commercial developers.
I can understand if online services such as Twitter don’t want to go out of their way to help academics, but it is rather disappointing that we are penalised for doing public research rather than chasing money. Whilst I will eventually be able to find a 24hr slot to run this particular program, it’s a shame that I won’t be able to run more large scale studies.
Application Programming Interfaces (APIs) are a brilliant way for researchers (as well as commercial developers) to use the data of the big web organisations in new and innovative ways in a controlled and ethical manner. Whilst there are usually limitations, we find ways of working within the boundaries we are set. What is annoying, however, is if you find that the service isn’t being particularly honest about the boundaries. This post’s wrath is aimed at Flickr’s API.
Whilst many API services will limit the number of results you can view, this is usually clearly set out in the documentation. For example, most search engines only allow you to view the first thousand results. Flickr however allows you to keep calling results, only to start sending back repeated pages of results for anything over 4,500. This can be clearly seen in the two pictures below from the Flickr API Explorer for flickr.photos.search. The first shows a partial screenshot of the results for the ninth page of 500 results for the tag ‘web’: The second shows a partial screenshot of the results for the tenth page of 500 results for the tag ‘web’: Basically the same results with a different page number.
I wouldn’t mind the restrictions if they were clear. Whilst it may be stated in the small print somewhere, which I still haven’t seen, why would you send the same data again and again and claim it as different pages of results? It is still possible to collect all the results by using some of the other arguments, e.g., min and max upload dates, it just means that I had to waste numerous hours collecting data again when the problem came to light. Flickr now owes me one Saturday.
This serves as a useful reminder to all web researchers: Make sure the API is giving you the data it is claiming to give you.
For some reason Google always saves its big releases for those days when I am busy. Could it be that they are fearful of my criticism? Or merely coincidence? Whatever the reason I couldn’t help but push other things to one side and comment on Google’s new SearchWiki. Basically, when you are logged into your Google account at google.com (not currently google.co.uk) you can change the results you find on your home page: promoting results, hiding results, commenting on results. Whilst it only affects your results page, you can see how other people have ranked/commented on items, and it seems highly likely that Google will eventually incorporate the findings in its general search results.
SearchWiki is by no means a new idea, sites such Aftervote (now Scour) have done it all before, the difference this time is the amount of people Google can put to work on the idea. At the time of writing this blog a search for ‘Google’ had already had 908 people make notes; it would probably have taken Aftervote weeks if not months to get that many comments on a single search term. So what is the collective wisdom regarding the best search result for the term ‘google’ entered into Google.com…that’ll be Google.com. Personally I would have thought that people are more likely to be searching for one of Google’s other services or information about Google rather than the page they are already on, but noone ever accused the public of being overly bright.
As someone who likes to do his bit for collective wisdom, I have made steps to clean up one of my most regular search ‘webometrics’: Just the three adjustments: promotion of the most important site, questioning the validity of a colleague’s page, and the removal of a character who has no right to call himself a webometrician. But I am sure everyone would agree that such amendments improve the page astronomically.
Whilst I am sure that shere weight of numbers will prevent the spamming of the top searches, it will be interesting to monitor the spam on the fringes. Will people be looking at the notes other have made? I will. SearchWiki seems as though it will give great insight into what people think of different sites, I just hope Google adds it to their API.
UPDATE: Whilst I initially said it was only available on Google.com, it’s seemingly not as simple as that. When I log into Google.com with my webometrics account I get SearchWiki, when I log in with my gmail account I don’t get SearchWiki! It seems as though they are taking steps to restrict access geographically.
Python is a really simple programming language for the novice programmer. As such I held an afternoon’s “workshop” for a couple of PhD students in my front room: The aim of the workshop was to provide sufficient information about programming in Python so that at the end of the afternoon the user could: -Install Python libraries -Download information through various APIs -Manipulate the downloaded information. As it was necessary to create an extensive slide show, covering everything from installing Python to getting data from the Yahoo API, I thought it may potentially be of interest to other novice users who don’t know where to start.
It doesn’t necessarily include the quickest or most efficient way of doing things, but it is simple and does the job.
If you have any questions about specific points, feel free to ask…the questions can’t be more stupid than the questions the PhD students asked…and some of the slides could probably benefit from further explanation.
Already there seems to be an embarrassment of riches: a neighbourhood statistics API from the Office of National Statistics; Transport information from Transport Direct; Health care services and information from the NHS…and the list just keeps going on. Despite messing about with APIs for a number of years, the quantity of data available means that I have no idea where to start. The good news is that the Power are offering up to £20,000 to develop any ideas that you may have.
Whilst I have not had a chance to play about with any of the data yet, I do have one criticism: The use of a .co.uk domain name (i.e., www.showusabetterway.co.uk). As the Power of Information Task Force has a government email address (i.e., email@example.com), why didn’t they use a government domain name? Such domain names are restricted, and therefore provide a indication of legitimacy.
67% of Flickr members have no photos! Whilst Lotka’s law teaches us that the majority of contributors to a community make very few contributions, I was still surprised at the number of members with no photos; after all, I am not talking visitors to the site, but those who have taken the trouble to join. What is the point of joining Flickr if you are not going to put photos on the site?
Data was collected about the number of photos for 324 randomly selected users. 216 had no photos, an additional 58 had less than 20, with only 50 having over 20: Really I should have a look at whether these missing users are active in other ways, (e.g., members of groups, leavers of comments), but this was little more than an aside as I spend my time messing about with Python. I have now loaded Python on my main computer as well as my Eee PC, and can barely believe how easy it is!
Already I have been writing codes in python that use the Twitter, Flickr and Digg API, programs that can form the basis of numerous articles that I will never get around to writing…it’s SO easy (with the possible exception of installing the simplejson library that the Twitter library relies on). Just wish some other sites would roll out APIs (e.g., Stumbleupon and Reddit).
So, do we all need to become top-class programmers? No. But if you can program, even to a basic level, the web becomes a lot more exciting and interactive place.
It is all too easy to forget about some of the alternative search engines out there, and I must admit that I can’t remember the last time I used Gigablast. It was therefore good to read on ResearchBuzz that Gigablast are now offering site search, which I have now added to the right-hand frame of my blog (too often people overlook the blog search in the blogger toolbar/banner).
Gigablast seems to have had a bit of make-over since I last visited (when it looked something like THIS), and now it even has a very limited API. Personally I would like to see the API extended and a few advanced operators, surely that’s an easy way of getting a competitive advantage over the other search engines.
Personally I hate the growth of Google search, and love any opportunity to support other search engines.