Application Programming Interfaces (APIs) are a brilliant way for researchers (as well as commercial developers) to use the data of the big web organisations in new and innovative ways in a controlled and ethical manner. Whilst there are usually limitations, we find ways of working within the boundaries we are set. What is annoying, however, is if you find that the service isn’t being particularly honest about the boundaries. This post’s wrath is aimed at Flickr’s API.
Whilst many API services will limit the number of results you can view, this is usually clearly set out in the documentation. For example, most search engines only allow you to view the first thousand results. Flickr however allows you to keep calling results, only to start sending back repeated pages of results for anything over 4,500. This can be clearly seen in the two pictures below from the Flickr API Explorer for flickr.photos.search. The first shows a partial screenshot of the results for the ninth page of 500 results for the tag ‘web’:
The second shows a partial screenshot of the results for the tenth page of 500 results for the tag ‘web’:
Basically the same results with a different page number.
I wouldn’t mind the restrictions if they were clear. Whilst it may be stated in the small print somewhere, which I still haven’t seen, why would you send the same data again and again and claim it as different pages of results? It is still possible to collect all the results by using some of the other arguments, e.g., min and max upload dates, it just means that I had to waste numerous hours collecting data again when the problem came to light. Flickr now owes me one Saturday.
This serves as a useful reminder to all web researchers: Make sure the API is giving you the data it is claiming to give you.