Playing with OpenURL Router Data

Tony Hirst recently posted on experimenting with processing a reasonably large data set with *NIX command line tools. The data set is the recently published OpenURL Router Data. Inspired by this post I wondered what I could hack up in Ruby to process the same data, and if I could do this processing without a database. The answer is that it is pretty simple to process…

First what is the OpenURL Router, and what is its data? What we need to know here is that the Router effectively enables library and other services to find the URLs for online bibiographic resources (more detail). A simplification is that the Router supply a translation from bibliographic data to the URL in question. The OpenURL router is funded by JISC and administered by EDINA in association with UKOLN.

Suitably anonymised OpenURL Router Data has been published by a JISC-funded University of Edinburgh/EDINA Project, the Using OpenURL Activity Data Project. This project is participating in JISC’s Activity Data Programme where Hedtek is collaborating in the synthesis of the outputs of the projects participating in this proramme. Thus my interest in the data and what can be done with it.

My initial interest was in who has what proportion referrals. Tony computed this, and I wanted to replicate his results. In the end I had a slightly different set of results.

Downloading and decompressing this CSV data was pretty easy, as was honing in on one field of interest, the source of the data being referred to. Tony’s post and the OpenURL Router Data documentation made it pretty easy to hone in on the 40th field in each line of this CSV formatted data.

My first attempts were to use a ruby gem, CSV, from the ruby interpreter irb. This went well enough but I soon discovered that CSV wouldn’t handle fields with a double quote in them. Resorting to the my OS X command line

  tr \" \'   < L2_2011.csv   > nice.csv

soon sorted that out.

It soon emerged that I needed to write a method, so I flipped to the excellent RubyMine, and soon hacked up a little script. Interestingly, I found that the representation of the site with the requested resource often had a major component and a minor component, separated by a colon, thus

 EBSCO:CINAHL with Full Text

Having been excited by previous mention of Mendeley by Tony and wanting to find out the percentage of references to Mendeley’s data for another piece of work I am doing, I stripped out the minor component, and came up with the following code

While its open to a good refactoring, it did the job well enough, producing an unsorted list of results. A quick refeactor resulted in the following, which also coalesced both and into one result.

To sort the output I used a command line sort after the script invocation

 ruby totals.rb | sort -nr

and obtained the following, here only listing those sites with more than 1000 references

44870 	EBSCO
34186 	undefined
27545 	OVID
9446 	Elsevier
6938 	CSA
6180 	EI
4353 	Ovid
2558 	jstor
2070 	Dialog
1034 	Refworks

The rest, working out percentages, is easy thanks to Excel, see the middle column











This entry was posted in projects and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Posted June 8, 2011 at 9:26 am | Permalink

    Great comparative technique:-)
    By the by, yousaid: “In the end I had a slightly different set of results.”
    I took this to mean you got different summary data (which would be a Bad Thing) but the numbers look the same to me? Or do you mean you took a slightly different route to the solution (to use the Perl mantra, I guess there’s always more than one way to do it 😉

  2. Posted June 8, 2011 at 10:16 am | Permalink

    @Tony We do get the same results! That’s satisfying 🙂
    Just shows how it is a good idea to cross check results in the morning

One Trackback

  • […] ranged from those aimed squarely at folks with technical expertise, such as Mark van Harmelen’s ‘Command Line Ruby Database-free Processor’ through to applications aimed at end users with no technical or specialist knowledge, such as Alex […]

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>