Friday 30 July 2010

Testpage for the PDF metadata extraction pipeline

We released this week the test web page for the PDF metadata extraction pipeline. We haven't yet advertised it widely (this blog entry can be regarded as the first public announcement!). The accuracy is similar to the one we announced at the beginning of the month at a meeting at CARET but at that time we were not aware of issues that were potential sources of troubles.

The most important one was related to the validation against the CiteXplore dataset. We were doing it against the whole content of CiteXplore but this was a potential source of errors as only two of the six subsets can be regarded as "ground truth" sources - the PubMed and the Agricola subsets.

Once we eliminated the other four subsets the number of accurate matches in the test set of 300 PDFs dropped to 44% which was embarrassing and triggered a frantic search of acceptable alternatives. The fastest solution was to introduce a new source of ground truth metadata. After contemplating two sources related to mathematics and statistics (Zentralblatt, AMS) we settled on DBLP which covers computer science and engineering. The problem was that the DBLP mirrors we tried at first were either too slow or not responsive. Others did not have advanced search capability. Fortunately, we stumbled upon the faceted DBLP site which was fitting the bill both in terms of speed and of advanced search functionality.

Another change we implemented was to filter out words than contained special characters and were not contained in a standard UNIX English dictionary and to reformulate the validation queries with the title as a bag of words, rather than a phrase. This helped deal with cases where artefacts from the PDF conversion were making the validation impossible.

A third change was related to using a heuristic involving the likely publication year. This helped to resolve some of the cases of multiple validation matches.

These changes and the hacking they involved delayed us by a week but once we were done we computed the stats and found that we are, again, successfully validating more than 50% of the metadata records (and even approaching 60% if cases of PDFs requiring OCR are not counted). Phew!

Tuesday 27 July 2010

interviews and biscuits


I've been out and about round the University chatting to computer officers and administrators who are responsible for maintaining staff profiles. It seems like there are as many ways of updating these pages as there are departments. Some have annual updates of profiles, others update on request. Some provide a set of guidelines with advice on tone, length of profiles and a limit to the number of publications one can list while other departments leave it individual choice.
It's hard to draw the strands together and come up with a unified idea of what a profile is and what it should contain. It seems most likely that the variety in the production and content of staff profiles is reflective of the myriad departments and subdepartments of the University, the influence of early adopters of the web, the personal preferences of individuals and the gradual evolution of a process to suit each group.

Thanks to everyone who donated their time to talk with me, and I hope you liked your biscuits!