Friday 30 July 2010

Testpage for the PDF metadata extraction pipeline

We released this week the test web page for the PDF metadata extraction pipeline. We haven't yet advertised it widely (this blog entry can be regarded as the first public announcement!). The accuracy is similar to the one we announced at the beginning of the month at a meeting at CARET but at that time we were not aware of issues that were potential sources of troubles.

The most important one was related to the validation against the CiteXplore dataset. We were doing it against the whole content of CiteXplore but this was a potential source of errors as only two of the six subsets can be regarded as "ground truth" sources - the PubMed and the Agricola subsets.

Once we eliminated the other four subsets the number of accurate matches in the test set of 300 PDFs dropped to 44% which was embarrassing and triggered a frantic search of acceptable alternatives. The fastest solution was to introduce a new source of ground truth metadata. After contemplating two sources related to mathematics and statistics (Zentralblatt, AMS) we settled on DBLP which covers computer science and engineering. The problem was that the DBLP mirrors we tried at first were either too slow or not responsive. Others did not have advanced search capability. Fortunately, we stumbled upon the faceted DBLP site which was fitting the bill both in terms of speed and of advanced search functionality.

Another change we implemented was to filter out words than contained special characters and were not contained in a standard UNIX English dictionary and to reformulate the validation queries with the title as a bag of words, rather than a phrase. This helped deal with cases where artefacts from the PDF conversion were making the validation impossible.

A third change was related to using a heuristic involving the likely publication year. This helped to resolve some of the cases of multiple validation matches.

These changes and the hacking they involved delayed us by a week but once we were done we computed the stats and found that we are, again, successfully validating more than 50% of the metadata records (and even approaching 60% if cases of PDFs requiring OCR are not counted). Phew!

No comments:

Post a Comment