Tuesday 28 September 2010

Full review of publication metadata extraction techniques

We have completed a review of existing methods for metadata extraction from publications, specifically PDFs. This sort of technique allows automated derivation of title, authors, and so on, from an academic publication in PDF form. This document should prove a useful reference for others looking into this area.

You can view our report online here; this builds upon Dr. Amyas Phillip's earlier work[PDF] in this area.

Many thanks to Richard Easty for the bulk of the research in this area, and to Verity Allan for refining the final text.

Recommendations from Science Online London 2010

At Science Online London this year there was a great session on Discovery, by Kevin Emamy of CiteULike, and Jason Hoyt of Mendeley. Discovery is finding something you needed. This is not quite the same as Search, although you can discover something whilst searching or browsing.


From CiteULike, we heard about ways of creating recommendations.  You can start from a person or an article or a tag, and recommend a person, an article, or a tag.

SInce CiteULike have many users, they’ve been able to test their recommender system in anger and in volume. They can show users recommendations, and let the user click accept or reject, so success can be measured.

The first tests were about recommending an article, given a person. So, the system would look at who you are and the papers you already have, and would then pick another user with similar papers, and recommend one of the papers they have that you don’t yet have. This method provided about 14.6% acceptance over a large dataset.

Next, CiteULike tried user-based collaborative filtering. This added in the idea of adjustment for relative library sizes, whilst keeping the idea of selecting another user who has similar papers to you. This secured 28% acceptance (on a slightly smaller dataset).

It is notable that the top two accepted recommendations happen to be amongst the most popular papers overall - “How to choose a good problem” and “How to write boring scientific literature” if my notes are correct... Further down the list, though, the recommendations look better and useful.

New algorithms are in the works, as is the possibility of offering users a choice of  systems, as opposed to the blind tests reported on above.  None of these systems looks at the text itself, just the sets of papers for each user; and the recommenders don’t consider all the papers in the system, just the ones with DOI included,which is in fact a tiny proportion of the overall set.


Jason Hoyt from Mendeley then talked about idealism in discovery.

He defined “agnostic” as algorithms which rank only on the content of the papers.  We should note that search engines are not agnostic, because Google considers other factors to identify high ranking search results; and neither is Mendeley’s discovery tool (which incorporates ranking based on content plus readership information).

PubMed, Mendeley and Google Scholar use different algorithms, but all include subjective measures of freshness, impact and so on - it’s editorial control at the algorithm level.  So why not exploit that editorial control to increase the ratings of open access and open data papers?


The session then opened into a discussion of risks and concerns from the audience around recommenders and related discovery systems, in search and reference management.

Academics wondered what services would exist for 5 years or more - without long term availability, they are less likely to adopt new tools.

There were concerns around spam, and an increase in annoying interruptions and information overload.

If the output rankings or recommendations meant anything significant, such as likelihood of citation, and the method was at all transparent, then there is a motivation for people to try to game the system. This would mean that diligent academics who published good work may not be rewarded as much as those who instead invested time and effort in improving their “findability”.  Notably, total transparency of the algorithm may not be required for this to be an issue; in other sessions, the question of whether academics should be expected to undertake search engine optimisation of their output, and what support they might need for this, was raised. Academic SEO applies just as much today in the era of Google Scholar as it does in a future world where more papers are found through other means.

The audience touched briefly on the fascinating question of the desired level of openness of personal reference collections. Mendeley offer private collections by default, but you can open them to be public if you wish; CiteULike take the opposite approach.

There was also mention of how Google Scholar and similar tools can tend to cause a group of scholars or a school of thought to dominate results in some disciplines, due to their creating a collection of heavily interlinked papers, whereas other authors who do not crosslink so heavily can be pushed down the rankings and hard to find. The idea of limiting impact factor in ranking algorithms, to allow “hidden gem” papers to be brought out, was suggested; the most high impact papers are ones that a scholar is likely to have already come across in some way, anyway, and valuable serendipitous discovery is perhaps more likely to be of more minor works.  An alternative method of adding some randomness to the algorithms or result presentation also seemed popular.

Overall, though, one key point remained, which was that metadata sources are “bloody useless” and all the algorithms in the world  - even if flawless themselves - will be hampered until the metadata gets better.