Monday 13 December 2010

The scholarly discovery curve

We've been discussing recommenders and discovery and serendipity in the scholarly research process of late, and have started to put together some partially-formed ideas. One such idea is described below - please let us know your thoughts in the comments!  This is not a formal view from the project, just some - possibly provocative - ideas we're knocking around.

Information overload

For most of the time, most researchers are struggling with information overload, combined with heavy workloads and time pressure.  A key factor in this area is email - too many emails coming in, demanding response, consideration of papers, reviews, meetings, and so on.  Scholars working with "Web2.0" tools may also have RSS feeds, tweets, scholarly networking feeds, and more, adding to their daily information burden. Some of this information is valuable, some less so, some cannot be evaluated without absorbing a great deal of time.

It is hard to imagine that yet more information would be welcomed at this point, unless it was of high quality, and could be seen to be of high quality without investigation; for example, a suggestion of a paper from a renowned scholar who understands one's field of work is likely to be interesting and reading it fruitful. However, a paper suggested by a less trustworthy source may turn out to be poor quality, irrelevant, or to not add anything to the existing discourse, and adds to the feeling of overload.

A researcher will be keeping an eye on new and emerging research in his field, perhaps through watching newly issued journals or attending conferences, or RSS feeds and newsletters. Only new material is of interest, as a rule (we assume that our researcher has been working in his field for some time, and therefore has a good knowledge of what is relevant from previously published material).


This changes when a scholar is mugging up on a new subject, or digging out to explore a certain angle of investigation.  At this point, new information is sorely needed! Let us consider the phases of research here...

The three phases of researching a new field








First of all, the researcher knows nothing (or hardly anything) about the area he is starting to study. Assuming he begins by searching a catalogue or the web, he can try some keywords and will get many results back. Each single search gives the researcher many new papers he has not yet seen; because the field is new to him, almost anything is somewhat useful, as a source of new references from the bibliography, or for some new facts (for almost all facts are novel).  The list of papers to review grows exponentially. There is no shortage of new information or new papers.





Next, the scholar reduces the effort put into finding more papers, and concentrates on parsing and organising the information he has to hand.  New papers might come to light, but until they've had time to review what they have already found, it's hard to tell if these are useful. So there's a levelling off of growth of the list of papers yet to be read. 





Finally, the scholar has read many papers in the field, has identified what seems important and what less so. Now is a time of filling in the gaps - trying to see if anyone has published on one very specific topic, making sure that all the papers cited by a major review paper have been looked at, and so on.  Now, most of the papers that the scholar comes across have already been read or seen; most of the new papers that might be found aren't relevant to the specific topic of investigation and can be dismissed. The scholar is fussier - only publications which add to the body of knowledge built up in the previous phase of research are useful, things which fill in the gaps or reassure the researcher that there aren't any gaps, that the search has "bottomed out".




Recommenders

Now, in ConnectedWorks we've been thinking about recommenders. Recommenders in the field of scholarly work are still somewhat new and experimental, but most efforts focus on suggesting a few more papers based on the papers already read by the researcher, in some way. This might be suggesting other papers which are written by the author of a paper one has already read, or by suggesting papers with keywords matching the keywords of a paper already read, or by suggesting other papers read by scholars who also read a paper one has read, and so on.

Let's consider how effective this might be for the phases of research we just discussed - starting with the time when the scholar is not exploring a new field, but just doing day to day work. During this phase,  new papers are of interest, but high reputation sources are prized over those which might just add to information overload, without adding research value.  A quality recommendation from a trusted recommender system would be valued, but can any existing system deliver this? Also, how well do current systems do at recommending new papers, as opposed to old ones an experienced researcher has probably already seen?

Then, we have a time of intense study into a new area, where many papers are found through every search or in every bibliography and are added to the list to read. It is hard to imagine that a recommender could be more effective than a keyword search here - and a small number of additional papers brings little additional value to the process. The researcher's basic search techniques are very effective and need little supplement.

After this, the researcher is reading and digesting papers, and occasionally coming across a new one which he hasn't read before. If the recommender didn't know exactly what was being read, and entirely up to date, it would most likely be suggesting papers which were "already on the pile", and this would be more of an annoyance than a gain. As such, a recommender built into a reference management tool which also tracked reading might well be helpful - although this might become less effective if the researcher's colleagues hand him printouts of work they have found useful!

Finally, the scholar is filling in the gaps. Here, again, a useful recommendation system would need to consider exactly what had been read so far - repeat papers are a nuisance at this phase. The researcher's field of interest is now very narrow indeed - things out of the scope of the study and even outside the scope of the gap-filling are not of interest. It would be a very high quality recommender indeed which could deliver papers to this exacting standards.



Tentative conclusions

As such, it seems to us that many recommenders fitting the common template of suggesting papers based on papers already read may not provide much scholarly benefit for researchers already established in their main field of study.

What do you think? Please let us know in the comments. 



(Many thanks to Dan Sheppard from the JISC Library Widgets project for cross-fertilization of ideas in this area!)

Friday 15 October 2010

Scholarly Networking outside the institution

Most of ConnectedWorks has been considering how scholarly networking applies within a university. In particular, how researchers create and maintain their online profiles, and how they find and connect to each other.

But scholars work with each other within disciplines across institutions too - these ties may be as meaningful, or more so, as the connections within a university.  We can imagine a vision (which is nearly here) where a researcher is a member of his university's scholarly network, so he can connect to other local scholars, and also a member of a discipline-oriented scholarly network, perhaps provided by his learned society.  This means that the researcher can connect to others who work in his university as well as scholars nationally or internationally in his field.  Nonetheless, this isn't a perfect setup; the researcher has to make two online profiles and log on to two separate systems, etc. For an early career researcher, who might be changing institutions every couple of years (or more frequently), through Masters and PhD degrees and short term post doc contracts, the network with scholars worldwide may be more important.

So ConnectedWorks has been looking into the prospects for scholarly networks organised by discipline, potentially by learned or professional societies. We've been lucky enough to have the chance to learn from the American Academy of Religion, who have been looking into scholarly networking, and we're now publishing our research report [PDF] - Scholarly Networking in the Learned Society, by Helen Burchmore, Anne-Sophie de Baets, and Laura James.

Tuesday 5 October 2010

World of reference managers


We are preparing the next phase of project research work, which is investigating what reference managers are used here in Cambridge, and the attitudes of researchers and academics to various reference manager features or potential future features. 

Part of this research will be a survey, and we've been thinking about questions over the last couple of weeks. There's a lot of things we want to find out about, whilst keeping the questionnaire short enough that academics are happy to complete it.

One obvious question is: what reference managers do you use? And we'e got a list of ones we know about to put in as options. If you know of any more, please let us know in the comments...
  • Papers
  • EndNote
  • Mendeley
  • Zotero
  • qiqqa
  • iCite
  • JabRef
  • RefWorks 
  • Paperpile
  • BibTeX
  • Own computer-based system, such as an Excel spreadsheet
  • Own paper system
  • Other (please specify)
As well as finding out which of these (and it might well be more than one!) people use, we have questions in other categories too - at least, in our draft questionnaire - we might need to slim it down. Again, if there are other things you think we should ask the good scholars of Cambridge, please post in the comments!
  1. About you - career stage, subject area, level of collaboration in your research
  2. Level of usage of "web2" tools and other technology
  3. Level of comfort and practice sharing aspects of research
  4. What reference managers you use
  5. What you use reference managers for (eg. formatting citations, storing papers to read later, finding out what others are reading)
  6. Whether you use any of the "social" aspects of reference managers, such as sharing papers with a group
  7. How you choose reference managers (feature lists, price, provision by institution, use by peers, training or support availability, long term sustainability of the tool, etc)

Tuesday 28 September 2010

Full review of publication metadata extraction techniques

We have completed a review of existing methods for metadata extraction from publications, specifically PDFs. This sort of technique allows automated derivation of title, authors, and so on, from an academic publication in PDF form. This document should prove a useful reference for others looking into this area.

You can view our report online here; this builds upon Dr. Amyas Phillip's earlier work[PDF] in this area.

Many thanks to Richard Easty for the bulk of the research in this area, and to Verity Allan for refining the final text.

Recommendations from Science Online London 2010

At Science Online London this year there was a great session on Discovery, by Kevin Emamy of CiteULike, and Jason Hoyt of Mendeley. Discovery is finding something you needed. This is not quite the same as Search, although you can discover something whilst searching or browsing.


From CiteULike, we heard about ways of creating recommendations.  You can start from a person or an article or a tag, and recommend a person, an article, or a tag.

SInce CiteULike have many users, they’ve been able to test their recommender system in anger and in volume. They can show users recommendations, and let the user click accept or reject, so success can be measured.

The first tests were about recommending an article, given a person. So, the system would look at who you are and the papers you already have, and would then pick another user with similar papers, and recommend one of the papers they have that you don’t yet have. This method provided about 14.6% acceptance over a large dataset.

Next, CiteULike tried user-based collaborative filtering. This added in the idea of adjustment for relative library sizes, whilst keeping the idea of selecting another user who has similar papers to you. This secured 28% acceptance (on a slightly smaller dataset).

It is notable that the top two accepted recommendations happen to be amongst the most popular papers overall - “How to choose a good problem” and “How to write boring scientific literature” if my notes are correct... Further down the list, though, the recommendations look better and useful.

New algorithms are in the works, as is the possibility of offering users a choice of  systems, as opposed to the blind tests reported on above.  None of these systems looks at the text itself, just the sets of papers for each user; and the recommenders don’t consider all the papers in the system, just the ones with DOI included,which is in fact a tiny proportion of the overall set.


Jason Hoyt from Mendeley then talked about idealism in discovery.

He defined “agnostic” as algorithms which rank only on the content of the papers.  We should note that search engines are not agnostic, because Google considers other factors to identify high ranking search results; and neither is Mendeley’s discovery tool (which incorporates ranking based on content plus readership information).

PubMed, Mendeley and Google Scholar use different algorithms, but all include subjective measures of freshness, impact and so on - it’s editorial control at the algorithm level.  So why not exploit that editorial control to increase the ratings of open access and open data papers?


The session then opened into a discussion of risks and concerns from the audience around recommenders and related discovery systems, in search and reference management.

Academics wondered what services would exist for 5 years or more - without long term availability, they are less likely to adopt new tools.

There were concerns around spam, and an increase in annoying interruptions and information overload.

If the output rankings or recommendations meant anything significant, such as likelihood of citation, and the method was at all transparent, then there is a motivation for people to try to game the system. This would mean that diligent academics who published good work may not be rewarded as much as those who instead invested time and effort in improving their “findability”.  Notably, total transparency of the algorithm may not be required for this to be an issue; in other sessions, the question of whether academics should be expected to undertake search engine optimisation of their output, and what support they might need for this, was raised. Academic SEO applies just as much today in the era of Google Scholar as it does in a future world where more papers are found through other means.

The audience touched briefly on the fascinating question of the desired level of openness of personal reference collections. Mendeley offer private collections by default, but you can open them to be public if you wish; CiteULike take the opposite approach.

There was also mention of how Google Scholar and similar tools can tend to cause a group of scholars or a school of thought to dominate results in some disciplines, due to their creating a collection of heavily interlinked papers, whereas other authors who do not crosslink so heavily can be pushed down the rankings and hard to find. The idea of limiting impact factor in ranking algorithms, to allow “hidden gem” papers to be brought out, was suggested; the most high impact papers are ones that a scholar is likely to have already come across in some way, anyway, and valuable serendipitous discovery is perhaps more likely to be of more minor works.  An alternative method of adding some randomness to the algorithms or result presentation also seemed popular.

Overall, though, one key point remained, which was that metadata sources are “bloody useless” and all the algorithms in the world  - even if flawless themselves - will be hampered until the metadata gets better.

Friday 27 August 2010

Empowering Administrators

I have to admit that before I started the interviews with people who update profile pages I had a few deep rooted pre-conceptions. One was that most senior admin staff in the University are ladies in their 50's or 60's. The other was that they would be at best unwilling users of technology. The former turned out to be a reasonable assumption, but on the latter - I was completely wrong!

These ladies are not only proficient in shorthand but it turns out that they are equally adept with html. The majority of University websites aren't using any kind of Content Management System so when a new administrator arrives one of their first tasks is to get up to speed on how to update the website - most do this by attending a course at the Computing Service. Those who have taken up the challenge are all happy to be given this task and most seemed reluctant to hand over control of profiles to academics themselves. Some felt it was unfair to ask academics to spend time on something which they could easily do themselves, and others felt that it would lead to half of the profiles being left empty or never updated.

On the subject of Content Management Systems and visual editors opinion was divided. While some saw that it would be beneficial for those who are daunted by the prospect of html others felt it was easy to learn the tasks needed to update a simple html page. The most extreme view I came across was that a CMS was an an illusion, that they "throw a mask over the technology" and give people the impression that they can edit the web without knowing what's underneath.

It will be interesting to see this issue from the perspective of profile holders and how they feel about current editing and updating processes. It's clear that the view from the admin side that this isn't a burdensome task, with few spending more than a few hours a week updating websites. The current system in most departments also gives administrators the opportunity to learn and practice new skills which they value highly.

Friday 30 July 2010

Testpage for the PDF metadata extraction pipeline

We released this week the test web page for the PDF metadata extraction pipeline. We haven't yet advertised it widely (this blog entry can be regarded as the first public announcement!). The accuracy is similar to the one we announced at the beginning of the month at a meeting at CARET but at that time we were not aware of issues that were potential sources of troubles.

The most important one was related to the validation against the CiteXplore dataset. We were doing it against the whole content of CiteXplore but this was a potential source of errors as only two of the six subsets can be regarded as "ground truth" sources - the PubMed and the Agricola subsets.

Once we eliminated the other four subsets the number of accurate matches in the test set of 300 PDFs dropped to 44% which was embarrassing and triggered a frantic search of acceptable alternatives. The fastest solution was to introduce a new source of ground truth metadata. After contemplating two sources related to mathematics and statistics (Zentralblatt, AMS) we settled on DBLP which covers computer science and engineering. The problem was that the DBLP mirrors we tried at first were either too slow or not responsive. Others did not have advanced search capability. Fortunately, we stumbled upon the faceted DBLP site which was fitting the bill both in terms of speed and of advanced search functionality.

Another change we implemented was to filter out words than contained special characters and were not contained in a standard UNIX English dictionary and to reformulate the validation queries with the title as a bag of words, rather than a phrase. This helped deal with cases where artefacts from the PDF conversion were making the validation impossible.

A third change was related to using a heuristic involving the likely publication year. This helped to resolve some of the cases of multiple validation matches.

These changes and the hacking they involved delayed us by a week but once we were done we computed the stats and found that we are, again, successfully validating more than 50% of the metadata records (and even approaching 60% if cases of PDFs requiring OCR are not counted). Phew!

Tuesday 27 July 2010

interviews and biscuits


I've been out and about round the University chatting to computer officers and administrators who are responsible for maintaining staff profiles. It seems like there are as many ways of updating these pages as there are departments. Some have annual updates of profiles, others update on request. Some provide a set of guidelines with advice on tone, length of profiles and a limit to the number of publications one can list while other departments leave it individual choice.
It's hard to draw the strands together and come up with a unified idea of what a profile is and what it should contain. It seems most likely that the variety in the production and content of staff profiles is reflective of the myriad departments and subdepartments of the University, the influence of early adopters of the web, the personal preferences of individuals and the gradual evolution of a process to suit each group.

Thanks to everyone who donated their time to talk with me, and I hope you liked your biscuits!

Saturday 19 June 2010

profiles and repositories elsewhere

There's a terrific thread getting going on the JISC Repositories mailing list, where people are describing the repository use in their institutions. It has also diverged into discussion of how profiles and repositories interact. As well as local efforts, there is mention of some major players:

  • Vivo - a "research-focussed discovery tool" from Cornell, which lets you browse or search for information about Cornell faculty and staff
  • Catalyst - "brings together the intellectual force, technologies, and clinical expertise of Harvard University and its affiliates and partners to reduce the burden of human illness"
  • BibApp - a "Campus Research Gateway and Expert Finder"
  • OpenScholar -a "a paradigm shift in how the personal academic and research web sites are created and maintained"
I don't think our project would yet lay claim to any of these exotic descriptions, but it's great to know that other work is going on in this area - it's an exciting space!

    Thursday 17 June 2010

    Summary of Initial Research

    There is no single standard in the presentation or content of University profiles, just as departmental websites differ widely in their design and layout so there is no common practice in representing staff. We did however find some universal basic categories of information which are used on all profiles.

    It’s interesting that the take up of profiles is high but there are still people who have nothing but a name and perhaps the name of their college listed on their profile. This leads us to wonder about the level of concern among staff about putting their contact details and information about themselves online. We didn’t find any evidence to suggest that there is a bias towards age, gender or seniority in the completion or non-completion of profiles. Further research with staff members through interviews and a survey will hopefully uncover the reasons why some feel less inclined to provide details on their departmental websites.

    Throughout the course of our research we kept asking questions about how users were writing and updating their profiles. In the next stage of our research we will look at how departmental profiles are managed and who are they are managed by. We’d also like to learn more about the attitudes of staff towards their University profiles and whether they feel satisfied with how their profiles represent them to the wider world.

    A more detailed analysis of of initial research can be found here

    Wednesday 28 April 2010

    Templates vs. Free Pages

    We’ve finished our initial research on the current status of University profiles. Issues of templates and standardisation versus a free hand in creating one’s own web page have raised some interesting questions. Templates - as you would expect produce a standardised and uniform look which has the stamp of authority, but which often fall short in terms of interesting and personal content. Web pages where users are free to communicate what they value feel more friendly and open. These pages invite dialogue and present academics as people with passions and interests that a list of publications can’t communicate.

    There is an underlying problem though, templates offer a way onto the web for those who don’t have the technical expertise to write their own html. It’s no surprise that most departments using the ‘free page’ system are science and technology focused.

    So what’s the answer for more engaging and professional profiles? I think it's probably a synthesis of the accessibility and professionalism of templates but with the freedom offered by free pages (how we achieve this is another matter). We need to allow and encourage people to post more often, with more options for personalisation and the space to write more creatively. One idea might be to replace the titles of boxes on templates with questions. So instead of a box called ‘research interests’ we would write ‘describe your research interests and your current work’ or even ‘ Imagine that you’ve met a new person at a conference - they’re not familiar with your area of work, how would you describe it?’ Titles on text boxes encourage users to write lists of keywords separated by commas. This provides a useful string of metadata but it tells us nothing of that person's motivation, their journey through their work or their passion for their subject.

    Profiles need to be easy to use and quick to update to encourage academics to use them, but the more fundamental problem is in making people care about their digital identity - why would they care about their online profile? Who is the profile for anyway? These are questions that we hope to address through surveys and more in-depth interviews with academic staff. Esther Dingley, co-founder of Graduate Junction and current Arcadia fellow at the UL has a really great post on her blog about the importance of legitimising the time spent on creating and maintaining digital identities.

    Monday 19 April 2010

    challenges of understanding social network data

    A great article from danah boyd, about the challenges when computer scientists meet social scientists in the study of networks and relationships, based on large data sets.