What is PubFetcher?¶
A Java command-line tool and library to download and store publications with metadata by combining content from various online resources (Europe PMC, PubMed, PubMed Central, Unpaywall, journal web pages), plus extract content from general web pages.
PubFetcher used to be part of EDAMmap until its functionality was determined to be potentially useful on its own, thus PubFetcher is now an independently usable application. However, its features and structure are still influenced by EDAMmap, for example the supported publication resources are mainly from the biomedical and life sciences fields and getting the list of authors of a publication is currently not supported (as it’s not needed in EDAMmap). Also, the functionality of extracting content from general web pages is geared towards web pages containing software tools descriptions and documentation (GitHub, BioConductor, etc), as PubFetcher has built-in rules to extract from these pages and it has fields to store the software license and programming language.
Ideally, all scientific literature would be open and easily accessible through one interface for text mining and other purposes. One interface for getting publications is Europe PMC, which PubFetcher uses as its main resource. In the middle of 2018, Europe PMC was able to provide almost all of the titles, around 95% of abstracts, 50% of full texts and only 10% of user-assigned keywords for the publications present in the bio.tools registry at that time. While some articles don’t have keywords and some full texts can’t be obtained, many of the gaps can be filled by other resources. And sometimes we need the maximum amount of content about each publication for better results, thus the need for PubFetcher, that extracts and combines data from these different resources.
The speed of downloading, when multithreading is enabled, is roughly one publication per second. This limitation, along with the desire to not overburden the used APIs and publisher sites, means that PubFetcher is best used for medium-scale processing of publications, where the number of entries is in the thousands and not in the millions, but where the largest amount of completeness for these few thousand publications is desired. If millions of publications are required, then it is better to restrict oneself to the Open Access subset, which can be downloaded in bulk: https://europepmc.org/downloads.
PubFetcher has an extensive command-line tool to use all of its functionality. It contains a few helper operations, but the main use is the construction of a simple pipeline for querying, fetching and outputting of publications and general and documentation web pages: first IDs of interest are specified/loaded and filtered, then corresponding content fetched/loaded and filtered, and last it is possible to output the results or store them to a database. Among other functionality, content and all the metadata can be output in HTML or plain text, but also exported to JSON. All fetching operations can be influenced by a few general parameters. Progress along with error messages is logged to the console and to a log file, if specified. The command-line tool can be extended, for example to add new ways of loading IDs.
- Command-line interface manual documents all parameters of the command-line interface, accompanied by many examples
- Output describes different outputs: the database, the log file and the JSON output, through which the structure of publications, webpages and docs is also explained
- Fetching logic deals with fetching logic, describing for example the content fetching methods and the resources and filling logic of publication parts
- Scraping rules is about scraping rules and how to define and test them
- Programming reference gives a short overview about the source code for those wanting to use the PubFetcher library
- Ideas for future contains ideas how to improve PubFetcher
# Create a new empty database $ java -jar pubfetcher-cli-<version>.jar -db-init database.db # Fetch two publications and store them to the database $ java -jar pubfetcher-cli-<version>.jar -pub 10.1093/nar/gkz369 10.1101/692905 -db-fetch-end database.db # Print the fetched publications $ java -jar pubfetcher-cli-<version>.jar -pub-db database.db -db database.db -out
For many more examples, see Examples.
Should you need help installing or using PubFetcher, please get in touch with Erik Jaaniso (the lead developer) directly via the tracker.