################### What is PubFetcher? ################### A Java command-line tool and library to download and store publications with metadata by combining content from various online resources (Europe PMC, PubMed, PubMed Central, Unpaywall, journal web pages), plus extract content from general web pages. ******** Overview ******** PubFetcher used to be part of `EDAMmap `_ until its functionality was determined to be potentially useful on its own, thus PubFetcher is now an independently usable application. However, its features and structure are still influenced by EDAMmap, for example the supported :ref:`publication resources ` are mainly from the biomedical and life sciences fields and getting the list of authors of a publication is currently not supported (as it's not needed in EDAMmap). Also, the functionality of extracting content from :ref:`general web pages ` is geared towards web pages containing software tools descriptions and documentation (GitHub, BioConductor, etc), as PubFetcher has built-in rules to extract from these pages and it has fields to store the :ref:`software license ` and :ref:`programming language `. Ideally, all scientific literature would be open and easily accessible through one interface for text mining and other purposes. One interface for getting publications is `Europe PMC `_, which PubFetcher uses as its main resource. In the middle of 2018, Europe PMC was able to provide almost all of the titles, around 95% of abstracts, 50% of full texts and only 10% of user-assigned keywords for the publications present in the `bio.tools `_ registry at that time. While some articles don't have keywords and some full texts can't be obtained, many of the gaps can be filled by other :ref:`resources `. And sometimes we need the maximum amount of content about each publication for better results, thus the need for PubFetcher, that extracts and combines data from these different resources. The speed of downloading, when :ref:`multithreading ` is enabled, is roughly one publication per second. This limitation, along with the desire to not overburden the used APIs and publisher sites, means that PubFetcher is best used for medium-scale processing of publications, where the number of entries is in the thousands and not in the millions, but where the largest amount of completeness for these few thousand publications is desired. If millions of publications are required, then it is better to restrict oneself to the Open Access subset, which can be downloaded in bulk: https://europepmc.org/downloads. In addition to the main content of a publication (:ref:`title `, :ref:`abstract ` and :ref:`full text `), PubFetcher supports getting different keywords about the publication: the :ref:`user-assigned keywords `, the :ref:`MeSH terms ` as assigned in PubMed and :ref:`EFO terms ` and :ref:`GO terms ` as mined from the full text by Europe PMC. Each publication has up to three identificators: a :ref:`PMID `, a :ref:`PMCID ` and a :ref:`DOI `. In addition, different metadata (found from the different :ref:`resources `) about a publication is saved, like whether the article is :ref:`Open Access `, the :ref:`journal ` where it was published, the :ref:`publication date `, etc. The :ref:`source ` of each :ref:`publication part ` is remembered, with content from a higher confidence resource potentially overwriting the current content. It is possible to fetch only some :ref:`publication parts ` (thus avoiding querying some :ref:`resources `) and there is :ref:`an algorithm ` to determine if an already existing entry should be refetched or is it complete enough. Fetching and :ref:`extracting ` of content is done using various Java libraries with support for :ref:`JavaScript ` and :ref:`PDF ` files. The downloaded publications can be persisted to disk to a :ref:`key-value store ` for later analysis. A number of :ref:`built-in rules ` are included (along with :ref:`tests `) for :ref:`scraping ` publication parts from publisher sites, but additional rules can also be defined. Currently, there is support for over 50 publishers of journals and around 25 repositories of tools and tools' metadata and documentation and around 750 test cases for the rules have been defined. If no rules are defined for a given site, then :ref:`automatic cleaning ` is applied to get the main content of the page. PubFetcher has an extensive :ref:`command-line tool ` to use all of its functionality. It contains a few :ref:`helper operations `, but the main use is the construction of a simple :ref:`pipeline ` for querying, fetching and outputting of publications and general and documentation web pages: first IDs of interest are specified/loaded and filtered, then corresponding content fetched/loaded and filtered, and last it is possible to output the results or store them to a database. Among other functionality, content and all the metadata can be output in :ref:`HTML or plain text `, but also :ref:`exported ` to :ref:`JSON `. All fetching operations can be influenced by a few :ref:`general parameters `. Progress along with error messages is logged to the console and to a :ref:`log file `, if specified. The command-line tool can be :ref:`extended `, for example to add new ways of loading IDs. ******* Outline ******* * :ref:`cli` documents all parameters of the command-line interface, accompanied by many examples * :ref:`output` describes different outputs: the database, the log file and the JSON output, through which the structure of publications, webpages and docs is also explained * :ref:`fetcher` deals with fetching logic, describing for example the content fetching methods and the resources and filling logic of publication parts * :ref:`scraping` is about scraping rules and how to define and test them * :ref:`api` gives a short overview about the source code for those wanting to use the PubFetcher library * :ref:`future` contains ideas how to improve PubFetcher ******* Install ******* Installation instructions can be found in the project's GitHub repo at `INSTALL `_. ********** Quickstart ********** .. code-block:: bash # Create a new empty database $ java -jar pubfetcher-cli-.jar -db-init database.db # Fetch two publications and store them to the database $ java -jar pubfetcher-cli-.jar -pub 10.1093/nar/gkz369 10.1101/692905 -db-fetch-end database.db # Print the fetched publications $ java -jar pubfetcher-cli-.jar -pub-db database.db -db database.db -out For many more examples, see :ref:`Examples `. **** Repo **** PubFetcher is hosted at https://github.com/edamontology/pubfetcher. ******* Support ******* Should you need help installing or using PubFetcher, please get in touch with Erik Jaaniso (the lead developer) directly via the `tracker `_. ******* License ******* PubFetcher is free and open-source software licensed under the GNU General Public License v3.0, as seen in `COPYING `_.