Programming reference

In addition to command line usage, documented in the section Command-line interface manual, PubFetcher can be used as a library. This section is a short overview of the public interface of the source code that constitutes PubFetcher. Documentation in the code itself is currently sparse.

Package pubfetcher.core.common

BasicArgs is the abstract class used as base class for FetcherArgs and FetcherPrivateArgs and other command line argument classes in “org.edamontology” packages that use JCommander for command line argument parsing and Log4J2 for logging. It provides the -h/--help and -l/--log keys and functionality.

FetcherArgs and FetcherPrivateArgs are classes encapsulating the parameters described in Fetching and Fetching private. Arg and Args are used to store properties of each parameter, like the default value or description string (this comes in useful in EDAMmap, where parameters, including fetching parameters, are displayed and controllable by the user).

IllegalRequestException is a custom Java runtime exception thrown if there are problems with the user’s request. The exception message can be output back to the user, for example over a web API.

Version contains the name, URL and version of the program. These are read from the project’s properties file, found at the absolute resource /project.properties.

The main class of interest for a potential library user is however PubFetcher. This class contains most of the public methods making up the PubFetcher API. Currently, it is also the only class documented using Javadoc. Some of the methods (those described in Publication IDs and Miscellaneous) can be called from PubFetcher-CLI.

Package pubfetcher.core.db (and subpackages)

The Database class can be used to initialise a database file, put content to or get or remove content from the database file, get IDs contained or ask if an ID is contained in the database file or compact a database file. The class abstracts away the currently used underlying database system (MapDB). The structure of the database is described in the Database section of the output documentation. Some methods can be called from PubFetcher-CLI, these are described in the corresponding Database section.

DatabaseEntry is the base class for Publication and Webpage. It contains the methods “canFetch” and “updateCounters” whose logic is explained in Can fetch. DatabaseEntryType specifies whether a given DatabaseEntry is a publication, webpage or doc.

Publication, Webpage and most other classes in the “pubfetcher.core.db” packages are the entities stored in the database. These classes contain methods to get and set the value of their fields and methods to output content fields in plain text, HTML or JSON, with or without metadata fields. Their structure is explained in Contents.

The PublicationIds class encapsulates publication IDs that can be stored in the database. Its structure is explained in IDs of publications.

The PublicationPartType enumeration of possible publication types is explained in Publication types.

Package pubfetcher.core.fetching

Fetcher is the main class dealing with fetching. Its logic is explained in Fetching logic.

Fetcher contains the public method “getDoc”, which is described in Getting a HTML document. The “getDoc” method, but also the “getWebpage” method and the “updateCitationsCount” method can be called from PubFetcher-CLI as seen in Print a web page and Update citations count.

The Fetcher methods “initPublication” and “initWebpage” must be used to construct a Publication and Webpage. Then, the methods “getPublication” and “getWebpage” can be used to fetch the Publication and Webpage. But instead of these “init” and “get” methods, the “getPublication”, “getWebpage” and “getDoc” methods of class PubFetcher should be used, when possible.

Because executing JavaScript is prone to serious bugs in the used HtmlUnit library, fetching a HTML document with JavaScript support turned on is done in a separate JavaScriptThread, that can be killed if it gets stuck.

The HtmlMeta class is explained in Meta and the Links class in Links.

Automatic cleaning and formatting of web pages without scraping rules has been implemented in the CleanWebpage class.

The “pubfetcher.core.fetching” package also contains the classes related to testing: FetcherTest and FetcherTestArgs. These are explained in Testing of rules.

Package pubfetcher.core.scrape

Classes in this package deal with scraping, as explained in the Scraping rules section.

The public methods of the Scrape class can be called from PubFetcher-CLI using the parameters shown in Scrape rules.

Package pubfetcher.cli

The command line interface of PubFetcher, that is PubFetcher-CLI, is implemented in package “pubfetcher.cli”. Its usage is the topic of the first section Command-line interface manual.

The functionality of PubFetcher-CLI can be extended by implementing new operations in a new command line tool, where the public “run” method of the PubFetcherMethods class can then be called to pull in all the functionality of PubFetcher-CLI. One of the main reasons to do this is to implement some new way of getting publication IDs and webpage/doc URLs. These IDs and URLs can then be passed to the “run” method of PubFetcherMethods as the lists “externalPublicationIds”, “externalWebpageUrls” and “externalDocUrls”. One example of such functionality extension is the EDAMmap-Util tool (see its UtilMain class).

Configuration resources/log4j2.xml

The PubFetcher-CLI Logging configuration file log4j2.xml specifies how logging is done and how the Log file will look like.