Command-line interface manual¶

The CLI of PubFetcher is provided by a Java executable packaged in a .jar file. If a Java Runtime Environment (JRE) capable of running version 8 of Java is installed on the system, then this .jar file can be executed using the java command. For example, executing PubFetcher-CLI with the parameter -h or --help outputs a list of all possible parameters:

$ java -jar path/to/pubfetcher-cli-<version>.jar --help

Parsing of command line parameters is provided by JCommander.

Logging¶

Parameter	Description
`-l` or `--log`	The path of the log file

PubFetcher-CLI will output its log to the console (to stderr). With the --log parameter we can specify a text file location where this same log will be output. It will not be coloured as the console output, but will include a few DEBUG level messages omitted in the console (this includes the very first line listing all parameters the program was run with).

If the specified file already exists, then new log messages will be appended to its end. In case of a new log file creation, any missing parent directories will be created as necessary.

General parameters¶

Parameters affecting many other operations specified below. These parameters can be supplied to PubFetcher externally through programmatic means. When supplied on the command line (of PubFetcher-CLI), then two dashes (--) have to be added in front of the parameter names specified in the two following tables.

Fetching¶

Parameters that affect many of the operations specified further below, for example --timeout changes the timeouts of all attempted network connections. The cooldown and retryLimit parameters affect if we can fetch (or rather, refetch) a publication or webpage. The minimum length and size parameters affect whether an entry is usable and final.

Parameter	Default	Min	Description
emptyCooldown	`720`		If that many minutes have passed since last fetching attempt of an empty publication or empty webpage, then fetching can be attempted again, resetting the retryCounter. Setting to `0` means fetching of empty database entries will always be attempted again. Setting to a negative value means refetching will never be done (and retryCounter never reset) only because the entry is empty.
nonFinalCooldown	`10080`		If that many minutes have passed since last fetching attempt of a non-final publication or non-final webpage (which are not empty), then fetching can be attempted again, resetting the retryCounter. Setting to `0` means fetching of non-final database entries will always be attempted again. Setting to a negative value means refetching will never be done (and retryCounter never reset) only because the entry is non-final.
fetchExceptionCooldown	`1440`		If that many minutes have passed since last fetching attempt of a publication or webpage with a fetchException, then fetching can be attempted again, resetting the retryCounter. Setting to `0` means fetching of database entries with fetchException will always be attempted again. Setting to a negative value means refetching will never be done (and retryCounter never reset) only because the fetchException of the entry is `true`.
retryLimit	`3`		How many times can fetching be retried for an entry that is still empty, non-final or has a fetchException after the initial attempt. Setting to `0` will disable retrying, unless the retryCounter is reset by a cooldown in which case one initial attempt is allowed again. Setting to a negative value will disable this upper limit.
titleMinLength	`4`	`0`	Minimum length of a usable publication title
keywordsMinSize	`2`	`0`	Minimum size of a usable publication keywords/MeSH list
minedTermsMinSize	`1`	`0`	Minimum size of a usable publication EFO/GO terms list
abstractMinLength	`200`	`0`	Minimum length of a usable publication abstract
fulltextMinLength	`2000`	`0`	Minimum length of a usable publication fulltext
webpageMinLength	`50`	`0`	Minimum length of a usable webpage combined title and content
webpageMinLengthJavascript	`200`	`0`	If the length of a the whole web page text fetched without JavaScript is below the specified limit and no scraping rules are found for the corresponding URL, then refetching using JavaScript support will be attempted
timeout	`15000`	`0`	Connect and read timeout of connections, in milliseconds

Fetching private¶

These are like Fetching parameters in that they have a general effect, e.g. setting --userAgent changes the HTTP User-Agent of all HTTP connections. However, Fetching parameters are such parameters that we might want to expose via a web API to be changeable by a client (when extending or using the PubFetcher library), but the parameters below should probably only be configured locally and as such are separated in code.

Parameter	Description
europepmcEmail	E-mail to send to the Europe PMC API
oadoiEmail	E-mail to send to the oaDOI (Unpaywall) API
userAgent	HTTP User-Agent
journalsYaml	YAML file containing custom journals scrape rules to add to default ones
webpagesYaml	YAML file containing custom webpages scrape rules to add to default ones

Simple one-off operations¶

Some simple operations (represented by the parameters with one dash (-) below), that mostly should by the sole parameter supplied to PubFetcher, when used.

Database¶

A collection of one-off database operations on a single database file.

Parameter	Parameter args	Description
`-db-init`	<database file>	Create an empty database file. This is the only way to make new databases.
`-db-commit`	<database file>	Commit all pending changes by merging all WAL files to the main database file. This has only an effect if WAL files are present beside the database file after an abrupt termination of the program, as normally committing is done in code where required.
`-db-compact`	<database file>	Compaction reclaims space by removing deprecated records (left over after database updates)
`-db-publications-size`	<database file>	Output the number of publications stored in the database to stdout
`-db-webpages-size`	<database file>	Output the number of webpages stored in the database to stdout
`-db-docs-size`	<database file>	Output the number of docs stored in the database to stdout
`-db-publications-map`	<database file>	Output all PMID to primary ID, PMCID to primary ID and DOI to primary ID mapping pairs stored in the database to stdout
`-db-publications-map-reverse`	<database file>	Output all mappings from primary ID to the triple [PMID, PMCID, DOI] stored in the database to stdout

Print a web page¶

Methods for fetching and outputting a web page. Affected by timeout and userAgent parameters, -fetch-webpage-selector also by webpageMinLength and webpageMinLengthJavascript.

Parameter	Parameter args	Description
`-fetch-document`	<url>	Fetch a web page (without JavaScript support, i.e. using jsoup) and output its raw HTML to stdout
`-fetch-document-javascript`	<url>	Fetch a web page (with JavaScript support, i.e. using HtmlUnit) and output its raw HTML to stdout
`-post-document`	<url> <param name> <param value> <param name> <param value> …	Fetch a web resource using HTTP POST. The first parameter specifies the resource URL and is followed by the request data in the form of name/value pairs, with names and values separated by spaces.
`-fetch-webpage-selector`	<url> <title selector> <content selector> <javascript support>	Fetch a webpage and output it to stdout in the format specified by the Output modifiers `--plain` and `--format`. Works also for PDF files. Title and content args are CSS selectors as supported by jsoup. If the title selector is an empty string, then the page title will be the text content of the document’s `<title>` element. If the content selector is an empty string, then content will be the automatically cleaned whole text content parsed from the HTML/XML. If javascript arg is `true`, then fetching will be done using JavaScript support (HtmlUnit), if `false`, then without JavaScript (jsoup). If javascript arg is empty, then fetching will be done without JavaScript and if the text length of the returned document is less than webpageMinLengthJavascript or if a `<noscript>` tag is found in it, a second fetch will happen with JavaScript support.

Scrape rules¶

Print requested parts of currently effective scraping rules loaded from default or custom scrape rules YAML files.

Parameter	Parameter args	Description
`-scrape-site`	<url>	Output found journal site name for the given URL to stdout (or `null` if not found or URL invalid)
`-scrape-selector`	<url> <ScrapeSiteKey>	Output the CSS selector used for extracting the publication part represented by ScrapeSiteKey from the given URL
`-scrape-javascript`	<url>	Output `true` or `false` depending on whether JavaScript will be used or not for fetching the given publication URL
`-scrape-webpage`	<url>	Output all CSS selectors used for extracting webpage content and metadata from the given URL (or `null` if not found or URL invalid)

Publication IDs¶

Simple operations on publication IDs, with result output to stdout.

Parameter	Parameter args	Description
`-is-pmid`	<string>	Output `true` or `false` depending on whether the given string is a valid PMID or not
`-is-pmcid`	<string>	Output `true` or `false` depending on whether the given string is a valid PMCID or not
`-extract-pmcid`	<pmcid>	Remove the prefix “PMC” from a PMCID and output the rest. Output an empty string if the given string is not a valid PMCID.
`-is-doi`	<string>	Output `true` or `false` depending on whether the given string is a valid DOI or not
`-normalise-doi`	<doi>	Remove any valid prefix (e.g. “https://doi.org/”, “doi:”) from a DOI and output the rest, converting letters from the 7-bit ASCII set to uppercase. The validity of the input DOI is not checked.
`-extract-doi-registrant`	<doi>	Output the registrant ID of a DOI (the substring after “10.” and before “/”). Output an empty string if the given string is not a valid DOI.

Miscellaneous¶

Methods to test the escaping of HTML entities as done by PubFetcher (necessary when outputting raw input to HTML format) and test the validity of publication IDs and webpage URLs.

Parameter	Parameter args	Description
`-escape-html`	<string>	Output the result of escaping necessary characters in the given string such that it can safely by used as text in a HTML document (without the string interacting with the document’s markup)
`-escape-html-attribute`	<string>	Output the result of escaping necessary characters in the given string such that it can safely by used as an HTML attribute value (without the string interacting with the document’s markup)
`-check-publication-id`	<string>	Given one publication ID, output it in publication IDs form (`<pmid>\t<pmcid>\t<doi>`) if it is a valid PMID, PMCID or DOI, or throw an exception if it is an invalid publication ID
`-check-publication-ids`	<pmid> <pmcid> <doi>	Given a PMID, a PMCID and a DOI, output them in publication IDs form (`<pmid>\t<pmcid>\t<doi>`) if given IDs are a valid PMID, PMCID and DOI, or throw an exception if at least one is invalid
`-check-url`	<string>	Given a webpage ID (i.e. a URL), output the parsed URL, or throw an exception if it is an invalid URL

Pipeline of operations¶

A simple pipeline that allows for more complex querying, fetching and outputting of publications , webpages and docs : first IDs of interest are specified/loaded and filtered, then corresponding content fetched/loaded and filtered, and last it is possible to output or store the results. Component operations of the pipeline are specified as command-line parameters with one dash (-). In addition, there are some parameters modifying some aspect of the pipeline, these will have two dashes (--). The Fetching and Fetching private parameters will also have an effect (on fetching and determining the finality of content).

Add IDs¶

publication IDs, webpage URLs and doc URLs can be specified on the command-line and can be loaded from text and database files. The resultant list of IDs is actually a set, meaning that if duplicate IDs are encountered, they’ll be ignored and not added to the list.

Parameter	Parameter args	Description
`-pub`	<string> <string> …	A space-separated list of publication IDs (either PMID, PMCID or DOI) to add
`-web`	<string> <string> …	A space-separated list of webpage URLs to add
`-doc`	<string> <string> …	A space-separated list of doc URLs to add
`-pub-file`	<text file> …	Load all publication IDs from the specified list of text files containing publication IDs in the form `<pmid>\t<pmcid>\t<doi>`, one per line. Empty lines and lines beginning with `#` are ignored.
`-web-file`	<text file> …	Load all webpage URLs from the specified list of text files containing webpage URLs, one per line. Empty lines and lines beginning with `#` are ignored.
`-doc-file`	<text file> …	Load all doc URLs from the specified list of text files containing doc URLs, one per line. Empty lines and lines beginning with `#` are ignored.
`-pub-db`	<database file> …	Load all publication IDs found in the specified database files
`-web-db`	<database file> …	Load all webpage URLs found in the specified database files
`-doc-db`	<database file> …	Load all doc URLs found in the specified database files

Filter IDs¶

Conditions that publication IDs, webpage URLs and doc URLs must meet to be retained in the pipeline.

Parameter	Parameter args	Description
`-has-pmid`		Only keep publication IDs whose PMID is present
`-not-has-pmid`		Only keep publication IDs whose PMID is empty
`-pmid`	<regex>	Only keep publication IDs whose PMID has a match with the given regular expression
`-not-pmid`	<regex>	Only keep publication IDs whose PMID does not have a match with the given regular expression
`-pmid-url`	<regex>	Only keep publication IDs whose PMID provenance URL has a match with the given regular expression
`-not-pmid-url`	<regex>	Only keep publication IDs whose PMID provenance URL does not have a match with the given regular expression
`-has-pmcid`		Only keep publication IDs whose PMCID is present
`-not-has-pmcid`		Only keep publication IDs whose PMCID is empty
`-pmcid`	<regex>	Only keep publication IDs whose PMCID has a match with the given regular expression
`-not-pmcid`	<regex>	Only keep publication IDs whose PMCID does not have a match with the given regular expression
`-pmcid-url`	<regex>	Only keep publication IDs whose PMCID provenance URL has a match with the given regular expression
`-not-pmcid-url`	<regex>	Only keep publication IDs whose PMCID provenance URL does not have a match with the given regular expression
`-has-doi`		Only keep publication IDs whose DOI is present
`-not-has-doi`		Only keep publication IDs whose DOI is empty
`-doi`	<regex>	Only keep publication IDs whose DOI has a match with the given regular expression
`-not-doi`	<regex>	Only keep publication IDs whose DOI does not have a match with the given regular expression
`-doi-url`	<regex>	Only keep publication IDs whose DOI provenance URL has a match with the given regular expression
`-not-doi-url`	<regex>	Only keep publication IDs whose DOI provenance URL does not have a match with the given regular expression
`-doi-registrant`	<string> <string> …	Only keep publication IDs whose DOI registrant code (the bit after “10.” and before “/”) is present in the given list of strings
`-not-doi-registrant`	<string> <string> …	Only keep publication IDs whose DOI registrant code (the bit after “10.” and before “/”) is not present in the given list of strings
`-url`	<regex>	Only keep webpage URLs and doc URLs that have a match with the given regular expression
`-not-url`	<regex>	Only keep webpage URLs and doc URLs that don’t have a match with the given regular expression
`-url-host`	<string> <string> …	Only keep webpage URLs and doc URLs whose host part is present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-not-url-host`	<string> <string> …	Only keep webpage URLs and doc URLs whose host part is not present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-in-db`	<database file>	Only keep publication IDs, webpage URLs and doc URLs that are present in the given database file
`-not-in-db`	<database file>	Only keep publication IDs, webpage URLs and doc URLs that are not present in the given database file

Sort IDs¶

Sorting of added and filtered IDs. publication IDs are first sorted by PMID, then by PMCID (if PMID is absent), then by DOI (if PMID and PMCID are absent). Internally, the PMID, the PMCID and the DOI registrant are sorted numerically, DOIs within the same registrant alphabetically. webpage URLs and doc URLs are sorted alphabetically.

Parameter	Parameter args	Description
`-asc-ids`		Sort publication IDs, webpage URLs and doc URLs in ascending order
`-desc-ids`		Sort publication IDs, webpage URLs and doc URLs is descending order

Limit IDs¶

Added, filtered and sorted IDs can be limited to a given number of IDs either in the front or back.

Parameter	Parameter args	Description
`-head-ids`	<positive integer>	Only keep the first given number of publication IDs, webpage URLs and doc URLs
`-tail-ids`	<positive integer>	Only keep the last given number of publication IDs, webpage URLs and doc URLs

Remove from database by IDs¶

The resulting list of IDs can be used to remove corresponding entries from a database.

Parameter	Parameter args	Description
`-remove-ids`	<database file>	From the given database, remove content corresponding to publication IDs, webpage URLs and doc URLs

Output IDs¶

Outputs the final list of loaded IDs to stdout or the specified text files in the format specified by the Output modifiers --plain and --format. Without --plain publication IDs are output with their corresponding provenance URLs, with --plain these are omitted. webpage URLs and doc URLs are not affected by --plain. Specifying --format as text (the default) and using --plain will output publication IDs in the form <pmid>\t<pmcid>\t<doi>.

Parameter	Parameter args	Description
`-out-ids`		Output publication IDs, webpage URLs and doc URLs to stdout in the format specified by the Output modifiers `--plain` and `--format`
`-txt-ids-pub`	<file>	Output publication IDs to the given file in the format specified by the Output modifiers `--plain` and `--format`
`-txt-ids-web`	<file>	Output webpage URLs to the given file in the format specified by `--format`
`-txt-ids-doc`	<file>	Output doc URLs to the given file in the format specified by `--format`
`-count-ids`		Output count numbers for publication IDs, webpage URLs and doc URLs to stdout

Get content¶

Operations to get publications, webpages and docs corresponding to the final list of loaded publication IDs, webpage URLs and doc URLs. Content will be fetched from the Internet, loaded from a database file, or both, with updated content possibly saved back to the database. In case multiple content getting operations are used, first everything with -db is got, then -fetch, -fetch-put, -db-fetch and last -db-fetch-end. The list of entries will have the order in which entries were got, duplicates are allowed. When saved to a database file, duplicates will be merged, in other cases (e.g. when outputting content) duplicates will be present.

Parameter	Parameter args	Description
`-db`	<database file>	Get publications, webpages and docs from the given database
`-fetch`		Fetch publications, webpages and docs from the Internet. All entries for which some fetchException happens are fetched again in the end (this is done only once).
`-fetch-put`	<database file>	Fetch publications, webpages and docs from the Internet and put each entry in the given database right after it has been fetched, ignoring any filters and overwriting any existing entries with equal IDs/URLs. All entries for which some fetchException happens are fetched and put to the database again in the end (this is done only once).
`-db-fetch`	<database file>	First, get an entry from the given database (if found), then fetch the entry (if the entry can be fetched), then put the entry back to the database while ignoring any filters (if the entry was updated). All entries which have the fetchException set are got again in the end (this is done only once). This operation is multithreaded (in contrast to `-fetch` and `-fetch-put`), with `--threads` number of threads, thus it should be preferred for larger amounts of content.
`-db-fetch-end`	<database file>	Like `-db-fetch`, except no content is kept in memory (saving back to the given database still happens), thus no further processing down the pipeline is possible. This is useful for avoiding large memory usage if only fetching and saving of content to the database is to be done and no further operations on content (like outputting it) are required.

Get content modifiers¶

Some parameters to influence the behaviour of content getting operations.

Parameter	Parameter args	Default	Description
`--fetch-part`	<PublicationPartName> …		List of publication parts that will be fetched from the Internet. All other parts will be empty (except the publication IDs which will be filled whenever possible). Fetching of resources not containing any specified parts will be skipped. If used, then `--not-fetch-part` must not be used. If neither of `--fetch-part` and `--not-fetch-part` is used, then all parts will be fetched.
`--not-fetch-part`	<PublicationPartName> …		List of publication parts that will not be fetched from the Internet. All other parts will be fetched. Fetching of resources not containing any not specified parts will be skipped. If used, then `--fetch-part` must not be used.
`--pre-filter`			Normally, all content is loaded into memory before filtering specified in Filter content is applied. This option ties the filtering step to the loading/fetching step for each individual entry, discarding entries not passing the filter right away, thus reducing memory usage. As a tradeoff, in case multiple filters are used, it won’t be possible to see in the log how many entries were discarded by each filter.
`--limit`	<positive integer>	`0`	Maximum number of publications, webpages and docs that can be loaded/fetched. In case the limit is applied, the concrete returned content depends on the order it is loaded/fetched, which depends on the order of content getting operations, then on whether there was a fetchException and last on the ordering of received IDs. If the multithreaded `-db-fetch` is used or a fetchException happen, then the concrete returned content can vary slightly between equal applications of limit. If `--pre-filter` is also used, then the filters of Filter content will be applied before the limit, otherwise the limit is applied beforehand and the filters can reduce the number of entries further. Set to `0` to disable.
`--threads`	<positive integer>	`8`	Number of threads used for getting content with `-db-fetch` and `-db-fetch-end`. Should not be bound by actual processor core count, as mostly threads sit idle, waiting for an answer from a remote host or waiting behind another thread to finish communicating with the same host.

Filter content¶

Conditions that publications, webpages and docs must meet to be retained in the pipeline. All filters will be ANDed together.

Parameter	Parameter args	Description
`-fetch-time-more`	<ISO-8601 time>	Only keep publications, webpages and docs whose fetchTime is more than or equal to the given time
`-fetch-time-less`	<ISO-8601 time>	Only keep publications, webpages and docs whose fetchTime is less than or equal to the given time
`-retry-counter`	<positive integer> …	Only keep publications, webpages and docs whose retryCounter is equal to one of given counts
`-not-retry-counter`	<positive integer> …	Only keep publications, webpages and docs whose retryCounter is not equal to any of given counts
`-retry-counter-more`	<positive integer>	Only keep publications, webpages and docs whose retryCounter is more than the given count
`-retry-counter-less`	<positive integer>	Only keep publications, webpages and docs whose retryCounter is less than the given count
`-fetch-exception`		Only keep publications, webpages and docs with a fetchException
`-not-fetch-exception`		Only keep publications, webpages and docs without a fetchException
`-empty`		Only keep empty publications, empty webpages and empty docs
`-not-empty`		Only keep non-empty publications, non-empty webpages and non-empty docs
`-usable`		Only keep usable publications, usable webpages and usable docs
`-not-usable`		Only keep non-usable publications, non-usable webpages and non-usable docs
`-final`		Only keep final publications, final webpages and final docs
`-not-final`		Only keep non-final publications, non-final webpages and non-final docs
`-grep`	<regex>	Only keep publications, webpages and docs whose whole content (as output using `--plain`) has a match with the given regular expression
`-not-grep`	<regex>	Only keep publications, webpages and docs whose whole content (as output using `--plain`) does not have a match with the given regular expression

Filter publications¶

Conditions that publications must meet to be retained in the pipeline.

Parameter	Parameter args	Description
`-totally-final`		Only keep publications whose content is totally final
`-not-totally-final`		Only keep publications whose content is not totally final
`-oa`		Only keep publications that are Open Access
`-not-oa`		Only keep publications that are not Open Access
`-journal-title`	<regex>	Only keep publications whose journal title has a match with the given regular expression
`-not-journal-title`	<regex>	Only keep publications whose journal title does not have a match with the given regular expression
`-journal-title-empty`		Only keep publications whose journal title is empty
`-not-journal-title-empty`		Only keep publications whose journal title is not empty
`-pub-date-more`	<ISO-8601 time>	Only keep publications whose publication date is more than or equal to given time (add “T00:00:00Z” to the end to get an ISO-8601 time from a date)
`-pub-date-less`	<ISO-8601 time>	Only keep publications whose publication date is less than or equal to given time (add “T00:00:00Z” to the end to get an ISO-8601 time from a date)
`-citations-count`	<positive integer> …	Only keep publications whose citations count is equal to one of given counts
`-not-citations-count`	<positive integer> …	Only keep publications whose citations count is not equal to any of given counts
`-citations-count-more`	<positive integer>	Only keep publications whose citations count is more than the given count
`-citations-count-less`	<positive integer>	Only keep publications whose citations count is less than the given count
`-citations-timestamp-more`	<ISO-8601 time>	Only keep publications whose citations count last update timestamp is more than or equal to the given time
`-citations-timestamp-less`	<ISO-8601 time>	Only keep publications whose citations count last update timestamp is less than or equal to the given time
`-corresp-author-name`	<regex>	Only keep publications with a corresponding author name having a match with the given regular expression
`-not-corresp-author-name`	<regex>	Only keep publications with no corresponding authors names having a match with the given regular expression
`-corresp-author-name-empty`		Only keep publications whose corresponding authors names are empty
`-not-corresp-author-name-empty`		Only keep publications with a corresponding author name that is not empty
`-corresp-author-orcid`	<regex>	Only keep publications with a corresponding author ORCID iD having a match with the given regular expression
`-not-corresp-author-orcid`	<regex>	Only keep publications with no corresponding authors ORCID iDs having a match with the given regular expression
`-corresp-author-orcid-empty`		Only keep publications whose corresponding authors ORCID iDs are empty
`-not-corresp-author-orcid-empty`		Only keep publications with a corresponding author ORCID iD that is not empty
`-corresp-author-email`	<regex>	Only keep publications with a corresponding author e-mail address having a match with the given regular expression
`-not-corresp-author-email`	<regex>	Only keep publications with no corresponding authors e-mail addresses having a match with the given regular expression
`-corresp-author-email-empty`		Only keep publications whose corresponding authors e-mail addresses are empty
`-not-corresp-author-email-empty`		Only keep publications with a corresponding author e-mail address that is not empty
`-corresp-author-phone`	<regex>	Only keep publications with a corresponding author telephone number having a match with the given regular expression
`-not-corresp-author-phone`	<regex>	Only keep publications with no corresponding authors telephone numbers having a match with the given regular expression
`-corresp-author-phone-empty`		Only keep publications whose corresponding authors telephone numbers are empty
`-not-corresp-author-phone-empty`		Only keep publications with a corresponding author telephone number that is not empty
`-corresp-author-uri`	<regex>	Only keep publications with a corresponding author web page address having a match with the given regular expression
`-not-corresp-author-uri`	<regex>	Only keep publications with no corresponding authors web page addresses having a match with the given regular expression
`-corresp-author-uri-empty`		Only keep publications whose corresponding authors web page addresses are empty
`-not-corresp-author-uri-empty`		Only keep publications with a corresponding author web page address that is not empty
`-corresp-author-size`	<positive integer> …	Only keep publications whose corresponding authors size is equal to one of given sizes
`-not-corresp-author-size`	<positive integer> …	Only keep publications whose corresponding authors size is not equal to any of given sizes
`-corresp-author-size-more`	<positive integer>	Only keep publications whose corresponding authors size is more than given size
`-corresp-author-size-less`	<positive integer>	Only keep publications whose corresponding authors size is less than given size
`-visited`	<regex>	Only keep publications with a visited site whose URL has a match with the given regular expression
`-not-visited`	<regex>	Only keep publications with no visited sites whose URL has a match with the given regular expression
`-visited-host`	<string> <string> …	Only keep publications with a visited site whose URL host part is present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-not-visited-host`	<string> <string> …	Only keep publications with no visited sites whose URL host part is present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-visited-type`	<PublicationPartType> …	Only keep publications with a visited site of type equal to one of given types
`-not-visited-type`	<PublicationPartType> …	Only keep publications with no visited sites of type equal to any of given types
`-visited-type-more`	<PublicationPartType>	Only keep publications with a visited site of better type than the given type
`-visited-type-less`	<PublicationPartType>	Only keep publications with a visited site of lesser type than the given type
`-visited-type-final`		Only keep publications with a visited site of final type
`-not-visited-type-final`		Only keep publications with no visited sites of final type
`-visited-type-pdf`		Only keep publications with a visited site of PDF type
`-not-visited-type-pdf`		Only keep publications with no visited sites of PDF type
`-visited-from`	<regex>	Only keep publications with a visited site whose provenance URL has a match with the given regular expression
`-not-visited-from`	<regex>	Only keep publications with no visited sites whose provenance URL has a match with the given regular expression
`-visited-from-host`	<string> <string> …	Only keep publications with a visited site whose provenance URL host part is present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-not-visited-from-host`	<string> <string> …	Only keep publications with no visited sites whose provenance URL host part is present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-visited-time-more`	<ISO-8601 time>	Only keep publications with a visited site whose visit time is more than or equal to the given time
`-visited-time-less`	<ISO-8601 time>	Only keep publications with a visited site whose visit time is less than or equal to the given time
`-visited-size`	<positive integer> …	Only keep publications whose visited sites size is equal to one of given sizes
`-not-visited-size`	<positive integer> …	Only keep publications whose visited sites size is not equal to any of given sizes
`-visited-size-more`	<positive integer>	Only keep publications whose visited sites size is more than the given size
`-visited-size-less`	<positive integer>	Only keep publications whose visited sites size is less than the given size

Filter publication parts¶

Conditions that publication parts must meet for the publication to be retained in the pipeline.

Each parameter (except -part-empty, -not-part-empty, -part-usable, -not-part-usable, -part-final, -not-part-final) has a corresponding parameter specifying the publication parts that need to meet the condition given by the parameter. For example, -part-content gives a regular expression and -part-content-part lists all publication parts that must have a match with the given regular expression. If -part-content is specified, then -part-content-part must also be specified (and vice versa).

A publication part is any of: the pmid, the pmcid, the doi, title, keywords, MeSH, EFO, GO, theAbstract, fulltext.

Parameter	Parameter args	Description
`-part-empty`	<PublicationPartName> …	Only keep publications with specified parts being empty
`-not-part-empty`	<PublicationPartName> …	Only keep publications with specified parts not being empty
`-part-usable`	<PublicationPartName> …	Only keep publications with specified parts being usable
`-not-part-usable`	<PublicationPartName> …	Only keep publications with specified parts not being usable
`-part-final`	<PublicationPartName> …	Only keep publications with specified parts being final
`-not-part-final`	<PublicationPartName> …	Only keep publications with specified parts not being final
`-part-content`	<regex>	Only keep publications where the contents of all parts specified with `-part-content-part` have a match with the given regular expression
`-not-part-content`	<regex>	Only keep publications where the contents of all parts specified with `-not-part-content-part` do not have a match with the given regular expression
`-part-size`	<positive integer> …	Only keep publications where the sizes of all parts specified with `-part-size-part` are equal to any of given sizes
`-not-part-size`	<positive integer> …	Only keep publications where the sizes of all parts specified with `-not-part-size-part` are not equal to any of given sizes
`-part-size-more`	<positive integer>	Only keep publications where the sizes of all parts specified with `-part-size-more-part` are more than the given size
`-part-size-less`	<positive integer>	Only keep publications where the sizes of all parts specified with `-part-size-less-part` are less than the given size
`-part-type`	<PublicationPartType> …	Only keep publications where the types of all parts specified with `-part-type-part` are equal to any of given types
`-not-part-type`	<PublicationPartType> …	Only keep publications where the types of all parts specified with `-not-part-type-part` are not equal to any of given types
`-part-type-more`	<PublicationPartType>	Only keep publications where the types of all parts specified with `-part-type-more-type` are better than the given type
`-part-type-less`	<PublicationPartType>	Only keep publications where the types of all parts specified with `-part-type-less-type` are lesser than the given type
`-part-type-final`	<PublicationPartType>	Only keep publications where the types of all parts specified with `-part-type-final` are of final type
`-not-part-type-final`	<PublicationPartType>	Only keep publications where the types of all parts specified with `-not-part-type-final` are not of final type
`-part-type-pdf`	<PublicationPartType>	Only keep publications where the types of all parts specified with `-part-type-pdf-part` are of PDF type
`-not-part-type-pdf`	<PublicationPartType>	Only keep publications where the types of all parts specified with `-not-part-type-pdf-part` are not of PDF type
`-part-url`	<regex>	Only keep publications where the URLs of all parts specified with `-part-url-part` have a match with the given regular expression
`-not-part-url`	<regex>	Only keep publications where the URLs of all parts specified with `-not-part-url-part` do not have a match with the given regular expression
`-part-url-host`	<string> <string> …	Only keep publications where the URL host parts of all parts specified with `-part-url-host-part` are present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-not-part-url-host`	<string> <string> …	Only keep publications where the URL host parts of all parts specified with `-not-part-url-host-part` are not present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-part-time-more`	<ISO-8601 time>	Only keep publications where the timestamps of all parts specified with `-part-time-more-part` are more than or equal to the given time
`-part-time-less`	<ISO-8601 time>	Only keep publications where the timestamps of all parts specified with `-part-time-less-part` are less than or equal to the given time

Filter webpages and docs¶

Conditions that webpages and docs must meet to be retained in the pipeline.

Parameter	Parameter args	Description
`-broken`		Only keep webpages and docs that are broken
`-not-broken`		Only keep webpages and docs that are not broken
`-start-url`	<regex>	Only keep webpages and docs whose start URL has a match with the given regular expression
`-not-start-url`	<regex>	Only keep webpages and docs whose start URL does not have a match with the given regular expression
`-start-url-host`	<string> <string> …	Only keep webpages and docs whose start URL host part is present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-not-start-url-host`	<string> <string> …	Only keep webpages and docs whose start URL host part is not present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-final-url`	<regex>	Only keep webpages and docs whose final URL has a match with the given regular expression
`-not-final-url`	<regex>	Only keep webpages and docs whose final URL does not have a match with the given regular expression
`-final-url-host`	<string> <string> …	Only keep webpages and docs whose final URL host part is present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-not-final-url-host`	<string> <string> …	Only keep webpages and docs whose final URL host part is not present in the given list of strings (comparison is done case-insensitively and “www.” is removed)
`-final-url-empty`		Only keep webpages and docs whose final URL is empty
`-not-final-url-empty`		Only keep webpages and docs whose final URL is not empty
`-content-type`	<regex>	Only keep webpages and docs whose HTTP Content-Type has a match with the given regular expression
`-not-content-type`	<regex>	Only keep webpages and docs whose HTTP Content-Type does not have a match with the given regular expression
`-content-type-empty`		Only keep webpages and docs whose HTTP Content-Type is empty
`-not-content-type-empty`		Only keep webpages and docs whose HTTP Content-Type is not empty
`-status-code`	<integer> <integer> …	Only keep webpages and docs whose HTTP status code is equal to one of given codes
`-not-status-code`	<integer> <integer> …	Only keep webpages and docs whose HTTP status code is not equal to any of given codes
`-status-code-more`	<integer>	Only keep webpages and docs whose HTTP status code is bigger than the given code
`-status-code-less`	<integer>	Only keep webpages and docs whose HTTP status code is smaller than the given code
`-title`	<regex>	Only keep webpages and docs whose page title has a match with the given regular expression
`-not-title`	<regex>	Only keep webpages and docs whose page title does not have a match with the given regular expression
`-title-size`	<positive integer> …	Only keep webpages and docs whose title length is equal to one of given lengths
`-not-title-size`	<positive integer> …	Only keep webpages and docs whose title length is not equal to any of given lengths
`-title-size-more`	<positive integer>	Only keep webpages and docs whose title length is more than the given length
`-title-size-less`	<positive integer>	Only keep webpages and docs whose title length is less than the given length
`-content`	<regex>	Only keep webpages and docs whose content has a match with the given regular expression
`-not-content`	<regex>	Only keep webpages and docs whose content does not have a match with the given regular expression
`-content-size`	<positive integer> …	Only keep webpages and docs whose content length is equal to one of given lengths
`-not-content-size`	<positive integer> …	Only keep webpages and docs whose content length is not equal to any of given lengths
`-content-size-more`	<positive integer>	Only keep webpages and docs whose content length is more than the given length
`-content-size-less`	<positive integer>	Only keep webpages and docs whose content length is less than the given length
`-content-time-more`	<ISO-8601 time>	Only keep webpages and docs whose content time is more than or equal to the given time
`-content-time-less`	<ISO-8601 time>	Only keep webpages and docs whose content time is less than or equal to the given time
`-license`	<regex>	Only keep webpages and docs whose software license has a match with the given regular expression
`-not-license`	<regex>	Only keep webpages and docs whose software license does not have a match with the given regular expression
`-license-empty`		Only keep webpages and docs whose software license is empty
`-not-license-empty`		Only keep webpages and docs whose software license is not empty
`-language`	<regex>	Only keep webpages and docs whose programming language has a match with the given regular expression
`-not-language`	<regex>	Only keep webpages and docs whose programming language does not have a match with the given regular expression
`-language-empty`		Only keep webpages and docs whose programming language is empty
`-not-language-empty`		Only keep webpages and docs whose programming language is not empty
`-has-scrape`		Only keep webpages and docs that have scraping rules (based on final URL)
`-not-has-scrape`		Only keep webpages and docs that do not have scraping rules (based on final URL)

Sort content¶

Sorting of fetched/loaded and filtered content. If sorted by their ID, then publications are first sorted by the PMID, then by the PMCID (if PMID is absent), then by the DOI (if PMID and PMCID are absent). Internally, the PMID, the PMCID and the DOI registrant are sorted numerically, DOIs within the same registrant alphabetically. If sorted by their URL, then webpages and docs are sorted alphabetically according to their startUrl.

Parameter	Parameter args	Description
`-asc`		Sort publications, webpages and docs by their ID/URL in ascending order
`-desc`		Sort publications, webpages and docs by their ID/URL in descending order
`-asc-time`		Sort publications, webpages and docs by their fetchTime in ascending order
`-desc-time`		Sort publications, webpages and docs by their fetchTime in descending order

Limit content¶

Fetched/loaded, filtered and sorted content can be limited to a given number of entries either in the front or back. The list of top hosts will also be limited.

Parameter	Parameter args	Description
`-head`	<positive integer>	Only keep the first given number of publications, webpages and docs (same for top hosts from publications, webpages and docs)
`-tail`	<positive integer>	Only keep the last given number of publications, webpages and docs (same for top hosts from publications, webpages and docs)

Update citations count¶

Parameter	Parameter args	Description
`-update-citations-count`	<database file>	Fetch and update the citations count and citations count last update timestamp of all publications resulting from the pipeline and put successfully updated publications to the given database

Put to database¶

Parameter	Parameter args	Description
`-put`	<database file>	Put all publications, webpages and docs resulting from the pipeline to the given database, overwriting any existing entries that have equal IDs/URLs

Remove from database¶

Parameter	Parameter args	Description
`-remove`	<database file>	From the given database, remove all publications, webpages and docs with IDs corresponding to IDs of publications, webpages and docs resulting from the pipeline

Output¶

Output final list of publications (or publication parts specified by --out-part), webpages and docs resulting from the pipeline to stdout or the specified text files in the format specified by the Output modifiers --plain and --format.

If --format text (the default) and --plain are specified and --out-part specifies only publication IDs, then publications will be output in the form <pmid>\t<pmcid>\t<doi>, one per line. Also in case of --format text --plain, if --out-part specifies only one publication part (that is not theAbstract or fulltext), then for each publication there will be only one line in the output, containing the plain text output of that publication part. Otherwise, there will be separator lines separating different publications in the output.

If --format html and --plain are specified and --out-part specifies only publication IDs, then the output will be a HTML table of publication IDs, with one row corresponding to one publication.

The full output format of --format json is specified later in JSON format. There is also a short description about the HTML and plain text outputs.

Additionally, there are operations to get the so-called top hosts: all host parts of URLs of visited sites of publications, of URLs of webpages and of URLs of docs, starting from the most common and including count numbers. This can be useful for example for finding hosts to write scraping rules for. When counting different hosts, comparison of hosts is done case-insensitively and “www.” is removed. Parameter -has-scrape can be added to only output hosts for which scraping rules could be found and parameter -not-has-scrape added to only output hosts for which no scraping rules could be found. Parameters -head and -tail can be used to limit the size of top hosts output.

For analysing the different sources of publication part content, there is an option to print a PublicationPartType vs PublicationPartName table in CSV format.

Parameter	Parameter args	Description
`-out`		Output publications (or publication parts specified by `--out-part`), webpages and docs to stdout in the format specified by the Output modifiers `--plain` and `--format`
`-txt-pub`	<file>	Output publications (or publication parts specified by `--out-part`) to the given file in the format specified by the Output modifiers `--plain` and `--format`
`-txt-web`	<file>	Output webpages to the given file in the format specified by the Output modifiers `--plain` and `--format`
`-txt-doc`	<file>	Output docs to the given file in the format specified by the Output modifiers `--plain` and `--format`
`-count`		Output count numbers for publications, webpages and docs to stdout
`-out-top-hosts`		Output all host parts of URLs of visited sites of publications, of URLs of webpages and of URLs of docs to stdout, starting from most common and including count number
`-txt-top-hosts-pub`	<file>	Output all host parts of URLs of visited sites of publications to the given file, starting from the most common and including count numbers
`-txt-top-hosts-web`	<file>	Output all host parts of URLs of webpages to the given file, starting from the most common and including count numbers
`-txt-top-hosts-doc`	<file>	Output all host parts of URLs of docs to the given file, starting from the most common and including count numbers
`-count-top-hosts`		Output number of different host parts of URLs of visited sites of publications, of URLs of webpages and of URLs of docs to stdout
`-part-table`		Output a PublicationPartType vs PublicationPartName table in CSV format to stdout, i.e. how many publications have content for the given publication part fetched from the given resource type

Output modifiers¶

Some parameters to influence the behaviour of outputting operations.

Parameter	Parameter args	Default	Description
`--plain`			If specified, then any potential metadata will be omitted from the output
`--format`	<Format>	`text`	Can choose between plain text output format (`text`), HTML format (`html`) and JSON format (`json`)
`--out-part`	<PublicationPartName> …		If specified, then only the specified publication parts will be output (webpages and docs are not affected). Independent from the `--fetch-part` parameter.

Test¶

Operations for testing built-in and configurable scraping rules (e.g., -print-europepmc-xml and -test-europepmc-xml; -print-site and -test-site) are described in the scraping rules section.

Examples¶

Operations with IDs¶

As a first step in the pipeline of operations, some publication IDs, webpage URLs or doc URLs must be loaded (and possibly filtered). How to create and populate the database files used in this section is explained in the next section.

$ java -jar pubfetcher-cli-<version>.jar \
-pub 12345678 10.1093/nar/gkw199 -pub-file pub1.txt pub2.txt \
-pub-db database.db new.db \
-has-pmcid -doi '(?i)nmeth' \
-doi-url '^https://www.ebi.ac.uk/europepmc/' -doi-registrant 1038 \
-out-ids --plain

First, add two publication IDs from the command-line: a publication ID where the PMID is 12345678 and a publication ID where the DOI is 10.1093/nar/gkw199. Then add publication IDs from the text files pub1.txt and pub2.txt, where each line must be in the form <pmid>\t<pmcid>\t<doi> (except empty lines and lines beginning with # which are ignored). As last, add all publication IDs found in the database files database.db and new.db. The resulting list of publication IDs is actually a set, meaning duplicate IDs will be merged.

Then, the publication IDs will be filtered. Parameter -has-pmcid means that only publication IDs that have a non-empty PMCID (probably meaning that the fulltext is available in PubMed Central) will be kept. Specifying -doi '(?i)nmeth' means that, in addition, the DOI part of the ID must have a match with “nmeth” (Nature Methods) case-insensitively (we specify case-insensitivity with “(?i)” because we are converting the DOIs to upper-case). With -doi-url we specify that the DOI was found first from the Europe PMC API and with -doi-registrant we specify that the DOI registrant code must be 1038 (Nature).

The resultant list of filtered publication IDs will be output to standard output as plain text with the parameter -out-ids. Specifying the modifier --plain means that the ID provenance URLs will not be output and the output of IDs will be in the form <pmid>\t<pmcid>\t<doi>.

$ java -jar pubfetcher-cli-<version>.jar \
-pub-db new.db -web-db new.db -not-in-db database.db \
-url '^https' -not-url-host bioconductor.org github.com \
-txt-ids-pub pub.json -txt-ids-web web.json --format json

First, add all publication IDs and all webpage URLs from the database file new.db. With -not-in-db we remove all publication IDs and webpage URLs that are already present in the database file database.db. With the regex ^https specified using the -url parameter only webpage URLs whose schema is HTTPS are kept. And with -not-url-host we remove all webpage URLs whose host part is bioconductor.org or github.com (or “www.bioconductor.org” or “www.github.com”) case-insensitively. The resultant list of publication IDs will be output to the file pub.json and the resultant list of webpage URLs will be output to the file web.json. The output will be in JSON format because it was specified using the --format modifier. By using --format html or --format html --plain we would get a HTML file instead, which when opened in a web browser would list the IDs and URLs as clickable links.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-has-pmid -asc-ids -head-ids 10 -txt-ids-pub oldpmid.txt --plain

Add all publication IDs from the database file database.db, only keep publication IDs that have a non-empty PMID part, order the publication IDs (smallest PMID first) and only keep the 10 first IDs. The resultant 10 publication IDs will be output to the file oldpmid.txt, where each line is in the form <pmid>\t<pmcid>\t<doi>.

$ java -jar pubfetcher-cli-<version>.jar \
-pub-file oldpmid.txt -pub 12345678 -remove-ids database.db

publications that have a small PMID are in the database possibly by mistake. So we can review the file oldpmid.txt generated in the previous step and keep entries we want to remove from the database listed in that file. Then, with the last command, we add publication IDs from the file oldpmid.txt, manually add an extra publication ID with PMID 12345678 from the command-line and with -remove-ids remove all publications corresponding to the resultant list of publication IDs from the database file database.db.

Get content¶

Next, we’ll see how content can be fetched/loaded and how database files (such as those used in the previous section) can be populated with content.

$ java -jar pubfetcher-cli-<version>.jar -db-init database.db

This creates a new empty database file called database.db.

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-fetch --timeout 30000 -usable -put database.db

Add all publication IDs from the file pub.txt (where each line is in the form <pmid>\t<pmcid>\t<doi>) and for each ID put together a publication with content fetched from different resources, thus getting a list of publications. The connect and read timeout is changed from the default value of 15 seconds to 30 seconds with the general Fetching parameter timeout. Filter out non-usable publications from the list with parameter -usable and put all publications from the resultant list to the database file database.db. Any existing publication with an ID equal to an ID of a new publication will be overwritten.

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-fetch-put database.db --timeout 30000

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -not-usable -remove database.db

If parameters -fetch and -put are used, then first all publications are fetched and loaded into memory, and only then all publications are saved to the database file at once. This is not optimal if there are a lot of publications to fetch, as if some severe error occurs, all content will be lost. Using the parameter -fetch-put, each publication will be put to the database right after it has been fetched. This has the downside of not being able to filter publications before they are put to the database. One way around this is to put all content to the database while fetching and then remove some of the entries from the database based on required filters, as illustrated by the second command.

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-db-fetch database.db --threads 16 -usable -count

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -count

With parameter -db-fetch the following happens for each publication: first the publication is looked for in the database; if found, it will be updated with fetched content, if possible and required, and saved back to the database file; if not found, a new publication will be put together with fetched content and put to the database file. This potentially enables less fetching in the future and enables progressive betterment of some publications over time. Additionally, in contrast to -fetch and -fetch-put, operation -db-fetch is multithreaded (with the number of threads specified using --threads), thus much quicker.

Like with -fetch-put, publications can’t be filtered before they are put to the database. Any specified filter parameters will only have an effect on which content is retained in memory for further processing (like outputting) down the pipeline. For example, with -usable -count, the number of usable publications is output to stdout after fetching is done, but both usable and non-usable publications were saved to the database file, as can be seen with the -count of the seconds command.

$ java -jar pubfetcher-cli-<version>.jar -db-init new.db

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-db-fetch new.db --threads 16 -usable -count

$ java -jar pubfetcher-cli-<version>.jar -pub-db new.db \
-db new.db -not-usable -remove new.db

$ java -jar pubfetcher-cli-<version>.jar -pub-db new.db \
-db new.db -put database.db

Sometimes, we may want only “fresh” entries (fetched only once and not updated), like -fetch and -fetch-put provide, but with multithreading support, like -db-fetch provides, and with filtering support, like -fetch provides. Then, the above sequence of commands can be used: make a new database file called new.db; fetch entries to new.db using 16 threads; filter out non-usable entries from new.db; and put content from new.db to our main database file, overwriting any existing entries there.

Another similar option would be to disable updating of entries by setting the retryLimit to 0 and emptyCooldown, nonFinalCooldown, fetchExceptionCooldown to a negative number.

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-db-fetch-end database.db --threads 16

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -usable -count

Parameter -db-fetch will, in addition to saving entries to the database file, load all entries into memory while fetching for further processing (like outputting) down the pipeline. This might cause excessive memory usage if a lot of entries are fetched. Thus, parameter -db-fetch-end is provided, which is like -db-fetch except it does not retain any of the entries in memory. Any further filtering, outputting, etc can be done on the database file after fetching with -db-fetch-end is done, as shown with the provided second command.

$ java -jar pubfetcher-cli-<version>.jar \
-pub-file pub.txt -web-file web.txt -doc-file doc.txt \
-db-fetch-end database.db --threads 16 --log database.log

An example of a comprehensive and quick fetching command: add all provided publication IDs, webpage URLs and doc URLs, fetch all corresponding publications, webpages and docs, using 16 threads for this process and saving the content to the database file database.db, and append all log messages to the file database.log for possible future reference and analysis.

Loading content¶

After content has been fetched, e.g. using one of commands in the previous section, it can be loaded and explored.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db --pre-filter -oa -journal-title 'Nature' \
-not-part-empty fulltext -out | less

From the database file database.db, load all publications that are Open Access, that are from a journal whose title has a match with the regular expression Nature and whose fulltext part is not empty, and output these publications with metadata and in plain text to stdout, from where output is piped to the pager less. Specifying --pre-filter means that content is filtered while being loaded from the database, meaning that entries not passing the filter will not be retained in memory. If --pre-filter would not be specified, then first all entries corresponding to the added publication IDs would be loaded to memory at once and only then would the entries start to be removed with the specified filters. This has the advantage of being able to see in log messages how many entries pass each filter, however, if the number of added and filtered publication IDs is very big, it could be better to use --pre-filter to not cause excessive memory usage.

Limit fetching/loading¶

For testing or memory reduction purposes the number of fetched/loaded entries can be limited with --limit.

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-fetch --limit 3 -out | less

Only fetch and output the first 3 publications listed in pub.txt.

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-fetch --limit 3 --pre-filter -oa -out | less

Only fetch and output the first 3 Open Access publications listed in pub.txt. Using --pre-filter means that filtering is done before limiting the entries, meaning that more than 3 entries might be fetched, because fetching happens until a third Open Access publication is encountered, but exactly 3 entries are output (if there are enough publications listed in pub.txt). If --pre-filter was not used, then exactly 3 entries would be fetched (if there are enough publications listed in pub.txt), meaning that less than 3 entries might be output, because not all of the publications might be Open Access.

Fetch only some publication parts¶

If we are only interested in some publication parts, it might be advantageous to list them explicitly. This might make fetching faster, because we can skip Internet resources that can’t provide us with any missing parts we are interested in or we can stop fetching of new resources altogether if all parts we are interested in are final.

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-fetch --fetch-part title theAbstract -out | less

Only fetch the title and theAbstract for the added publication IDs, all other publication parts (except IDs) will be empty in the output.

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-fetch --fetch-part title theAbstract \
-out --out-part title theAbstract --plain | less

If only title and theAbstract are fetched, then all other publication parts (except IDs) will be empty, thus we might not want to output these empty parts. This can be done be specifying the title and theAbstract parts with --out-part. Additionally specifying --plain means no metadata is output either, thus the output will consist of only plain text publication titles and abstracts with separating characters between different publications.

Converting IDs¶

As a special case of the ability to only fetch some publication parts, PubFetcher can be used as an ID converter between PMID/PMCID/DOI.

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-fetch --fetch-part pmid pmcid doi --out-part pmid pmcid doi \
-txt-pub newpub.txt --plain

Take all publication IDs from pub.txt (where each line is in the form <pmid>\t<pmcid>\t<doi>) and for each ID fetch only publication parts the PMID, the PMCID and the DOI and output only these parts to the file newpub.txt. In the output file each line will be in the form <pmid>\t<pmcid>\t<doi>, because ID provenance URLs are excluded with --plain and no other publication parts are output. If the goal is to convert only DOI to PMID and PMCID, for example, then each line in pub.txt could be in the form \t\t<doi> and parameters specified as --fetch-part pmid pmcid --out-part pmid pmcid.

$ java -jar pubfetcher-cli-<version>.jar -db-init newpub.db

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-db-fetch-end newpub.db --threads 16 --fetch-part pmid pmcid doi

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt \
-db newpub.db --out-part pmid pmcid doi -txt-pub newpub.txt --plain

If a lot of publication IDs are to be converted, it would be better to first fetch all publications to a resumable temporary database file, using the multithreaded -db-fetch-end, and only then output the parts the PMID, the PMCID and the DOI to the file newpub.txt.

$ java -jar pubfetcher-cli-<version>.jar -pub-db newpub.db \
-db newpub.db -part-table

We can output a PublicationPartType vs PublicationPartName table in CSV format to see from which resources the converted IDs were got from. Most likely the large majority will be from Europe PMC (e.g., https://www.ebi.ac.uk/europepmc/webservices/rest/search?resulttype=core&format=xml&query=ext_id:17478515%20src:med). DOIs with types other than the “europepmc”, “pubmed” or “pmc” types were not converted to DOI by the corresponding resource but just confirmed by it (as fetching that resource required the knowledge of a DOI in the first place). Type “external” means that the supplied ID was not found and confirmed in any resource.

In one instance of around 10000 publications, the usefulness of PubFetcher for only ID conversion manifested itself mostly in the case of finding PMCIDs. But even then, around 97% of PMCIDs were present in Europe PMC. As to the rest, around 2% were of type “link_oadoi” (i.e., found using Unpaywall) and around 1% were of type “pubmed_xml” (i.e., present in PubMed, but not Europe PMC, although it was mostly articles which had been assigned a PMCID but were actually not yet available due to delayed release (embargo)). In the case of PMIDs the usefulness is even less and mostly in finding a corresponding PMID (if missing) to the PMCID found using a source other than Europe PMC. And in the case of DOIs, only a couple (out of 10000) were found from resources other than Europe PMC (mostly because initially only a PMCID was supplied and that PMCID was not present in Europe PMC).

So in conclusion, PubFetcher gives an advantage of a few percent over simply using an XML returned by the Europe PMC API when finding PMCIDs for articles (but also when converting from DOI to PMID), but gives almost no advantage when converting from PMID to DOI.

Filtering content¶

The are many possible filters, all of which are defined above in the section Filter content.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db -web-db \
database.db -doc-db database.db -db database.db -usable -grep 'DNA' \
-oa -pub-date-more 2018-08-15T00:00:00Z -citations-count-more 9 \
-corresp-author-size 1 2 -part-size-more 2 -part-size-more-part \
keywords mesh -part-type europepmc_xml pmc_xml doi -part-type-part \
fulltext -part-time-more 2018-08-15T12:00:00Z -part-time-more-part \
fulltext -title '(?i)software|database' -status-code-more 199 \
-status-code-less 300 -not-license-empty -has-scrape -asc -out | less

This example will load all content (publications, webpages and docs) from the database file database.db and apply the following filters (ANDed together) to remove content before it is sorted in ascending order and output:

Parameter	Description
`-usable`	Only usable publications, usable webpages and usable docs will be kept
`-grep 'DNA'`	Only publications, webpages and docs whose whole content (excluding metadata) has a match with the regular expression “DNA” (i.e., contains the string “DNA”)
`-oa`	Only keep publications that are Open Access
`-pub-date-more 2018-08-15T00:00:00Z`	Only keep publications whose publication date is `2018-08-15` or later
`-citations-count-more 9`	Only keep publications that are cited more than `9` times
`-corresp-author-size 1 2`	Only keep publications for whose `1` or `2` corresponding authors were found (i.e., publications with no found corresponding authors or more that 2 corresponding authors are discarded)
`-part-size-more 2 -part-size-more-part keywords mesh`	Only keep publications that have more than `2` keywords and more than `2` MeSH terms
`-part-type europepmc_xml pmc_xml doi -part-type-part fulltext`	Only keep publications whose fulltext part is of type “europepmc_xml”, “pmc_xml” or “doi”
`-part-time-more 2018-08-15T12:00:00Z -part-time-more-part fulltext`	Only keep publications whose fulltext part has been obtained at `2018-08-15 noon (UTC)` or later
`-title '(?i)software\|database'`	Only keep webpages and docs whose page title has a match with the regular expression `(?i)software\|database` (i.e., contains case-insensitively “software” or “database”)
`-status-code-more 199 -status-code-less 300`	Only keep webpages and docs whose status code is `2xx`
`-not-license-empty`	Only keep webpages and docs that have a non-empty software license name present
`-has-scrape`	Only keep webpages and docs for which scraping rules are present

Terminal operations¶

Operations that are done on the final list of entries. If multiple such operations are specified in one command, then they will be performed in the order they are defined in this reference.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -oa -update-citations-count database.db

Load all publications from the database file database.db, update the citations count of all Open Access publications and save successfully updated publications back to the database file database.db.

$ java -jar pubfetcher-cli-<version>.jar -db-init oapub.db

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -oa -put oapub.db

Copy all Open Access publications from the database file database.db to the new database file oapub.db.

$ java -jar pubfetcher-cli-<version>.jar -pub-db new.db -db new.db \
-put database.db

Copy all publications from the database file new.db to the database file database.db, overwriting any existing entries in database.db.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -not-oa -remove database.db

Remove all not Open Access publications from the database file database.db.

$ java -jar pubfetcher-cli-<version>.jar -pub-db other.db \
-remove-ids database.db

Remove all publications that are also present in the database file other.db from the database file database.db. As removal is done based on all IDs found in other.db and no filtering based on the content of entries needs to be done, then loading of content from the database file other.db is not done and -remove-ids must be used instead of -remove for removal from the database file database.db.

Output¶

Output can happen to stdout or text files in plain text, HTML or JSON, with or without metadata.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -out | less

Output all publications from the database file database.db to stdout in plain text and with metadata and pipe stdout to the pager less.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-web-db database.db -db database.db \
-txt-pub pub.html -txt-web web.html --format html

Output all publications and webpages from the database file database.db in HTML format and with metadata to the files pub.html and web.html respectively.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db \
-txt-pub pubids.html --out-part pmid pmcid doi --format html --plain

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-txt-ids-pub pubids.html --format html --plain

Both commands will output all publication IDs from the database file database.db as an HTML table to the file pubids.html.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -out --out-part mesh --format text --plain

Output the MeSH terms of all publications from the database file database.db to stdout in plain text and without metadata. As only one publication part (that is not theAbstract or fulltext) is output without metadata, then there will be one line of output (a list of MeSH terms) for each publication.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-web-db database.db -db database.db \
-out-top-hosts -head 10 -not-has-scrape

From the database file database.db, output host parts of URLs of visited sites of publications and of URLs of webpages for which no scraping rules could be found, starting from the most common and including count numbers and limiting output to the 10 first hosts for both cases. This could be useful for finding hosts to add scraping rules for.

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -part-table > part-table.csv

From all publications in the database file database.db, generate a PublicationPartType vs PublicationPartName table in CSV format and output it to the file part-table.csv.

Export to JSON¶

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-web-db database.db -doc-db database.db -db database.db \
-txt-pub pub.json -txt-web web.json -txt-doc doc.json --format json

Output all publications, webpages and docs from the database file database.db in JSON format and with metadata to the files pub.json, web.json and doc.json respectively. That is, export all content in JSON, so that the database file and PubFetcher itself would not be needed again for further work with the data.

Notes¶

The syntax of regular expressions is as defined in Java, see documentation of the Pattern class: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html.

The ISO-8601 times must be specified like “2018-08-31T13:37:51Z” or “2018-08-31T13:37:51.123Z”.

All publication DOIs are normalised, this effect can be tested with the -normalise-doi method.

webpages and docs have the same structure, equivalent methods and common scraping rules, they just provide separate stores for saving general web pages and documentation web pages respectively.

If an entry is final (and without a fetching exception) in a database, then it can never be refetched again (only the citations count can be updated). If that entry needs to be refreshed for some reason, then -fetch or -fetch-put must be used to fetch a completely new entry and overwrite the old one in the database.

On the other hand, -db-fetch or -db-fetch-end could be used multiple times after some interval to try to complete non-final entries, e.g. web servers that were offline might be up again, some resources have been updated with extra content or we have updated some scraping rules. For example, the command java -jar pubfetcher-cli-<version>.jar -pub-file pub.txt -db-fetch-end database.db could be run a week after the same command was initially run.

Limitations¶

The querying capabilities of PubFetcher are rather rudimentary (unlike SQL), but hopefully enough for most use cases.

For example, different filters are ANDed together and there is no support for OR. As a workaround, different conditions can be output to temporary files of IDs/URLs that can then be put together. For example, output all publications from the database file database.db that are cited more than 9 times or that have been published on 2018-01-01 or later to stdout:

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -citations-count-more 9 \
-txt-pub pub_citations.txt --out-part pmid pmcid doi --plain

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -pub-date-more 2018-01-01T00:00:00Z \
-txt-pub pub_pubdate.txt --out-part pmid pmcid doi --plain

$ java -jar pubfetcher-cli-<version>.jar -pub-file pub_citations.txt \
pub_pubdate.txt -db database.db -out | less

Some advanced filtering might not be possible, because some command-line switches can’t be specified twice. For example, the filter -part-size-more 2 -part-size-more-part keywords -part-size-more 999 -part-size-more-part theAbstract will not filter out entries that have more than 2 keywords and whose theAbstract length is more than 999, but instead result in an error. As a workaround, the filter might be broken down and the result of the different conditions saved in temporary database files that can then be ANDed together:

$ java -jar pubfetcher-cli-<version>.jar -db-init pub_keywords.db

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -part-size-more 2 -part-size-more-part keywords \
-put pub_keywords.db

$ java -jar pubfetcher-cli-<version>.jar -db-init pub_abstract.db

$ java -jar pubfetcher-cli-<version>.jar -pub-db database.db \
-db database.db -part-size-more 999 -part-size-more-part theAbstract \
-put pub_abstract.db

$ java -jar pubfetcher-cli-<version>.jar -pub-db pub_keywords.db \
-in-db pub_abstract.db -db database.db -out | less

In the pipeline the operations are done in the order they are defined in this reference and with one command the pipeline is run only once. Which means, for example, that it is not possible to filter some content and then refetch the filtered entries using only one command, because content loading/fetching happens before content filtering. In such cases, intermediate results can be saved to temporary files, which can be used by the next command to get the desired outcome. For example, get all publications from the database file database.db that have a visited site whose URL has a match with the regular expression academic\.oup\.com|[a-zA-Z0-9.-]*sciencemag\.org and refetch those publications from scratch, overwriting the corresponding old publications in database.db:

$ java -jar pubfetcher-cli-<version>.jar \
-pub-db database.db -db database.db \
-visited 'academic\.oup\.com|[a-zA-Z0-9.-]*sciencemag\.org' \
-txt-pub oup_science.txt --out-part pmid pmcid doi --plain

$ java -jar pubfetcher-cli-<version>.jar -pub-file oup_science.txt \
-fetch-put database.db