Output

Database

The database file that is used by PubFetcher to save publications, webpages and docs on disk is a simple key-value store generated by the MapDB library.

In case of the webpages and docs stores, a key is simply the string representing the startUrl, i.e. the URL given to PubFetcher for fetching content for. The resolved finalUrl might be different than the startUrl (for example a redirection from HTTP to HTTPS might happen), meaning there might be webpages and docs with equal final URLs (that had different start URLs) stored in the database. Also to note, that webpages and docs have the same structure, they just provide two entirely separate stores for saving general web pages and documentation web pages respectively.

Publications can be identified by 3 separate IDs: a PMID, a PMCID or a DOI. Therefore, the following is done. A key – which can be called the primary ID of the publication – in the publications store is either a PMID, a PMCID or a DOI, depending on which of them was non-empty when the publication was first saved to the database. If more than one of them was available, then the PMID is preferred over the PMCID and the PMCID is preferred over the DOI. Then, there is an extra store called “publicationsMap”, where a key is an ID (PMID/PMCID/DOI) of a publication and the corresponding value is the primary ID (PMID/PMCID/DOI) of that publication. So, for example, if a publication is to be loaded from the database, first publicationsMap is consulted to find the primary ID and then the found primary ID used to find the publication from the publications store. All the mappings in publicationsMap can be dumped to stdout with -db-publications-map. There is also a store called “publicationsMapReverse”, which has mappings that are the reverse of the publicationsMap mappings, that is, from primary ID to the triplet PMID, PMCID, DOI. In addition, publicationsMapReverse stores the URLs where these PMID, PMCID and DOI were found. This reverse mapping can be useful, for example, for quickly listing all publication IDs (as the triplet PMID, PMCID, DOI) found in a database file. All the mappings in publicationsMapReverse can be dumped to stdout with -db-publications-map-reverse. The stores publicationsMapReverse and publicationsMap and the publications store are all kept coherent and in sync with each other. Also to note, that all stored DOIs are normalised, i.e. any valid prefix is removed (e.g. “https://doi.org/”, “doi:”) and letters from the 7-bit ASCII set are converted to uppercase.

The structure of the values in the publications, webpages and docs stores, i.e. the actual contents stored in the database, is best described by the next section JSON output, as the entire content of the database can be exported to an equivalently structured JSON file. To note, all the “empty”, “usable”, “final”, “totallyFinal” and “broken” fields present in the JSON output are not stored in the database, but these values are inferred from actual database values and depend on some fetching parameters. Additionally, the fields “version” and “argv” are only specific to JSON.

With a new release of PubFetcher, the structure of the database content might change (this involves code in the package org.edammap.pubfetcher.core.db). Currently, there is no database migration support, which means that the content of existing database files will be become unreadable in case of structure updates. If that content is still required, it would need to be refetched to a new database file (created with the new version of PubFetcher).

JSON output

The output of PubFetcher will be in JSON format if the option --format json is specified. If the option --plain is additionally specified, then fields about metadata will be omitted from the output. JSON support is implemented using libraries from the Jackson project.

Common

All JSON output will contain the fields “version” and “argv”.

version

Information about the application that generated this JSON file

name: Name of the application
url: Homepage of the application
version: Version of the application

argv

Array of all command-line parameters that were supplied to the application that generated this JSON file

IDs

JSON output of IDs/URLs, output using -out-ids, -txt-ids-pub, -txt-ids-web or -txt-ids-doc.

IDs of publications

Publications are identified by the triplet PMID, PMCID and DOI.

publicationIds

Array of publication IDs

pmid: The PubMed ID of the publication. Only articles available in PubMed can have this.
pmcid: The PubMed Central ID of the publication. Only articles available in PMC can have this.
doi: The Digital Object Identifier of the publication
pmidUrl: Provenance URL of the PMID
pmcidUrl: Provenance URL of the PMCID
doiUrl: Provenance URL of the DOI

If --plain is specified, then the provenance URLs are not output.

URLs of webpages

Webpages are identified by a URL.

webpageUrls: Array of webpage URLs

URLs of docs

Docs are identified by a URL.

docUrls: Array of doc URLs

Contents

JSON output of the entire content of publications, webpages and docs, output using -out, -txt-pub, -txt-web and -txt-doc.

Content of publications

A publication represents one publication (most often a research paper) and contains its ID (a PMID, a PMCID and/or a DOI), content (title, abstract, full text), keywords (user-assigned, MeSH and mined EFO and GO terms) and various metadata (Open Access flag, journal title, publication date, etc).

publications

Array of publications

fetchTime

Time of initial fetch or last retryCounter reset as UNIX time (in milliseconds)

fetchTimeHuman

Time of initial fetch or last retryCounter reset as ISO 8601 combined date and time

retryCounter

A refetch can occur if the value of retryCounter is less than retryLimit; or if any of the cooldown times (in fetching parameters) of a currently true condition have passed since fetchTime, in which case retryCounter is also reset

fetchException

true if there was a fetching exception during the last fetch; false otherwise

oa

true if the article is Open Access; false otherwise

preprint

true if the article is a preprint; false otherwise

journalTitle

Title of the journal the article was published in

pubDate

Publication date of the article as UNIX time (in milliseconds); negative, if unknown

pubDateHuman

Publication date of the article as ISO 8601 date; before 1970-01-01, if unknown

citationsCount

Number of times the article has been cited (according to Europe PMC); negative, if unknown

citationsTimestamp

Time when citationsCount was last updated as UNIX time (in milliseconds); negative, if citationsCount has not yet been updated

citationsTimestampHuman

Time when citationsCount was last updated as ISO 8601 combined date and time; before 1970-01-01T00:00:00.000Z, if citationsCount has not yet been updated

correspAuthor

Array of objects representing corresponding authors of the article

name: Name of the corresponding author
orcid: ORCID iD of the corresponding author
email: E-mail of the corresponding author
phone: Telephone number of the corresponding author
uri: Web page of the corresponding author

visitedSites

Array of objects representing sites visited for getting content (outside of standard Europe PMC, PubMed and oaDOI resources and also excluding PDFs)

url: URL of the visited site
type: The type of the site (as resource)
from: URL where the link of the site was picked up
timestamp: Time when the link of the site was picked up as UNIX time (in milliseconds)
timestampHuman: Time when the link of the site was picked up as ISO 8601 combined date and time

empty

true, if all publication parts (except IDs) are empty; false otherwise

usable

true, if at least one publication part (apart from IDs) is usable; false otherwise

final

true, if title, abstract and fulltext are final; false otherwise

totallyFinal

true, if all publication parts are final; false otherwise

pmid

A publication part (like the following pmcid, doi, title, etc), in this case representing the publication PMID

content: Content of the publication part (in this case, the publication PMID as a string)
type: The type of the publication part content source
url: URL of the publication part content source
timestamp: Time when the publication part content was set as UNIX time (in milliseconds)
timestampHuman: Time when the publication part content was set as ISO 8601 combined date and time
size: Number of characters in the content
empty: true, if the content is empty (size is 0); false otherwise
usable: true, if the content is long enough (the threshold can be influenced by fetching parameters), in other words, if the publication part content can be used as input for other applications; false otherwise
final: true, if the content is from a reliable source and is long enough, in other words, if there is no need to try fetching the publication part content from another source; false otherwise

pmcid

Publication part representing the publication PMCID. Structure same as in pmid.

doi

Publication part representing the publication DOI. Structure same as in pmid.

title

Publication part representing the publication title. Structure same as in pmid.

keywords

Publication part representing publication keywords. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.

list: Array of string representing publication keywords

mesh

Publication part representing publication MeSH terms. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.

list

Array of objects representing publication MeSH terms

term: Term name
majorTopic: true, if the term is a major topic of the article
uniqueId: MeSH Unique Identifier

efo

Publication part representing publication EFO and other experimental methods terms. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.

list

Array of objects representing publication EFO terms

term: Term name
count: Number of times the term was mined from full text by Europe PMC
uri: Unique URI to the ontology term

go

Publication part representing publication GO terms. Structure same as in efo.

abstract

Publication part representing the publication abstract. Structure same as in pmid.

fulltext

Publication part representing the publication fulltext. Structure same as in pmid.

If --plain is specified, then metadata is omitted from the output (everything from fetchTime to totallyFinal) and the value corresponding to publication part keys (pmid to fulltext) will be the value of content (for pmid, pmcid, doi, title, abstract, fulltext) or the value of “list” (for keywords, mesh, efo, go) as specified above for each corresponding part.

If --out-part is specified, then everything from fetchTime to totallyFinal will be omitted from the output and only publication parts specified by --out-part will be output (with structure as specified above). If --plain is specified along with --out-part, then output parts will only have as value the value of content (for pmid, pmcid, doi, title, abstract, fulltext) or the value of “list” (for keywords, mesh, efo, go).

Content of webpages

A webpage represents a general web page from where relevant content has been extracted, along with some metadata. If the web page is about a software tool, then the software license and programming language can be stored separately, if found (this feature has been added to support EDAMmap).

webpages

Array of webpages

fetchTime: Same as fetchTime of publications
fetchTimeHuman: Same as fetchTimeHuman of publications
retryCounter: Same as retryCounter of publications
fetchException: Same as fetchException of publications
startUrl: URL given as webpage identifier, same as listed by webpageUrls
finalUrl: Final URL after potential redirections
contentType: HTTP Content-Type header
statusCode: HTTP status code
contentTime: Time when current webpage content was last set as UNIX time (in milliseconds)
contentTimeHuman: Time when current webpage content was last set as ISO 8601 combined date and time
license: Software license of the tool the webpage is about (empty if not found or missing corresponding scraping rule)
language: Programming language of the tool the webpage is about (empty if not found or missing corresponding scraping rule)
titleLength: Number of characters in the webpage title
contentLength: Number of characters in the webpage content
title: The webpage title (as extracted by the corresponding scraping rule; or text from the HTML <title> element if scraping rules were not found)
empty: true, if webpage title and webpage content are empty; false otherwise
usable: true, if the length of webpage title plus the length of webpage content is large enough (at least webpageMinLength characters), that is, the webpage can be used as input for other applications; false otherwise
final: true, if the webpage is not broken and the webpage is usable and the length on the webpage content is larger than 0; false otherwise
broken: true, if the webpage with the given URL could not be fetched (based on the values of statusCode and finalUrl); false otherwise
content: The webpage content (as extracted by the corresponding scraping rule; or the automatically cleaned content from the entire HTML of the page if scraping rules were not found)

If --plain is specified, then only startUrl, webpage title and webpage content will be present.

Content of docs

Like Content of webpages, except it allows for a separate store for documentation web pages.

docs

Array of docs

Structure is same as in webpages

HTML and plain text output

Output will be in HTML format, if --format html is specified, and in plain text, if --format text is specified or --format is omitted (as text is the default).

The HTML output is meant to be formatted and viewed in a web browser. Links to external resources (such as the different URL fields) are clickable in the browser.

The plain text output is formatted for viewing in the console or in a text editor.

Both the HTML output and the plain text output will contain the same information as the JSON output specified above and will behave analogously in respect to the --plain and --out-part parameters. There are however a few fields that are missing in HTML and plain text compared to JSON: “empty”, “usable”, “final”, “totallyFinal”, “broken” (these values are inferred from the values of some other fields and depend on some fetching parameters) and the JSON specific “version” and “argv”.

Log file

PubFetcher-CLI will log to stderr using the Apache Log4j 2 library. With the --log parameter (described in Logging), a text file where the same log will be output to can be specified.

Each log line will consist of the following: the data and time, log level, log message, the name of the logger that published the logging event and the name of the thread that generated the logging event. The date and time will be the local time in the format “2018-08-24 11:37:20,187”. Log level can be DEBUG, INFO, WARN and ERROR. DEBUG level messages are only output to the log file (and not to the console). Currently, there are only few DEBUG messages, including the very first message listing all parameters the program was run with. Any line breaks in the log message will be escaped, so that each log message can fit on exactly one line. The name of the logger is just the fully qualified Java class (with the prefix “org.edamontology” removed) the logging event is called from (prepended with “@” in the log file), e.g. “@pubfetcher.cli.Cli”. The name of the thread will be “main” if the logging event was generated by the main thread, any subsequent thread will be named “Thread-2”, “Thread-3”, etc. In the log file the thread name will be in square brackets, e.g. “[Thread-2]”. Some Java exceptions can also be logged, these will be output with the stack trace on subsequent lines after the logged exception message.

Analysing logs

Log level ERROR is set to erroneous conditions which mostly occur on the side of the PubFetcher user (like problems in provided input), so searching for “ERROR” in log files can potentially help in finding problems that can be fixed by the user. Some problems might be caused by issues in the used resources, like Europe PMC and PubMed, and some reported problems are not problems at all, like failing to find a publication part which is actually supposed to be missing, but these messages will usually have the log level WARN. One example of WARN level messages that can indicate inconsistencies in used resource are the messages beginning with “Old ID”.

Some examples of issues found by analysing logs:

If multiple threads are writing to a log file, then the messages of different threads will be interwoven. To get the sequence of messages of only one thread, grep could be used:

$ grep Thread-2 database.log

In addition to analysing logs, the output of -part-table (described in Output) could be checked for possible problems. For example, title being “na” is a good indicator of an invalid ID. To list all such publications the filter -part-type na -part-type-part title could be used. Other things of interest might be for example parts which are from other sources than the main ones (the europepmc, pubmed, pmc types and doi) or parts missing in Europe PMC, but present in PubMed or PMC.