Output

Database

The database file that is used by PubFetcher to save publications, webpages and docs on disk is a simple key-value store generated by the MapDB library.

In case of the webpages and docs stores, a key is simply the string representing the startUrl, i.e. the URL given to PubFetcher for fetching content for. The resolved finalUrl might be different than the startUrl (for example a redirection from HTTP to HTTPS might happen), meaning there might be webpages and docs with equal final URLs (that had different start URLs) stored in the database. Also to note, that webpages and docs have the same structure, they just provide two entirely separate stores for saving general web pages and documentation web pages respectively.

Publications can be identified by 3 separate IDs: a PMID, a PMCID or a DOI. Therefore, the following is done. A key – which can be called the primary ID of the publication – in the publications store is either a PMID, a PMCID or a DOI, depending on which of them was non-empty when the publication was first saved to the database. If more than one of them was available, then the PMID is preferred over the PMCID and the PMCID is preferred over the DOI. Then, there is an extra store called “publicationsMap”, where a key is an ID (PMID/PMCID/DOI) of a publication and the corresponding value is the primary ID (PMID/PMCID/DOI) of that publication. So, for example, if a publication is to be loaded from the database, first publicationsMap is consulted to find the primary ID and then the found primary ID used to find the publication from the publications store. All the mappings in publicationsMap can be dumped to stdout with -db-publications-map. There is also a store called “publicationsMapReverse”, which has mappings that are the reverse of the publicationsMap mappings, that is, from primary ID to the triplet PMID, PMCID, DOI. In addition, publicationsMapReverse stores the URLs where these PMID, PMCID and DOI were found. This reverse mapping can be useful, for example, for quickly listing all publication IDs (as the triplet PMID, PMCID, DOI) found in a database file. All the mappings in publicationsMapReverse can be dumped to stdout with -db-publications-map-reverse. The stores publicationsMapReverse and publicationsMap and the publications store are all kept coherent and in sync with each other. Also to note, that all stored DOIs are normalised, i.e. any valid prefix is removed (e.g. “https://doi.org/”, “doi:”) and letters from the 7-bit ASCII set are converted to uppercase.

The structure of the values in the publications, webpages and docs stores, i.e. the actual contents stored in the database, is best described by the next section JSON output, as the entire content of the database can be exported to an equivalently structured JSON file. To note, all the “empty”, “usable”, “final”, “totallyFinal” and “broken” fields present in the JSON output are not stored in the database, but these values are inferred from actual database values and depend on some fetching parameters. Additionally, the fields “version” and “argv” are only specific to JSON.

With a new release of PubFetcher, the structure of the database content might change (this involves code in the package org.edammap.pubfetcher.core.db). Currently, there is no database migration support, which means that the content of existing database files will be become unreadable in case of structure updates. If that content is still required, it would need to be refetched to a new database file (created with the new version of PubFetcher).

JSON output

The output of PubFetcher will be in JSON format if the option --format json is specified. If the option --plain is additionally specified, then fields about metadata will be omitted from the output. JSON support is implemented using libraries from the Jackson project.

Common

All JSON output will contain the fields “version” and “argv”.

version

Information about the application that generated this JSON file

name
Name of the application
url
Homepage of the application
version
Version of the application
argv
Array of all command-line parameters that were supplied to the application that generated this JSON file

IDs

JSON output of IDs/URLs, output using -out-ids, -txt-ids-pub, -txt-ids-web or -txt-ids-doc.

IDs of publications

Publications are identified by the triplet PMID, PMCID and DOI.

publicationIds

Array of publication IDs

pmid
The PubMed ID of the publication. Only articles available in PubMed can have this.
pmcid
The PubMed Central ID of the publication. Only articles available in PMC can have this.
doi
The Digital Object Identifier of the publication
pmidUrl
Provenance URL of the PMID
pmcidUrl
Provenance URL of the PMCID
doiUrl
Provenance URL of the DOI

If --plain is specified, then the provenance URLs are not output.

URLs of webpages

Webpages are identified by a URL.

webpageUrls
Array of webpage URLs

URLs of docs

Docs are identified by a URL.

docUrls
Array of doc URLs

Contents

JSON output of the entire content of publications, webpages and docs, output using -out, -txt-pub, -txt-web and -txt-doc.

Content of publications

A publication represents one publication (most often a research paper) and contains its ID (a PMID, a PMCID and/or a DOI), content (title, abstract, full text), keywords (user-assigned, MeSH and mined EFO and GO terms) and various metadata (Open Access flag, journal title, publication date, etc).

publications

Array of publications

fetchTime
Time of initial fetch or last retryCounter reset as UNIX time (in milliseconds)
fetchTimeHuman
Time of initial fetch or last retryCounter reset as ISO 8601 combined date and time
retryCounter
A refetch can occur if the value of retryCounter is less than retryLimit; or if any of the cooldown times (in fetching parameters) of a currently true condition have passed since fetchTime, in which case retryCounter is also reset
fetchException
true if there was a fetching exception during the last fetch; false otherwise
oa
true if the article is Open Access; false otherwise
journalTitle
Title of the journal the article was published in
pubDate
Publication date of the article as UNIX time (in milliseconds); negative, if unknown
pubDateHuman
Publication date of the article as ISO 8601 date; before 1970-01-01, if unknown
citationsCount
Number of times the article has been cited (according to Europe PMC); negative, if unknown
citationsTimestamp
Time when citationsCount was last updated as UNIX time (in milliseconds); negative, if citationsCount has not yet been updated
citationsTimestampHuman
Time when citationsCount was last updated as ISO 8601 combined date and time; before 1970-01-01T00:00:00.000Z, if citationsCount has not yet been updated
correspAuthor

Array of objects representing corresponding authors of the article

name
Name of the corresponding author
orcid
ORCID iD of the corresponding author
email
E-mail of the corresponding author
phone
Telephone number of the corresponding author
uri
Web page of the corresponding author
visitedSites

Array of objects representing sites visited for getting content (outside of standard Europe PMC, PubMed and oaDOI resources and also excluding PDFs)

url
URL of the visited site
type
The type of the site (as resource)
from
URL where the link of the site was picked up
timestamp
Time when the link of the site was picked up as UNIX time (in milliseconds)
timestampHuman
Time when the link of the site was picked up as ISO 8601 combined date and time
empty
true, if all publication parts (except IDs) are empty; false otherwise
usable
true, if at least one publication part (apart from IDs) is usable; false otherwise
final
true, if title, abstract and fulltext are final; false otherwise
totallyFinal
true, if all publication parts are final; false otherwise
pmid

A publication part (like the following pmcid, doi, title, etc), in this case representing the publication PMID

content
Content of the publication part (in this case, the publication PMID as a string)
type
The type of the publication part content source
url
URL of the publication part content source
timestamp
Time when the publication part content was set as UNIX time (in milliseconds)
timestampHuman
Time when the publication part content was set as ISO 8601 combined date and time
size
Number of characters in the content
empty
true, if the content is empty (size is 0); false otherwise
usable
true, if the content is long enough (the threshold can be influenced by fetching parameters), in other words, if the publication part content can be used as input for other applications; false otherwise
final
true, if the content is from a reliable source and is long enough, in other words, if there is no need to try fetching the publication part content from another source; false otherwise
pmcid
Publication part representing the publication PMCID. Structure same as in pmid.
doi
Publication part representing the publication DOI. Structure same as in pmid.
title
Publication part representing the publication title. Structure same as in pmid.
keywords

Publication part representing publication keywords. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.

list
Array of string representing publication keywords
mesh

Publication part representing publication MeSH terms. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.

list

Array of objects representing publication MeSH terms

term
Term name
majorTopic
true, if the term is a major topic of the article
uniqueId
MeSH Unique Identifier
efo

Publication part representing publication EFO and other experimental methods terms. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.

list

Array of objects representing publication EFO terms

term
Term name
count
Number of times the term was mined from full text by Europe PMC
uri
Unique URI to the ontology term
go
Publication part representing publication GO terms. Structure same as in efo.
abstract
Publication part representing the publication abstract. Structure same as in pmid.
fulltext
Publication part representing the publication fulltext. Structure same as in pmid.

If --plain is specified, then metadata is omitted from the output (everything from fetchTime to totallyFinal) and the value corresponding to publication part keys (pmid to fulltext) will be the value of content (for pmid, pmcid, doi, title, abstract, fulltext) or the value of “list” (for keywords, mesh, efo, go) as specified above for each corresponding part.

If --out-part is specified, then everything from fetchTime to totallyFinal will be omitted from the output and only publication parts specified by --out-part will be output (with structure as specified above). If --plain is specified along with --out-part, then output parts will only have as value the value of content (for pmid, pmcid, doi, title, abstract, fulltext) or the value of “list” (for keywords, mesh, efo, go).

Content of webpages

A webpage represents a general web page from where relevant content has been extracted, along with some metadata. If the web page is about a software tool, then the software license and programming language can be stored separately, if found (this feature has been added to support EDAMmap).

webpages

Array of webpages

fetchTime
Same as fetchTime of publications
fetchTimeHuman
Same as fetchTimeHuman of publications
retryCounter
Same as retryCounter of publications
fetchException
Same as fetchException of publications
startUrl
URL given as webpage identifier, same as listed by webpageUrls
finalUrl
Final URL after potential redirections
contentType
HTTP Content-Type header
statusCode
HTTP status code
contentTime
Time when current webpage content was last set as UNIX time (in milliseconds)
contentTimeHuman
Time when current webpage content was last set as ISO 8601 combined date and time
license
Software license of the tool the webpage is about (empty if not found or missing corresponding scraping rule)
language
Programming language of the tool the webpage is about (empty if not found or missing corresponding scraping rule)
titleLength
Number of characters in the webpage title
contentLength
Number of characters in the webpage content
title
The webpage title (as extracted by the corresponding scraping rule; or text from the HTML <title> element if scraping rules were not found)
empty
true, if webpage title and webpage content are empty; false otherwise
usable
true, if the length of webpage title plus the length of webpage content is large enough (at least webpageMinLength characters), that is, the webpage can be used as input for other applications; false otherwise
final
true, if the webpage is not broken and the webpage is usable and the length on the webpage content is larger than 0; false otherwise
broken
true, if the webpage with the given URL could not be fetched (based on the values of statusCode and finalUrl); false otherwise
content
The webpage content (as extracted by the corresponding scraping rule; or the automatically cleaned content from the entire HTML of the page if scraping rules were not found)

If --plain is specified, then only startUrl, webpage title and webpage content will be present.

Content of docs

Like Content of webpages, except it allows for a separate store for documentation web pages.

docs

Array of docs

Structure is same as in webpages

HTML and plain text output

Output will be in HTML format, if --format html is specified, and in plain text, if --format text is specified or --format is omitted (as text is the default).

The HTML output is meant to be formatted and viewed in a web browser. Links to external resources (such as the different URL fields) are clickable in the browser.

The plain text output is formatted for viewing in the console or in a text editor.

Both the HTML output and the plain text output will contain the same information as the JSON output specified above and will behave analogously in respect to the --plain and --out-part parameters. There are however a few fields that are missing in HTML and plain text compared to JSON: “empty”, “usable”, “final”, “totallyFinal”, “broken” (these values are inferred from the values of some other fields and depend on some fetching parameters) and the JSON specific “version” and “argv”.

Log file

PubFetcher-CLI will log to stderr using the Apache Log4j 2 library. With the --log parameter (described in Logging), a text file where the same log will be output to can be specified.

Each log line will consist of the following: the data and time, log level, log message, the name of the logger that published the logging event and the name of the thread that generated the logging event. The date and time will be the local time in the format “2018-08-24 11:37:20,187”. Log level can be DEBUG, INFO, WARN and ERROR. DEBUG level messages are only output to the log file (and not to the console). Currently, there are only few DEBUG messages, including the very first message listing all parameters the program was run with. Any line breaks in the log message will be escaped, so that each log message can fit on exactly one line. The name of the logger is just the fully qualified Java class (with the prefix “org.edamontology” removed) the logging event is called from (prepended with “@” in the log file), e.g. “@pubfetcher.cli.Cli”. The name of the thread will be “main” if the logging event was generated by the main thread, any subsequent thread will be named “Thread-2”, “Thread-3”, etc. In the log file the thread name will be in square brackets, e.g. “[Thread-2]”. Some Java exceptions can also be logged, these will be output with the stack trace on subsequent lines after the logged exception message.

Analysing logs

Log level ERROR is set to erroneous conditions which mostly occur on the side of the PubFetcher user (like problems in provided input), so searching for “ERROR” in log files can potentially help in finding problems that can be fixed by the user. Some problems might be caused by issues in the used resources, like Europe PMC and PubMed, and some reported problems are not problems at all, like failing to find a publication part which is actually supposed to be missing, but these messages will usually have the log level WARN. One example of WARN level messages that can indicate inconsistencies in used resource are the messages beginning with “Old ID”.

Some examples of issues found by analysing logs:

If multiple threads are writing to a log file, then the messages of different threads will be interwoven. To get the sequence of messages of only one thread, grep could be used:

$ grep Thread-2 database.log

In addition to analysing logs, the output of -part-table (described in Output) could be checked for possible problems. For example, title being “na” is a good indicator of an invalid ID. To list all such publications the filter -part-type na -part-type-part title could be used. Other things of interest might be for example parts which are from other sources than the main ones (the europepmc, pubmed, pmc types and doi) or parts missing in Europe PMC, but present in PubMed or PMC.