Output
Database
The database file that is used by PubFetcher to save publications, webpages and docs on disk is a simple key-value store generated by the MapDB library.
In case of the webpages and docs stores, a key is simply the string representing the startUrl, i.e. the URL given to PubFetcher for fetching content for. The resolved finalUrl might be different than the startUrl (for example a redirection from HTTP to HTTPS might happen), meaning there might be webpages and docs with equal final URLs (that had different start URLs) stored in the database. Also to note, that webpages and docs have the same structure, they just provide two entirely separate stores for saving general web pages and documentation web pages respectively.
Publications can be identified by 3 separate IDs: a PMID, a PMCID or a DOI. Therefore, the following is done. A key – which can be called the primary ID of the publication – in the publications store is either a PMID, a PMCID or a DOI, depending on which of them was non-empty when the publication was first saved to the database. If more than one of them was available, then the PMID is preferred over the PMCID and the PMCID is preferred over the DOI. Then, there is an extra store called “publicationsMap”, where a key is an ID (PMID/PMCID/DOI) of a publication and the corresponding value is the primary ID (PMID/PMCID/DOI) of that publication. So, for example, if a publication is to be loaded from the database, first publicationsMap is consulted to find the primary ID and then the found primary ID used to find the publication from the publications store. All the mappings in publicationsMap can be dumped to stdout with -db-publications-map. There is also a store called “publicationsMapReverse”, which has mappings that are the reverse of the publicationsMap mappings, that is, from primary ID to the triplet PMID, PMCID, DOI. In addition, publicationsMapReverse stores the URLs where these PMID, PMCID and DOI were found. This reverse mapping can be useful, for example, for quickly listing all publication IDs (as the triplet PMID, PMCID, DOI) found in a database file. All the mappings in publicationsMapReverse can be dumped to stdout with -db-publications-map-reverse. The stores publicationsMapReverse and publicationsMap and the publications store are all kept coherent and in sync with each other. Also to note, that all stored DOIs are normalised, i.e. any valid prefix is removed (e.g. “https://doi.org/”, “doi:”) and letters from the 7-bit ASCII set are converted to uppercase.
The structure of the values in the publications, webpages and docs stores, i.e. the actual contents stored in the database, is best described by the next section JSON output, as the entire content of the database can be exported to an equivalently structured JSON file. To note, all the “empty”, “usable”, “final”, “totallyFinal” and “broken” fields present in the JSON output are not stored in the database, but these values are inferred from actual database values and depend on some fetching parameters. Additionally, the fields “version” and “argv” are only specific to JSON.
With a new release of PubFetcher, the structure of the database content might change (this involves code in the package org.edammap.pubfetcher.core.db). Currently, there is no database migration support, which means that the content of existing database files will be become unreadable in case of structure updates. If that content is still required, it would need to be refetched to a new database file (created with the new version of PubFetcher).
JSON output
The output of PubFetcher will be in JSON format if the option --format json is specified. If the option --plain is additionally specified, then fields about metadata will be omitted from the output. JSON support is implemented using libraries from the Jackson project.
Common
All JSON output will contain the fields “version” and “argv”.
- version
Information about the application that generated this JSON file
- name
Name of the application
- url
Homepage of the application
- version
Version of the application
- argv
Array of all command-line parameters that were supplied to the application that generated this JSON file
IDs
JSON output of IDs/URLs, output using -out-ids, -txt-ids-pub, -txt-ids-web or -txt-ids-doc.
IDs of publications
Publications are identified by the triplet PMID, PMCID and DOI.
- publicationIds
Array of publication IDs
- pmid
The PubMed ID of the publication. Only articles available in PubMed can have this.
- pmcid
The PubMed Central ID of the publication. Only articles available in PMC can have this.
- doi
The Digital Object Identifier of the publication
- pmidUrl
Provenance URL of the PMID
- pmcidUrl
Provenance URL of the PMCID
- doiUrl
Provenance URL of the DOI
If --plain is specified, then the provenance URLs are not output.
URLs of webpages
Webpages are identified by a URL.
- webpageUrls
Array of webpage URLs
URLs of docs
Docs are identified by a URL.
- docUrls
Array of doc URLs
Contents
JSON output of the entire content of publications, webpages and docs, output using -out, -txt-pub, -txt-web and -txt-doc.
Content of publications
A publication represents one publication (most often a research paper) and contains its ID (a PMID, a PMCID and/or a DOI), content (title, abstract, full text), keywords (user-assigned, MeSH and mined EFO and GO terms) and various metadata (Open Access flag, journal title, publication date, etc).
- publications
Array of publications
- fetchTime
Time of initial fetch or last retryCounter reset as UNIX time (in milliseconds)
- fetchTimeHuman
Time of initial fetch or last retryCounter reset as ISO 8601 combined date and time
- retryCounter
A refetch can occur if the value of retryCounter is less than retryLimit; or if any of the cooldown times (in fetching parameters) of a currently
truecondition have passed since fetchTime, in which case retryCounter is also reset- fetchException
trueif there was a fetching exception during the last fetch;falseotherwise- oa
trueif the article is Open Access;falseotherwise- preprint
trueif the article is a preprint;falseotherwise- journalTitle
Title of the journal the article was published in
- pubDate
Publication date of the article as UNIX time (in milliseconds); negative, if unknown
- pubDateHuman
Publication date of the article as ISO 8601 date; before
1970-01-01, if unknown- citationsCount
Number of times the article has been cited (according to Europe PMC); negative, if unknown
- citationsTimestamp
Time when citationsCount was last updated as UNIX time (in milliseconds); negative, if citationsCount has not yet been updated
- citationsTimestampHuman
Time when citationsCount was last updated as ISO 8601 combined date and time; before
1970-01-01T00:00:00.000Z, if citationsCount has not yet been updated- correspAuthor
Array of objects representing corresponding authors of the article
- name
Name of the corresponding author
- orcid
ORCID iD of the corresponding author
E-mail of the corresponding author
- phone
Telephone number of the corresponding author
- uri
Web page of the corresponding author
- visitedSites
Array of objects representing sites visited for getting content (outside of standard Europe PMC, PubMed and oaDOI resources and also excluding PDFs)
- empty
true, if all publication parts (except IDs) are empty;falseotherwise- usable
true, if at least one publication part (apart from IDs) is usable;falseotherwise- final
true, if title, abstract and fulltext are final;falseotherwise- totallyFinal
true, if all publication parts are final;falseotherwise- pmid
A publication part (like the following pmcid, doi, title, etc), in this case representing the publication PMID
- content
Content of the publication part (in this case, the publication PMID as a string)
- type
The type of the publication part content source
- url
URL of the publication part content source
- timestamp
Time when the publication part content was set as UNIX time (in milliseconds)
- timestampHuman
Time when the publication part content was set as ISO 8601 combined date and time
- size
Number of characters in the content
- empty
true, if the content is empty (size is0);falseotherwise- usable
true, if the content is long enough (the threshold can be influenced by fetching parameters), in other words, if the publication part content can be used as input for other applications;falseotherwise- final
true, if the content is from a reliable source and is long enough, in other words, if there is no need to try fetching the publication part content from another source;falseotherwise
- pmcid
Publication part representing the publication PMCID. Structure same as in pmid.
- doi
Publication part representing the publication DOI. Structure same as in pmid.
- title
Publication part representing the publication title. Structure same as in pmid.
- keywords
Publication part representing publication keywords. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.
- list
Array of string representing publication keywords
- mesh
Publication part representing publication MeSH terms. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.
- list
Array of objects representing publication MeSH terms
- term
Term name
- majorTopic
true, if the term is a major topic of the article- uniqueId
MeSH Unique Identifier
- efo
Publication part representing publication EFO and other experimental methods terms. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.
- list
Array of objects representing publication EFO terms
- term
Term name
- count
Number of times the term was mined from full text by Europe PMC
- uri
Unique URI to the ontology term
- go
Publication part representing publication GO terms. Structure same as in efo.
- abstract
Publication part representing the publication abstract. Structure same as in pmid.
- fulltext
Publication part representing the publication fulltext. Structure same as in pmid.
If --plain is specified, then metadata is omitted from the output (everything from fetchTime to totallyFinal) and the value corresponding to publication part keys (pmid to fulltext) will be the value of content (for pmid, pmcid, doi, title, abstract, fulltext) or the value of “list” (for keywords, mesh, efo, go) as specified above for each corresponding part.
If --out-part is specified, then everything from fetchTime to totallyFinal will be omitted from the output and only publication parts specified by --out-part will be output (with structure as specified above). If --plain is specified along with --out-part, then output parts will only have as value the value of content (for pmid, pmcid, doi, title, abstract, fulltext) or the value of “list” (for keywords, mesh, efo, go).
Content of webpages
A webpage represents a general web page from where relevant content has been extracted, along with some metadata. If the web page is about a software tool, then the software license and programming language can be stored separately, if found (this feature has been added to support EDAMmap).
- webpages
Array of webpages
- fetchTime
Same as fetchTime of publications
- fetchTimeHuman
Same as fetchTimeHuman of publications
- retryCounter
Same as retryCounter of publications
- fetchException
Same as fetchException of publications
- startUrl
URL given as webpage identifier, same as listed by webpageUrls
- finalUrl
Final URL after potential redirections
- contentType
HTTP Content-Type header
- statusCode
- contentTime
Time when current webpage content was last set as UNIX time (in milliseconds)
- contentTimeHuman
Time when current webpage content was last set as ISO 8601 combined date and time
- license
Software license of the tool the webpage is about (empty if not found or missing corresponding scraping rule)
- language
Programming language of the tool the webpage is about (empty if not found or missing corresponding scraping rule)
- titleLength
Number of characters in the webpage title
- contentLength
Number of characters in the webpage content
- title
The webpage title (as extracted by the corresponding scraping rule; or text from the HTML
<title>element if scraping rules were not found)- empty
true, if webpage title and webpage content are empty;falseotherwise- usable
true, if the length of webpage title plus the length of webpage content is large enough (at least webpageMinLength characters), that is, the webpage can be used as input for other applications;falseotherwise- final
true, if the webpage is not broken and the webpage is usable and the length on the webpage content is larger than 0;falseotherwise- broken
true, if the webpage with the given URL could not be fetched (based on the values of statusCode and finalUrl);falseotherwise- content
The webpage content (as extracted by the corresponding scraping rule; or the automatically cleaned content from the entire HTML of the page if scraping rules were not found)
If --plain is specified, then only startUrl, webpage title and webpage content will be present.
Content of docs
Like Content of webpages, except it allows for a separate store for documentation web pages.
- docs
Array of docs
Structure is same as in webpages
HTML and plain text output
Output will be in HTML format, if --format html is specified, and in plain text, if --format text is specified or --format is omitted (as text is the default).
The HTML output is meant to be formatted and viewed in a web browser. Links to external resources (such as the different URL fields) are clickable in the browser.
The plain text output is formatted for viewing in the console or in a text editor.
Both the HTML output and the plain text output will contain the same information as the JSON output specified above and will behave analogously in respect to the --plain and --out-part parameters. There are however a few fields that are missing in HTML and plain text compared to JSON: “empty”, “usable”, “final”, “totallyFinal”, “broken” (these values are inferred from the values of some other fields and depend on some fetching parameters) and the JSON specific “version” and “argv”.
Log file
PubFetcher-CLI will log to stderr using the Apache Log4j 2 library. With the --log parameter (described in Logging), a text file where the same log will be output to can be specified.
Each log line will consist of the following: the data and time, log level, log message, the name of the logger that published the logging event and the name of the thread that generated the logging event. The date and time will be the local time in the format “2018-08-24 11:37:20,187”. Log level can be DEBUG, INFO, WARN and ERROR. DEBUG level messages are only output to the log file (and not to the console). Currently, there are only few DEBUG messages, including the very first message listing all parameters the program was run with. Any line breaks in the log message will be escaped, so that each log message can fit on exactly one line. The name of the logger is just the fully qualified Java class (with the prefix “org.edamontology” removed) the logging event is called from (prepended with “@” in the log file), e.g. “@pubfetcher.cli.Cli”. The name of the thread will be “main” if the logging event was generated by the main thread, any subsequent thread will be named “Thread-2”, “Thread-3”, etc. In the log file the thread name will be in square brackets, e.g. “[Thread-2]”. Some Java exceptions can also be logged, these will be output with the stack trace on subsequent lines after the logged exception message.
Analysing logs
Log level ERROR is set to erroneous conditions which mostly occur on the side of the PubFetcher user (like problems in provided input), so searching for “ERROR” in log files can potentially help in finding problems that can be fixed by the user. Some problems might be caused by issues in the used resources, like Europe PMC and PubMed, and some reported problems are not problems at all, like failing to find a publication part which is actually supposed to be missing, but these messages will usually have the log level WARN. One example of WARN level messages that can indicate inconsistencies in used resource are the messages beginning with “Old ID”.
Some examples of issues found by analysing logs:
If multiple threads are writing to a log file, then the messages of different threads will be interwoven. To get the sequence of messages of only one thread, grep could be used:
$ grep Thread-2 database.log
In addition to analysing logs, the output of -part-table (described in Output) could be checked for possible problems. For example, title being “na” is a good indicator of an invalid ID. To list all such publications the filter -part-type na -part-type-part title could be used. Other things of interest might be for example parts which are from other sources than the main ones (the europepmc, pubmed, pmc types and doi) or parts missing in Europe PMC, but present in PubMed or PMC.