The database file that is used by PubFetcher to save publications, webpages and docs on disk is a simple key-value store generated by the MapDB library.
In case of the webpages and docs stores, a key is simply the string representing the startUrl, i.e. the URL given to PubFetcher for fetching content for. The resolved finalUrl might be different than the startUrl (for example a redirection from HTTP to HTTPS might happen), meaning there might be webpages and docs with equal final URLs (that had different start URLs) stored in the database. Also to note, that webpages and docs have the same structure, they just provide two entirely separate stores for saving general web pages and documentation web pages respectively.
Publications can be identified by 3 separate IDs: a PMID, a PMCID or a DOI. Therefore, the following is done. A key – which can be called the primary ID of the publication – in the publications store is either a PMID, a PMCID or a DOI, depending on which of them was non-empty when the publication was first saved to the database. If more than one of them was available, then the PMID is preferred over the PMCID and the PMCID is preferred over the DOI. Then, there is an extra store called “publicationsMap”, where a key is an ID (PMID/PMCID/DOI) of a publication and the corresponding value is the primary ID (PMID/PMCID/DOI) of that publication. So, for example, if a publication is to be loaded from the database, first publicationsMap is consulted to find the primary ID and then the found primary ID used to find the publication from the publications store. All the mappings in publicationsMap can be dumped to stdout with
-db-publications-map. There is also a store called “publicationsMapReverse”, which has mappings that are the reverse of the publicationsMap mappings, that is, from primary ID to the triplet PMID, PMCID, DOI. In addition, publicationsMapReverse stores the URLs where these PMID, PMCID and DOI were found. This reverse mapping can be useful, for example, for quickly listing all publication IDs (as the triplet PMID, PMCID, DOI) found in a database file. All the mappings in publicationsMapReverse can be dumped to stdout with
-db-publications-map-reverse. The stores publicationsMapReverse and publicationsMap and the publications store are all kept coherent and in sync with each other. Also to note, that all stored DOIs are normalised, i.e. any valid prefix is removed (e.g. “https://doi.org/”, “doi:”) and letters from the 7-bit ASCII set are converted to uppercase.
The structure of the values in the publications, webpages and docs stores, i.e. the actual contents stored in the database, is best described by the next section JSON output, as the entire content of the database can be exported to an equivalently structured JSON file. To note, all the “empty”, “usable”, “final”, “totallyFinal” and “broken” fields present in the JSON output are not stored in the database, but these values are inferred from actual database values and depend on some fetching parameters. Additionally, the fields “version” and “argv” are only specific to JSON.
With a new release of PubFetcher, the structure of the database content might change (this involves code in the package org.edammap.pubfetcher.core.db). Currently, there is no database migration support, which means that the content of existing database files will be become unreadable in case of structure updates. If that content is still required, it would need to be refetched to a new database file (created with the new version of PubFetcher).
The output of PubFetcher will be in JSON format if the option
--format json is specified. If the option
--plain is additionally specified, then fields about metadata will be omitted from the output. JSON support is implemented using libraries from the Jackson project.
All JSON output will contain the fields “version” and “argv”.
Information about the application that generated this JSON file
- Name of the application
- Homepage of the application
- Version of the application
- Array of all command-line parameters that were supplied to the application that generated this JSON file
JSON output of IDs/URLs, output using
IDs of publications¶
Publications are identified by the triplet PMID, PMCID and DOI.
Array of publication IDs
- The PubMed ID of the publication. Only articles available in PubMed can have this.
- The PubMed Central ID of the publication. Only articles available in PMC can have this.
- The Digital Object Identifier of the publication
- Provenance URL of the PMID
- Provenance URL of the PMCID
- Provenance URL of the DOI
--plain is specified, then the provenance URLs are not output.
JSON output of the entire content of publications, webpages and docs, output using
Content of publications¶
A publication represents one publication (most often a research paper) and contains its ID (a PMID, a PMCID and/or a DOI), content (title, abstract, full text), keywords (user-assigned, MeSH and mined EFO and GO terms) and various metadata (Open Access flag, journal title, publication date, etc).
Array of publications
- Time of initial fetch or last retryCounter reset as UNIX time (in milliseconds)
- Time of initial fetch or last retryCounter reset as ISO 8601 combined date and time
- A refetch can occur if the value of retryCounter is less than retryLimit; or if any of the cooldown times (in fetching parameters) of a currently
truecondition have passed since fetchTime, in which case retryCounter is also reset
trueif there was a fetching exception during the last fetch;
trueif the article is Open Access;
- Title of the journal the article was published in
- Publication date of the article as UNIX time (in milliseconds); negative, if unknown
- Publication date of the article as ISO 8601 date; before
1970-01-01, if unknown
- Number of times the article has been cited (according to Europe PMC); negative, if unknown
- Time when citationsCount was last updated as UNIX time (in milliseconds); negative, if citationsCount has not yet been updated
- Time when citationsCount was last updated as ISO 8601 combined date and time; before
1970-01-01T00:00:00.000Z, if citationsCount has not yet been updated
Array of objects representing corresponding authors of the article
- Name of the corresponding author
- ORCID iD of the corresponding author
- E-mail of the corresponding author
- Telephone number of the corresponding author
- Web page of the corresponding author
Array of objects representing sites visited for getting content (outside of standard Europe PMC, PubMed and oaDOI resources and also excluding PDFs)
true, if all publication parts (except IDs) are empty;
true, if at least one publication part (apart from IDs) is usable;
true, if title, abstract and fulltext are final;
true, if all publication parts are final;
A publication part (like the following pmcid, doi, title, etc), in this case representing the publication PMID
- Content of the publication part (in this case, the publication PMID as a string)
- The type of the publication part content source
- URL of the publication part content source
- Time when the publication part content was set as UNIX time (in milliseconds)
- Time when the publication part content was set as ISO 8601 combined date and time
- Number of characters in the content
true, if the content is empty (size is
true, if the content is long enough (the threshold can be influenced by fetching parameters), in other words, if the publication part content can be used as input for other applications;
true, if the content is from a reliable source and is long enough, in other words, if there is no need to try fetching the publication part content from another source;
- Publication part representing the publication PMCID. Structure same as in pmid.
- Publication part representing the publication DOI. Structure same as in pmid.
- Publication part representing the publication title. Structure same as in pmid.
Publication part representing publication keywords. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.
- Array of string representing publication keywords
Publication part representing publication MeSH terms. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.
Array of objects representing publication MeSH terms
- Term name
true, if the term is a major topic of the article
- MeSH Unique Identifier
Publication part representing publication EFO and other experimental methods terms. Structure same as in pmid, except content is replaced with “list” and size is number of elements in “list”.
Array of objects representing publication EFO terms
- Term name
- Number of times the term was mined from full text by Europe PMC
- Unique URI to the ontology term
- Publication part representing publication GO terms. Structure same as in efo.
- Publication part representing the publication abstract. Structure same as in pmid.
- Publication part representing the publication fulltext. Structure same as in pmid.
--plain is specified, then metadata is omitted from the output (everything from fetchTime to totallyFinal) and the value corresponding to publication part keys (pmid to fulltext) will be the value of content (for pmid, pmcid, doi, title, abstract, fulltext) or the value of “list” (for keywords, mesh, efo, go) as specified above for each corresponding part.
--out-part is specified, then everything from fetchTime to totallyFinal will be omitted from the output and only publication parts specified by
--out-part will be output (with structure as specified above). If
--plain is specified along with
--out-part, then output parts will only have as value the value of content (for pmid, pmcid, doi, title, abstract, fulltext) or the value of “list” (for keywords, mesh, efo, go).
Content of webpages¶
A webpage represents a general web page from where relevant content has been extracted, along with some metadata. If the web page is about a software tool, then the software license and programming language can be stored separately, if found (this feature has been added to support EDAMmap).
Array of webpages
- Same as fetchTime of publications
- Same as fetchTimeHuman of publications
- Same as retryCounter of publications
- Same as fetchException of publications
- URL given as webpage identifier, same as listed by webpageUrls
- Final URL after potential redirections
- HTTP Content-Type header
- HTTP status code
- Time when current webpage content was last set as UNIX time (in milliseconds)
- Time when current webpage content was last set as ISO 8601 combined date and time
- Software license of the tool the webpage is about (empty if not found or missing corresponding scraping rule)
- Programming language of the tool the webpage is about (empty if not found or missing corresponding scraping rule)
- Number of characters in the webpage title
- Number of characters in the webpage content
- The webpage title (as extracted by the corresponding scraping rule; or text from the HTML
<title>element if scraping rules were not found)
true, if webpage title and webpage content are empty;
true, if the length of webpage title plus the length of webpage content is large enough (at least webpageMinLength characters), that is, the webpage can be used as input for other applications;
true, if the webpage is not broken and the webpage is usable and the length on the webpage content is larger than 0;
true, if the webpage with the given URL could not be fetched (based on the values of statusCode and finalUrl);
- The webpage content (as extracted by the corresponding scraping rule; or the automatically cleaned content from the entire HTML of the page if scraping rules were not found)
--plain is specified, then only startUrl, webpage title and webpage content will be present.
Content of docs¶
Like Content of webpages, except it allows for a separate store for documentation web pages.
Array of docs
Structure is same as in webpages
HTML and plain text output¶
Output will be in HTML format, if
--format html is specified, and in plain text, if
--format text is specified or
--format is omitted (as
text is the default).
The HTML output is meant to be formatted and viewed in a web browser. Links to external resources (such as the different URL fields) are clickable in the browser.
The plain text output is formatted for viewing in the console or in a text editor.
Both the HTML output and the plain text output will contain the same information as the JSON output specified above and will behave analogously in respect to the
--out-part parameters. There are however a few fields that are missing in HTML and plain text compared to JSON: “empty”, “usable”, “final”, “totallyFinal”, “broken” (these values are inferred from the values of some other fields and depend on some fetching parameters) and the JSON specific “version” and “argv”.
PubFetcher-CLI will log to stderr using the Apache Log4j 2 library. With the
--log parameter (described in Logging), a text file where the same log will be output to can be specified.
Each log line will consist of the following: the data and time, log level, log message, the name of the logger that published the logging event and the name of the thread that generated the logging event. The date and time will be the local time in the format “2018-08-24 11:37:20,187”. Log level can be DEBUG, INFO, WARN and ERROR. DEBUG level messages are only output to the log file (and not to the console). Currently, there are only few DEBUG messages, including the very first message listing all parameters the program was run with. Any line breaks in the log message will be escaped, so that each log message can fit on exactly one line. The name of the logger is just the fully qualified Java class (with the prefix “org.edamontology” removed) the logging event is called from (prepended with “@” in the log file), e.g. “@pubfetcher.cli.Cli”. The name of the thread will be “main” if the logging event was generated by the main thread, any subsequent thread will be named “Thread-2”, “Thread-3”, etc. In the log file the thread name will be in square brackets, e.g. “[Thread-2]”. Some Java exceptions can also be logged, these will be output with the stack trace on subsequent lines after the logged exception message.
Log level ERROR is set to erroneous conditions which mostly occur on the side of the PubFetcher user (like problems in provided input), so searching for “ERROR” in log files can potentially help in finding problems that can be fixed by the user. Some problems might be caused by issues in the used resources, like Europe PMC and PubMed, and some reported problems are not problems at all, like failing to find a publication part which is actually supposed to be missing, but these messages will usually have the log level WARN. One example of WARN level messages that can indicate inconsistencies in used resource are the messages beginning with “Old ID”.
Some examples of issues found by analysing logs:
If multiple threads are writing to a log file, then the messages of different threads will be interwoven. To get the sequence of messages of only one thread,
grep could be used:
$ grep Thread-2 database.log
In addition to analysing logs, the output of
-part-table (described in Output) could be checked for possible problems. For example, title being “na” is a good indicator of an invalid ID. To list all such publications the filter
-part-type na -part-type-part title could be used. Other things of interest might be for example parts which are from other sources than the main ones (the europepmc, pubmed, pmc types and doi) or parts missing in Europe PMC, but present in PubMed or PMC.