.. _output:

######
Output
######

.. _database:

********
Database
********

The database file that is used by PubFetcher to save publications_, webpages_ and docs_ on disk is a simple key-value store generated by the `MapDB <http://www.mapdb.org/>`_ library.

In case of the webpages and docs stores, a key is simply the string representing the startUrl_, i.e. the URL given to PubFetcher for fetching content for. The resolved finalUrl_ might be different than the startUrl_ (for example a redirection from HTTP to HTTPS might happen), meaning there might be webpages and docs with equal final URLs (that had different start URLs) stored in the database. Also to note, that webpages and docs have the same structure, they just provide two entirely separate stores for saving general web pages and documentation web pages respectively.

Publications can be identified by 3 separate IDs: :ref:`a PMID <id_pmid>`, :ref:`a PMCID <id_pmcid>` or :ref:`a DOI <id_doi>`. Therefore, the following is done. A key -- which can be called the primary ID of the publication -- in the publications store is either a PMID, a PMCID or a DOI, depending on which of them was non-empty when the publication was first saved to the database. If more than one of them was available, then the PMID is preferred over the PMCID and the PMCID is preferred over the DOI. Then, there is an extra store called "publicationsMap", where a key is an ID (PMID/PMCID/DOI) of a publication and the corresponding value is the primary ID (PMID/PMCID/DOI) of that publication. So, for example, if a publication is to be loaded from the database, first publicationsMap is consulted to find the primary ID and then the found primary ID used to find the publication from the publications store. All the mappings in publicationsMap can be dumped to stdout with ``-db-publications-map``. There is also a store called "publicationsMapReverse", which has mappings that are the reverse of the publicationsMap mappings, that is, from primary ID to the triplet PMID, PMCID, DOI. In addition, publicationsMapReverse stores the URLs where these PMID, PMCID and DOI were found. This reverse mapping can be useful, for example, for quickly listing all publication IDs (as the triplet PMID, PMCID, DOI) found in a database file. All the mappings in publicationsMapReverse can be dumped to stdout with ``-db-publications-map-reverse``. The stores publicationsMapReverse and publicationsMap and the publications store are all kept coherent and in sync with each other. Also to note, that all stored DOIs are normalised, i.e. any valid prefix is removed (e.g. "https://doi.org/", "doi:") and letters from the 7-bit ASCII set are converted to uppercase.

The structure of the values in the publications, webpages and docs stores, i.e. the actual contents_ stored in the database, is best described by the next section `JSON output`_, as the entire content of the database can be exported to an equivalently structured JSON file. To note, all the "empty", "usable", "final", "totallyFinal" and "broken" fields present in the JSON output are not stored in the database, but these values are inferred from actual database values and depend on some :ref:`fetching <fetching>` parameters. Additionally, the fields "version" and "argv" are only specific to JSON.

With a new release of PubFetcher, the structure of the database content might change (this involves code in the package `org.edammap.pubfetcher.core.db <https://github.com/edamontology/pubfetcher/tree/master/core/src/main/java/org/edamontology/pubfetcher/core/db>`_). Currently, there is no database migration support, which means that the content of existing database files will be become unreadable in case of structure updates. If that content is still required, it would need to be refetched to a new database file (created with the new version of PubFetcher).

.. _json_output:

***********
JSON output
***********

The output of PubFetcher will be in JSON format if the option ``--format json`` is specified. If the option ``--plain`` is additionally specified, then fields about metadata will be omitted from the output. JSON support is implemented using libraries from the `Jackson project <https://github.com/FasterXML/jackson>`_.

Common
======

All JSON output will contain the fields "version" and "argv".

version
  Information about the application that generated this JSON file

  name
    Name of the application
  url
    Homepage of the application
  version
    Version of the application
argv
  Array of all command-line parameters that were supplied to the application that generated this JSON file

IDs
===

JSON output of IDs/URLs, output using ``-out-ids``, ``-txt-ids-pub``, ``-txt-ids-web`` or ``-txt-ids-doc``.

.. _ids_of_publications:

IDs of publications
-------------------

Publications_ are identified by the triplet PMID, PMCID and DOI.

publicationIds
  Array of publication IDs

    .. _id_pmid:

  pmid
    The PubMed ID of the publication. Only articles available in `PubMed <https://www.ncbi.nlm.nih.gov/pubmed>`_ can have this.

    .. _id_pmcid:
  pmcid
    The PubMed Central ID of the publication. Only articles available in `PMC <https://www.ncbi.nlm.nih.gov/pmc/>`_ can have this.

    .. _id_doi:
  doi
    The `Digital Object Identifier <https://www.doi.org/>`_ of the publication

    .. _pmidUrl:
  pmidUrl
    Provenance URL of the PMID

    .. _pmcidUrl:
  pmcidUrl
    Provenance URL of the PMCID

    .. _doiUrl:
  doiUrl
    Provenance URL of the DOI

If ``--plain`` is specified, then the provenance URLs are not output.

.. _urls_of_webpages:

URLs of webpages
----------------

Webpages_ are identified by a URL.

  .. _webpageUrls:

webpageUrls
  Array of webpage URLs

.. _urls_of_docs:

URLs of docs
------------

Docs_ are identified by a URL.

docUrls
  Array of doc URLs

.. _contents:

Contents
========

JSON output of the entire content of publications, webpages and docs, output using ``-out``, ``-txt-pub``, ``-txt-web`` and ``-txt-doc``.

.. _content_of_publications:

Content of publications
-----------------------

A publication represents one publication (most often a research paper) and contains its ID (:ref:`a PMID <id_pmid>`, :ref:`a PMCID <id_pmcid>` and/or :ref:`a DOI <id_doi>`), content (title, abstract, full text), keywords (user-assigned, MeSH and mined EFO and GO terms) and various metadata (Open Access flag, journal title, publication date, etc).

    .. _publications:

publications
  Array of publications

  _`fetchTime`
    Time of initial fetch or last retryCounter_ reset as `UNIX time <https://en.wikipedia.org/wiki/Unix_time>`_ (in milliseconds)
  _`fetchTimeHuman`
    Time of initial fetch or last retryCounter_ reset as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ combined date and time
  _`retryCounter`
    A refetch can occur if the value of retryCounter_ is less than :ref:`retryLimit <retrylimit>`; or if any of the cooldown times (in :ref:`fetching <fetching>` parameters) of a currently ``true`` condition have passed since fetchTime_, in which case retryCounter_ is also reset
  _`fetchException`
    ``true`` if there was a fetching exception during the last fetch; ``false`` otherwise
  _`oa`
    ``true`` if the article is Open Access; ``false`` otherwise
  _`preprint`
    ``true`` if the article is a preprint; ``false`` otherwise
  _`journalTitle`
    Title of the journal the article was published in
  _`pubDate`
    Publication date of the article as `UNIX time <https://en.wikipedia.org/wiki/Unix_time>`_ (in milliseconds); negative, if unknown
  _`pubDateHuman`
    Publication date of the article as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ date; before ``1970-01-01``, if unknown
  _`citationsCount`
    Number of times the article has been cited (according to :ref:`Europe PMC <europe_pmc>`); negative, if unknown
  _`citationsTimestamp`
    Time when citationsCount_ was last updated as `UNIX time <https://en.wikipedia.org/wiki/Unix_time>`_ (in milliseconds); negative, if citationsCount_ has not yet been updated
  _`citationsTimestampHuman`
    Time when citationsCount_ was last updated as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ combined date and time; before ``1970-01-01T00:00:00.000Z``, if citationsCount_ has not yet been updated
  _`correspAuthor`
    Array of objects representing corresponding authors of the article

    name
      Name of the corresponding author
    orcid
      `ORCID iD <https://en.wikipedia.org/wiki/ORCID>`_ of the corresponding author
    email
      E-mail of the corresponding author
    phone
      Telephone number of the corresponding author
    uri
      Web page of the corresponding author
  _`visitedSites`
    Array of objects representing sites visited for getting content (outside of standard Europe PMC, PubMed and oaDOI :ref:`resources <resources>` and also excluding PDFs)

    url
      URL of the visited site
    type
      :ref:`The type <publication_types>` of the site (as resource)
    from
      URL where the link of the site was picked up
    timestamp
      Time when the link of the site was picked up as `UNIX time <https://en.wikipedia.org/wiki/Unix_time>`_ (in milliseconds)
    timestampHuman
      Time when the link of the site was picked up as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ combined date and time

    .. _publication_empty:

  empty
    ``true``, if all :ref:`publication part <publication_parts>`\ s (except IDs) are empty_; ``false`` otherwise

    .. _publication_usable:

  usable
    ``true``, if at least one :ref:`publication part <publication_parts>` (apart from IDs) is usable_; ``false`` otherwise

    .. _publication_final:

  final
    ``true``, if title_, abstract_ and fulltext_ are final_; ``false`` otherwise
  _`totallyFinal`
    ``true``, if all :ref:`publication part <publication_parts>`\ s are final_; ``false`` otherwise
  _`pmid`
    A :ref:`publication part <publication_parts>` (like the following pmcid_, doi_, title_, etc), in this case representing the publication PMID

    _`content`
      Content of the publication part (in this case, the publication PMID as a string)
    _`type`
      :ref:`The type <publication_types>` of the publication part content source
    _`url`
      URL of the publication part content source
    _`timestamp`
      Time when the publication part content was set as `UNIX time <https://en.wikipedia.org/wiki/Unix_time>`_ (in milliseconds)
    _`timestampHuman`
      Time when the publication part content was set as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ combined date and time
    _`size`
      Number of characters in the content
    _`empty`
      ``true``, if the content is empty (size is ``0``); ``false`` otherwise
    _`usable`
      ``true``, if the content is long enough (the threshold can be influenced by :ref:`fetching <fetching>` parameters), in other words, if the publication part content can be used as input for other applications; ``false`` otherwise
    _`final`
      ``true``, if the content is from a reliable source and is long enough, in other words, if there is no need to try fetching the publication part content from another source; ``false`` otherwise
  _`pmcid`
    Publication part representing the publication PMCID. Structure same as in pmid_.
  _`doi`
    Publication part representing the publication DOI. Structure same as in pmid_.
  _`title`
    Publication part representing the publication title. Structure same as in pmid_.
  _`keywords`
    Publication part representing publication keywords. Structure same as in pmid_, except content_ is replaced with "list" and size_ is number of elements in "list".

    list
      Array of string representing publication keywords
  _`mesh`
    Publication part representing publication MeSH terms. Structure same as in pmid_, except content_ is replaced with "list" and size_ is number of elements in "list".

    list
      Array of objects representing publication MeSH terms

      term
        Term name
      majorTopic
        ``true``, if the term is a major topic of the article
      uniqueId
        MeSH Unique Identifier
  _`efo`
    Publication part representing publication EFO and other experimental methods terms. Structure same as in pmid_, except content_ is replaced with "list" and size_ is number of elements in "list".

    list
      Array of objects representing publication EFO terms

      term
        Term name
      count
        Number of times the term was mined from full text by :ref:`Europe PMC <europe_pmc>`
      uri
        Unique URI to the ontology term
  _`go`
    Publication part representing publication GO terms. Structure same as in efo_.
  _`abstract`
    Publication part representing the publication abstract. Structure same as in pmid_.
  _`fulltext`
    Publication part representing the publication fulltext. Structure same as in pmid_.

If ``--plain`` is specified, then metadata is omitted from the output (everything from fetchTime_ to totallyFinal_) and the value corresponding to :ref:`publication part <publication_parts>` keys (pmid_ to fulltext_) will be the value of content_ (for pmid_, pmcid_, doi_, title_, abstract_, fulltext_) or the value of "list" (for keywords_, mesh_, efo_, go_) as specified above for each corresponding part.

If ``--out-part`` is specified, then everything from fetchTime_ to totallyFinal_ will be omitted from the output and only :ref:`publication part <publication_parts>`\ s specified by ``--out-part`` will be output (with structure as specified above). If ``--plain`` is specified along with ``--out-part``, then output parts will only have as value the value of content_ (for pmid_, pmcid_, doi_, title_, abstract_, fulltext_) or the value of "list" (for keywords_, mesh_, efo_, go_).

.. _content_of_webpages:

Content of webpages
-------------------

A webpage represents a general web page from where relevant content has been extracted, along with some metadata. If the web page is about a software tool, then the software license and programming language can be stored separately, if found (this feature has been added to support `EDAMmap <https://github.com/edamontology/edammap>`_).

  .. _webpages:

webpages
  Array of webpages

  fetchTime
    Same as fetchTime_ of publications
  fetchTimeHuman
    Same as fetchTimeHuman_ of publications
  retryCounter
    Same as retryCounter_ of publications
  fetchException
    Same as fetchException_ of publications
  _`startUrl`
    URL given as webpage identifier, same as listed by webpageUrls_
  _`finalUrl`
    Final URL after potential redirections
  _`contentType`
    `HTTP Content-Type <https://en.wikipedia.org/wiki/Media_type>`_ header
  _`statusCode`
    `HTTP status code <https://en.wikipedia.org/wiki/List_of_HTTP_status_codes>`_
  _`contentTime`
    Time when current webpage content was last set as `UNIX time <https://en.wikipedia.org/wiki/Unix_time>`_ (in milliseconds)
  contentTimeHuman
    Time when current webpage content was last set as `ISO 8601 <https://en.wikipedia.org/wiki/ISO_8601>`_ combined date and time
  _`license`
    Software license of the tool the webpage is about (empty if not found or missing corresponding :ref:`scraping rule <scraping>`)
  _`language`
    Programming language of the tool the webpage is about (empty if not found or missing corresponding :ref:`scraping rule <scraping>`)
  _`titleLength`
    Number of characters in the `webpage title`_
  _`contentLength`
    Number of characters in the `webpage content`_

    .. _webpage_title:

  title
    The webpage title (as extracted by the corresponding :ref:`scraping rule <scraping>`; or text from the HTML ``<title>`` element if scraping rules were not found)

    .. _webpage_empty:

  empty
    ``true``, if `webpage title`_ and `webpage content`_ are empty; ``false`` otherwise

    .. _webpage_usable:

  usable
    ``true``, if the length of `webpage title`_ plus the length of `webpage content`_ is large enough (at least :ref:`webpageMinLength <webpageminlength>` characters), that is, the webpage can be used as input for other applications; ``false`` otherwise

    .. _webpage_final:

  final
    ``true``, if the webpage is not broken_ and the webpage is usable_ and the length on the `webpage content`_ is larger than 0; ``false`` otherwise

  _`broken`
    ``true``, if the webpage with the given URL could not be fetched (based on the values of statusCode_ and finalUrl_); ``false`` otherwise

    .. _webpage_content:

  content
    The webpage content (as extracted by the corresponding :ref:`scraping rule <scraping>`; or the :ref:`automatically cleaned <cleaning>` content from the entire HTML of the page if scraping rules were not found)

If ``--plain`` is specified, then only startUrl_, `webpage title`_ and `webpage content`_ will be present.

.. _content_of_docs:

Content of docs
---------------

Like `Content of webpages`_, except it allows for a separate store for documentation web pages.

  .. _docs:

docs
  Array of docs

  Structure is same as in webpages_

.. _html_and_plain_text_output:

**************************
HTML and plain text output
**************************

Output will be in HTML format, if ``--format html`` is specified, and in plain text, if ``--format text`` is specified or ``--format`` is omitted (as ``text`` is the default).

The HTML output is meant to be formatted and viewed in a web browser. Links to external resources (such as the different URL fields) are clickable in the browser.

The plain text output is formatted for viewing in the console or in a text editor.

Both the HTML output and the plain text output will contain the same information as the `JSON output`_ specified above and will behave analogously in respect to the ``--plain`` and ``--out-part`` parameters. There are however a few fields that are missing in HTML and plain text compared to JSON: "empty", "usable", "final", "totallyFinal", "broken" (these values are inferred from the values of some other fields and depend on some :ref:`fetching <fetching>` parameters) and the JSON specific "version" and "argv".

.. _log_file:

********
Log file
********

PubFetcher-CLI will log to stderr using the `Apache Log4j 2 <https://logging.apache.org/log4j/2.x/>`_ library. With the ``--log`` parameter (described in :ref:`Logging <logging>`), a text file where the same log will be output to can be specified.

Each log line will consist of the following: the data and time, log level, log message, the name of the logger that published the logging event and the name of the thread that generated the logging event. The date and time will be the local time in the format "2018-08-24 11:37:20,187". Log level can be DEBUG, INFO, WARN and ERROR. DEBUG level messages are only output to the log file (and not to the console). Currently, there are only few DEBUG messages, including the very first message listing all parameters the program was run with. Any line breaks in the log message will be escaped, so that each log message can fit on exactly one line. The name of the logger is just the fully qualified Java class (with the prefix "org.edamontology" removed) the logging event is called from (prepended with "@" in the log file), e.g. "@pubfetcher.cli.Cli". The name of the thread will be "main" if the logging event was generated by the main thread, any subsequent thread will be named "Thread-2", "Thread-3", etc. In the log file the thread name will be in square brackets, e.g. "[Thread-2]". Some Java exceptions can also be logged, these will be output with the stack trace on subsequent lines after the logged exception message.

Analysing logs
==============

Log level ERROR is set to erroneous conditions which mostly occur on the side of the PubFetcher user (like problems in provided input), so searching for "ERROR" in log files can potentially help in finding problems that can be fixed by the user. Some problems might be caused by issues in the used :ref:`resources <resources>`, like :ref:`Europe PMC <europe_pmc>` and PubMed, and some reported problems are not problems at all, like failing to find a :ref:`publication part <publication_parts>` which is actually supposed to be missing, but these messages will usually have the log level WARN. One example of WARN level messages that can indicate inconsistencies in used resource are the messages beginning with "Old ID".

Some examples of issues found by analysing logs:

* https://github.com/bio-tools/biotoolsRegistry/issues/281
* https://github.com/bio-tools/biotoolsRegistry/issues/331
* https://github.com/bio-tools/biotoolsRegistry/issues/332

If multiple threads are writing to a log file, then the messages of different threads will be interwoven. To get the sequence of messages of only one thread, ``grep`` could be used:

.. code-block:: bash

  $ grep Thread-2 database.log

In addition to analysing logs, the output of ``-part-table`` (described in :ref:`Output <cli_output>`) could be checked for possible problems. For example, title_ being "na" is a good indicator of an invalid ID. To list all such publications_ the filter ``-part-type na -part-type-part title`` could be used. Other things of interest might be for example parts which are from other sources than the main ones (the europepmc, pubmed, pmc types and doi) or parts missing in :ref:`Europe PMC <europe_pmc>`, but present in PubMed or PMC.

..

.. _`webpage title`: webpage_title_
.. _`webpage content`: webpage_content_