Fetching logic

The functionality explained in this section is mostly implemented in the source code file Fetcher.java.

Low-level methods

Getting a HTML document

Fetching HTML (or XML) resources for both publications and webpages/docs is done in the same method, where either the jsoup or HtmlUnit libraries are used for getting the document. The HtmlUnit library has the advantage of supporting JavaScript, which needs to be executed to get the proper output for many sites, and it also works for some sites with problematic SSL certificates. As a disadvantage, it is a lot slower than jsoup, which is why using jsoup is the default and HtmlUnit is used only if JavaScript support is requested (or switched to automatically in case of some SSL exceptions). Also, fetching with JavaScript can get stuck for a few rare sites, in which case the misbehaving HtmlUnit code is terminated.

Supplied fetching parameters timeout and userAgent are used for setting the connect timeout and the read timeout and the User-Agent HTTP header of connections. If getting the HTML document for a publication is successful and a list of already fetched links is supplied, then the current URL will be added to that list so that it is not tried again for the current publication. The successfully fetched document is returned to the caller for further processing.

A number of exceptions can occur, in which case getting the HTML document has failed and the following is done:

MalformedURLException
The protocol is not HTTP or HTTPS or the URL is malformed. Getting the URL is tried again as a PDF document, as a few of these exception are caused by URLs that point to PDFs accessible through the FTP protocol.
HttpStatusException or FailingHttpStatusCodeException
The fetchException of the publication, webpage or doc is set to true in case the HTTP status code in the response is 503 Service Unavailable. Setting fetchException to true means the URL can be tried again in the future (depending on retryCounter or fetchExceptionCooldown) as the 503 code is usually a temporary condition. Additionally, in case of publications, fetchException is set to true for all failing HTTP status codes if the URL is not from “doi.org” and it is not a URL pointing to a PDF, PS or GZIP file.
ConnectException or NoRouteToHostException
An error occurred while attempting to connect a socket to a remote address and port. Set fetchException to true.
UnsupportedMimeTypeException
The response MIME type is not supported. If the MIME type is determined to be a PDF type, then getting the URL is tried again, but as a PDF document.
SocketTimeoutException
A timeout has occurred on a socket read or accept. A new attempt is made right away and if that also fails with a timeout, then fetchException is set to true.
SSLHandshakeException or SSLProtocolException
Problem with SSL. If fetching was attempted with jsoup, then it is attempted once more, but with HtmlUnit.
IOException
A connection or read error occurred, just issue a warning to the log.
Exception
Some other checked exception has occurred, set fetchException to true.

The HTML document fetching method can be tested with the CLI commands -fetch-document or -fetch-document-javascript (but without publications, webpages, docs and PDF support).

Getting a PDF document

Analogous to getting a HTML document. The Apache PDFBox library is used for extracting content and metadata from the PDF. The method for getting a PDF document is called upon if the URL is known in advance to point to a PDF file or if this fact is found out during the fetching of the URL as a HTML document.

Nothing is returned to the caller, as the supplied publication, webpage or doc is filled directly. For webpages and docs, all the text extracted from the PDF is set as their content, and if a title is found among the PDF metadata, it is set as their title. For publications, the text extracted from the PDF is set to be the fulltext. Also, title, keywords or theAbstract are filled with content found among the PDF metadata, but as this happens very rarely, fetching of the PDF is not done at all if the fulltext is already final.

Selecting from the returned HTML document

The fetched HTML is parsed to a jsoup Document and returned to the caller.

Then, parts of the document can be selected to fill the corresponding fields of publications, webpages and docs using the jsoup CSS-like element Selector. This is explained in more detail in the Scraping rules section.

Testing fetching of HTML (and PDF) documents and selecting from them can be done with the CLI operation -fetch-webpage-selector.

Cleaning the returned HTML document

If no selectors are specified for the given HTML document, then automatic cleaning and formatting of the document will be done instead.

The purpose of cleaning is to only extract the main content, while discarding auxiliary content, like menus and other navigational elements, footers, search and login forms, social links, contents of <noscript>, publication references, etc. We clean the document by deleting such elements and their children. The elements are found by tag names (for example <nav> or <footer>), but also their IDs, class names and ARIA roles are matched with combinations of keywords. Some words (like “menu” or “navbar”) are good enough to outright delete the matched element, either matching it by itself or with a specifier (like “left” or “main”) or in combination with another word (like “tab” or “links”). Other words (like the mentioned “tab” and “links”, but also “bar”, “search”, “hidden”, etc), either by themselves or combined with specifiers, are not specific enough to delete the matched element without some extra confidence. So, for these words and combinations there is the extra condition that no children or parents of the matched element can be an element that we determine to be about the main content (<main>, <article>, <h1>, “content”, “heading”, etc).

After this cleaning has been done, the remaining text will be extracted from the document and formatted. Paragraphs and other blocks of text will be separated by empty lines in the output. If any text is found in the description <meta> tag, then it will be prepended to the output.

Multithreaded fetching

Only one thread should be filling one publication or one webpage or one doc. But many threads can be filling different publications, webpages and docs in parallel. If many of these threads depend on the same resources, then what can happen is many parallel connections to the same host. To avoid such hammering, locking is implemented around each connection such that only one connection to one host is allowed at once (comparison of hosts is done case-insensitively and “www.” is removed). Other threads wanting to connect to the same host will have to wait until the resource is free again.

Fetching publications

Resources

Unfortunately, all content pertaining to a publication is not available from one sole Internet resource. Therefore, a number of resources are consulted and the final publication might contain content from different resources, for example an abstract from one place and the full text from another.

What follows is a list of these resources. They are defined in the order they are tried: if after fetching a given resource all required publication parts become final, or none of the subsequent resources can fill the missing parts, then the resources below the given resource are not fetched from.

But, if after going through all the resources below (as necessary) more IDs about the publication are known than before consulting the resources, then another run through all the resources is done, starting from the first (as knowing a new ID might enable us to query a resource that couldn’t be queried before). In doing this we are keeping track of resources that have successfully been fetched to not fetch these a second time and of course, for each resource, we are still evaluating if the resource can provide us with anything useful before fetching is attempted.

Sometimes, publication IDs can change, e.g., when we find from a resource with better type (see Publication types) that the DOI of the publication is different than what we currently have. In such cases all publication content (except IDs) is emptied and fetching restarted from scratch.

Europe PMC

Europe PubMed Central is a repository containing, among other things, abstracts, full text and preprints of biomedical and life sciences articles. It is the primary resource used by PubFetcher and a majority of content can be obtained from there.

The endpoint of the API is https://www.ebi.ac.uk/europepmc/webservices/rest/search, documentation is at https://europepmc.org/RestfulWebService. The API accepts any of the publication IDs: either a PMID, a PMCID or a DOI. With parameter europepmcEmail an e-mail address can be supplied to the API.

We can possibly get all publication parts from the Europe PMC API, except for fulltext, efo and go for which we get a Y or N indicating if the corresponding part is available at the Europe PMC fulltext or Europe PMC mined resource. In addition, we can possibly get values for the publication fields oa, journalTitle, pubDate and citationsCount. Europe PMC is currently the only resource we can get the citationsCount value from.

Europe PMC itself has content from multiple sources (see https://europepmc.org/Help#contentsources) and in some cases multiple results are returned for a query (each from a different source). In that case the MED (MEDLINE) source is preferred, then PMC (PubMed Central), then PPR (preprints) and then whichever source is first in the list of results.

Europe PMC fulltext

Full text from the Europe PMC API is obtained from a separate endpoint: https://www.ebi.ac.uk/europepmc/webservices/rest/{PMCID}/fullTextXML. The PMCID of the publication must be known to query the API.

The API is primarily meant for getting the fulltext, but it can also be used to get the parts pmid, pmcid, doi, title, keywords, theAbstract if these were requested and are still non-final (for some reason not obtained from the main resource of Europe PMC). In addition, journalTitle and correspAuthor can be obtained.

Europe PMC mined

Europe PMC has text-mined terms from publication full texts. These can be fetched from the API endpoint https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds, documentation of the Annotations API is at https://europepmc.org/AnnotationsApi. These resources are the only way to fill the publication parts efo and go and only those publication parts can be obtained from these resources (type “Gene Ontology” is used for GO and type “Experimental Methods” for EFO). Either a PMID or a PMCID is required to query these resources.

PubMed XML

The PubMed resource is used to access abstracts of biomedical and life sciences literature from the MEDLINE database.

The following URL is used for retrieving data in XML format for an article: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=xml&db=pubmed&id={PMID}. As seen, a PMID is required to query the resource. Documentation is at https://www.ncbi.nlm.nih.gov/books/NBK25500/.

In addition to theAbstract, the publication parts pmid, pmcid, doi, title and mesh can possibly be obtained from PubMed. Also, the publication part keywords can seldom be obtained, but if keywords is the only still missing publication part, then the resource is not fetched (instead, PubMed Central is relied upon for keywords). In addition, we can possibly get values for the publication fields journalTitle and pubDate.

PubMed HTML

Information from PubMed can be ouput in different formats, including in HTML (to be viewed in the browser) from the URL: https://www.ncbi.nlm.nih.gov/pubmed/?term={PMID}. By scraping the resultant page we can get the same publication parts as from the XML obtained through PubMed E-utilities, however the HTML version of PubMed is only fetched if by that point title or theAbstract are still non-final (i.e., PubMed XML, but also Europe PMC, failed to fetch these for some reason). So this is more of a redundant resource, that is rarely used and even more rarely useful.

PubMed Central

PubMed Central contains full text articles, which can be obtained in XML format from the URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=xml&db=pmc&id={PMCID}, where {PMCID} is the PMCID of the publication with the “PMC” prefix removed.

It is analogous to Europe PMC fulltext and used as a backup to that resource for getting content for articles available in the PMC system.

DOI resource

Sometimes, some publication parts must be fetched directly from the publisher. A DOI (Digital Object Identifier) of a publication is a persistent identifier which, when resolved, should point to the correct URL of the journal article.

First, the DOI is resolved to the URL it redirects to and this URL is fed to the Getting a HTML document method. If the URL has a match in the JavaScript section of the Journals YAML scraping configuration, then the HTML document will be fetched using JavaScript support. The publication parts that can possibly be scraped from the article’s page are doi, title, keywords, theAbstract, fulltext and possibly (but very rarely) pmid and pmcid. These publication parts are extracted from the web page using corresponding scraping rules. If no scraping rules are found, then the content of the HTML <title> element will be set as the value of the publication part title (if title is still non-final) and the whole text of the HTML set as the value of fulltext (if fulltext is still non-final). Additionally, a link to the web page containing the full text of the article and a link pointing to the article PDF might be added to Links, if specified by the scraping rules, and in addition names and e-mails for correspAuthor can be found.

In contrast to the other resources, <meta> elements are looked for in the HTML as these might contain the publication parts pmid, pmcid, doi, title, keywords and also theAbstract, plus Links to additional web pages or PDFs containing the article and sometimes also e-mail addresses for correspAuthor. More about these meta tags is described in Meta.

Also, in contrast to other resources, the final URL resolved from the DOI is added to visitedSites.

Unpaywall

The Unpaywall service helps to find Open Access content. It us mainly useful for finding PDFs for some articles for which no full text content was found using the above resources, but it can help in filling a few other publication parts and fields also, such as oa. The service was recently called oaDOI.

The API is queried as follows: https://api.unpaywall.org/v2/{DOI}?email={oadoiEmail}, documentation is at https://unpaywall.org/products/api. As seen, the DOI of the publication must be known to query the service.

The response will be in JSON format, which is why the method of Getting a HTML document is not used (but the process of obtaining the resource is analogous). Unpaywall will be called if title, theAbstract of fulltext are non-final (or pmid, pmcid, doi are non-final, but only if these are the only publication parts requested). From the response we can possibly directly fill the publication part title and the fields oa and journalTitle. But in addition we can find Links to web pages containing the article or to PDFs of the article.

Meta

The web pages of journal articles can have metadata embedded in the HTML in <meta> elements. Sometimes this can be used to fill publication parts which have not been found elsewhere.

There are a few standard meta tag formats, those supported by PubFetcher are: HighWire, EPrints, bepress, Dublin Core, Open Graph, Twitter and generic tag (without any prefix). An example of a HighWire tag: <meta name="citation_keyword" content="foo">. An example of a Open Graph tag: <meta property="og:title" content="bar" />.

Publication parts potentially found in <meta> elements (depending on format) are: pmid, pmcid, doi, title, keywords, theAbstract. Additionally, Links to web pages containing the article or to PDFs of the article can be found in some meta tags.

In web pages of articles of some journals the standard <meta> tags are filled with content that is not entirely correct (for our purposes), so some exceptions to not use these tags for these journals have been defined.

<meta> elements are only searched for in all web pages resolved from DOI and also in all web pages added to Links.

Publication types

Publication part types are the following, ordered from better to lower type:

europepmc
Type given to parts got from Europe PMC and Europe PMC mined resources
europepmc_xml
From Europe PMC fulltext resource
europepmc_html
Currently disabled
pubmed_xml
From PubMed XML resource
pubmed_html
From PubMed HTML resource
pmc_xml
From PubMed Central resource
pmc_html
Currently disabled
doi
From DOI resource (excluding PDF links)
link
Link to publication. Not used in PubFetcher itself. Meant as an option in applications extending or using PubFetcher.
link_oadoi
Given to Links found in Unpaywall resource (excluding PDF links)
citation
From HighWire Meta tags (excluding links)
eprints
From EPrints Meta tags (excluding links)
bepress
From bepress Meta tags (excluding PDF links)
link_citation
Links from Highwire Meta tags (excluding PDF links)
link_eprints
Links from EPrints Meta tags (excluding PDF links)
dc
From Dublin Core Meta tags
og
From Open Graph Meta tags
twitter
From Twitter Meta tags
meta
From generic Meta tags (excluding links)
link_meta
Links from generic Meta tags (excluding PDF links)
external
Type given to externally supplied pmid, pmcid or doi
oadoi
From Unpaywall resource (excluding links, currently only title)
pdf_europepmc
Currently disabled
pdf_pmc
Currently disabled
pdf_doi
Type given to PDF Links extracted from a DOI resource or if the DOI itself resolves to a PDF file (which is fetched as described in Getting a PDF document)
pdf_link
PDF from link to publication. Not used in PubFetcher itself. Meant as an option in applications extending or using PubFetcher.
pdf_oadoi
PDF Links from Unpaywall resource
pdf_citation
PDF Links from HighWire Meta tags
pdf_eprints
PDF Links from EPrints Meta tags
pdf_bepress
PDF Links from bepress Meta tags
pdf_meta
PDF Links from generic Meta tags
webpage
Type given to title and fulltext set from an article web page with no scraping rules
na
Initial type of a publication part

Types “europepmc”, “europepmc_xml”, “europepmc_html”, “pubmed_xml”, “pubmed_html”, “pmc_xml”, “pmc_html”, “doi”, “link” and “link_oadoi” are final types. Final types are the best type and they are equivalent with each other (meaning that one final type is not better than some other final type and their ordering does not matter).

The type of the publication part being final is a necessary condition for the publication part to be final. The other condition is for the publication part to be large enough (as specified by titleMinLength, keywordsMinSize, minedTermsMinSize, abstractMinLength or fulltextMinLength in fetching parameters). The fulltext part has the additional requirement of being better than “webpage” type to be considered final.

When filling a publication part then the type of the new content must be better than the type of the old content. Or, if both types are final but the publication part itself is not yet final (because the content is not large enough), then new content will override old content if new content is larger. Publication parts which are final can’t be overwritten. Also, the publication fields (these are not publication parts) journalTitle, pubDate and correspAuthor can only be set once with non-empty content, after which they can’t be overwritten anymore.

Publication parts

publication parts have content and contain the fields type, url and timestamp as described in the JSON output of the publication part pmid. The publication fields oa, journalTitle, pubDate, etc do not contain extra information besides content and are not publication parts.

The publication parts are as follows:

pmid

The PubMed ID of the publication. Only articles available in PubMed can have this. Only a valid PMID can be set to the part. The pmid structure.

pmcid

The PubMed Central ID of the publication. Only articles available in PMC can have this. Only a valid PMCID can be set to the part. The pmcid structure.

doi

The Digital Object Identifier of the publication. Only a valid DOI can be set to the part. The DOI will be normalised in the process, i.e. any valid prefix (e.g. “https://doi.org/”, “doi:”) is removed and letters from the 7-bit ASCII set are converted to uppercase. The doi structure.

title

The title of the publication. The title structure.

keywords

Author-assigned keywords of the publication. Often missing or not found. Empty and duplicate keywords are removed. The keywords structure.

mesh

Medical Subject Headings terms of the publication. Assigned to articles in PubMed (with some delay after publication). The mesh structure.

efo
Experimental factor ontology terms of the publication (but also experimental methods terms from other ontologies like Molecular Interactions Controlled Vocabulary and Ontology for Biomedical Investigations). Text-mined by the Europe PMC project from the full text of the article. The efo structure.
go
Gene ontology terms of the publication. Text-mined by the Europe PMC project from the full text of the article. The go structure.
theAbstract

The abstract of the publication. The part is called “theAbstract” instead of just “abstract”, because “abstract” is a reserved keyword in the Java programming language. The abstract structure.

fulltext

The full text of the publication. The part includes the title and abstract of the publication in the beginning of the content string. All the main content of the article’s full text is included, from introduction to conclusions. Captions of figures and tables and descriptions of supplementary materials are also included. From back matter, the glossary, notes and misc sections are usually included. But acknowledgements, appendices, biographies, footnotes, copyrights and, most importantly, references are excluded, whenever possible. If fulltext is obtained from a PDF, then everything is included. In the future, it could be useful to include all these parts of full text, like references, but in a structured way. The fulltext structure.

Fetching webpages and docs

A webpage or doc is also got using the method described in Getting a HTML document (or Getting a PDF document if the webpage or doc URL turns out to be a link to a PDF file). Webpage and doc fields that can be filled from the fetched content using scraping rules are the webpage title, the webpage content, license and language. Other fields are filled with metadata during the fetching process, the whole structure can be seen in webpages section of the output documentation. If no scraping rules are present for the webpage or doc then the webpage content will be the entire string parsed from the fetched HTML and the webpage title will be the content inside the <title> tag. Whether the webpage or doc is fetched with JavaScript support or not can also be influenced with scraping rules. A webpage or doc can also be fetch using rules specified on the command line with the command -fetch-webpage-selector (see Print a web page).

The same publication can be fetched multiple times, with each fetching potentially adding some missing content to the existing publication. In contrast, a webpage or doc is always fetched from scratch. If the resulting webpage or doc is final and a corresponding webpage or doc already exists, then this existing entry will be overwritten. An existing webpage or doc will also be overwritten, if the new entry is non-final (but not empty) and the old entry is non-final (and potentially empty) and if both new and old entries are empty.

Can fetch

The methods for fetching publications, webpages and docs are always given a publication, webpage or doc as parameter. If a publication, webpage or doc is fetched from scratch, then an initial empty entry is supplied. Each time, these methods have to determine if a publication, webpage or doc can be fetched or should the fetching be skipped this time. The fetching will happen if any of the following conditions is met:

If it was determined that fetching happens, then fetchTime is set to the current time and retryCounter is reset to 0 if any condition except the last is met. If only the last condition (about retryCounter and retryLimit) is met, then retryCounter is incremented by 1 (and fetchTime is left as is, meaning that fetchTime does not necessarily show the time of the last fetching, but only the time of the initial fetching or the time when fetching happened because one of the cooldown timers expired).

The fetchException is set to false in the beginning of each fetching and it is set to true if some certain types of errors happen during fetching, some such error conditions are described in Getting a HTML document. fetchException can be set to true also by the method described in Getting a PDF document and the custom method getting the Unpaywall resource.