Fetching logic

The functionality explained in this section is mostly implemented in the source code file Fetcher.java.

Low-level methods

Getting a HTML document

Fetching HTML (or XML) resources for both publications and webpages/docs is done in the same method, where either the jsoup, HtmlUnit or Selenium WebDriver libraries are used for getting the document. The HtmlUnit and Selenium libraries have the advantage of supporting JavaScript, which needs to be executed to get the proper output for many sites, and they also work for some sites with problematic SSL certificates. As a disadvantage, it is a lot slower than jsoup, which is why using jsoup is the default and HtmlUnit/Selenium is used only if JavaScript support is requested (or switched to automatically in case of some SSL exceptions or if a webpage without scraping rules seems to require JavaScript based on --webpageMinLengthJavascript or the presence of <noscript> in the HTML source). Among HtmlUnit and Selenium, using Selenium is the default and switching to HtmlUnit can be done by supplying --selenium false. Selenium is faster and works correctly with more sites, furthermore HtmlUnit can sometimes get stuck for some sites (in which case the misbehaving HtmlUnit code is terminated). However, Selenium requires the Firefox browser to be installed (or available and pointed to with --seleniumFirefox) and is somewhat more heavyweight. Also, with Selenium the HTTP status code can’t be obtained and getting HTTP Content-Type is tricky (some more popular ones are guessed). Up to 4 Selenium WebDrivers are started, if all 4 are occupied fetching a resource, then further parallel requests will have to wait for one of them to finish.

Supplied fetching parameters timeout and userAgent are used for setting the connect timeout and the read timeout and the User-Agent HTTP header of connections. If getting the HTML document for a publication is successful and a list of already fetched links is supplied, then the current URL will be added to that list so that it is not tried again for the current publication. The successfully fetched document is returned to the caller for further processing.

A number of exceptions can occur, in which case getting the HTML document has failed and the following is done:

MalformedURLException or ClientProtocolException: The protocol is not HTTP or HTTPS or the URL is malformed. Getting the URL is tried again as a PDF document, as a few of these exception are caused by URLs that point to PDFs accessible through the FTP protocol.
HttpStatusException or FailingHttpStatusCodeException: The fetchException of the publication, webpage or doc is set to true in case the HTTP status code in the response is 503 Service Unavailable. Setting fetchException to true means the URL can be tried again in the future (depending on retryCounter or fetchExceptionCooldown) as the 503 code is usually a temporary condition. Additionally, in case of publications, fetchException is set to true for all failing HTTP status codes if the URL is not from “doi.org” and it is not a URL pointing to a PDF, PS or GZIP file.
ConnectException or NoRouteToHostException: An error occurred while attempting to connect a socket to a remote address and port. Set fetchException to true.
UnsupportedMimeTypeException: The response MIME type is not supported. If the MIME type is determined to be a PDF type, then getting the URL is tried again, but as a PDF document.
SocketTimeoutException or TimeoutException: A timeout has occurred on a socket read or accept. A new attempt is made right away and if that also fails with a timeout, then fetchException is set to true.
SSLHandshakeException or SSLProtocolException: Problem with SSL. If fetching was attempted with jsoup, then it is attempted once more, but with HtmlUnit or Selenium.
IOException: A connection or read error occurred, just issue a warning to the log.
Exception: Some other checked exception has occurred, set fetchException to true.

The HTML document fetching method can be tested with the CLI commands -fetch-document or -fetch-document-javascript (but without publications, webpages, docs and PDF support).

Getting a PDF document

Analogous to getting a HTML document. The Apache PDFBox library is used for extracting content and metadata from the PDF. The method for getting a PDF document is called upon if the URL is known in advance to point to a PDF file or if this fact is found out during the fetching of the URL as a HTML document.

Nothing is returned to the caller, as the supplied publication, webpage or doc is filled directly. For webpages and docs, all the text extracted from the PDF is set as their content, and if a title is found among the PDF metadata, it is set as their title. For publications, the text extracted from the PDF is set to be the fulltext. Also, title, keywords or theAbstract are filled with content found among the PDF metadata, but as this happens very rarely, fetching of the PDF is not done at all if the fulltext is already final.

Selecting from the returned HTML document

The fetched HTML is parsed to a jsoup Document and returned to the caller.

Then, parts of the document can be selected to fill the corresponding fields of publications, webpages and docs using the jsoup CSS-like element Selector. This is explained in more detail in the Scraping rules section.

Testing fetching of HTML (and PDF) documents and selecting from them can be done with the CLI operation -fetch-webpage-selector.

Cleaning the returned HTML document

If no selectors are specified for the given HTML document, then automatic cleaning and formatting of the document will be done instead.

The purpose of cleaning is to only extract the main content, while discarding auxiliary content, like menus and other navigational elements, footers, search and login forms, social links, contents of <noscript>, publication references, etc. We clean the document by deleting such elements and their children. The elements are found by tag names (for example <nav> or <footer>), but also their IDs, class names and ARIA roles are matched with combinations of keywords. Some words (like “menu” or “navbar”) are good enough to outright delete the matched element, either matching it by itself or with a specifier (like “left” or “main”) or in combination with another word (like “tab” or “links”). Other words (like the mentioned “tab” and “links”, but also “bar”, “search”, “hidden”, etc), either by themselves or combined with specifiers, are not specific enough to delete the matched element without some extra confidence. So, for these words and combinations there is the extra condition that no children or parents of the matched element can be an element that we determine to be about the main content (<main>, <article>, <h1>, “content”, “heading”, etc).

After this cleaning has been done, the remaining text will be extracted from the document and formatted. Paragraphs and other blocks of text will be separated by empty lines in the output. If any text is found in the description <meta> tag, then it will be prepended to the output.

Multithreaded fetching

Only one thread should be filling one publication or one webpage or one doc. But many threads can be filling different publications, webpages and docs in parallel. If many of these threads depend on the same resources, then what can happen is many parallel connections to the same host. To avoid such hammering, locking is implemented around each connection such that only up to 4 connections to the same host are allowed at once (comparison of hosts is done case-insensitively and “www.” is removed). Other threads wanting to connect to the same host will have to wait until one of these 4 threads has finished fetching from the host.

Fetching publications

Resources

Unfortunately, all content pertaining to a publication is not available from one sole Internet resource. Therefore, a number of resources are consulted and the final publication might contain content from different resources, for example an abstract from one place and the full text from another.

What follows is a list of these resources. They are defined in the order they are tried: if after fetching a given resource all required publication parts become final, or none of the subsequent resources can fill the missing parts, then the resources below the given resource are not fetched from.

But, if after going through all the resources below (as necessary) more IDs about the publication are known than before consulting the resources, then another run through all the resources is done, starting from the first (as knowing a new ID might enable us to query a resource that couldn’t be queried before). In doing this we are keeping track of resources that have successfully been fetched to not fetch these a second time and of course, for each resource, we are still evaluating if the resource can provide us with anything useful before fetching is attempted.

Sometimes, publication IDs can change, e.g., when we find from a resource with better type (see Publication types) that the DOI of the publication is different than what we currently have. In such cases all publication content (except IDs) is emptied and fetching restarted from scratch.

Europe PMC

Europe PubMed Central is a repository containing, among other things, abstracts, full text and preprints of biomedical and life sciences articles. It is the primary resource used by PubFetcher and a majority of content can be obtained from there.

The endpoint of the API is https://www.ebi.ac.uk/europepmc/webservices/rest/search, documentation is at https://europepmc.org/RestfulWebService. The API accepts any of the publication IDs: either a PMID, a PMCID or a DOI. With parameter europepmcEmail an e-mail address can be supplied to the API.

We can possibly get all publication parts from the Europe PMC API, except for fulltext, efo and go for which we get a Y or N indicating if the corresponding part is available at the Europe PMC fulltext or Europe PMC mined resource. In addition, we can possibly get values for the publication fields oa, preprint, journalTitle, pubDate, citationsCount and correspAuthor. Europe PMC is currently the only resource we can get the preprint and citationsCount values from.

Europe PMC itself has content from multiple sources (see https://europepmc.org/Help#contentsources) and in some cases multiple results are returned for a query (each from a different source). In that case the MED (MEDLINE) source is preferred, then PMC (PubMed Central), then PPR (preprints) and then whichever source is first in the list of results.

Europe PMC fulltext

Full text from the `Europe PMC`_ API is obtained from a separate endpoint: https://www.ebi.ac.uk/europepmc/webservices/rest/{PMCID}/fullTextXML. The PMCID of the publication must be known to query the API.

The API is primarily meant for getting the fulltext, but it can also be used to get the parts pmid, pmcid, doi, title, keywords, theAbstract if these were requested and are still non-final (for some reason not obtained from the main resource of `Europe PMC`_). In addition, journalTitle and correspAuthor can be obtained.

Europe PMC mined

Europe PMC has text-mined terms from publication full texts. These can be fetched from the API endpoint https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds, documentation of the Annotations API is at https://europepmc.org/AnnotationsApi. These resources are the only way to fill the publication parts efo and go and only those publication parts can be obtained from these resources (type “Gene Ontology” is used for GO and type “Experimental Methods” for EFO). Either a PMID or a PMCID is required to query these resources.

PubMed XML

The PubMed resource is used to access abstracts of biomedical and life sciences literature from the MEDLINE database.

The following URL is used for retrieving data in XML format for an article: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=xml&db=pubmed&id={PMID}. As seen, a PMID is required to query the resource. Documentation is at https://www.ncbi.nlm.nih.gov/books/NBK25500/.

In addition to theAbstract, the publication parts pmid, pmcid, doi, title and mesh can possibly be obtained from PubMed. Also, the publication part keywords can seldom be obtained, but if keywords is the only still missing publication part, then the resource is not fetched (instead, `PubMed Central`_ is relied upon for keywords). In addition, we can possibly get values for the publication fields journalTitle, pubDate and correspAuthor.

PubMed HTML

Information from PubMed can be ouput in different formats, including in HTML (to be viewed in the browser) from the URL: https://www.ncbi.nlm.nih.gov/pubmed/?term={PMID}. By scraping the resultant page we can get the same publication parts as from the XML obtained through PubMed E-utilities, however the HTML version of PubMed is only fetched if by that point title or theAbstract are still non-final (i.e., PubMed XML, but also `Europe PMC`_, failed to fetch these for some reason). So this is more of a redundant resource, that is rarely used and even more rarely useful.

PubMed Central

PubMed Central contains full text articles, which can be obtained in XML format from the URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=xml&db=pmc&id={PMCID}, where {PMCID} is the PMCID of the publication with the “PMC” prefix removed.

It is analogous to Europe PMC fulltext and used as a backup to that resource for getting content for articles available in the PMC system.

PMC Author Manuscripts can also be fetched from the corresponding HTML page.

DOI resource

Sometimes, some publication parts must be fetched directly from the publisher. A DOI (Digital Object Identifier) of a publication is a persistent identifier which, when resolved, should point to the correct URL of the journal article.

First, the DOI is resolved to the URL it redirects to and this URL is fed to the Getting a HTML document method. If the URL has a match in the JavaScript section of the Journals YAML scraping configuration, then the HTML document will be fetched using JavaScript support. The publication parts that can possibly be scraped from the article’s page are doi, title, keywords, theAbstract, fulltext and possibly (but very rarely) pmid and pmcid. These publication parts are extracted from the web page using corresponding scraping rules. If no scraping rules are found, then the content of the HTML <title> element will be set as the value of the publication part title (if title is still non-final) and the whole text of the HTML set as the value of fulltext (if fulltext is still non-final). Additionally, a link to the web page containing the full text of the article and a link pointing to the article PDF might be added to Links, if specified by the scraping rules, and in addition names and e-mails for correspAuthor can be found.

In contrast to the other resources, <meta> elements are looked for in the HTML as these might contain the publication parts pmid, pmcid, doi, title, keywords and also theAbstract, plus Links to additional web pages or PDFs containing the article and sometimes also e-mail addresses for correspAuthor. More about these meta tags is described in Meta.

Also, in contrast to other resources, the final URL resolved from the DOI is added to visitedSites.

Unpaywall

The Unpaywall service helps to find Open Access content. It us mainly useful for finding PDFs for some articles for which no full text content was found using the above resources, but it can help in filling a few other publication parts and fields also, such as oa. The service was recently called oaDOI.

The API is queried as follows: https://api.unpaywall.org/v2/{DOI}?email={oadoiEmail}, documentation is at https://unpaywall.org/products/api. As seen, the DOI of the publication must be known to query the service.

The response will be in JSON format, which is why the method of Getting a HTML document is not used (but the process of obtaining the resource is analogous). Unpaywall will be called if title, theAbstract of fulltext are non-final (or pmid, pmcid, doi are non-final, but only if these are the only publication parts requested). From the response we can possibly directly fill the publication part title and the fields oa and journalTitle. But in addition we can find Links to web pages containing the article or to PDFs of the article.

Meta

The web pages of journal articles can have metadata embedded in the HTML in <meta> elements. Sometimes this can be used to fill publication parts which have not been found elsewhere.

There are a few standard meta tag formats, those supported by PubFetcher are: HighWire, EPrints, bepress, Dublin Core, Open Graph, Twitter and generic tag (without any prefix). An example of a HighWire tag: <meta name="citation_keyword" content="foo">. An example of a Open Graph tag: <meta property="og:title" content="bar" />.

Publication parts potentially found in <meta> elements (depending on format) are: pmid, pmcid, doi, title, keywords, theAbstract. Additionally, Links to web pages containing the article or to PDFs of the article can be found in some meta tags.

In web pages of articles of some journals the standard <meta> tags are filled with content that is not entirely correct (for our purposes), so some exceptions to not use these tags for these journals have been defined.

<meta> elements are only searched for in all web pages resolved from DOI and also in all web pages added to Links.

Links

Links to web pages containing an article or to PDFs of the article can be found in the Unpaywall resource, in some Meta tags and in web pages (resolved from DOI or from Links) that have scraping rules specifying how to extract links. In addition to its URL, a publication type (see Publication types) corresponding to the resource the link was found from, the URL of the web page the link was found from and a timestamp, are saved for each link.

These links are collected in a list that will be looked through only after all other resources above have been exhausted. DOI links (with host “doi.org” or “dx.doi.org”) and links to web pages of articles in the PMC system (either Europe PMC or PubMed Central) are not added to this list. But, in case of PMC links, a missing PMCID (or PMID) of the publication can sometimes be extracted from the URL string itself. In addition, links that have already been tried or links already present in the list are not added to the list a second time.

Links are sorted according to publication types in the list they are collected to, with links of final type on top. Which means, that once fetching of resources has reached this list of links then links of higher types are visited first. If publication parts title, keywords, theAbstract and fulltext are final or with types that are better or equal to types of any of the remaining links in the list, then the remaining links are discarded.

In case of links to web pages the content is fetched and the publication is filled the same way as in the DOI resource (including the addition of the link to visitedSites), except the resolving of the DOI to URL step is not done (the supplied URL of the link is treated the same as a URL resolved from a DOI). In case of links to PDFs the content is fetched and the publication is filled as described in Getting a PDF document.

Publication types

Publication part types are the following, ordered from better to lower type:

europepmc: Type given to parts got from `Europe PMC`_ and Europe PMC mined resources
europepmc_xml: From Europe PMC fulltext resource
europepmc_html: Currently disabled
pubmed_xml: From PubMed XML resource
pubmed_html: From PubMed HTML resource
pmc_xml: From `PubMed Central`_ resource
pmc_html: From `PubMed Central`_ HTML resource
doi: From DOI resource (excluding PDF links)
link: Link to publication. Not used in PubFetcher itself. Meant as an option in applications extending or using PubFetcher.
link_oadoi: Given to Links found in Unpaywall resource (excluding PDF links)
citation: From HighWire Meta tags (excluding links)
eprints: From EPrints Meta tags (excluding links)
bepress: From bepress Meta tags (excluding PDF links)
link_citation: Links from Highwire Meta tags (excluding PDF links)
link_eprints: Links from EPrints Meta tags (excluding PDF links)
dc: From Dublin Core Meta tags
og: From Open Graph Meta tags
twitter: From Twitter Meta tags
meta: From generic Meta tags (excluding links)
link_meta: Links from generic Meta tags (excluding PDF links)
external: Type given to externally supplied pmid, pmcid or doi
oadoi: From Unpaywall resource (excluding links, currently only title)
pdf_europepmc: Currently disabled
pdf_pmc: Currently disabled
pdf_doi: Type given to PDF Links extracted from a DOI resource or if the DOI itself resolves to a PDF file (which is fetched as described in Getting a PDF document)
pdf_link: PDF from link to publication. Not used in PubFetcher itself. Meant as an option in applications extending or using PubFetcher.
pdf_oadoi: PDF Links from Unpaywall resource
pdf_citation: PDF Links from HighWire Meta tags
pdf_eprints: PDF Links from EPrints Meta tags
pdf_bepress: PDF Links from bepress Meta tags
pdf_meta: PDF Links from generic Meta tags
webpage: Type given to title and fulltext set from an article web page with no scraping rules
na: Initial type of a publication part

Types “europepmc”, “europepmc_xml”, “europepmc_html”, “pubmed_xml”, “pubmed_html”, “pmc_xml”, “pmc_html”, “doi”, “link” and “link_oadoi” are final types. Final types are the best type and they are equivalent with each other (meaning that one final type is not better than some other final type and their ordering does not matter).

The type of the publication part being final is a necessary condition for the publication part to be final. The other condition is for the publication part to be large enough (as specified by titleMinLength, keywordsMinSize, minedTermsMinSize, abstractMinLength or fulltextMinLength in fetching parameters). The fulltext part has the additional requirement of being better than “webpage” type to be considered final.

When filling a publication part then the type of the new content must be better than the type of the old content. Or, if both types are final but the publication part itself is not yet final (because the content is not large enough), then new content will override old content if new content is larger. Publication parts which are final can’t be overwritten. Also, the publication fields (these are not publication parts) journalTitle, pubDate and correspAuthor can only be set once with non-empty content, after which they can’t be overwritten anymore.

Publication parts

publication parts have content and contain the fields type, url and timestamp as described in the JSON output of the publication part pmid. The publication fields oa, journalTitle, pubDate, etc do not contain extra information besides content and are not publication parts.

The publication parts are as follows:

pmid: The PubMed ID of the publication. Only articles available in PubMed can have this. Only a valid PMID can be set to the part. The pmid structure.
pmcid: The PubMed Central ID of the publication. Only articles available in PMC can have this. Only a valid PMCID can be set to the part. The pmcid structure.
doi: The Digital Object Identifier of the publication. Only a valid DOI can be set to the part. The DOI will be normalised in the process, i.e. any valid prefix (e.g. “https://doi.org/”, “doi:”) is removed and letters from the 7-bit ASCII set are converted to uppercase. The doi structure.
title: The title of the publication. The title structure.
keywords: Author-assigned keywords of the publication. Often missing or not found. Empty and duplicate keywords are removed. The keywords structure.
mesh: Medical Subject Headings terms of the publication. Assigned to articles in PubMed (with some delay after publication). The mesh structure.
efo: Experimental factor ontology terms of the publication (but also experimental methods terms from other ontologies like Molecular Interactions Controlled Vocabulary and Ontology for Biomedical Investigations). Text-mined by the Europe PMC project from the full text of the article. The efo structure.
go: Gene ontology terms of the publication. Text-mined by the Europe PMC project from the full text of the article. The go structure.
theAbstract: The abstract of the publication. The part is called “theAbstract” instead of just “abstract”, because “abstract” is a reserved keyword in the Java programming language. The abstract structure.
fulltext: The full text of the publication. The part includes the title and abstract of the publication in the beginning of the content string. All the main content of the article’s full text is included, from introduction to conclusions. Captions of figures and tables and descriptions of supplementary materials are also included. From back matter, the glossary, notes and misc sections are usually included. But acknowledgements, appendices, biographies, footnotes, copyrights and, most importantly, references are excluded, whenever possible. If fulltext is obtained from a PDF, then everything is included. In the future, it could be useful to include all these parts of full text, like references, but in a structured way. The fulltext structure.

Fetching webpages and docs

A webpage or doc is also got using the method described in Getting a HTML document (or Getting a PDF document if the webpage or doc URL turns out to be a link to a PDF file). Webpage and doc fields that can be filled from the fetched content using scraping rules are the webpage title, the webpage content, license and language. Other fields are filled with metadata during the fetching process, the whole structure can be seen in webpages section of the output documentation. If no scraping rules are present for the webpage or doc then the webpage content will be the entire string parsed from the fetched HTML and the webpage title will be the content inside the <title> tag. Whether the webpage or doc is fetched with JavaScript support or not can also be influenced with scraping rules. A webpage or doc can also be fetch using rules specified on the command line with the command -fetch-webpage-selector (see Print a web page).

The same publication can be fetched multiple times, with each fetching potentially adding some missing content to the existing publication. In contrast, a webpage or doc is always fetched from scratch. If the resulting webpage or doc is final and a corresponding webpage or doc already exists, then this existing entry will be overwritten. An existing webpage or doc will also be overwritten, if the new entry is non-final (but not empty) and the old entry is non-final (and potentially empty) and if both new and old entries are empty.

Can fetch

The methods for fetching publications, webpages and docs are always given a publication, webpage or doc as parameter. If a publication, webpage or doc is fetched from scratch, then an initial empty entry is supplied. Each time, these methods have to determine if a publication, webpage or doc can be fetched or should the fetching be skipped this time. The fetching will happen if any of the following conditions is met:

fetchTime is 0, this is only true for initial empty entries;
the publication is empty or the webpage or doc is empty and emptyCooldown is not negative and at least emptyCooldown minutes have passed since fetchTime;
the publication is final or the webpage or doc is final (and they are not empty) and nonFinalCooldown is not negative and at least nonFinalCooldown minutes have passed since fetchTime;
the entry has a fetchException and fetchExceptionCooldown is not negative and at least fetchExceptionCooldown minutes have passed since fetchTime;
the entry is empty or non-final or has a fetchException and either retryCounter is less than retryLimit or retryLimit is negative.

If it was determined that fetching happens, then fetchTime is set to the current time and retryCounter is reset to 0 if any condition except the last is met. If only the last condition (about retryCounter and retryLimit) is met, then retryCounter is incremented by 1 (and fetchTime is left as is, meaning that fetchTime does not necessarily show the time of the last fetching, but only the time of the initial fetching or the time when fetching happened because one of the cooldown timers expired).

The fetchException is set to false in the beginning of each fetching and it is set to true if some certain types of errors happen during fetching, some such error conditions are described in Getting a HTML document. fetchException can be set to true also by the method described in Getting a PDF document and the custom method getting the Unpaywall resource.