Highlight for Query

The web service method /highlight-for-query takes the user's query string and finds hits to highlight using internal search engine.

A typical use case is the highlighting of documents listed in a search results page. (The first level search may be provided by any search solution, e.g. Elasticsearch, Apache Solr, or a database full text search).

This highlighting method requires the following parameters:

  • uri - PDF document location. The value can be combined with other options on the server too.
  • query - Search string containing words, phrases, etc.

There is also a number of parameters that affect document delivery. Check the service API documentation for details.

note

Due to the eventual feature differences of search engine used to find the document, the highlighted PDF may actually mark more or less words than found by the search engine. (Internally, the highlighter uses Apache Solr search engine which has highly customizable text analysis and search options.)

The simplest way to integrate Highlighter would be using our jQuery plugin. The approach involves:

  1. Adding data attributes to the results page HTML, and...
  2. Including and initializing pdfHighlighter plugin.

Adding data attributes to page HTML

To a common ancestor element of document links, add data attributes for the query and, optionally, language:

...
<div id="results">
<ul data-query='rabbit' data-language="en">
<li><a href="url/to/result1.pdf">Document title</a></li>
<li><a href="url/to/result2.pdf">Document title</a></li>
...
</ul>
</div>
...

The query attribute may contain the complete search string including phrases (in quotes) and Boolean operators.

note

When rending the HTML page server side, make sure to HTML encode the query string. Otherwise, quotes from a phrase search could break your markup.

Initialize PDF Highlighter plugin

In the scripts section of your results page, include jQuery (if you don't use it already), the plugin script jquery.pdf-highlighter.js, and initialize it:

<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="https://demo.highlight4.me/js/jquery.pdf-highlighter.js"></script>
<script>
jQuery(document).ready(function() {
var hlConfig = {
highlighterUrl: "https://demo.highlight4.me", // update for self-hosted Highlighter
resolveDocumentBase: true
};
$('#results a[href*=".pdf"]').pdfHighlighter(hlConfig);
});
</script>

In the above example, using jQuery selector we attached highlighter to all PDF links below the results element.

Search Syntax

Search syntax for finding keywords and phrases is pretty universal and similar to searching on Google. Advanced querying syntax supported by PDF Highlighter is most similar to Apache Solr and Elasticsearch search engines.

Simple search

mortgage
tax return

Search for an exact phrase

To lookup for a phrase, enclose multiple words in quotation marks.

"tax return"

Proximity search

Proximity search allows you to find words near to each other within specified distance.

"acknowledgment message"~5

To find words acknowledgment and message within 5 words of each other.

Search using wildcards

Use * to matches any group of characters, or ? for a single character.

qual*

Searches for documents containing any word starting with the letters qual, such as qualify, quality, qualification, qualifier, and so forth.

Fuzzy search

To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term.

roam~

This search will match terms like roams, foam, & foams. It will also match the word "roam" itself.

An optional distance parameter specifies the maximum number of edits allowed, between 0 and 2, defaulting to 2. For example:

roam~1

This will match terms like roams & foam - but not foams since it has an edit distance of "2".

NOTE: If fuzzy search is enabled globally, it will apply to all keyword searches without need to use the tilde symbol.

Boolean expressions, positive and negative terms

PDF Highlighter supports Boolean expressions, as well as positive and negative terms, with some specifics. Considering that a typical use case for PDF Highlighter is highlighting of documents that already matched user's document search request, it's generally safe to pass the same query to PDF Highlighter for document processing. All keywords which are not specifically excluded (with a "NOT" or with "-") will be highlighted.

Stemming

If language is defined, PDF Highlighter automatically enables stemming. Stemming is a text analysis technique that allows you to find word variations. It means that if you search for term qualification search engine will also find documents with terms qualify, qualifier, etc.

Regular Expressions

To search using a regular expression, prefix your query with regex:

Regular expression search works with page text as is, including white space.

Multi Query Highlighting

Highlighter's /highlight-for-query web service endpoint allows passing multiple queries (even thousands) for PDF processing, at the same time allowing greater level of the highlighting process.

note

To annotate multiple PDF documents for a predefined set of phrases, check batch highlighting tool.

To send multiple queries to Highlighter, send POST request to /highlight-for-query with a payload as:

{
"uri": "https://www.example.com/document.pdf",
"language": "en",
"query": [
{
"query": "booking"
},
{
"query": "message or \"lorem ipsum\"",
"color": "FFFF00"
},
{
"query": "international airport",
"type": "phrase",
"tag": "test1"
}
]
}

The query array may contain one or more items where:

  • The only required field of a query item is the query string.
  • Unless the type of item is phrase, the query can be any search string. There's no need to put query terms in quotes if the query type is set to phrase.
  • The color is desired highlighting color for the query item, specified as RGB value.
  • Optional tag is client's identifier for the query item. If the color was not specified but the tag is defined, all queries with the same tag will get assigned the same color.

The payload object can also contain other parameters accepted by the PDF highlighting service.