Getting Started With Highlighting Search Terms in PDF

JObjects Highlighter is a web service for highlighting search terms in PDF documents. Depending on the received HTTP request, it will either:

  1. create a new PDF with highlights included ("burnt in"), or
  2. show the original PDF document in a specialized web-based viewer, with highlights on top of it.

Which method will be used for delivering the results depends on the Accept HTTP header. If the header is not provided, JObjects Highlighter determines the method using server preferences.

In this guide, we'll show you how to consume the PDF highlighting service and walk you through the integration steps and its obstacles. For simplicity, in most of the examples here, we use JObjects Highlighter Cloud Service so that you can try them right away. The same principles apply to the use of the JObjects Highlighter Server that you can download and install on your server, but you need to use your instance URL instead.

In the examples below we reference a PDF file named alice.pdf that you can download here, or you can run tests using your own PDF document.

Burning Highlights Into a PDF From the Command Line

If you're on a Linux system with curl available, you can run the following command to upload local file alice.pdf to Highlight4me (JObjects' hosted highlighting service), highlight word rabbit, and save the received output as alice-rabbit-highlighted.pdf.

curl \
-H "Accept: application/pdf" \
-F "file=@alice.pdf" \
-F "query=rabbit" \
-F "language=en" \
https://cloud.highlight4.me/api/highlight-for-query > alice-rabbit-highlighted.pdf

We've used highlight-for-query service method that highlights PDF for provided keywords. Besides this, JObjects Highlighter also has highlight-for-xml method that accepts Adobe Highlight File format.

Notice that by using language=en parameter, we have instructed JObjects Highlighter to use English language rules. That will highlight not only rabbit (singular) but rabbits (plural) version of the word as well.

note

For demo purposes, Highlight4me service allows highlighting of PDF files of up to 1MB in size. To try it with larger documents, you can start a trial, get your API key and include it with your requests as either apiKey parameter or X-Api-Key HTTP header:

curl \
-H "Accept: application/pdf" \
-H "X-Api-Key: YOUR_API_KEY_HERE" \
-F "file=@test.pdf" \
-F "query=account" \
https://cloud.highlight4.me/api/highlight-for-query > alice-rabbit-highlighted.pdf

Burning highlights into PDF documents is a rather specific use case. If you need to do this in scale, check out our batch highlighting tool.

Getting Highlighting Results as JSON

Most of the time, you'd probably be interested in integrating JObjects Highlighter with web-based solutions, invoking the service using JavaScript. So, let's try highlighting the same document but requesting a response in JSON format:

curl \
-H "Accept: application/json" \
-F "file=@alice.pdf" \
-F "query=rabbit" \
-F "language=en" \
https://cloud.highlight4.me/api/highlight-for-query

We'll get a response similar to:

{
"success": true,
"highlightedTerms": 52,
"highlightedPages": 20,
"pagesWithMatches": [3, 5, 10, 11, 21, 22, 23, 24, 25, 50, 51, 52, 66, 70, 71, 74, 75, 77, 78, 81],
"cacheKey": "3eef6edabc1a659f8fe6570cfe1c1d53",
"foundInIndex": false,
"navigationStrategy": "hit-to-hit",
"documentId": "37a3241026f5faba5976ce2bdcbfeaf9"
}

This response provides just basic information about highlighted items. However, the field cacheKey contains a key that we can send to the hits method of the web service, to get the position of each found keyword:

curl https://cloud.highlight4.me/api/hits/3eef6edabc1a659f8fe6570cfe1c1d53

That will return something like:

{
"success": true,
"highlightedTerms": 52,
"highlightedPages": 20,
"pagesWithMatches": [3, 5, 10, 11, 21, 22, 23, 24, 25, 50, 51, 52, 66, 70, 71, 74, 75, 77, 78, 81],
"cacheKey": "3eef6edabc1a659f8fe6570cfe1c1d53",
"foundInIndex": false,
"navigationStrategy": "hit-to-hit",
"documentId": "37a3241026f5faba5976ce2bdcbfeaf9"
"matches": [
{
"area": [172, 623, 239, 643],
"color": [1, 1, 0],
"index": 0,
"page": 3
},
{
"area": [173, 439, 220, 455],
"color": [1, 1, 0],
"index": 1,
"page": 3
},
... more items here...
{
"area": [442, 554, 489, 570],
"color": [1, 1, 0],
"index": 51,
"page": 81
}
]
}

You don't need to deal with this response directly - this data is consumed by our Highlighting PDF Viewer instead. Let's see how to use this on a website...

Web Page Integration

Integration of JObjects Highlighter with a website or application comes down to:

  1. Creating and invoking highlighting method (e.g. highlight-for-query),
  2. Opening Highlighting PDF Viewer in a frame or a window, and passing both the PDF and the hits URL to it.

A simple jQuery based solution could look like this:

var highlighterServer = 'https://cloud.highlight4.me/api';
var viewerUrl = 'https://cdn.highlight4.me/highlighter/4.4.1/viewer/index.html';
var pdfUrl = 'https://cdn.highlight4.me/examples/alice.pdf';
// ...
function showHighlighted(pdfUrl, queryString) {
$.post({
url: highlighterServer + '/highlight-for-query',
data: {
uri: pdfUrl,
query: queryString,
viewer: viewerUrl
},
dataType: 'json',
//headers: {'X-Api-Key': 'YOUR_API_KEY_HERE'}, // if you got an API key
success: function(data) {
if (data.success) {
// create URL to the Highlighting PDF Viewer
var openUrl = viewerUrl +
'?file=' + encodeURIComponent(pdfUrl) +
'&highlightsFile=' + encodeURIComponent(highlighterServer + '/hits/' + data.cacheKey) +
'&powerSearch=1' + // open power search panel (e.g. for user to modify the query)
'&q=' + encodeURIComponent(queryString) // init search panel with the query string
;
viewerFrame.attr('src', openUrl); // show in a frame
// window.open(openUrl, '_blank'); // or open in a new window
}
else {
// as a fallback in case of any error, open PDF in the viewer without highlights
console.error('Something wrong happened.', data);
var openUrl = viewerUrl + '?file=' + encodeURIComponent(pdfUrl);
viewerFrame.attr('src', openUrl);
}
}
});
}

You can try live example at https://jsfiddle.net/jobjects/quzf7dp1/28/

The above example shows the simplest case, when both PDF and the viewer are coming from the same origin (i.e., have the same hostname and protocol) - from our CDN. In the real world, you would probably have to handle the case when they have different origins.

Run the above example after modifyin the pdfUrl to https://jobjects.com/examples/alice.pdf, or open https://jsfiddle.net/jobjects/quzf7dp1/29/

In the PDF viewer you will see the error message "An error occurred while loading the PDF" and in the web browser console you can see something like:

Access to fetch at 'https://jobjects.com/examples/alice.pdf' from origin 'https://cdn.highlight4.me'
has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on
the requested resource. If an opaque response serves your needs, set the request's mode to
'no-cors' to fetch the resource with CORS disabled.

If you don't have prior experience with CORS, this requires an explanation...

Dealing With CORS

Cross-origin resource sharing (CORS) is a mechanism that allows restricted resources on a web page to be requested from another domain outside the domain from which the first resource was served. (see an explanation by Wikipedia or Mozilla)

In short, it's a web browser security measure. In our case, the JavaScript, which is an integral part of the PDF Viewer hosted on one domain, is requesting a PDF file that's hosted on another domain.

Resolve Copying the Viewer

A simple workaround for the CORS issue would be copying Highlighting PDF Viewer to the server hosting PDF documents. If you opt for this approach, you can download JObjects Highlighter asset files for use on your web server.

However, this may not always be possible or desirable, so let's see how to fix this on the web server configuration level.

Resolve Sending Access-Control Headers

To use PDF viewer hosted on one domain but pull PDFs from another, you need to configure web server serving PDFs to send Access-Control-Allow-Origin HTTP header.

For example, you could return Access-Control-Allow-Origin: * which would allow any website to access your documents using JavaScript.

To try it live, open the updated demo at https://jsfiddle.net/jobjects/quzf7dp1/31/.

What we did is that we've changed the pdfUrl to https://jobjects.com/examples/cors/all/alice.pdf. If you inspect HTTP headers in your browser developer tools, you can notice that we're returning Access-Control-Allow-Origin: * for this file path.

Of course, you may want to limit such access only to a specific host. In fact, we highly recommend it. In the live example at https://jsfiddle.net/jobjects/quzf7dp1/32/ we're fetching the PDF from a path that returns Access-Control-Allow-Origin: https://cdn.highlight4.me header.

note

Note that adding Access-Control-Allow-Origin header does not circumvent any user access controls you may have on the website. It's just a signal to the web browser that it's OK to proceed with accessing the given URL from JavaScript. Without it, the browser will not even try to do it.

Resolve Using Reverse Proxying

Another way to work around the CORS is to use reverse proxing. Using this concept, you would set up virtual paths on your web server that would internally be forwarded (by the web server) to the remote service.

For example, you could configure your web server to proxy two paths:

  1. /highlighter/api/ to https://cloud.highlight4.me/api/
  2. /highlighter/assets/ to https://cdn.highlight4.me/highlighter/4.4.1/

Then, you could reference the highlighting service in your scripts at /highlighter/api/highlight-for-query, and use the PDF Viewer at /highlighter/assets/viewer/index.html.

note

If you have installed JObjects Highlighter Server on your server, you will need to set up proxying on the user-facing web server to make it accessible to your users. Typically, you would proxy path /highlighter/ to http://localhost:8998/. For details, see reverse proxying.