Highlighter Data Files

Setting Data Folder#

By default, PDF Highlighter uses file system for all data storage. If you don't explicitly set data directory in your application.conf, Highlighter will create and use highlighter-cache folder in system's tmp directory.

To change where Highlighter keeps data files, set highlighter.dataDir property in the application.conf:

highlighter {
dataDir = "D:/highlighter-data"

Types of Data Files#


Details below are for in-depth understanding, not necessary for an occasional server setup.

Basically, there are three types of persisted data:

  1. Full text search index.
  2. Text positions cache.
  3. Results cache.

Full Text Search Index#

One might ask why Highlighter needs own search engine when it's dealing with one document at a time and, in the most common use case, it's integrated with an external search solution? The thing is, external search engine usually doesn't provide the type of data Highlighter needs. Highlighter indexes each PDF page individually in order to quickly locate which document pages need to be highlighted.

Full text search index files are located in the index/solr data directory and are managed by Apache Solr instance embedded with Highlighter.

The total index size depends on PDF documents set and it's roughly about 5%-10% of documents size.

If you're going to use only highlight-for-xml highlighting method, the full text search indexing module can be disabled.

Text Positions Cache#

This cache keeps data about position in page of each document word. When the cache exists, Highlighter can handle highlighting requests without reading the PDF each time. For performance reasons, multiple files are created per indexed document.

Text position cache files are located below the index/text data directory. The total size depends on PDF documents set and it's roughly 15-20% of documents size.

Results Cache#

A file is created per handled highlight request, usually containing less than 1KB of data persisted.

Results cache is automatically cleaned up in accordance with settings.

Optimizing Storage#

By default, PDF Highlighter caches data for boosting performance of both highlighting methods (highlight-for-query and highlight-for-xml). However, in most use cases, just one of these methods will be used.

Once you are clear which highlighting method you need, you can setup Highlighter to store only data for the selected method.

If you use only highlight for query method, in your application.conf, enable option:

highlighter.indexing {
storeDataForHighlightForQueryMethodOnly = true

If you only highlight PDF documents using highlight files, and do not want use Power Search in the viewer, use:

highlighter.indexing {
storeDataForHighlightForXmlMethodOnly = true
solr.disabled = true

After modifying storage options, you should clear Highlighter's cache folder and start fresh.