Setting Data Folder
By default, PDF Highlighter uses file system for all data storage. If you don't explicitly set data directory
application.conf, Highlighter will create and use
highlighter-cache folder in system's tmp directory.
To change where Highlighter keeps data files, set
highlighter.dataDir property in the
Types of Data Files
Details below are for in-depth understanding, not necessary for an occasional server setup.
Basically, there are three types of persisted data:
- Full text search index.
- Text positions cache.
- Results cache.
Full Text Search Index
One might ask why Highlighter needs own search engine when it's dealing with one document at a time and, in the most common use case, it's integrated with an external search solution? The thing is, external search engine usually doesn't provide the type of data Highlighter needs. Highlighter indexes each PDF page individually in order to quickly locate which document pages need to be highlighted.
Full text search index files are located in the
index/solr data directory and are managed by Apache Solr instance embedded with Highlighter.
The total index size depends on PDF documents set and it's roughly about 5%-10% of documents size.
If you're going to use only
highlight-for-xml highlighting method, the full text search indexing module can be disabled.
Text Positions Cache
This cache keeps data about position in page of each document word. When the cache exists, Highlighter can handle highlighting requests without reading the PDF each time. For performance reasons, multiple files are created per indexed document.
Text position cache files are located below the
index/text data directory. The total size depends on PDF documents set and it's roughly 15-20% of documents size.
A file is created per handled highlight request, usually containing less than 1KB of data persisted.
Results cache is automatically cleaned up in accordance with settings.
By default, PDF Highlighter caches data for boosting performance of both highlighting methods (
However, in most use cases, just one of these methods will be used.
Once you are clear which highlighting method you need, you can setup Highlighter to store only data for the selected method.
If you use only highlight for query method, in your
application.conf, enable option:
If you only highlight PDF documents using highlight files, and do not want use Power Search in the viewer, use:
After modifying storage options, you should clear Highlighter's cache folder and start fresh.