Print this page

HTDig setup for Document Indexing

 

How to setup htdig for documentation indexing

This is getting somewhat dated, but the process is still useful.

This configuration will allow indexing of text files, html documents, MS Word documents and PDF docs.

RHEL 3 and RHEL 4 both ship with htdig. Install this package.

Download, build and install catdoc (numerous mirrors; I used http://www.45.free.net/~vitus/ice/catdoc/catdoc.1.html).

Download http://www.htdig.org/files/contrib/parsers/doc2html_31.zip and uncompress it in /usr/local/bin... so that doc2html.pl and pdf2html.pl are in /usr/local/bin. REMEMBER to make sure these scripts executable!

Download and install xpdf, which supplies pdftotext and pdfinfo

Now in /etc/htdig/htdig.conf:
1. change the start_url to reflect your http location of docs e.g. http://www.example.com/internal_docs
2. set max_doc_size to 500000000 #500MB, yes we might have some really large docs to be indexed.
3. set maintainer to appropriate
4. add
external_parsers: application/pdf->text/html /usr/loca/bin/pdf2html.pl \
application/msword->text/html /usr/local/bin/doc2html.pl


In pdf2html.pl and doc2html.pl, you'll need to fix the interpreter line so it reads
#!/usr/bin/perl -w

In pdf2html.pl, define
my $PDFTOTEXT = "/usr/bin/pdftotext";
my $PDFINFO = "/usr/bin/pdfinfo";


In doc2html.pl, define
my $CATDOC = '/usr/local/bin/catdoc';


Copy /usr/bin/htsearch to /var/www/cgi-bin.
Per the install docs that come with htdig, create search.html, header.html, footer.html, wrapper.html, nomatch.html, and syntax.html. These are described in the Configuration section for htdig. There's even html code that can be added to each. The search.html should go in the root of your indexed directory. The others go in common_dir as defined in htdig. (Actally these are all created under RHEL3... not RHEL4). Create symlink in indexed dir for search.html back to /usr/share/htdig/search.html

Finally, you're ready to run rundig, which will create databases and index your docs for fast searching.

NOTES:
1. Had to comment out lines in /usr/bin/rundig related to "$verbose metaphone" and "$verbose soundex". These were causing segfaults, and don't seem to have any adverse affects by being gone. AHH! You have to run "htfuzzy soundex" and "htfuzzy metaphone" first. Then you don't have to comment out these lines. Ran into this again, and it may just be easier to comment out these lines.
2. Make sure your directory to be indexed has indexes (apache/web indexes i.e. +Indexes) turned on. Also, be sure to limit who has access to this documentation, since it might contain sensitive info. You can limit access using apache's built-in Order deny,allow...
3. Upgraded to version that shipped with RHEL4 (3.2.0b6)
4. To clean up search results, modify the htdig.conf file with these directives:

external_parsers: application/pdf->text/html /usr/local/bin/doc2html_31/pdf2html.pl \
application/msword->text/html /usr/local/bin/doc2html_31/doc2html.pl \
application/vnd.sun.xml.writer->text/html /usr/local/bin/doc2html_31/doc2html-new.pl

#Actually this third line allows indexing openoffice docs and the doc2html-new.pl is actually from the latest version of htdig in cvs. I just copied it into /usr/local, and configured it's openoffice line to use /usr/bin/unzip to extract openoffice docs.

start_url: http://www.example.com/internal_info

common_url_parts: ${limit_urls_to} .html .htm .shtml .doc .pdf .xls .sxw .sxc

exclude_urls: /cgi-bin/ .cgi /Pwd C=D C=M C=N C=S O=A O=D

maintainer: webmaster@example.com

max_doc_size: 200000000

5. The doc2html.pl script requires that the LANG environment variable be set to C instead of the Redhat system default of UTF-8. Thus, when invoking rundig, a shell script should first set "export LANG=C". Without this, doc2html.pl fails a lot.

6. In this scenario, documents are kept on a NAS, and therefore need to be sync'd to the web server first. Here's an example shell script put in /etc/cron.daily:

#!/bin/bash
#
if [ -n "$(mount |grep /var/www)" ]; then
export LANG=C

mount -t smbfs -o username=docuser,password=docpass //docnas/share /mnt/tmp
rsync -a --delete --exclude="/Pwd" /mnt/tmp/docs/ /var/www/html/internal_info/
umount /mnt/tmp
rundig &>/dev/null

fi