::Susa.NET
Home News Reviews Links G-Portal

About SusaNET

::DB2 for Linux: Full Text Searching using SWISH++

::A rough guide by Kevin Sangeelee (C)2000.

Introduction

At the time of writing, there is no Text Extender facility for DB2 on the Linux platform. The IBM text extender allows text columns of tables to be indexed, and searches can be performed using SQL.

This document describes way to provide this functionality using the text indexing package SWISH++, a User Defined Function (in this case, written in C), and a Perl script. It should be taken as an overview guide rather than as detailed instructions, since the solution may require some coercing to fit with other applications.

Many pitfalls and unnecessary complexities were avoided in my first implementation of this thanks to the postings from various people in the SWISH++ mailing list archives, and also thanks to the comprehensive documentation and well written & commented code in SWISH++ itself.

The ideas here can probably be applied to different indexers and database systems. Any database that allows you to define your own table functions (i.e. functions that return tables rather that scalars) should be suitable.

Overview

When searching for files or web addresses, an index-searcher locates filenames or URI's according to the given search criteria, and returns a list of filenames that match. In a similar way, if the indexer was given primary-keys and record-text to index instead of filenames and their contents, it could return the keys that match the search criteria.

The first part of the solution therefore involves giving records to the indexer one at a time. More precisely, we need to give the relevant text fields of each record to the indexer one at a time, and we need a way to tell it the associated primary-key to associate with it. However we accomplish this, the end result is an index that can be queried to return the associated primary keys.

The next problem then is how to integrate this with SQL, so that queries can include the results of a full text search.

The solution proposed is to use a User Defined Function that takes a query string as its parameter and calls the index-searcher with the provided query. It then reads the result of the query (the list of matching primary keys) and presents the list to the calling database system as a table.

This should allow us to perform simple queries like this: -

SELECT mykey FROM TABLE( TEXTSEARCH( 'red and (green or blue)' ) ) AS mytable

and more complex queries like this: -

SELECT stock_code FROM inventory, TABLE( TEXTSEARCH('red and (green or blue)')) AS catalog WHERE inventory.item_id = catalog.id

The sections below will describe the techniques used to achieve this and the last part of this document describes the actual implementation and includes the required code.

Creating the Index

The SWISH++ indexer can be persuaded to index database records by writing each record to a temporary file where the filename is the value of the primary key, and telling the indexer to index it. This can be accomplished using a scripting language like Perl.

The script connects to the database and executes a query that extracts the primary key and the relevant text columns. It then iterates through the set of records and, for each, it creates a new file with the name of the primary key, writes the associated text to this file, closes the file, and then tells the indexer to process it. Before continuing with the next record, the script waits for the indexer to finish its work and deletes the indexed file.

In order for this method of indexing to work efficiently and effectively, the SWISH++ indexer is capable of accepting the list of files to index via the standard input stream (stdin). This is used to good effect by launching the indexer as a background process that waits for input on stdin. It's launched prior to us iterating over the records, and it terminates when all records are finished. The indexer is launched using the Perl IPC::Open2() function which launches the specified program, and provides us with stdin and stdout handles for the new process. To instruct the indexer to index the next file, we simply print the name of the file (the primary key) to the returned stdin. We wait for the indexer to process the file by reading it's response from the returned stdout (i.e. a blocking read).

A further speed enhancement would be to run the script in a ram-disk working directory. The resulting index would still be written to the hard disk, but all temporary files would be created and deleted more quickly. However, it should be noted that the whole process is pretty quick anyway. For example, on a P300 with 128MB RAM I indexed over 100000 records in (much) less than ten minutes, and that included a sub-query within the main record fetching loop.

The end product here is an index file that can be used with the SWISH++ search command.

Creating the User Defined Function

DB2 allows user defined functions (UDFs) that return either scalar or table values. We can use a table UDF to provide a suitable interface between the SQL query engine of DB2 and the SWISH++ query engine.

Functions in DB2 can be written in various languages, however I've chosen C since it's perhaps the most widely documented. Regardless of language, a table function is called by DB2 at least 3 times for each SQL query that uses it. The first call that DB2 makes is an 'open' call. Then DB2 calls the function repeatedly until the function sets a 'no more records' return code. At this point DB2 makes a 'close' call. DB2 identifies the type of call it's making via a passed parameter. It also passes us a scratch-pad of memory to help us store values between function calls. The C program that defines our UDF is described below.

In the Open call, we make a call to the stdlib C function 'popen(cmd, "r")' with a string that calls the SWISH++ search command. The command itself is something like '/usr/local/packages/swish++/search <query>' where query is the search query that we passed in from the SQL. The popen() function returns us with a FILE pointer from which we can read the output of the command, which will look something like this: -

# results: 3
100 23456 1000 23456
87 23457 550 23457
60 45675 900 45675

The first line, predictably, tells us how many matches there were. Then, for each match, SWISH++ returns the grade, file, size of file, name of file.

It's here that we observe how the true purpose of SWISH++ was to index files, not records. The 'file' item would normally show the full path of the file, and the 'name of file' would show either the file name or, in the case of HTML files, the value between the <TITLE></TITLE> tags. Let's not worry about this and move on.

We're only really interested in the first line and the last field of each result line (we don't use the second field because this could potentially return a directory path before the filename, depending on how we do our indexing).

The first line is read, and the number of results extracted. We then malloc() an appropriately sized array to hold our keys, and read each line in turn adding the value of the last column to the array (converted to a long integer using atol()).

When all results have been read, we close the file handle and place the pointer to the array into the UDF's scratch-pad, along with the record count and a 'current record' counter initialised to zero. We're now ready to accept the 'fetch' calls to the function.

For each Fetch call that DB2 makes to the function, we simply extract the next element from the array held in the scratch-pad and place it at the pointer of our table function return value and return, setting the return status to 'more to come'. When the last result item has been given to DB2 we return, setting the return status to 'no more records'.

At this point DB2 will make a call to our UDF with a Close call type. We simply free() our malloc() array and return.

Installing the User Defined Function

When the UDF has been written and the C code compiled, the only remaining step is to install the UDF in a database. This simply involves connecting to your database and issuing a CREATE FUNCTION... statement with the appropriate parameters, The actual command is given below. Note that the function definition also defines the names of the columns returned. In this example, there is only one column returned, that is the id (key) of the records that match the supplied text. This ties in with the C code of the user defined function. If you want to return different key types, then you'll need to modify the C code and the function definition below.

-- First drop any existing definition

DROP FUNCTION tsearch;

-- Now create the function (these might not be the optimal set of parameters...)

CREATE FUNCTION tsearch(varchar(800))
		RETURNS TABLE(d_id integer)
	    	EXTERNAL NAME 'tsearch!tsearch'
		DETERMINISTIC
	    	EXTERNAL ACTION
		LANGUAGE C
		FENCED
		PARAMETER STYLE DB2SQL
	    	NO SQL
	    	SCRATCHPAD
		FINAL CALL
		DISALLOW PARALLEL;
			

Perl source code to build an index file (buildindex.pl)

Here's some sample code to build an index. It should be taken as a guide only, since you'll have to customise it to suit your database. It will require the Perl DBD driver for DB2 and of course the DBI module (see http://www.cpan.org, or the DB2 web site).

I'm not sure yet how feasible it would be to perform this process using only SQL and UDF's. However, since it's a 'do once, access many times' kind of operation, this will do fine for now.

#!/usr/bin/perl
use IPC::Open2;
use Shell qw(rm mv);
use DBI;
use DBD::DB2;

# **** First, process parameters.

# I hacked this in because indexing takes up a massive
# amount of memory on large sets. Doing it in chunks 
# using incremental indexing helps.

$pass = $ARGV[0];
if($pass == 1) {
	$fromDate = "1980-01-01"; $toDate = "1996-12-31";
	$flag = "";
} elsif($pass == 2) {
	$fromDate = "1997-01-01"; $toDate = "1998-12-31";
	$flag = "-I";
} else {
	die "Specify as a parameter either 1 or 2 (pass 1 or pass 2)\n";
}

$INDEX_HOME = '/home/db2inst1/indexing';
$ENV{DB2_HOME} = '/home/db2inst1';
$ENV{DB2INSTANCE} = 'db2inst1';

# *** Make the connection to your database
# ***
$dbh = DBI->connect('DBI:DB2:MYDBASE','db2inst1','< somepassword >',{AutoCommit => 1});

$inh = $dbh->prepare("select id from deals_map_gen where announce_date >= '$fromDate' and announce_date < '$toDate'");
$inh->execute;

$synopsish = $dbh->prepare("select text from deal_synopsis where id = ? order by pos");
$summaryh = $dbh->prepare("select type,text from deal_summary_txt where id = ? order by type,pos");

my ($fid,$fname);
my $counter = 0;

print "Starting\n";

open2( \*INDEX_STDOUT, \*INDEX_STDIN, "/usr/local/bin/index $flag -i$INDEX_HOME/swish++.index -v4 -");

while(($inid) = $inh->fetchrow_array) {

        $synopsish->execute($inid) || die $synopsish->errstr;

        $fname = $inid;
        $fname =~ tr/-/M/;

        open DEALFILE, ">$fname" or die "Can't create the deal file";

        # Write all text records for this 'id'
        while(($txt) = $synopsish->fetchrow_array) {
                print DEALFILE "$txt";
        }
        close DEALFILE;
        
        # Generated a file: give instruction to 'index' (by telling it the filename)
        print INDEX_STDIN "$fname\n";

        # Wait for 'index' to index the file by doing a blocking read.
        $_ = <INDEX_STDOUT>;

		# Now delete the file
        rm($fname);

        if((++$counter) % 250 == 0) {
                print "Done: $counter deals\n";
        }
} # end while

close INDEX_STDIN;

while ( <INDEX_STDOUT> ) {
}

$synopsish->finish;
$inh->finish;
$dbh->disconnect;

if($pass > 1) {
        print "Renaming $INDEX_HOME/swish++.index.new to swish++.index\n";
        mv("$INDEX_HOME/swish++.index.new", "$INDEX_HOME/swish++.index");
} else {
        print "Generated $INDEX_HOME/swish++.index from pass 1.\n";
}

User Defined Function source code (tsearch.c)

Examples and explanations on UDFs can be found in Don Chamberlain's book 'Complete Guide to DB2' (a well thumbed book), but I think the on-line docs go into some detail too.

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <sql.h>
#include <sqludf.h> /* for use in compiling
User Defined Function */
 
/* structure scr defines the passed scratchpad
for the function "ctr" */

struct scr {
   long len;
   long *keybuf;
   int numitems;
   int currentitem;
   char not_used[88];
} ;
 
#ifdef __cplusplus
extern "C"
#endif

void SQL_API_FN tsearch (
		char *terms,
		long *key,
		short *termsnull,                   
		short *keynull,                   
		char *sqlstate,                      
		char *funcname,                      
		char *specname,                     
		char *mesgtext,                      
		struct scr *scratchptr,            /* scratch pad */
		SQLUDF_CALL_TYPE *calltype)
{

	FILE *output;
	char command[1024], line[1024];
	int i,j,k,l;
	char *escs = "!#~[]()*&<>";
	switch(*calltype) {
	
	 case SQL_TF_OPEN:
		scratchptr->numitems = scratchptr->currentitem = 0;
		scratchptr->keybuf = NULL;
		for(i=0, k=0; terms[i] != '\0'; i++) {
			for(j=0; escs[j] != '\0'; j++) {
				if(terms[i] == escs[j]) {
					line[k++] = '\\';
					break;
				}
			}
			if(terms[i] == '\r' || terms[i] == '\n' || terms[i] == '\t')
				line[k++] = ' ';
			else
				line[k++] = terms[i];
		}
		line[k] = '\0';
		sprintf(command,
			"/usr/local/bin/search -m999999 -i/home/db2inst1/indexing/swish++.index %s",
				line);
		output = popen(command, "r");
		if(output == NULL) {
			return;
		}
		while(!feof(output)) {
			if( fscanf(output, "%[^\n]\n", &line[0]) < 1 ) {
				break;
			}
			if(strstr(line, "# results:") != NULL) {
				if( (scratchptr->numitems = atol(&line[11])) > 0 )
					scratchptr->keybuf = 
						(long *)malloc(scratchptr->numitems * sizeof(long));
				else
					break;
			} else {
				if(line[0] == '#')
					continue;
				for(i=0,j=0; line[i] && i < 1000 && j < 3; i++, j+= line[i] == ' ');
				if(i == 1000 || line[i] == '\0')
					continue;
				if(line[++i] == 'M')
					line[i] = '-';
				scratchptr->keybuf[scratchptr->currentitem++] = atol(&line[i]);
			}
		}
		pclose(output);
		scratchptr->currentitem = 0;
		break;

	 case SQL_TF_FETCH:
	 	if(scratchptr->currentitem < scratchptr->numitems) {
	 		*key = scratchptr->keybuf[scratchptr->currentitem++];
	 		*keynull = 0;
			strcpy(sqlstate,"00000");
		} else {
	 		*key = 0;
	 		*keynull = 1;
			strcpy(sqlstate,"02000");
		}
	 	break;

	 case SQL_TF_CLOSE:
	 	if(scratchptr->keybuf != NULL)
	 		free(scratchptr->keybuf);
	 	break;
	}
}
/* end of UDF : tsearch */

Shell script to build the 'tsearch.c' UDF

Whipped straight out of the DB2 distribution, but it was missing on one installation of mine, so I thought I'd better include it just in case. You should be logged on as user db2inst1 (or whoever the instance owner is) when running this.

	#! /bin/ksh
	# bldccudf script file
	# Builds a C user-defined function library.
	# Usage: bldccudf <prog_name>
	 
	# Set DB2PATH to where DB2 will be accessed.
	
	# The default is the instance path.
	DB2PATH=~/sqllib
	 
	# Compile the program.
	cc -I$DB2PATH/include -c $1.c
	# Link the program and create a shared library.
	cc -o $1 $1.o -shared -L$DB2PATH/lib -ldb2 -ldb2apie
	# Copy the shared library to the function subdirectory.
	# The user must have write permission to this directory.
	rm -f $DB2PATH/function/$1
	cp $1 $DB2PATH/function
	

Putting it all together

In short, the steps required to implement this solution are as follows (after, of course, installing the Swish++ package and the Perl DBI & DBD for DB2) : -

1. Build and install tsearch.c after making any necessary changes (such as the location of the file 'swish++.index').

2. Modify buildindex.pl to connect to your own database and to query your own table(s). Then run it to generate your index file (swish++.index).

3. Run some SQL against your new index, such as SELECT d_id FROM TABLE( TSEARCH('some and text or exampl*') ), and you should get a list of keys that match your search.

Notes

If you create a RAM disk to speed up writing/indexing/deleting of the temporary files, simply run buildindex.pl from the ram-disk directory. The code as it stands will just generate and index in whatever the current directory is.

There's not a lot of code to this technique, so try to read and understand exactly what's going on, since there are a number of ways that the technique can be made more powerful (for example, see the META tag handling of Swish++ and think how it can be used in a database context).

The function deals with only one index at a time. It should be fairly easy to extend it to accept a second parameter to allow you to specify which index to query. Otherwise you'll have to compile a separate UDF (i.e. tsearch2.c) to handle your second index.

References

IBM DB2 for Linux Home
Swish++ Home
DB2 for Linux HOWTO
DB2 Perl Database Interface


::design © eleusis 2002 --> ::updated: 8th Sept 02 -->