Tuesday, March 2, 2010

DOCX Part II: How to Index Document Content With Oracle Text

Here I will demonstrate how to index content of CLOB data using Oracle Text. Other than the database objects used in the example, the post addresses the use of Oracle Text in a general way, and will work for any CLOB column. There is no ground breaking code here, but it completes the example on how to handle Microsoft Open Office XML Format (DOCX) documents. The examples and content here are based on my previous post.

The code here is tested with Oracle database release, but should work with releases down to 10g. When I say "tested", remember I am a developer, so it basically means it does compile and it did give the expected result on one run trough...

See it in Action
I have included a search page in my demo application. Upload a DOCX document, and try it for yourself: Go to demo application.

About Oracle Text
Oracle Text (formerly Oracle interMedia) is Oracles full-text retrieval technology, and part of Standard, Enterprise and even XE editions of the database. See Oracle Text on oracle.com for more information.

In my case I will use it to index text content stored in a CLOB column, and utilize it's search capabilities to enable users to retrieve relevant results based on search phrases.

Indexing CLOBs
Or BLOBs, or BFILEs, for that matter. It is very easy to get up and running with a text index. The following code is all that is needed to get a text index up and running against the TEXT_CONTENT column of the SAMPLES_DOCX table:
create index samples_docx_ctx on samples_docx
indextype is ctxsys.context
That's it! If you want to get fancy, there are a number of options and ways to enhance the indexing, look at the documentation for more information on the subject (lexers, sections, stopword lists, etc.).

Searching with Oracle Text
With the text index in place, the next step is to use the index to retrieve search results. In my example I will use a technique called "Progressive Relaxation" as described by Roger Ford here. It uses a query template to implement a progressive relaxation of the search tokens.

The following script creates a package to handle and use the new text index:
create or replace package samples_docx_search_p
   -- result types
   type t_result_rec is record (
      score         number,
      filename      samples_docx.filename%type,
      snippet       varchar2(4000));
   type t_result_tab is table of t_result_rec;
   -- synchronize search index
   procedure sync_search_index (
      p_ctx_index_name     in          varchar2 default 'SAMPLES_DOCX_CTX'
   -- search docx content
   function search_text (
      p_tokens    in          varchar2
   ) return t_result_tab pipelined;

create or replace package body samples_docx_search_p
   -- synchronize search index
   procedure sync_search_index (
      p_ctx_index_name     in          varchar2 default 'SAMPLES_DOCX_CTX'
   ) is
      ctx_ddl.sync_index(p_ctx_index_name, '2M');
   -- search docx content
   function search_text (
      p_tokens    in          varchar2
   ) return t_result_tab pipelined
      l_max_rows         integer := 10;
      l_counter          integer := 0;
      l_ret_rec          t_result_rec;
      l_tokens           varchar2(4000) := lower(p_tokens); 
      l_query_template   varchar2(32000):='<query>'
                              ||chr(10)||'   <textquery> heregoesthetokens'
                              ||chr(10)||'     <progression>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite>/seq>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite></seq>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite></seq>'
                              ||chr(10)||'       <seq><rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite></seq>'
                              ||chr(10)||'     </progression>'
                              ||chr(10)||'   </textquery>'
      l_query            varchar2(32000);
      if l_tokens is null
      end if;
      -- create query temlpate based on tokens
      l_query_template := replace(l_query_template, 'heregoesthetokens', l_tokens);
      -- restructure tokens for use with snippet
      l_tokens := replace(l_tokens, ' ', ' | ');
      for c in (select rowid
                     , filename 
                  from samples_docx 
                 where contains (text_content, l_query_template, 1) > 0)
         -- crude custom score, just listing the results as they come
         l_ret_rec.score := l_counter;
         l_ret_rec.filename := c.filename;
         -- create snippet (with tokenbased highlight) of content to return to the user
         l_ret_rec.snippet :=  ctx_doc.snippet('SAMPLES_DOCX_CTX', c.rowid, l_tokens, '<b>', '</b>', false);
         l_counter := l_counter + 1;
         -- return row
         pipe row (l_ret_rec);
         -- exit when max number of rows exceeded
         exit when l_counter >= l_max_rows;
      end loop;
Some parts of the code are more cryptic than others.

First up is the "SYNC_SEARCH_INDEX" procedure. It's sole purpose is to synchronize the search index with the TEXT_CONTENT column. When you perform DML on the TEXT_CONTENT column, changes to the text index does not automatically propagate to the search index (unless the index is explicitly told to do so). So, in this example, if you insert a new row, the TEXT_CONTENT will not show up through Oracle Text searches until you have told it to synchronize the index. Indexing can be a resource demanding operation, so it makes sense to separate table DML from the indexing job.

In short: Make sure you synchronize the text index after DML on the indexed column has been performed.

Next is the "SEARCH_TEXT" function, particularly the query template. In the past I have used Oracle Text in its most simple form, but query templates adds a new dimension to controlling your searches. Basically you define how you want to use the search tokens (search phrase) to find matches in the text index. It is written in XML-form, and the <seq>-tag states the order of which you want to process your criteria. For more information on query templates, take a look at the documentation.

A short explanation on the query template:
  • <rewrite>transform((TOKENS, "{", "}", " "))</rewrite>: Matches when whole phrase (tokens in order) is present within document
  • <rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite>: Matches when each token is present within document
  • <rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite>: Matches when each token is present within docoument, allowing typos (fussy search)
  • <rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite>: Matches when any of the tokens are present within document
  • <rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite>: Matches when any of the tokens are present within document, allowing typos (fussy search)

Although it will return matches according to the sequence, the "score" within the sequence will be random (as far as I could see at least), so the "score" column in the example above needs a bit of work. I also learned the hard way that structure of the XML in the query template is not validated, and it swallows typos without a word.

Test the Code
You can test the code directly from SQL*Plus or your favorite weapon of choice. The following code will search for "test":
select filename
,      snippet
from   table(samples_docx_search_p.search_text('test'))
order  by score

My query template does not support searching for parts of words, there are some reasons for this behavior.
  • Wildcard searches requires more resources
  • The results returned will be unpredictable. Imagine searching for "ora*", both "oracle" and "orange" will match.
  • Wildcard searches are more suitable for look-ups in structured data
What did I mean by that last statement? As I see it, users today are using a wide range of search engines and site searches every day, and wildcard matching are in most cases available only through the "advanced search"-link (that is rarely used, if ever). Let alone that people do not expect this kind of behavior anymore. If you want to help the user, add ajax autocomplete to your search field, and use the Oracle Text index. In the traditional back office application, it is another matter (look-ups and such).

Create a Search Page in Oracle APEX
I will not do a detailed explanation here, but creating a report based on a query will do the trick. If you choose "Yes" for "Enable search" in the wizard it will save you some work by creating the search dialog on the page. Clean up the query in the report region so that tour query looks something like this:
select filename
,      snippet
from   table(samples_docx_search_p.search_text(:PXX_REPORT_SEARCH))
order  by score
When keeping the order by, I assume that you will let the score column determine the result order. For good measure you can create a row template for the result representation in the APEX report so you can have heading and snippet spread over two rows.

Update Sample Upload Page
If you created the upload page described in my previous post, include a new page process (After Submit, and after the "extract_text"-process). The process should be of PL/SQL-type, and the code should look something like this:
It does not necessary have to be a page process, but it is important to synchronize the text index after DML.

How to Handle DOC and DOCX
...And PDF, XLS, PPT, etc. Well, the DOCX is the exception here. Oracle Text does not understand the format yet (unless you are running on release or higher). If you create an Oracle Text index on a BLOB column containing other file types which it does understand, they will be indexed automagically. If you want to index the content of a DOCX document in the same index, use the technique described in my last post, convert the output (plain text) into a BLOB, and insert the BLOB into the table with the other documents. Or said in a different way: you except and handle DOCX documents differently than the known document types.

Querying an Oracle Text index works the same for BLOB and CLOB columns.

There is definitely room for improvement in the code, but it serves well enough for demonstration purposes.

Enjoy :-)

1 comment: