The code here is tested with Oracle database release 11.1.0.6, but should work with releases down to 10g. When I say "tested", remember I am a developer, so it basically means it does compile and it did give the expected result on one run trough...
See it in Action
I have included a search page in my demo application. Upload a DOCX document, and try it for yourself: Go to demo application.
About Oracle Text
Oracle Text (formerly Oracle interMedia) is Oracles full-text retrieval technology, and part of Standard, Enterprise and even XE editions of the database. See Oracle Text on oracle.com for more information.
In my case I will use it to index text content stored in a CLOB column, and utilize it's search capabilities to enable users to retrieve relevant results based on search phrases.
Indexing CLOBs
Or BLOBs, or BFILEs, for that matter. It is very easy to get up and running with a text index. The following code is all that is needed to get a text index up and running against the TEXT_CONTENT column of the SAMPLES_DOCX table:
create index samples_docx_ctx on samples_docx (text_content) indextype is ctxsys.context /That's it! If you want to get fancy, there are a number of options and ways to enhance the indexing, look at the documentation for more information on the subject (lexers, sections, stopword lists, etc.).
Searching with Oracle Text
With the text index in place, the next step is to use the index to retrieve search results. In my example I will use a technique called "Progressive Relaxation" as described by Roger Ford here. It uses a query template to implement a progressive relaxation of the search tokens.
The following script creates a package to handle and use the new text index:
create or replace package samples_docx_search_p as -- result types type t_result_rec is record ( score number, filename samples_docx.filename%type, snippet varchar2(4000)); type t_result_tab is table of t_result_rec; -- synchronize search index procedure sync_search_index ( p_ctx_index_name in varchar2 default 'SAMPLES_DOCX_CTX' ); -- search docx content function search_text ( p_tokens in varchar2 ) return t_result_tab pipelined; end; / create or replace package body samples_docx_search_p as -- synchronize search index procedure sync_search_index ( p_ctx_index_name in varchar2 default 'SAMPLES_DOCX_CTX' ) is begin ctx_ddl.sync_index(p_ctx_index_name, '2M'); end; -- search docx content function search_text ( p_tokens in varchar2 ) return t_result_tab pipelined as l_max_rows integer := 10; l_counter integer := 0; l_ret_rec t_result_rec; l_tokens varchar2(4000) := lower(p_tokens); l_query_template varchar2(32000):='<query>' ||chr(10)||' <textquery> heregoesthetokens' ||chr(10)||' <progression>' ||chr(10)||' <seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>' ||chr(10)||' <seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite>/seq>' ||chr(10)||' <seq><rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite></seq>' ||chr(10)||' <seq><rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite></seq>' ||chr(10)||' <seq><rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite></seq>' ||chr(10)||' </progression>' ||chr(10)||' </textquery>' ||chr(10)||'</query>'; l_query varchar2(32000); begin if l_tokens is null then return; end if; -- create query temlpate based on tokens l_query_template := replace(l_query_template, 'heregoesthetokens', l_tokens); -- restructure tokens for use with snippet l_tokens := replace(l_tokens, ' ', ' | '); for c in (select rowid , filename from samples_docx where contains (text_content, l_query_template, 1) > 0) loop -- crude custom score, just listing the results as they come l_ret_rec.score := l_counter; l_ret_rec.filename := c.filename; -- create snippet (with tokenbased highlight) of content to return to the user ctx_doc.set_key_type('ROWID'); l_ret_rec.snippet := ctx_doc.snippet('SAMPLES_DOCX_CTX', c.rowid, l_tokens, '<b>', '</b>', false); l_counter := l_counter + 1; -- return row pipe row (l_ret_rec); -- exit when max number of rows exceeded exit when l_counter >= l_max_rows; end loop; end; end; /Some parts of the code are more cryptic than others.
First up is the "SYNC_SEARCH_INDEX" procedure. It's sole purpose is to synchronize the search index with the TEXT_CONTENT column. When you perform DML on the TEXT_CONTENT column, changes to the text index does not automatically propagate to the search index (unless the index is explicitly told to do so). So, in this example, if you insert a new row, the TEXT_CONTENT will not show up through Oracle Text searches until you have told it to synchronize the index. Indexing can be a resource demanding operation, so it makes sense to separate table DML from the indexing job.
In short: Make sure you synchronize the text index after DML on the indexed column has been performed.
Next is the "SEARCH_TEXT" function, particularly the query template. In the past I have used Oracle Text in its most simple form, but query templates adds a new dimension to controlling your searches. Basically you define how you want to use the search tokens (search phrase) to find matches in the text index. It is written in XML-form, and the <seq>-tag states the order of which you want to process your criteria. For more information on query templates, take a look at the documentation.
A short explanation on the query template:
- <rewrite>transform((TOKENS, "{", "}", " "))</rewrite>: Matches when whole phrase (tokens in order) is present within document
- <rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite>: Matches when each token is present within document
- <rewrite>transform((TOKENS, "?{", "}", "AND"))</rewrite>: Matches when each token is present within docoument, allowing typos (fussy search)
- <rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite>: Matches when any of the tokens are present within document
- <rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite>: Matches when any of the tokens are present within document, allowing typos (fussy search)
Although it will return matches according to the sequence, the "score" within the sequence will be random (as far as I could see at least), so the "score" column in the example above needs a bit of work. I also learned the hard way that structure of the XML in the query template is not validated, and it swallows typos without a word.
Test the Code
You can test the code directly from SQL*Plus or your favorite weapon of choice. The following code will search for "test":
select filename , snippet from table(samples_docx_search_p.search_text('test')) order by score /
Wildcards
My query template does not support searching for parts of words, there are some reasons for this behavior.
- Wildcard searches requires more resources
- The results returned will be unpredictable. Imagine searching for "ora*", both "oracle" and "orange" will match.
- Wildcard searches are more suitable for look-ups in structured data
Create a Search Page in Oracle APEX
I will not do a detailed explanation here, but creating a report based on a query will do the trick. If you choose "Yes" for "Enable search" in the wizard it will save you some work by creating the search dialog on the page. Clean up the query in the report region so that tour query looks something like this:
select filename , snippet from table(samples_docx_search_p.search_text(:PXX_REPORT_SEARCH)) order by scoreWhen keeping the order by, I assume that you will let the score column determine the result order. For good measure you can create a row template for the result representation in the APEX report so you can have heading and snippet spread over two rows.
Update Sample Upload Page
If you created the upload page described in my previous post, include a new page process (After Submit, and after the "extract_text"-process). The process should be of PL/SQL-type, and the code should look something like this:
samples_docx_search_p.sync_search_index;It does not necessary have to be a page process, but it is important to synchronize the text index after DML.
How to Handle DOC and DOCX
...And PDF, XLS, PPT, etc. Well, the DOCX is the exception here. Oracle Text does not understand the format yet (unless you are running on release 11.1.0.7 or higher). If you create an Oracle Text index on a BLOB column containing other file types which it does understand, they will be indexed automagically. If you want to index the content of a DOCX document in the same index, use the technique described in my last post, convert the output (plain text) into a BLOB, and insert the BLOB into the table with the other documents. Or said in a different way: you except and handle DOCX documents differently than the known document types.
Querying an Oracle Text index works the same for BLOB and CLOB columns.
There is definitely room for improvement in the code, but it serves well enough for demonstration purposes.
Enjoy :-)
No comments:
Post a Comment