Monday, March 1, 2010

DOCX Part I: How to Extract Document Content as Plain Text

In this post I will demonstrate how to extract plain text from Microsoft Open Office XML Format (DOCX) documents stored as a BLOB data type in an Oracle database. It is part one of two posts concerning how to extract and index unstructured content stored in DOCX files. I found snippets of solutions several places, but no complete copy-paste solution, so perhaps this will save some of you a bit of work.

The solution is tested on Oracle database Enterprise Edition release 11.1.0.6 with Apex 3.2.1 running on Windows XP, but should work down to 10g Standard Edition (not Oracle XE, since part of the code is in Java).

See it In Action
I have updated my demo application with a couple of pages so you can try it out yourselves: Go to demo application. There is a limited space available at apex.oracle.com, so please take care when uploading (and it is very easy to test in your own environment).


When will You Ever Need This?
If you have an Oracle database version prior to 11.1.0.7, Oracle Text is not able to index DOCX documents (as it does DOC documents). From version 11.1.0.7 and on, Oracle uses technology from Stellent, and DOCX is indexed as the other formats. So if you want to index DOCX text content in Oracle Text in a version prior to 11.1.0.7, then this could be a way to do it.

Oracle is working on a back-port to 10.x, but I have no status when (if) this will be available. Microsoft Office 2007 has been around since.. Take a wild guess! So these things obviously takes some time.

About Microsoft Open Office XML Format
As seen from the eyes of a PL/SQL developer, that is. First of all, XML sounds promising, you can do a ton of things with a valid XML in an Oracle database. You get the first nasty surprise when opening a DOCX-document in your favorite text editor; it is a zip archive! The next is when you realize that utl_compress can't help you uncompressing it either.

So Google next, and realizing this cannot be easily done in a pure PL/SQL solution, Google yields this gem from Ted Neward. It is a DOCX (Open XML) walk through as seen from the eyes of a Java developer. Very educational.

How to Unzip in PL/SQL
You probably can, but that would probably also involve a lot of work. The easy way is to take hold of a Java method that already does what you want. After searching the net I came up with this post from peterv6i which does exactly what I want (and more, it can also add files (BLOB) to an existing zip archive).

My short version of this (as I only need to extract content), follows below. The script creates a java class with a method to extract a file from a zip archive, and a PL/SQL wrapper to the getFileFromZip method. The database user must have JAVA_DEPLOY and JAVAUSERPRIV (I think) roles.
create or replace java source named "ZIPImpl" 
AS 
import java.io.*; 
import java.util.zip.*; 
import java.sql.*; 
import oracle.sql.*; 
public class ZIPImpl 
{ 
public static void getFileFromZip(oracle.sql.BLOB srcBlob, oracle.sql.BLOB dstBlob[], java.lang.String name) { 
try { 
   OutputStream outBuffer = dstBlob[0].getBinaryOutputStream(); 
   InputStream inBuffer = srcBlob.getBinaryStream(); 
   ZipInputStream zip = new ZipInputStream(inBuffer); 
   ZipEntry entry; 
   byte[] tmpBuffer = new byte[2048]; 
   while((entry = zip.getNextEntry()) != null) { 
      if (entry.getName().compareTo(name)==0) { 
         int n; 
         while ((n = zip.read(tmpBuffer)) >= 0) 
           outBuffer.write(tmpBuffer, 0, n); 
      } 
   } 
   outBuffer.close(); 
 } 
 catch (SQLException e) { 
   System.err.println(e); 
 } 
 catch (IOException e) { 
   System.err.println(e); 
 } 
} 
}; 
/

alter java source "ZIPImpl" compile 
/      

create or replace package zip as 
   procedure unzip( 
      p_src       in          blob
   ,  p_dst       in out      blob
   ,  p_filename  in          varchar2); 
end; 
/

create or replace package body zip as 
   procedure unzip( 
      p_src       in          blob
   ,  p_dst       in out      blob
   ,  p_filename  in          varchar2) 
   as language java 
   name 'ZIPImpl.getFileFromZip(oracle.sql.BLOB, oracle.sql.BLOB[], java.lang.String)'; 
end; 
/ 
How to Extract Plain Text Content from DOCX
So, now you have a procedure to extract a file from a zip archive, next is to attack the content of the file. The DOCX file consists of several files, but the text content of the document resides in "word/document.xml". As the file is an XML document, I will perform an XML/XSL transformation of the file to be left with only plain text.

The script below creates a table SAMPLES_DOCX which will hold the original DOCX file stored as a BLOB, and a CLOB containing the plain text. It also creates a package with three procedures:
  • Store the DOCX (as uploaded into APEX_APPLICATION_FILES)
  • Download the original file (strictly not necessary for this exercise, but nice to have all the same)
  • Extract and store plain text from DOCX

The XSL used to transform the document.xml file is an (extremely) abbreviated version of the XSL posted here in the Oracle Technical Forums.
create table samples_docx (
   filename       varchar2(255) not null
,  mime_type      varchar2(255)
,  orig_file      blob
,  text_content   clob)
/

alter table samples_docx add constraint samples_docx_pk primary key (filename)
/

create or replace package sample_docx_p as
   -- get file from apex_application_files and store it in samples_docx table
   procedure store_docx (
      p_file_name    in       varchar2);
   -- download procedure for original docx file
   procedure retrieve_docx (
      p_file_name    in       varchar2);
   -- extract plain text from docx file (this is the meat)
   procedure store_text (
      p_file_name    in       varchar2);
end;
/

create or replace package body sample_docx_p as
   -- get file from apex_application_files and store it in samples_docx table
   procedure store_docx (
      p_file_name    in       varchar2)
   is
      l_file         blob;
      l_mime_type    apex_application_files.mime_type%type;
      l_name         apex_application_files.name%type;
   begin
      -- get file from apex files
      select  name
            , mime_type
            , blob_content
       into   l_name
            , l_mime_type
            , l_file
       from   apex_application_files
      where   name = p_file_name;
      -- insert record into samples table
      insert into samples_docx ( filename
                               , mime_type
                               , orig_file)
       values   (   l_name
                  , l_mime_type
                  , l_file);
      -- delete file from apex files when done
      delete from   apex_application_files
           where   name = p_file_name;
   end store_docx;
   -- download procedure for original docx file
   procedure retrieve_docx (
      p_file_name    in       varchar2)
   is
      l_file         blob;
      l_mime_type    apex_application_files.mime_type%type;
      l_name         apex_application_files.name%type;
      l_size         number;
   begin
      -- get file from apex files
      select  filename
            , mime_type
            , orig_file
            , dbms_lob.getlength(orig_file)
       into   l_name
            , l_mime_type
            , l_file
            , l_size
       from samples_docx   
      where   filename = p_file_name;
      -- return file
      owa_util.mime_header( nvl(l_mime_type,'application/octet'), false);
      htp.p('Content-length: ' || l_size);
      htp.p('Content-Disposition:  attachment; filename="'||replace(replace(substr(l_name,instr(l_name,'/')+1),chr(10),null),chr(13),null)|| '"');
      owa_util.http_header_close;
      wpg_docload.download_file(l_file);
   end retrieve_docx;
   -- perform xsl tranformation of document.xml
   function get_text (
      p_docx_xml     in       xmltype
   ) return clob
   is
   l_clob   clob;
   -- xsl monkeyed from http://forums.oracle.com/forums/thread.jspa?messageID=3368284
   -- abbreviated quite a bit, check out original posting by "user304344" for the original
   l_xsl    xmltype := xmltype('<?xml version="1.0" encoding="utf-8"?>'
                     ||chr(10)||'<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" '
                     ||chr(10)||'xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"'
                     ||chr(10)||'xmlns:v="urn:schemas-microsoft-com:vml"'
                     ||chr(10)||'exclude-result-prefixes="w v">'
                     ||chr(10)||'<xsl:output method="text" indent="no" encoding="UTF-8" version="1.0"/>'
                     ||chr(10)||'<!-- document root -->'
                     ||chr(10)||'<xsl:template match="/">'
                     ||chr(10)||'<!-- root element in document --> '
                     ||chr(10)||'<xsl:apply-templates select="w:document"/> '
                     ||chr(10)||'</xsl:template>'
                     ||chr(10)||'<!-- ****************************start document**************************** -->'
                     ||chr(10)||'<xsl:template match="w:document">'
                     ||chr(10)||'<xsl:for-each select="//w:p">'
                     ||chr(10)||'<xsl:apply-templates select="*/w:t"/> '
                     ||chr(10)||'<xsl:text>|¤¤</xsl:text> '
                     ||chr(10)||'</xsl:for-each> '
                     ||chr(10)||'</xsl:template>'
                     ||chr(10)||'<!-- get all text nodes within a para -->'
                     ||chr(10)||'<xsl:template match="*/w:t">'
                     ||chr(10)||'<xsl:value-of select="."/>'
                     ||chr(10)||'</xsl:template>'
                     ||chr(10)||'<!-- **************************** end document**************************** -->'
                     ||chr(10)||'</xsl:stylesheet>');
   begin
      -- "|¤¤" is just a hack to get linebreaks, should be an easier way to achieve this
      select replace(xmltransform(p_docx_xml, l_xsl).GetClobVal(), '|¤¤', chr(10))
        into l_clob
        from dual;
      return l_clob;
   end;   
   -- extract plain text from docx file (this is the meat)
   procedure store_text (
      p_file_name    in       varchar2)
   is
      l_docx            blob;
      l_docx_unzip      blob;
      l_doc             clob;
      l_dest_offset     integer := 1;
      l_src_offset      integer := 1;
      l_lang_context    integer := dbms_lob.default_lang_ctx;
      l_warning         integer;
   begin
      -- get original file
      select orig_file
        into l_docx
        from samples_docx
       where filename = p_file_name;
      -- create lob locators
      dbms_lob.createtemporary(l_docx_unzip,false);
      dbms_lob.createtemporary(l_doc,false);
      -- use java to unzip the docx file and retrieve document.xml
      zip.unzip(l_docx, l_docx_unzip, 'word/document.xml');
      -- convert blob to clob
      dbms_lob.converttoclob
               ( dest_lob => l_doc
               , src_blob => l_docx_unzip
               , amount => dbms_lob.lobmaxsize
               , dest_offset => l_dest_offset
               , src_offset => l_src_offset
               , blob_csid => nls_charset_id('AL32UTF8') --in my case, it is stored as UTF8
               , lang_context => l_lang_context
               , warning => l_warning
               );
      -- transform clob via xsl to get clean text
      l_doc := get_text(xmltype(l_doc));
      -- update the column containing document text
      update samples_docx
         set text_content = l_doc
       where filename = p_file_name;
      -- clean lob locators, should be repeated in exception block
      dbms_lob.freetemporary(l_docx_unzip);
      dbms_lob.freetemporary(l_doc);
   end;
end;
/
Bringing it Together in Oracle APEX
To test the code, you can create a simple upload page in APEX:
  • Create new page
  • Form
  • Form on a procedure
  • <schema name> where you installed the application
  • Choose "sample_docx_p.store_docx" as stored procedure name
  • Accept defaults (or change to your liking)
  • Choose desired tab options
  • Leave Invoking Page/Button Label blank if you do not want an invoking page
  • Choose branch pages (can be the same as the one you are creating)
  • Accept defaults
  • Click Finish
  • Click Edit Page
  • Choose PXX_FILE_NAME, and change "Display as" from "Text item" to "File Browse..."
  • Apply changes

To add post processing to extract DOCX file content:
  • Click Create under Processes
  • Choose PL/SQL
  • Give the process a name like "extract_text"
  • In the "Enter PL/SQL Page Process" dialog, paste "sample_docx_p.store_text(:PXX_FILE_NAME);" (XX being your page number)
  • Create process (and ensure the process comes after "Run Stored Procedure" process created by the page wizard

The last process could be implemented as a trigger on the samples_docx table.

Run the page and test by uploading a DOCX document (or test it in my demo application).

Building More
If you have a need to manipulate DOCX documents, this could probably be achieved by using the complete "ZIPImpl" as posted in this article, and using substitution variables in the same manner as I have described with RTF documents in an earlier post.

The zip/unzip may also come in handy when handling other office files (like Open Office ODF).

This is only a proof of concept, and the code is not production ready. Really! I'm not just saying that out of habit.

The next (and final) part of the DOCX "series" will address how to index the plain text content using Oracle Text.

3 comments:

  1. Great post, and very useful! :-)

    ReplyDelete
  2. @morten

    Thanks! And I know just the project for it! Coming to a server near you ;-)

    ReplyDelete
  3. See my latest blog: http://technology.amis.nl/blog/8090/parsing-a-microsoft-word-docx-and-unzip-zipfiles-with-plsql

    ReplyDelete