The solution is tested on Oracle database Enterprise Edition release 11.1.0.6 with Apex 3.2.1 running on Windows XP, but should work down to 10g Standard Edition (not Oracle XE, since part of the code is in Java).
See it In Action
I have updated my demo application with a couple of pages so you can try it out yourselves: Go to demo application. There is a limited space available at apex.oracle.com, so please take care when uploading (and it is very easy to test in your own environment).
When will You Ever Need This?
If you have an Oracle database version prior to 11.1.0.7, Oracle Text is not able to index DOCX documents (as it does DOC documents). From version 11.1.0.7 and on, Oracle uses technology from Stellent, and DOCX is indexed as the other formats. So if you want to index DOCX text content in Oracle Text in a version prior to 11.1.0.7, then this could be a way to do it.
Oracle is working on a back-port to 10.x, but I have no status when (if) this will be available. Microsoft Office 2007 has been around since.. Take a wild guess! So these things obviously takes some time.
About Microsoft Open Office XML Format
As seen from the eyes of a PL/SQL developer, that is. First of all, XML sounds promising, you can do a ton of things with a valid XML in an Oracle database. You get the first nasty surprise when opening a DOCX-document in your favorite text editor; it is a zip archive! The next is when you realize that utl_compress can't help you uncompressing it either.
So Google next, and realizing this cannot be easily done in a pure PL/SQL solution, Google yields this gem from Ted Neward. It is a DOCX (Open XML) walk through as seen from the eyes of a Java developer. Very educational.
How to Unzip in PL/SQL
You probably can, but that would probably also involve a lot of work. The easy way is to take hold of a Java method that already does what you want. After searching the net I came up with this post from peterv6i which does exactly what I want (and more, it can also add files (BLOB) to an existing zip archive).
My short version of this (as I only need to extract content), follows below. The script creates a java class with a method to extract a file from a zip archive, and a PL/SQL wrapper to the getFileFromZip method. The database user must have JAVA_DEPLOY and JAVAUSERPRIV (I think) roles.
create or replace java source named "ZIPImpl" AS import java.io.*; import java.util.zip.*; import java.sql.*; import oracle.sql.*; public class ZIPImpl { public static void getFileFromZip(oracle.sql.BLOB srcBlob, oracle.sql.BLOB dstBlob[], java.lang.String name) { try { OutputStream outBuffer = dstBlob[0].getBinaryOutputStream(); InputStream inBuffer = srcBlob.getBinaryStream(); ZipInputStream zip = new ZipInputStream(inBuffer); ZipEntry entry; byte[] tmpBuffer = new byte[2048]; while((entry = zip.getNextEntry()) != null) { if (entry.getName().compareTo(name)==0) { int n; while ((n = zip.read(tmpBuffer)) >= 0) outBuffer.write(tmpBuffer, 0, n); } } outBuffer.close(); } catch (SQLException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } }; / alter java source "ZIPImpl" compile / create or replace package zip as procedure unzip( p_src in blob , p_dst in out blob , p_filename in varchar2); end; / create or replace package body zip as procedure unzip( p_src in blob , p_dst in out blob , p_filename in varchar2) as language java name 'ZIPImpl.getFileFromZip(oracle.sql.BLOB, oracle.sql.BLOB[], java.lang.String)'; end; /How to Extract Plain Text Content from DOCX
So, now you have a procedure to extract a file from a zip archive, next is to attack the content of the file. The DOCX file consists of several files, but the text content of the document resides in "word/document.xml". As the file is an XML document, I will perform an XML/XSL transformation of the file to be left with only plain text.
The script below creates a table SAMPLES_DOCX which will hold the original DOCX file stored as a BLOB, and a CLOB containing the plain text. It also creates a package with three procedures:
- Store the DOCX (as uploaded into APEX_APPLICATION_FILES)
- Download the original file (strictly not necessary for this exercise, but nice to have all the same)
- Extract and store plain text from DOCX
The XSL used to transform the document.xml file is an (extremely) abbreviated version of the XSL posted here in the Oracle Technical Forums.
create table samples_docx ( filename varchar2(255) not null , mime_type varchar2(255) , orig_file blob , text_content clob) / alter table samples_docx add constraint samples_docx_pk primary key (filename) / create or replace package sample_docx_p as -- get file from apex_application_files and store it in samples_docx table procedure store_docx ( p_file_name in varchar2); -- download procedure for original docx file procedure retrieve_docx ( p_file_name in varchar2); -- extract plain text from docx file (this is the meat) procedure store_text ( p_file_name in varchar2); end; / create or replace package body sample_docx_p as -- get file from apex_application_files and store it in samples_docx table procedure store_docx ( p_file_name in varchar2) is l_file blob; l_mime_type apex_application_files.mime_type%type; l_name apex_application_files.name%type; begin -- get file from apex files select name , mime_type , blob_content into l_name , l_mime_type , l_file from apex_application_files where name = p_file_name; -- insert record into samples table insert into samples_docx ( filename , mime_type , orig_file) values ( l_name , l_mime_type , l_file); -- delete file from apex files when done delete from apex_application_files where name = p_file_name; end store_docx; -- download procedure for original docx file procedure retrieve_docx ( p_file_name in varchar2) is l_file blob; l_mime_type apex_application_files.mime_type%type; l_name apex_application_files.name%type; l_size number; begin -- get file from apex files select filename , mime_type , orig_file , dbms_lob.getlength(orig_file) into l_name , l_mime_type , l_file , l_size from samples_docx where filename = p_file_name; -- return file owa_util.mime_header( nvl(l_mime_type,'application/octet'), false); htp.p('Content-length: ' || l_size); htp.p('Content-Disposition: attachment; filename="'||replace(replace(substr(l_name,instr(l_name,'/')+1),chr(10),null),chr(13),null)|| '"'); owa_util.http_header_close; wpg_docload.download_file(l_file); end retrieve_docx; -- perform xsl tranformation of document.xml function get_text ( p_docx_xml in xmltype ) return clob is l_clob clob; -- xsl monkeyed from http://forums.oracle.com/forums/thread.jspa?messageID=3368284 -- abbreviated quite a bit, check out original posting by "user304344" for the original l_xsl xmltype := xmltype('<?xml version="1.0" encoding="utf-8"?>' ||chr(10)||'<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" ' ||chr(10)||'xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"' ||chr(10)||'xmlns:v="urn:schemas-microsoft-com:vml"' ||chr(10)||'exclude-result-prefixes="w v">' ||chr(10)||'<xsl:output method="text" indent="no" encoding="UTF-8" version="1.0"/>' ||chr(10)||'<!-- document root -->' ||chr(10)||'<xsl:template match="/">' ||chr(10)||'<!-- root element in document --> ' ||chr(10)||'<xsl:apply-templates select="w:document"/> ' ||chr(10)||'</xsl:template>' ||chr(10)||'<!-- ****************************start document**************************** -->' ||chr(10)||'<xsl:template match="w:document">' ||chr(10)||'<xsl:for-each select="//w:p">' ||chr(10)||'<xsl:apply-templates select="*/w:t"/> ' ||chr(10)||'<xsl:text>|¤¤</xsl:text> ' ||chr(10)||'</xsl:for-each> ' ||chr(10)||'</xsl:template>' ||chr(10)||'<!-- get all text nodes within a para -->' ||chr(10)||'<xsl:template match="*/w:t">' ||chr(10)||'<xsl:value-of select="."/>' ||chr(10)||'</xsl:template>' ||chr(10)||'<!-- **************************** end document**************************** -->' ||chr(10)||'</xsl:stylesheet>'); begin -- "|¤¤" is just a hack to get linebreaks, should be an easier way to achieve this select replace(xmltransform(p_docx_xml, l_xsl).GetClobVal(), '|¤¤', chr(10)) into l_clob from dual; return l_clob; end; -- extract plain text from docx file (this is the meat) procedure store_text ( p_file_name in varchar2) is l_docx blob; l_docx_unzip blob; l_doc clob; l_dest_offset integer := 1; l_src_offset integer := 1; l_lang_context integer := dbms_lob.default_lang_ctx; l_warning integer; begin -- get original file select orig_file into l_docx from samples_docx where filename = p_file_name; -- create lob locators dbms_lob.createtemporary(l_docx_unzip,false); dbms_lob.createtemporary(l_doc,false); -- use java to unzip the docx file and retrieve document.xml zip.unzip(l_docx, l_docx_unzip, 'word/document.xml'); -- convert blob to clob dbms_lob.converttoclob ( dest_lob => l_doc , src_blob => l_docx_unzip , amount => dbms_lob.lobmaxsize , dest_offset => l_dest_offset , src_offset => l_src_offset , blob_csid => nls_charset_id('AL32UTF8') --in my case, it is stored as UTF8 , lang_context => l_lang_context , warning => l_warning ); -- transform clob via xsl to get clean text l_doc := get_text(xmltype(l_doc)); -- update the column containing document text update samples_docx set text_content = l_doc where filename = p_file_name; -- clean lob locators, should be repeated in exception block dbms_lob.freetemporary(l_docx_unzip); dbms_lob.freetemporary(l_doc); end; end; /Bringing it Together in Oracle APEX
To test the code, you can create a simple upload page in APEX:
- Create new page
- Form
- Form on a procedure
- <schema name> where you installed the application
- Choose "sample_docx_p.store_docx" as stored procedure name
- Accept defaults (or change to your liking)
- Choose desired tab options
- Leave Invoking Page/Button Label blank if you do not want an invoking page
- Choose branch pages (can be the same as the one you are creating)
- Accept defaults
- Click Finish
- Click Edit Page
- Choose PXX_FILE_NAME, and change "Display as" from "Text item" to "File Browse..."
- Apply changes
To add post processing to extract DOCX file content:
- Click Create under Processes
- Choose PL/SQL
- Give the process a name like "extract_text"
- In the "Enter PL/SQL Page Process" dialog, paste "sample_docx_p.store_text(:PXX_FILE_NAME);" (XX being your page number)
- Create process (and ensure the process comes after "Run Stored Procedure" process created by the page wizard
The last process could be implemented as a trigger on the samples_docx table.
Run the page and test by uploading a DOCX document (or test it in my demo application).
Building More
If you have a need to manipulate DOCX documents, this could probably be achieved by using the complete "ZIPImpl" as posted in this article, and using substitution variables in the same manner as I have described with RTF documents in an earlier post.
The zip/unzip may also come in handy when handling other office files (like Open Office ODF).
This is only a proof of concept, and the code is not production ready. Really! I'm not just saying that out of habit.
The next (and final) part of the DOCX "series" will address how to index the plain text content using Oracle Text.
Great post, and very useful! :-)
ReplyDeleteMonkey On Oracle: Docx Part I: How To Extract Document Content As Plain Text >>>>> Download Now
Delete>>>>> Download Full
Monkey On Oracle: Docx Part I: How To Extract Document Content As Plain Text >>>>> Download LINK
>>>>> Download Now
Monkey On Oracle: Docx Part I: How To Extract Document Content As Plain Text >>>>> Download Full
>>>>> Download LINK c5
@morten
ReplyDeleteThanks! And I know just the project for it! Coming to a server near you ;-)
See my latest blog: http://technology.amis.nl/blog/8090/parsing-a-microsoft-word-docx-and-unzip-zipfiles-with-plsql
ReplyDeleteA Plain Text Editor
ReplyDeletePlain Text files
That's right, if you're writer on a budget, you don't need to spend any money buying expensive writing software or apps. Instead, you can use the text editor that comes free with your operating system.
Just open up Notepad on Windows or TextEdit on a Mac. I like plain text editors for writing something short quickly and easily, without thinking much about it. I wrote a blog post about the benefits of using plain text editors as writing software.
Use for: writing whatever, wherever
A Plain Text Editor
ReplyDeletePlain Text files
That's right, if you're writer on a budget, you don't need to spend any money buying expensive writing software or apps. Instead, you can use the text editor that comes free with your operating system.
Just open up Notepad on Windows or TextEdit on a Mac. I like plain text editors for writing something short quickly and easily, without thinking much about it. I wrote a blog post about the benefits of using plain text editors as writing software.
Use for: writing whatever, wherever
I felt very happy while reading this site. This was really very informative site for me. I really liked it. This was really a cordial post. Thanks a lot!. https://techsng.net/save-pictures-embedded-microsoft-word-document/
ReplyDeleteMonkey On Oracle: Docx Part I: How To Extract Document Content As Plain Text >>>>> Download Now
ReplyDelete>>>>> Download Full
Monkey On Oracle: Docx Part I: How To Extract Document Content As Plain Text >>>>> Download LINK
>>>>> Download Now
Monkey On Oracle: Docx Part I: How To Extract Document Content As Plain Text >>>>> Download Full
>>>>> Download LINK Ny