Thursday, May 24, 2012

Manupulating pdf Files Using Oracle Text.

This is also based on one of my post in OTN. In this post, I will demonstrate various thing you can do with pdf files with the help of Oracle Text.
Also you can see these other related posts in OTN : OTN1 OTN2 and MYBLOG
Oracle Text can manipulate pdf files to a greate extent. For more information about Oracle Text Please visit
Online Documentation
http://docs.oracle.com/cd/E11882_01/text.112/e24435/toc.htm
Oracle Text Forum
https://forums.oracle.com/forums/forum.jspa?forumID=71
In the forum, I have gained valuable insights by reading the posts from Barbara Boehmer . You may also try those.
Here is some example of pdf file manipulation using Oracle Text:

Here is my pdf files and OS related information.

[oracle@localhost ~]$ cd Saubhik/ 
[oracle@localhost Saubhik]$ ls -l *.pdf
-rw------- 1 oracle oracle 1358211 Sep 18  2011 Graphical_Plans_NoCOUG_Article.pdf
-rw------- 1 oracle oracle  374150 May 22 20:21 mypdf2.pdf
[oracle@localhost Saubhik]$ uname -a
Linux localhost.localdomain 2.6.18-194.17.1.0.1.el5 #1 SMP Wed Sep 29 15:40:03 EDT 2010 i686 i686 i386 GNU/Linux
[oracle@localhost Saubhik]$ pwd
/home/oracle/Saubhik
[oracle@localhost Saubhik]$

My database information.

[oracle@localhost ~]$ sqlplus scott/tiger
 
SQL*Plus: Release 11.2.0.2.0 Production on Thu May 24 13:50:46 2012
 
Copyright (c) 1982, 2010, Oracle.  All rights reserved.
 
 
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
 
SQL> SELECT * FROM v$version;
 
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
PL/SQL Release 11.2.0.2.0 - Production
CORE    11.2.0.2.0      Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production
 
SQL> SELECT directory_name,directory_path
  2  FROM dba_directories
  3  WHERE directory_name='TEMP';
 
DIRECTORY_NAME
------------------------------
DIRECTORY_PATH
--------------------------------------------------------------------------------
TEMP
/home/oracle/Saubhik
 
 
SQL>

Now some demos. This can be make better and efficient by reading the above mentioned links.
SQL> /* Creating my demo table */
SQL> CREATE TABLE pdm(id_pk INTEGER PRIMARY KEY,fname VARCHAR2(50),pdf_file BLOB);
 
Table created.
 
SQL> /* Creating a procedure which will load the BLOBs (pdf files) into the table */
SQL> ed
Wrote file afiedt.buf
 
  1  CREATE OR REPLACE
  2  PROCEDURE load_file(
  3      pi_id  IN INTEGER,
  4      pfname IN VARCHAR2)
  5  IS
  6    src_file BFILE;
  7    dst_file BLOB;
  8    lgh_file BINARY_INTEGER;
  9  BEGIN
 10    src_file := bfilename('TEMP', pfname);
 11    INSERT
 12    INTO pdm
 13      (
 14        id_pk,
 15        fname,
 16        pdf_file
 17      )
 18      VALUES
 19      (
 20        pi_id,
 21        pfname,
 22        EMPTY_BLOB()
 23      )
 24    RETURNING pdf_file
 25    INTO dst_file;
 26    DBMS_LOB.OPEN(src_file, dbms_lob.file_readonly);
 27    lgh_file := dbms_lob.getlength(src_file);
 28    DBMS_LOB.LOADFROMFILE(dst_file, src_file, lgh_file);
 29    dbms_lob.close(src_file);
 30    COMMIT;
 31* END load_file;
SQL> / 
 
Procedure created.
 
SQL> EXECUTE load_file(1,'mypdf2.pdf');
 
PL/SQL procedure successfully completed.
 
SQL> EXECUTE load_file(2,'Graphical_Plans_NoCOUG_Article.pdf');
 
PL/SQL procedure successfully completed.
 
SQL> /* Checking the rows just inserted in the table */
SQL> SELECT id_pk,fname,DBMS_LOB.getlength(pdf_file) file_size
  2  FROM pdm;
 
     ID_PK FNAME                                               FILE_SIZE
---------- -------------------------------------------------- ----------
         1 mypdf2.pdf                                             374150
         2 Graphical_Plans_NoCOUG_Article.pdf                    1358211
 
SQL> /* Creating the index */
SQL> CREATE INDEX pdm_pdf_idx ON pdm(pdf_file) INDEXTYPE IS CTXSYS.CONTEXT;
 
Index created.
 
SQL> /* Checking for any errors in index creation */
SQL> SELECT * FROM ctx_user_index_errors;
 
no rows selected
 
SQL> /* Checking the index */
SQL> SELECT idx_name,idx_table_owner,idx_table,idx_status,idx_type
  2  FROM ctx_user_indexes
  3  WHERE idx_name='PDM_PDF_IDX';
 
IDX_NAME                       IDX_TABLE_OWNER
------------------------------ ------------------------------
IDX_TABLE                      IDX_STATUS   IDX_TYP
------------------------------ ------------ -------
PDM_PDF_IDX                    SCOTT
PDM                            INDEXED      CONTEXT
 
 
SQL> /* Just gathering some statistics */
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'PDM', cascade=>TRUE);
 
PL/SQL procedure successfully completed.
 
SQL> /* Now, you can search words and find in which pdf file that exists!  
SQL> Here are some examples */
SQL> SELECT ID_PK,FNAME
FROM PDM
WHERE CONTAINS(PDF_FILE,'Chris Lawson')>0;  2    3  
 
     ID_PK FNAME
---------- --------------------------------------------------
         2 Graphical_Plans_NoCOUG_Article.pdf
 
SQL> ed  
Wrote file afiedt.buf
 
  1  SELECT ID_PK,FNAME
  2  FROM PDM
  3* WHERE CONTAINS(PDF_FILE,'SAUBHIK')>0
SQL> / 
 
     ID_PK FNAME
---------- --------------------------------------------------
         1 mypdf2.pdf
 
SQL> ed
Wrote file afiedt.buf
 
  1  SELECT ID_PK,FNAME
  2  FROM PDM
  3* WHERE CONTAINS(PDF_FILE,'MEGHDEEP')>0
SQL> / 
 
     ID_PK FNAME
---------- --------------------------------------------------
         1 mypdf2.pdf
 
SQL> ed
Wrote file afiedt.buf
 
  1  SELECT ID_PK,FNAME
  2  FROM PDM
  3* WHERE CONTAINS(PDF_FILE,'DBMS_XPLAN')>0
SQL> / 
 
no rows selected
 
SQL> ed
Wrote file afiedt.buf
 
  1  SELECT ID_PK,FNAME
  2  FROM PDM
  3* WHERE CONTAINS(PDF_FILE,'www.graphviz.org')>0
SQL> / 
 
     ID_PK FNAME
---------- --------------------------------------------------
         2 Graphical_Plans_NoCOUG_Article.pdf
 
SQL> ed
Wrote file afiedt.buf
 
  1  SELECT ID_PK,FNAME
  2  FROM PDM
  3* WHERE CONTAINS(PDF_FILE,'Bla Bla')>0
SQL> / 
 
no rows selected
 
SQL> ed
Wrote file afiedt.buf
 
  1  SELECT ID_PK,FNAME
  2  FROM PDM
  3* WHERE CONTAINS(PDF_FILE,'DBMS_XPLAN-')>0
SQL> / 
SELECT ID_PK,FNAME
*
ERROR at line 1:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 12
 
 
SQL> ed
Wrote file afiedt.buf
 
  1  SELECT ID_PK,FNAME
  2  FROM PDM
  3* WHERE CONTAINS(PDF_FILE,'DBMS')>0
SQL> / 
 
     ID_PK FNAME
---------- --------------------------------------------------
         2 Graphical_Plans_NoCOUG_Article.pdf
 
SQL> /* Now say, You want to get some portions of the words from your pdf document */
SQL> /* For this purpose, I have created a function. I know, this function can be 
SQL>    very well written. Also you may try to use some other features like 
SQL>    CTX_DOC.SNIPPET and Regular expressions.  
SQL> */
SQL> ed
Wrote file afiedt.buf
 
  1  CREATE OR REPLACE
  2    FUNCTION search_pdf(
  3        pi_index_name  VARCHAR2,
  4        pi_search_word VARCHAR2,
  5        pi_pk_val      VARCHAR2 )
  6      RETURN VARCHAR2
  7    IS
  8      v_clob CLOB;
  9      v_amt_cut INTEGER :=250;
 10      v_buffer CLOB;
 11    BEGIN
 12      CTX_DOC.markup(pi_index_name,pi_pk_val,pi_search_word, v_clob);
 13      v_buffer:=DBMS_LOB.substr(v_clob,v_amt_cut,dbms_lob.instr(v_clob,pi_search_word,1,1));
 14      DBMS_LOB.FREETEMPORARY(v_clob);
 15      RETURN REGEXP_REPLACE(v_buffer,'<[^>]+>');
 16*   END search_pdf;
SQL> / 
 
Function created.
 
SQL> SELECT search_pdf('PDM_PDF_IDX','SAUBHIK','1') FROM dual;
 
SEARCH_PDF('PDM_PDF_IDX','SAUBHIK','1')
--------------------------------------------------------------------------------
SAUBHIK>>>  BANERJEE
***: M
Age: 36
<
 
 
SQL> ed
Wrote file afiedt.buf
 
  1* SELECT search_pdf('PDM_PDF_IDX','MEGHDEEP','1') FROM dual
SQL> / 
 
SEARCH_PDF('PDM_PDF_IDX','MEGHDEEP','1')
--------------------------------------------------------------------------------
MEGHDEEP>>> BANERJEE
***: M
Age: 4
<d
 
 
SQL> ed
Wrote file afiedt.buf
 
  1* SELECT search_pdf('PDM_PDF_IDX','DBMS','2') FROM dual
SQL> / 
 
SEARCH_PDF('PDM_PDF_IDX','DBMS','2')
--------------------------------------------------------------------------------
DBMS>>>_XPLAN?not very
easy to read, especially when many tables are involved. 
Today I found another interesting post :
https://forums.oracle.com/forums/thread.jspa?threadID=2394143&#10359156