This is also based on one of my post in OTN. In this post, I will demonstrate various thing you can do with pdf files with the help of Oracle Text.
Also you can see these other related posts in OTN : OTN1 OTN2 and MYBLOG
Oracle Text can manipulate pdf files to a greate extent. For more information about Oracle Text Please visit
Online Documentation
http://docs.oracle.com/cd/E11882_01/text.112/e24435/toc.htm
Oracle Text Forum
https://forums.oracle.com/forums/forum.jspa?forumID=71
In the forum, I have gained valuable insights by reading the posts from Barbara Boehmer . You may also try those.
Here is some example of pdf file manipulation using Oracle Text:
Here is my pdf files and OS related information.
My database information.
https://forums.oracle.com/forums/thread.jspa?threadID=2394143�
Also you can see these other related posts in OTN : OTN1 OTN2 and MYBLOG
Oracle Text can manipulate pdf files to a greate extent. For more information about Oracle Text Please visit
Online Documentation
http://docs.oracle.com/cd/E11882_01/text.112/e24435/toc.htm
Oracle Text Forum
https://forums.oracle.com/forums/forum.jspa?forumID=71
In the forum, I have gained valuable insights by reading the posts from Barbara Boehmer . You may also try those.
Here is some example of pdf file manipulation using Oracle Text:
Here is my pdf files and OS related information.
[oracle@localhost ~]$ cd Saubhik/ [oracle@localhost Saubhik]$ ls -l *.pdf -rw------- 1 oracle oracle 1358211 Sep 18 2011 Graphical_Plans_NoCOUG_Article.pdf -rw------- 1 oracle oracle 374150 May 22 20:21 mypdf2.pdf [oracle@localhost Saubhik]$ uname -a Linux localhost.localdomain 2.6.18-194.17.1.0.1.el5 #1 SMP Wed Sep 29 15:40:03 EDT 2010 i686 i686 i386 GNU/Linux [oracle@localhost Saubhik]$ pwd /home/oracle/Saubhik [oracle@localhost Saubhik]$
My database information.
[oracle@localhost ~]$ sqlplus scott/tiger SQL*Plus: Release 11.2.0.2.0 Production on Thu May 24 13:50:46 2012 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> SELECT * FROM v$version; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production PL/SQL Release 11.2.0.2.0 - Production CORE 11.2.0.2.0 Production TNS for Linux: Version 11.2.0.2.0 - Production NLSRTL Version 11.2.0.2.0 - Production SQL> SELECT directory_name,directory_path 2 FROM dba_directories 3 WHERE directory_name='TEMP'; DIRECTORY_NAME ------------------------------ DIRECTORY_PATH -------------------------------------------------------------------------------- TEMP /home/oracle/Saubhik SQL>Now some demos. This can be make better and efficient by reading the above mentioned links.
SQL> /* Creating my demo table */ SQL> CREATE TABLE pdm(id_pk INTEGER PRIMARY KEY,fname VARCHAR2(50),pdf_file BLOB); Table created. SQL> /* Creating a procedure which will load the BLOBs (pdf files) into the table */ SQL> ed Wrote file afiedt.buf 1 CREATE OR REPLACE 2 PROCEDURE load_file( 3 pi_id IN INTEGER, 4 pfname IN VARCHAR2) 5 IS 6 src_file BFILE; 7 dst_file BLOB; 8 lgh_file BINARY_INTEGER; 9 BEGIN 10 src_file := bfilename('TEMP', pfname); 11 INSERT 12 INTO pdm 13 ( 14 id_pk, 15 fname, 16 pdf_file 17 ) 18 VALUES 19 ( 20 pi_id, 21 pfname, 22 EMPTY_BLOB() 23 ) 24 RETURNING pdf_file 25 INTO dst_file; 26 DBMS_LOB.OPEN(src_file, dbms_lob.file_readonly); 27 lgh_file := dbms_lob.getlength(src_file); 28 DBMS_LOB.LOADFROMFILE(dst_file, src_file, lgh_file); 29 dbms_lob.close(src_file); 30 COMMIT; 31* END load_file; SQL> / Procedure created. SQL> EXECUTE load_file(1,'mypdf2.pdf'); PL/SQL procedure successfully completed. SQL> EXECUTE load_file(2,'Graphical_Plans_NoCOUG_Article.pdf'); PL/SQL procedure successfully completed. SQL> /* Checking the rows just inserted in the table */ SQL> SELECT id_pk,fname,DBMS_LOB.getlength(pdf_file) file_size 2 FROM pdm; ID_PK FNAME FILE_SIZE ---------- -------------------------------------------------- ---------- 1 mypdf2.pdf 374150 2 Graphical_Plans_NoCOUG_Article.pdf 1358211 SQL> /* Creating the index */ SQL> CREATE INDEX pdm_pdf_idx ON pdm(pdf_file) INDEXTYPE IS CTXSYS.CONTEXT; Index created. SQL> /* Checking for any errors in index creation */ SQL> SELECT * FROM ctx_user_index_errors; no rows selected SQL> /* Checking the index */ SQL> SELECT idx_name,idx_table_owner,idx_table,idx_status,idx_type 2 FROM ctx_user_indexes 3 WHERE idx_name='PDM_PDF_IDX'; IDX_NAME IDX_TABLE_OWNER ------------------------------ ------------------------------ IDX_TABLE IDX_STATUS IDX_TYP ------------------------------ ------------ ------- PDM_PDF_IDX SCOTT PDM INDEXED CONTEXT SQL> /* Just gathering some statistics */ SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'PDM', cascade=>TRUE); PL/SQL procedure successfully completed. SQL> /* Now, you can search words and find in which pdf file that exists! SQL> Here are some examples */ SQL> SELECT ID_PK,FNAME FROM PDM WHERE CONTAINS(PDF_FILE,'Chris Lawson')>0; 2 3 ID_PK FNAME ---------- -------------------------------------------------- 2 Graphical_Plans_NoCOUG_Article.pdf SQL> ed Wrote file afiedt.buf 1 SELECT ID_PK,FNAME 2 FROM PDM 3* WHERE CONTAINS(PDF_FILE,'SAUBHIK')>0 SQL> / ID_PK FNAME ---------- -------------------------------------------------- 1 mypdf2.pdf SQL> ed Wrote file afiedt.buf 1 SELECT ID_PK,FNAME 2 FROM PDM 3* WHERE CONTAINS(PDF_FILE,'MEGHDEEP')>0 SQL> / ID_PK FNAME ---------- -------------------------------------------------- 1 mypdf2.pdf SQL> ed Wrote file afiedt.buf 1 SELECT ID_PK,FNAME 2 FROM PDM 3* WHERE CONTAINS(PDF_FILE,'DBMS_XPLAN')>0 SQL> / no rows selected SQL> ed Wrote file afiedt.buf 1 SELECT ID_PK,FNAME 2 FROM PDM 3* WHERE CONTAINS(PDF_FILE,'www.graphviz.org')>0 SQL> / ID_PK FNAME ---------- -------------------------------------------------- 2 Graphical_Plans_NoCOUG_Article.pdf SQL> ed Wrote file afiedt.buf 1 SELECT ID_PK,FNAME 2 FROM PDM 3* WHERE CONTAINS(PDF_FILE,'Bla Bla')>0 SQL> / no rows selected SQL> ed Wrote file afiedt.buf 1 SELECT ID_PK,FNAME 2 FROM PDM 3* WHERE CONTAINS(PDF_FILE,'DBMS_XPLAN-')>0 SQL> / SELECT ID_PK,FNAME * ERROR at line 1: ORA-29902: error in executing ODCIIndexStart() routine ORA-20000: Oracle Text error: DRG-50901: text query parser syntax error on line 1, column 12 SQL> ed Wrote file afiedt.buf 1 SELECT ID_PK,FNAME 2 FROM PDM 3* WHERE CONTAINS(PDF_FILE,'DBMS')>0 SQL> / ID_PK FNAME ---------- -------------------------------------------------- 2 Graphical_Plans_NoCOUG_Article.pdf SQL> /* Now say, You want to get some portions of the words from your pdf document */ SQL> /* For this purpose, I have created a function. I know, this function can be SQL> very well written. Also you may try to use some other features like SQL> CTX_DOC.SNIPPET and Regular expressions. SQL> */ SQL> ed Wrote file afiedt.buf 1 CREATE OR REPLACE 2 FUNCTION search_pdf( 3 pi_index_name VARCHAR2, 4 pi_search_word VARCHAR2, 5 pi_pk_val VARCHAR2 ) 6 RETURN VARCHAR2 7 IS 8 v_clob CLOB; 9 v_amt_cut INTEGER :=250; 10 v_buffer CLOB; 11 BEGIN 12 CTX_DOC.markup(pi_index_name,pi_pk_val,pi_search_word, v_clob); 13 v_buffer:=DBMS_LOB.substr(v_clob,v_amt_cut,dbms_lob.instr(v_clob,pi_search_word,1,1)); 14 DBMS_LOB.FREETEMPORARY(v_clob); 15 RETURN REGEXP_REPLACE(v_buffer,'<[^>]+>'); 16* END search_pdf; SQL> / Function created. SQL> SELECT search_pdf('PDM_PDF_IDX','SAUBHIK','1') FROM dual; SEARCH_PDF('PDM_PDF_IDX','SAUBHIK','1') -------------------------------------------------------------------------------- SAUBHIK>>> BANERJEE ***: M Age: 36 < SQL> ed Wrote file afiedt.buf 1* SELECT search_pdf('PDM_PDF_IDX','MEGHDEEP','1') FROM dual SQL> / SEARCH_PDF('PDM_PDF_IDX','MEGHDEEP','1') -------------------------------------------------------------------------------- MEGHDEEP>>> BANERJEE ***: M Age: 4 <d SQL> ed Wrote file afiedt.buf 1* SELECT search_pdf('PDM_PDF_IDX','DBMS','2') FROM dual SQL> / SEARCH_PDF('PDM_PDF_IDX','DBMS','2') -------------------------------------------------------------------------------- DBMS>>>_XPLAN?not very easy to read, especially when many tables are involved.Today I found another interesting post :