This is also based on one of my post in OTN. In this post, I will demonstrate various thing you can do with pdf files with the help of Oracle Text.
Also you can see these other related posts in OTN : OTN1 OTN2 and MYBLOG
Oracle Text can manipulate pdf files to a greate extent. For more information about Oracle Text Please visit
Online Documentation
http://docs.oracle.com/cd/E11882_01/text.112/e24435/toc.htm
Oracle Text Forum
https://forums.oracle.com/forums/forum.jspa?forumID=71
In the forum, I have gained valuable insights by reading the posts from Barbara Boehmer . You may also try those.
Here is some example of pdf file manipulation using Oracle Text:
Here is my pdf files and OS related information.
My database information.
https://forums.oracle.com/forums/thread.jspa?threadID=2394143�
Also you can see these other related posts in OTN : OTN1 OTN2 and MYBLOG
Oracle Text can manipulate pdf files to a greate extent. For more information about Oracle Text Please visit
Online Documentation
http://docs.oracle.com/cd/E11882_01/text.112/e24435/toc.htm
Oracle Text Forum
https://forums.oracle.com/forums/forum.jspa?forumID=71
In the forum, I have gained valuable insights by reading the posts from Barbara Boehmer . You may also try those.
Here is some example of pdf file manipulation using Oracle Text:
Here is my pdf files and OS related information.
[oracle@localhost ~]$ cd Saubhik/ [oracle@localhost Saubhik]$ ls -l *.pdf -rw------- 1 oracle oracle 1358211 Sep 18 2011 Graphical_Plans_NoCOUG_Article.pdf -rw------- 1 oracle oracle 374150 May 22 20:21 mypdf2.pdf [oracle@localhost Saubhik]$ uname -a Linux localhost.localdomain 2.6.18-194.17.1.0.1.el5 #1 SMP Wed Sep 29 15:40:03 EDT 2010 i686 i686 i386 GNU/Linux [oracle@localhost Saubhik]$ pwd /home/oracle/Saubhik [oracle@localhost Saubhik]$
My database information.
[oracle@localhost ~]$ sqlplus scott/tiger SQL*Plus: Release 11.2.0.2.0 Production on Thu May 24 13:50:46 2012 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> SELECT * FROM v$version; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production PL/SQL Release 11.2.0.2.0 - Production CORE 11.2.0.2.0 Production TNS for Linux: Version 11.2.0.2.0 - Production NLSRTL Version 11.2.0.2.0 - Production SQL> SELECT directory_name,directory_path 2 FROM dba_directories 3 WHERE directory_name='TEMP'; DIRECTORY_NAME ------------------------------ DIRECTORY_PATH -------------------------------------------------------------------------------- TEMP /home/oracle/Saubhik SQL>Now some demos. This can be make better and efficient by reading the above mentioned links.
SQL> /* Creating my demo table */
SQL> CREATE TABLE pdm(id_pk INTEGER PRIMARY KEY,fname VARCHAR2(50),pdf_file BLOB);
Table created.
SQL> /* Creating a procedure which will load the BLOBs (pdf files) into the table */
SQL> ed
Wrote file afiedt.buf
1 CREATE OR REPLACE
2 PROCEDURE load_file(
3 pi_id IN INTEGER,
4 pfname IN VARCHAR2)
5 IS
6 src_file BFILE;
7 dst_file BLOB;
8 lgh_file BINARY_INTEGER;
9 BEGIN
10 src_file := bfilename('TEMP', pfname);
11 INSERT
12 INTO pdm
13 (
14 id_pk,
15 fname,
16 pdf_file
17 )
18 VALUES
19 (
20 pi_id,
21 pfname,
22 EMPTY_BLOB()
23 )
24 RETURNING pdf_file
25 INTO dst_file;
26 DBMS_LOB.OPEN(src_file, dbms_lob.file_readonly);
27 lgh_file := dbms_lob.getlength(src_file);
28 DBMS_LOB.LOADFROMFILE(dst_file, src_file, lgh_file);
29 dbms_lob.close(src_file);
30 COMMIT;
31* END load_file;
SQL> /
Procedure created.
SQL> EXECUTE load_file(1,'mypdf2.pdf');
PL/SQL procedure successfully completed.
SQL> EXECUTE load_file(2,'Graphical_Plans_NoCOUG_Article.pdf');
PL/SQL procedure successfully completed.
SQL> /* Checking the rows just inserted in the table */
SQL> SELECT id_pk,fname,DBMS_LOB.getlength(pdf_file) file_size
2 FROM pdm;
ID_PK FNAME FILE_SIZE
---------- -------------------------------------------------- ----------
1 mypdf2.pdf 374150
2 Graphical_Plans_NoCOUG_Article.pdf 1358211
SQL> /* Creating the index */
SQL> CREATE INDEX pdm_pdf_idx ON pdm(pdf_file) INDEXTYPE IS CTXSYS.CONTEXT;
Index created.
SQL> /* Checking for any errors in index creation */
SQL> SELECT * FROM ctx_user_index_errors;
no rows selected
SQL> /* Checking the index */
SQL> SELECT idx_name,idx_table_owner,idx_table,idx_status,idx_type
2 FROM ctx_user_indexes
3 WHERE idx_name='PDM_PDF_IDX';
IDX_NAME IDX_TABLE_OWNER
------------------------------ ------------------------------
IDX_TABLE IDX_STATUS IDX_TYP
------------------------------ ------------ -------
PDM_PDF_IDX SCOTT
PDM INDEXED CONTEXT
SQL> /* Just gathering some statistics */
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'PDM', cascade=>TRUE);
PL/SQL procedure successfully completed.
SQL> /* Now, you can search words and find in which pdf file that exists!
SQL> Here are some examples */
SQL> SELECT ID_PK,FNAME
FROM PDM
WHERE CONTAINS(PDF_FILE,'Chris Lawson')>0; 2 3
ID_PK FNAME
---------- --------------------------------------------------
2 Graphical_Plans_NoCOUG_Article.pdf
SQL> ed
Wrote file afiedt.buf
1 SELECT ID_PK,FNAME
2 FROM PDM
3* WHERE CONTAINS(PDF_FILE,'SAUBHIK')>0
SQL> /
ID_PK FNAME
---------- --------------------------------------------------
1 mypdf2.pdf
SQL> ed
Wrote file afiedt.buf
1 SELECT ID_PK,FNAME
2 FROM PDM
3* WHERE CONTAINS(PDF_FILE,'MEGHDEEP')>0
SQL> /
ID_PK FNAME
---------- --------------------------------------------------
1 mypdf2.pdf
SQL> ed
Wrote file afiedt.buf
1 SELECT ID_PK,FNAME
2 FROM PDM
3* WHERE CONTAINS(PDF_FILE,'DBMS_XPLAN')>0
SQL> /
no rows selected
SQL> ed
Wrote file afiedt.buf
1 SELECT ID_PK,FNAME
2 FROM PDM
3* WHERE CONTAINS(PDF_FILE,'www.graphviz.org')>0
SQL> /
ID_PK FNAME
---------- --------------------------------------------------
2 Graphical_Plans_NoCOUG_Article.pdf
SQL> ed
Wrote file afiedt.buf
1 SELECT ID_PK,FNAME
2 FROM PDM
3* WHERE CONTAINS(PDF_FILE,'Bla Bla')>0
SQL> /
no rows selected
SQL> ed
Wrote file afiedt.buf
1 SELECT ID_PK,FNAME
2 FROM PDM
3* WHERE CONTAINS(PDF_FILE,'DBMS_XPLAN-')>0
SQL> /
SELECT ID_PK,FNAME
*
ERROR at line 1:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 12
SQL> ed
Wrote file afiedt.buf
1 SELECT ID_PK,FNAME
2 FROM PDM
3* WHERE CONTAINS(PDF_FILE,'DBMS')>0
SQL> /
ID_PK FNAME
---------- --------------------------------------------------
2 Graphical_Plans_NoCOUG_Article.pdf
SQL> /* Now say, You want to get some portions of the words from your pdf document */
SQL> /* For this purpose, I have created a function. I know, this function can be
SQL> very well written. Also you may try to use some other features like
SQL> CTX_DOC.SNIPPET and Regular expressions.
SQL> */
SQL> ed
Wrote file afiedt.buf
1 CREATE OR REPLACE
2 FUNCTION search_pdf(
3 pi_index_name VARCHAR2,
4 pi_search_word VARCHAR2,
5 pi_pk_val VARCHAR2 )
6 RETURN VARCHAR2
7 IS
8 v_clob CLOB;
9 v_amt_cut INTEGER :=250;
10 v_buffer CLOB;
11 BEGIN
12 CTX_DOC.markup(pi_index_name,pi_pk_val,pi_search_word, v_clob);
13 v_buffer:=DBMS_LOB.substr(v_clob,v_amt_cut,dbms_lob.instr(v_clob,pi_search_word,1,1));
14 DBMS_LOB.FREETEMPORARY(v_clob);
15 RETURN REGEXP_REPLACE(v_buffer,'<[^>]+>');
16* END search_pdf;
SQL> /
Function created.
SQL> SELECT search_pdf('PDM_PDF_IDX','SAUBHIK','1') FROM dual;
SEARCH_PDF('PDM_PDF_IDX','SAUBHIK','1')
--------------------------------------------------------------------------------
SAUBHIK>>> BANERJEE
***: M
Age: 36
<
SQL> ed
Wrote file afiedt.buf
1* SELECT search_pdf('PDM_PDF_IDX','MEGHDEEP','1') FROM dual
SQL> /
SEARCH_PDF('PDM_PDF_IDX','MEGHDEEP','1')
--------------------------------------------------------------------------------
MEGHDEEP>>> BANERJEE
***: M
Age: 4
<d
SQL> ed
Wrote file afiedt.buf
1* SELECT search_pdf('PDM_PDF_IDX','DBMS','2') FROM dual
SQL> /
SEARCH_PDF('PDM_PDF_IDX','DBMS','2')
--------------------------------------------------------------------------------
DBMS>>>_XPLAN?not very
easy to read, especially when many tables are involved.
Today I found another interesting post :