Quantcast

Issue LocationTextExtractionStrategy and Excel CutePDF print

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Issue LocationTextExtractionStrategy and Excel CutePDF print

Cyrille Bartholomee
Dear,


I'm using itext to extract specific text from a PDF. However when
extracting from some PDF's the coordinates I specify are ignored. When
selecting a rectangular of only 1 postscript-point, a much larger text
block is extracted. I found out that the problem occurs when exporting
excel with some PDF generation tools (eg CutePDF): it is the whole table
cell wherein those coordinates locate that gets extracted.

Attached to this e-mail en example PDF file.
This is the code I use for the extraction:

iimport com.google.common.io.Closeables;
import com.google.common.io.Files;
import com.google.common.io.InputSupplier;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.Rectangle;
import com.itextpdf.text.pdf.parser.*;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;

public class itext {
     public static void main(String[] args) throws IOException {
         InputSupplier<? extends InputStream> pdf =
Files.newInputStreamSupplier(new File("src/main/resources/test.pdf"));

         int pageNumber = 1;
         int llx = 200;
         int lly = 776;
         int urx = 201;
         int ury = 777;
         Rectangle rect = new Rectangle(llx, lly, urx, ury);

         InputStream in = pdf.getInput();
         PdfReader reader = null;
         try {
             reader = new PdfReader(in);
             RenderFilter filter = new RegionTextRenderFilter(rect);
             TextExtractionStrategy strategy = new
FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);

             StringWriter out = new StringWriter();
             out.write(PdfTextExtractor.getTextFromPage(reader,
pageNumber, strategy));
             System.out.println(">" + out + "<");
         } finally {
             if (reader != null) {
                 reader.close();
             }
             Closeables.closeQuietly(in);
         }
     }
}

What is going wrong here? And how can I force itext to stick to the
correct coordinates?
An answer to change PDF generation tool does not help me, because that
lies beyond my control.


Kind regards,
Cyrille Bartholomee


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

test.pdf (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Issue LocationTextExtractionStrategy and Excel CutePDF print

Kevin Day
Your question has to do with how RegionTextRenderFilter interacts with the specific TextExtractionStrategy you have selected.

LocationTextExtractionStrategy returns text chunks that render from a specific x/y location (generally the lower left hand corner of the text draw operation that resulted in the chunk).  If the chunk extends past the bounds of the region, the full chunk is still going to get passed though the filter.

Long and short: LocationTextExtractionStrategy does not provide sufficient granularity to filter individual glyphs.  You could come up with your own TextExtractionStrategy that could do this - but I'll warn you now that it will be really hard to do - you'll have to effectively implement clipping path support into the render listening pipeline, extract partial chunks based on overlapping regions, etc...  It's a tough nut to crack.

Cheers,
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Issue LocationTextExtractionStrategy and Excel CutePDF print

Cyrille Bartholomee
Dear Kevin,


Thank you for your help and your quick response.

There is one thing I still don't understand. When I use a different PDF renderer (like PDFCreator) to generate the PDF, everything works fine. With exactly the same code and exactly the same excel file. Somehow in this case LocationTextExtractionStrategy has enough granularity to filter individual glyphs. Or could it be that PDFCreator renders every glyph in a single text chunk?


Kind regards,
Cyrille Bartholomee


On 19-06-12 17:56, Kevin Day wrote:
Your question has to do with how RegionTextRenderFilter interacts with the
specific TextExtractionStrategy you have selected.

LocationTextExtractionStrategy returns text chunks that render from a
specific x/y location (generally the lower left hand corner of the text draw
operation that resulted in the chunk).  If the chunk extends past the bounds
of the region, the full chunk is still going to get passed though the
filter.

Long and short: LocationTextExtractionStrategy does not provide sufficient
granularity to filter individual glyphs.  You could come up with your own
TextExtractionStrategy that could do this - but I'll warn you now that it
will be really hard to do - you'll have to effectively implement clipping
path support into the render listening pipeline, extract partial chunks
based on overlapping regions, etc...  It's a tough nut to crack.

Cheers,

--
View this message in context: http://itext-general.2136553.n4.nabble.com/Issue-LocationTextExtractionStrategy-and-Excel-CutePDF-print-tp4655369p4655373.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Issue LocationTextExtractionStrategy and Excel CutePDF print

Kevin Day
You'd have to look at the actual text draw operations in the PDF, but yes, my guess is that the cute PDF generator is creating larger chunks than the others you have tried.

You can use PdfContentReaderTool to get at the underlying content streams and page dictionaries - or use RUPS.
Loading...