|
Dear,
I'm using itext to extract specific text from a PDF. However when extracting from some PDF's the coordinates I specify are ignored. When selecting a rectangular of only 1 postscript-point, a much larger text block is extracted. I found out that the problem occurs when exporting excel with some PDF generation tools (eg CutePDF): it is the whole table cell wherein those coordinates locate that gets extracted. Attached to this e-mail en example PDF file. This is the code I use for the extraction: iimport com.google.common.io.Closeables; import com.google.common.io.Files; import com.google.common.io.InputSupplier; import com.itextpdf.text.pdf.PdfReader; import com.itextpdf.text.Rectangle; import com.itextpdf.text.pdf.parser.*; import java.io.File; import java.io.IOException; import java.io.InputStream; import java.io.StringWriter; public class itext { public static void main(String[] args) throws IOException { InputSupplier<? extends InputStream> pdf = Files.newInputStreamSupplier(new File("src/main/resources/test.pdf")); int pageNumber = 1; int llx = 200; int lly = 776; int urx = 201; int ury = 777; Rectangle rect = new Rectangle(llx, lly, urx, ury); InputStream in = pdf.getInput(); PdfReader reader = null; try { reader = new PdfReader(in); RenderFilter filter = new RegionTextRenderFilter(rect); TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter); StringWriter out = new StringWriter(); out.write(PdfTextExtractor.getTextFromPage(reader, pageNumber, strategy)); System.out.println(">" + out + "<"); } finally { if (reader != null) { reader.close(); } Closeables.closeQuietly(in); } } } What is going wrong here? And how can I force itext to stick to the correct coordinates? An answer to change PDF generation tool does not help me, because that lies beyond my control. Kind regards, Cyrille Bartholomee ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
Your question has to do with how RegionTextRenderFilter interacts with the specific TextExtractionStrategy you have selected.
LocationTextExtractionStrategy returns text chunks that render from a specific x/y location (generally the lower left hand corner of the text draw operation that resulted in the chunk). If the chunk extends past the bounds of the region, the full chunk is still going to get passed though the filter. Long and short: LocationTextExtractionStrategy does not provide sufficient granularity to filter individual glyphs. You could come up with your own TextExtractionStrategy that could do this - but I'll warn you now that it will be really hard to do - you'll have to effectively implement clipping path support into the render listening pipeline, extract partial chunks based on overlapping regions, etc... It's a tough nut to crack. Cheers, |
|
Dear Kevin,
Thank you for your help and your quick response. There is one thing I still don't understand. When I use a different PDF renderer (like PDFCreator) to generate the PDF, everything works fine. With exactly the same code and exactly the same excel file. Somehow in this case LocationTextExtractionStrategy has enough granularity to filter individual glyphs. Or could it be that PDFCreator renders every glyph in a single text chunk? Kind regards, Cyrille Bartholomee ![]() On 19-06-12 17:56, Kevin Day wrote: Your question has to do with how RegionTextRenderFilter interacts with the specific TextExtractionStrategy you have selected. LocationTextExtractionStrategy returns text chunks that render from a specific x/y location (generally the lower left hand corner of the text draw operation that resulted in the chunk). If the chunk extends past the bounds of the region, the full chunk is still going to get passed though the filter. Long and short: LocationTextExtractionStrategy does not provide sufficient granularity to filter individual glyphs. You could come up with your own TextExtractionStrategy that could do this - but I'll warn you now that it will be really hard to do - you'll have to effectively implement clipping path support into the render listening pipeline, extract partial chunks based on overlapping regions, etc... It's a tough nut to crack. Cheers, -- View this message in context: http://itext-general.2136553.n4.nabble.com/Issue-LocationTextExtractionStrategy-and-Excel-CutePDF-print-tp4655369p4655373.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
You'd have to look at the actual text draw operations in the PDF, but yes, my guess is that the cute PDF generator is creating larger chunks than the others you have tried.
You can use PdfContentReaderTool to get at the underlying content streams and page dictionaries - or use RUPS. |
| Powered by Nabble | Edit this page |
