|
Hi,
Seen a few threads on stackoverflow.com with the same problem - "Index was outside the bounds of the array" exception when parsing certain PDFs. Attached the smallest sample PDF I could find to reproduce the problem. Had the same issue when running a few other large PDFs. (electronics owners manuals) Thanks! ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
If you can provide the full stack trace, it would be a big help. Links to the SO articles would also be useful if you still have them handy.
I did find a problem with memory mapped files this morning - will be commiting a fix in a few minutes, but I can't tell you for sure if it's related. |
|
On Mon, Jan 30, 2012 at 10:56 PM, Kevin Day <[hidden email]> wrote:
> If you can provide the full stack trace, it would be a big help. Links to > the SO articles would also be useful if you still have them handy. ================ [IndexOutOfRangeException: Index was outside the bounds of the array.] iTextSharp.text.pdf.parser.LocationTextExtractionStrategy.GetResultantText() +505 iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy) +52 iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber) +40 ================ Here are a couple of links to the SO questions: http://stackoverflow.com/questions/8951408/index-was-outside-the-bounds-of-the-array-while-reading-a-pdf-using-itextsharp http://stackoverflow.com/questions/8578793/itextsharp-v5-gettextfrompage-throws-indexoutofrangeexception Here is the most interesting one, and also has the most information on what the user tried to extract images from a PDF: http://stackoverflow.com/questions/8493559/why-is-my-image-distorted-when-decoding-as-flatedecode-using-itextsharp/8511314#8511314 None of them have links to example PDFs, however... Thanks! ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
ok - I tested your PDF in the Java version of iText (latest code from HEAD) and it does *not* fail. Given the stack trace, I'm pretty sure that this is an issue that has been fixed - basically, if the text render operation had an empty string, we were winding up with an index out of bounds exception. Latest Java code definitely fixes that issue - I'm not sure where things are at with rolling that into the C# code base.
|
|
I suspect that this is also fixed in the iTextSharp HEAD. In any case, the Java and C# versions will be synchronized this weekend.
Paulo -----Original Message----- From: Kevin Day [mailto:[hidden email]] Sent: Tuesday, January 31, 2012 3:17 PM To: [hidden email] Subject: Re: [iText-questions] Possible bug in PdfTextExtractor.GetTextFromPage [iTextSharp] ok - I tested your PDF in the Java version of iText (latest code from HEAD) and it does *not* fail. Given the stack trace, I'm pretty sure that this is an issue that has been fixed - basically, if the text render operation had an empty string, we were winding up with an index out of bounds exception. Latest Java code definitely fixes that issue - I'm not sure where things are at with rolling that into the C# code base. -- View this message in context: http://itext-general.2136553.n4.nabble.com/Possible-bug-in-PdfTextExtractor-GetTextFromPage-iTextSharp-tp4342445p4344782.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
On Tue, Jan 31, 2012 at 4:22 PM, Paulo Soares <[hidden email]> wrote:
> I suspect that this is also fixed in the iTextSharp HEAD. In any case, the Java and C# versions will be synchronized this weekend. Yes, there's no problem when building from the latest SVN source code with the test file I had attached, thank you! While browsing sourceforge to get the code from SVN, I noticed someone else had submitted a similar bug report, ID #3474281. I tested with that file. Here's the stacktrace (using latest build): ======================================== Unhandled Exception: System.IndexOutOfRangeException: Index was outside the bounds of the array. at iTextSharp.text.pdf.CMapAwareDocumentFont.GetWidth(Int32 char1) at iTextSharp.text.pdf.parser.TextRenderInfo.GetStringWidth(String str) at iTextSharp.text.pdf.parser.TextRenderInfo.GetUnscaledBaselineWithOffset(Single yOffset) at iTextSharp.text.pdf.parser.LocationTextExtractionStrategy.RenderText(TextRenderInfo renderInfo) at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayPdfString(PdfString str) at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ShowTextArray.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands) at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands) at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources) at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener) at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber) at PdfTextExtractorTest.Main(String[] args) ======================================== Thanks Paulo! ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
In reply to this post by Kevin Day
On Tue, Jan 31, 2012 at 4:16 PM, Kevin Day <[hidden email]> wrote:
> ok - I tested your PDF in the Java version of iText (latest code from HEAD) > and it does *not* fail. Given the stack trace, I'm pretty sure that this is > an issue that has been fixed - basically, if the text render operation had > an empty string, we were winding up with an index out of bounds exception. > Latest Java code definitely fixes that issue - I'm not sure where things are > at with rolling that into the C# code base. Thank you - I downloaded the latest from SVN, and indeed the test file I submitted previously works without problem too. Another file I tested but didn't submit due to large file size still doesn't work with the lastest C# build, though. If you have time, it's located here: http://www.navigon.com/export/sites/default/common/Download/Manual/PNA/NAVIGON70/English_manual.pdf stacktrace from that file with latest SVN C# build: ============================================= Unhandled Exception: System.IndexOutOfRangeException: Index was outside the bounds of the array. at iTextSharp.text.pdf.CMapAwareDocumentFont.GetWidth(Int32 char1) at iTextSharp.text.pdf.parser.TextRenderInfo.GetStringWidth(String str) at iTextSharp.text.pdf.parser.TextRenderInfo.GetUnscaledBaselineWithOffset(Single yOffset) at iTextSharp.text.pdf.parser.LocationTextExtractionStrategy.RenderText(TextRenderInfo renderInfo) at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayPdfString(PdfString str) at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands) at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources) at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32pageNumber, E renderListener) at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber) at PdfTextExtractorTest.Main(String[] args) ============================================= Thank you! ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
ok - that one is going to be caused by something deeper in the unicode, font metrics, etc... area of iText - I'll need to rely on the others for digging into that.
Are you able to isolate which page is causing this issue? It really should be possible to get a single page that causes the problem, and having that will help quite a bit in getting a fix. |
|
On Tue, Jan 31, 2012 at 5:42 PM, Kevin Day <[hidden email]> wrote:
> Are you able to isolate which page is causing this issue? It really should > be possible to get a single page that causes the problem, and having that > will help quite a bit in getting a fix. For the link posted earlier (136 pages): http://www.navigon.com/export/sites/default/common/Download/Manual/PNA/NAVIGON70/English_manual.pdf The following pages throw an exception: 9,15,17,18,19,21,23,24,25,26,27,28,29,31,32,34,35,37,38,39,40,41,42, 43,44,45,46,47,48,49,50,51,52,53,54,58,60,61,62,63,64,65,66,67,68,69, 70,71,72,73,77,78,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,99, 100,101,102,104,105,106,107,108,109,110,111,112,113,114,115,117,118, 119,120,121,122,123,129,130,131,132 For the other post in reply to Paulo (sourceforge bug report) all six pages throw an exception. ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
In reply to this post by Kevin Day
I have the same problem with index outside the bounds...
I was converting following PDF => http://zbierka.sk/ov/kapitoly/default.aspx?KapitolaID=64396&FileName=ov2012-018-01&Rocnik=2012&TypKapitolyID=1 I have downloaded your source code... And FIRST ERROR was in the CMapAwareDocumentFont.GetWidth(int char1) - input was 327 (representing slovak character Ň), which was transformed by char1 = uni2cid[char1]; into 277, but after accessing by widths[char1] it gave me the error, because variable widths was initialized with 256 items... I had to increase size of that variable to avoid this mistake... SECOND ERROR was in LocationTextExtractionStrategy.GetResultantText() where following condition was missing => !string.IsNullOrEmpty(lastChunk.text) in this condition else if (dist > chunk.charSpaceWidth / 2.0f && chunk.text[0] != ' ' && lastChunk.text[lastChunk.text.Length - 1] != ' ') sb.Append(' '); After repairing these errors and building dll, conversion worked perfectly... Will you be so kind to take a look at our slovak diacritic and also repair in your official release ? If you already did it, just ignore my message... Thanks a lot... |
|
I believe that the bug in LocationTextExtractionStrategy.GetResultantText() was fixed some time ago - did you experience this problem with the latest code in HEAD ?
for reference, the line in question in SVN has the following (And startsWithSpace and endsWithSpace has the null and empty conditions covered): else if (dist > chunk.charSpaceWidth/2.0f && !startsWithSpace(chunk.text) && !endsWithSpace(lastChunk.text)) I will need to ask Paulo to look at this - I don't quite know the full implications of the uni2cid array - trying to maintain an array that is the length of the full unicode set isn't practical - increasing the array to 512 or something may address the current situation you find yourself in, but this seems to me like something that needs a more robust fix, and the whole unicode/cid transformation stuff is outside of my expertise. |
|
In reply to this post by newton
Fixed in the SVN.
Paulo
------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
| Powered by Nabble | Edit this page |
