|
I have an existing PDF that I'm trying to parse text out of, and am winding up with a null pointer exception when reading an array in the content stream.
I have narrowed the problem down to a particular line in the content stream (if I run this one line through PdfContentParser.parse() it fails): Here is the line (sorry this is so ugly - I'll describe the exact location of the problem in a second): [(*)-15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*)-15(346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-20(=3@,=3+//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)[(*)-15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*)-15(346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-20(=3@,=3+//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)(21/7,5)(*)-20(8.,)(*)-20(+<-/2+,2318=)(*)-20(34*)] TJ The problem is that there appears to be an open bracket [ in the middle of this line. If you search for -20(/>)[(*)-15 the problem is that open bracket. This makes the parser think it's reading an array inside the array. The ending ] then closes the inner array, and the whole thing blows up. At first blush, this looks like it's just a bad PDF. But the trick is that Acrobat parses and renders this thing just fine. So my question is: Is it possible that the above is actually valid per the PDF spec, and we are just missing something with the tokeniser or parser? It wouldn't seem like it would valid. But if that were the case, you'd really think that Acrobat wouldn't be able to parse it, either. Are we missing something in our parser, or is Acrobat doing some sort of intense logic to reconstruct the Tj operation if the array doesn't terminate properly? I've done some thinking on this, and I see no reasonable strategy for determining where in the content stream to insert an artificial ] |
|
Can we see the actual PDF?
--Mark Storer Senior Software Engineer Cardiff.com import legalese.Disclaimer; Disclaimer<Cardiff> DisCard = null; Autonomy Corp., an HP Company > -----Original Message----- > From: Kevin Day [mailto:[hidden email]] > Sent: Thursday, October 27, 2011 3:57 PM > To: [hidden email] > Subject: [iText-questions] Content stream question > > I have an existing PDF that I'm trying to parse text out of, > and am winding up with a null pointer exception when reading > an array in the content stream. > > I have narrowed the problem down to a particular line in the > content stream (if I run this one line through > PdfContentParser.parse() it fails): > > Here is the line (sorry this is so ugly - I'll describe the > exact location of the problem in a second): > > [(*)-15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(* ==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-2 0(=3@,=3> +//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)[(*) > -15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*)-15 (346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!> ==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-2 0(=3@,=3> +//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)(21/ > 7,5)(*)-20(8.,)(*)-20(+<-/2+,2318=)(*)-20(34*)] > TJ > > > The problem is that there appears to be an open bracket [ in > the middle of this line. If you search for -20(/>)[(*)-15 > the problem is that open bracket. This makes the parser > think it's reading an array inside the array. The ending ] > then closes the inner array, and the whole thing blows up. > > At first blush, this looks like it's just a bad PDF. But the > trick is that Acrobat parses and renders this thing just fine. > > So my question is: Is it possible that the above is actually > valid per the PDF spec, and we are just missing something > with the tokeniser or parser? > It wouldn't seem like it would valid. But if that were the > case, you'd really think that Acrobat wouldn't be able to > parse it, either. > > Are we missing something in our parser, or is Acrobat doing > some sort of intense logic to reconstruct the Tj operation if > the array doesn't terminate properly? I've done some > thinking on this, and I see no reasonable strategy for > determining where in the content stream to insert an artificial ] > > > -- > View this message in context: > http://itext-general.2136553.n4.nabble.com/Content-stream-ques > Sent from the iText - General mailing list archive at Nabble.com. > > -------------------------------------------------------------- > ---------------- > The demand for IT networking professionals continues to grow, > and the demand for specialized networking skills is growing > even more rapidly. > Take a complimentary Learning@Cisco Self-Assessment and learn > about Cisco certifications, training, and career opportunities. > http://p.sf.net/sfu/cisco-dev2dev > _______________________________________________ > iText-questions mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/itext-questions > > iText(R) is a registered trademark of 1T3XT BVBA. > Many questions posted to this list can (and will) be answered > with a reference to the iText book: > http://www.itextpdf.com/book/ Please check the keywords list > before you ask for examples: http://itextpdf.com/themes/keywords.php > > ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
Unfortunately, no - at least not something that I could post here. I could email it to you directly if it would help. Also, I have about 100 of these files that exhibit the same behavior - I'll try to track down a copy that doesn't contain sensitive info.
But the following unit test causes the failure (I haven't committed this b/c I'm still not clear on whether it's a problem with the content stream or not - if you'd like me to commit it, just let me know): public class PdfContentParserTest { @Before public void setUp() throws Exception { } @After public void tearDown() throws Exception { } @Test public void test() throws Exception { String line = "[(*)-15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*)-15(346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-20(=3@,=3+//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)[(*)-15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*)-15(346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-20(=3@,=3+//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)(21/7,5)(*)-20(8.,)(*)-20(+<-/2+,2318=)(*)-20(34*)] TJ"; PRTokeniser tokeniser = new PRTokeniser(line.getBytes()); PdfContentParser p = new PdfContentParser(tokeniser); ArrayList<PdfObject> operands = new ArrayList<PdfObject>(); while(p.parse(operands).size() > 0){ System.out.println(operands); // remove for production release } } } |
|
In reply to this post by Kevin Day
Kevin,
Acrobat is quite lax about errors; thus, it need not be your aim to emulate it. To me that line looks like the beginning of the line --- "[(*)-15(*)-15(&+,)(*)-15(-./0,123/45)(*)-15(/.)(*)-15(/2+,.)(*)-15(346/.7823/4)(*)-15(9,4,.82,:)(*)-15(;<)(*)-15(&+,)(*)-15(!==>52.823/4)(*)-15(?,4,.82/.)(*)-20(.,98.:349)(*)-20(2+,)(*)-20(=3@,=3+//:)(*)-20(/6)(*)-20(A8.3/>5)(*)-20(34A,527,42)(*)-20(/>)" --- is doubled, and most likely unintentionally so. Does Acrobat actually display these contents twice? If it doesn't, it maybe just ignores the first occurance... Or if it does, maybe (not expecting inner arrays) it ignores the extra '['... As mentioned above I think it more likely Acrobat simply ignores something, either the doubled beginning of the line or the extra '['. And I doubt that behaviour is required by the spec, most likely Acrobat is simply lax on its input. And I doubt you want to be as lax in automated processes... Regards, Michael |
|
Awesome catch - you are right - the array contents are actually duplicated except for the following little bit at the end:
(21/7,5)(*)-20(8.,)(*)-20(+<-/2+,2318=)(*)-20(34*) Very interesting... And I agree - at this point, we need to just call the PDF bad (even though Acrobat somehow parses it). Thanks for noticing that. |
|
Acrobat will try to handle even the worst garbage - since that' what our
users expect :(. On 10/28/11 3:07 PM, "Kevin Day" <[hidden email]> wrote: >Awesome catch - you are right - the array contents are actually duplicated >except for the following little bit at the end: > >(21/7,5)(*)-20(8.,)(*)-20(+<-/2+,2318=)(*)-20(34*) > >Very interesting... > >And I agree - at this point, we need to just call the PDF bad (even though >Acrobat somehow parses it). Thanks for noticing that. > >-- >View this message in context: >http://itext-general.2136553.n4.nabble.com/Content-stream-question-tp39463 >12p3947826.html >Sent from the iText - General mailing list archive at Nabble.com. > >-------------------------------------------------------------------------- >---- >The demand for IT networking professionals continues to grow, and the >demand for specialized networking skills is growing even more rapidly. >Take a complimentary Learning@Cisco Self-Assessment and learn >about Cisco certifications, training, and career opportunities. >http://p.sf.net/sfu/cisco-dev2dev >_______________________________________________ >iText-questions mailing list >[hidden email] >https://lists.sourceforge.net/lists/listinfo/itext-questions > >iText(R) is a registered trademark of 1T3XT BVBA. >Many questions posted to this list can (and will) be answered with a >reference to the iText book: http://www.itextpdf.com/book/ >Please check the keywords list before you ask for examples: >http://itextpdf.com/themes/keywords.php ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
Leanord: I hear you... It's funny b/c if from day one, it hadn't accepted garbage, then the developers that were using Acrobat as a litmus test to ensure compliance would have known they were making mistakes and would have fixed them. Now, of course, the cat is out of the bag, the cow has left the barn, etc... - now if Adobe changes their stance on this, the users blame Acrobat for not reading files. It's the whole IE situation all over again. I don't envy the position you are in.
|
|
I suggest you run the PDF Syntax Check on the file, just to be sure.
Stuff that Acrobat glosses over when displaying a form still get flagged by the check. I'm pretty sure it's only available in the "pro" versions. --Mark Storer Senior Software Engineer Cardiff.com import legalese.Disclaimer; Disclaimer<Cardiff> DisCard = null; Autonomy Corp., an HP Company > -----Original Message----- > From: Kevin Day [mailto:[hidden email]] > Sent: Friday, October 28, 2011 9:34 AM > To: [hidden email] > Subject: Re: [iText-questions] Content stream question > > Leanord: I hear you... It's funny b/c if from day one, it > hadn't accepted garbage, then the developers that were using > Acrobat as a litmus test to ensure compliance would have > known they were making mistakes and would have fixed them. > Now, of course, the cat is out of the bag, the cow has left > the barn, etc... - now if Adobe changes their stance on this, > the users blame Acrobat for not reading files. It's the > whole IE situation all over again. > I don't envy the position you are in. > > -- > View this message in context: > http://itext-general.2136553.n4.nabble.com/Content-stream-ques > Sent from the iText - General mailing list archive at Nabble.com. > > -------------------------------------------------------------- > ---------------- > The demand for IT networking professionals continues to grow, > and the demand for specialized networking skills is growing > even more rapidly. > Take a complimentary Learning@Cisco Self-Assessment and learn > about Cisco certifications, training, and career opportunities. > http://p.sf.net/sfu/cisco-dev2dev > _______________________________________________ > iText-questions mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/itext-questions > > iText(R) is a registered trademark of 1T3XT BVBA. > Many questions posted to this list can (and will) be answered > with a reference to the iText book: > http://www.itextpdf.com/book/ Please check the keywords list > before you ask for examples: http://itextpdf.com/themes/keywords.php > > ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
In reply to this post by Kevin Day
Kevin,
Unfortunately, in the beginning the PDF reference was not normative in nature (or so Leonard told us once). Thus, it wasn't as clear what was garbage and what was not. Regards, Michael |
| Powered by Nabble | Edit this page |
