Quantcast

PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

RIchard Hammond
Hello
 
 
Attached is the C# program (as a .txt file)   and the PDF file it fails on.
 
 
I have - in fact - run this PDF through another text converter and it works fine.
 
 
So my question is - any idea(s) why Itext can't or won't parse it?
 
 
Thanks in advance
 
 
Richard
 
 

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

bill04December2011.zip (92K) Download Attachment
PdfTextExtractor gives exception.zip (806 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Kevin Day
Next time, post the stack trace!

For everyone's reference, here is the stack trace:

 java.lang.RuntimeException: - is not a valid number - java.lang.NumberFormatException: For input string: "-"
        at com.itextpdf.text.pdf.PdfNumber.<init>(PdfNumber.java:83)
        at com.itextpdf.text.pdf.PdfContentParser.readPRObject(PdfContentParser.java:180)
        at com.itextpdf.text.pdf.PdfContentParser.parse(PdfContentParser.java:89)
        at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:358)
        at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
        at com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:73)
        at com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStreamForPage(PdfContentReaderTool.java:181)
        at com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStream(PdfContentReaderTool.java:204)
        at com.itextpdf.text.pdf.parser.PdfContentReaderTool.main(PdfContentReaderTool.java:248)


A little more digging, and I isolate the problem to this chunk of content:

10 w
1 J
0.0 G
7820 --240 m
7857 --233 l
S
Q


That sure doesn't look like valid PDF to me.  So who created this PDF, and why did they include two negative signs?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

RIchard Hammond
Kevin - the stack trace wasn't posted because the problem is easily reproducable.
 
If you use Abode Acrobat Reader to view the PDF it looks fine as I tested this before I posted the query.  The original file was produced by a power company (who should know what they are doing)   Also, I have used another PDF to text converter on the file, and that worked fine too ...
 
Regards
 
Richard
 
> Date: Sat, 4 Feb 2012 14:23:13 -0800

> From: [hidden email]
> To: [hidden email]
> Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file
>
> Next time, post the stack trace!
>
> For everyone's reference, here is the stack trace:
>
> java.lang.RuntimeException: - is not a valid number -
> java.lang.NumberFormatException: For input string: "-"
> at com.itextpdf.text.pdf.PdfNumber.<init>(PdfNumber.java:83)
> at
> com.itextpdf.text.pdf.PdfContentParser.readPRObject(PdfContentParser.java:180)
> at com.itextpdf.text.pdf.PdfContentParser.parse(PdfContentParser.java:89)
> at
> com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:358)
> at
> com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
> at
> com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:73)
> at
> com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStreamForPage(PdfContentReaderTool.java:181)
> at
> com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStream(PdfContentReaderTool.java:204)
> at
> com.itextpdf.text.pdf.parser.PdfContentReaderTool.main(PdfContentReaderTool.java:248)
>
>
> A little more digging, and I isolate the problem to this chunk of content:
>
> 10 w
> 1 J
> 0.0 G
> 7820 --240 m
> 7857 --233 l
> S
> Q
>
>
> That sure doesn't look like valid PDF to me. So who created this PDF, and
> why did they include two negative signs?
>
> --
> View this message in context: http://itext-general.2136553.n4.nabble.com/PDFTextExtractor-returns-an-exception-Input-string-was-not-in-a-correct-format-when-parsing-this-file-tp4357472p4358032.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> iText-questions mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Leonard Rosenthol-3
In reply to this post by Kevin Day
It's not valid PDF.

Leonard

-----Original Message-----
From: Kevin Day [mailto:[hidden email]]
Sent: Saturday, February 04, 2012 5:23 PM
To: [hidden email]
Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Next time, post the stack trace!

For everyone's reference, here is the stack trace:

 java.lang.RuntimeException: - is not a valid number -
java.lang.NumberFormatException: For input string: "-"
        at com.itextpdf.text.pdf.PdfNumber.<init>(PdfNumber.java:83)
        at
com.itextpdf.text.pdf.PdfContentParser.readPRObject(PdfContentParser.java:180)
        at com.itextpdf.text.pdf.PdfContentParser.parse(PdfContentParser.java:89)
        at
com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:358)
        at
com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
        at
com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:73)
        at
com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStreamForPage(PdfContentReaderTool.java:181)
        at
com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStream(PdfContentReaderTool.java:204)
        at
com.itextpdf.text.pdf.parser.PdfContentReaderTool.main(PdfContentReaderTool.java:248)


A little more digging, and I isolate the problem to this chunk of content:

10 w
1 J
0.0 G
7820 --240 m
7857 --233 l
S
Q


That sure doesn't look like valid PDF to me.  So who created this PDF, and why did they include two negative signs?

--
View this message in context: http://itext-general.2136553.n4.nabble.com/PDFTextExtractor-returns-an-exception-Input-string-was-not-in-a-correct-format-when-parsing-this-file-tp4357472p4358032.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Leonard Rosenthol-3
In reply to this post by RIchard Hammond

Just because Adobe Reader processes your file does NOT make it valid PDF.  Reader is EXTREMELY lenient because the average user would have no way to fix such crappy PDFs.

 

If you ran this file through the PDF validator in Acrobat, I would bet it would fail.

 

Leonard

 

From: RIchard Hammond [mailto:[hidden email]]
Sent: Saturday, February 04, 2012 6:45 PM
To: [hidden email]
Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

 

Kevin - the stack trace wasn't posted because the problem is easily reproducable.
 
If you use Abode Acrobat Reader to view the PDF it looks fine as I tested this before I posted the query.  The original file was produced by a power company (who should know what they are doing)   Also, I have used another PDF to text converter on the file, and that worked fine too ...
 
Regards
 
Richard
 

> Date: Sat, 4 Feb 2012 14:23:13 -0800


> From: [hidden email]
> To: [hidden email]
> Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file
>
> Next time, post the stack trace!
>
> For everyone's reference, here is the stack trace:
>
> java.lang.RuntimeException: - is not a valid number -
> java.lang.NumberFormatException: For input string: "-"
> at com.itextpdf.text.pdf.PdfNumber.<init>(PdfNumber.java:83)
> at
> com.itextpdf.text.pdf.PdfContentParser.readPRObject(PdfContentParser.java:180)
> at com.itextpdf.text.pdf.PdfContentParser.parse(PdfContentParser.java:89)
> at
> com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:358)
> at
> com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
> at
> com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:73)
> at
> com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStreamForPage(PdfContentReaderTool.java:181)
> at
> com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStream(PdfContentReaderTool.java:204)
> at
> com.itextpdf.text.pdf.parser.PdfContentReaderTool.main(PdfContentReaderTool.java:248)
>
>
> A little more digging, and I isolate the problem to this chunk of content:
>
> 10 w
> 1 J
> 0.0 G
> 7820 --240 m
> 7857 --233 l
> S
> Q
>
>
> That sure doesn't look like valid PDF to me. So who created this PDF, and
> why did they include two negative signs?
>
> --
> View this message in context: http://itext-general.2136553.n4.nabble.com/PDFTextExtractor-returns-an-exception-Input-string-was-not-in-a-correct-format-when-parsing-this-file-tp4357472p4358032.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> iText-questions mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

RIchard Hammond
You are correct, Leonard - I ran my file through a PDF validator against all the 5 or 6 variations  of PDF (PDF/A1, PDF/A2 etc) and it failed every one -
 
- but so did all the other 30 or 40 different PDFs from all the other energy suppliers! In fact every PDF document I have on my laptop fails to validate, so it would appear on the basis of my (very basic) validation exercise that nothing is ever 100% PDF compatible.
 
I've even downloaded Abode Reader X (10.1.2) from your company and looked at some of the PDFs supplied with that, and they fail to validate ...
 
Btw, Itext was happy to convert these other 30 to 40 PDFs to text, but not the one I posted ... I was just curious as to why
 
Richard
 

From: [hidden email]
To: [hidden email]
Date: Sat, 4 Feb 2012 15:51:35 -0800
Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Just because Adobe Reader processes your file does NOT make it valid PDF.  Reader is EXTREMELY lenient because the average user would have no way to fix such crappy PDFs.

 

If you ran this file through the PDF validator in Acrobat, I would bet it would fail.

 

Leonard

 

From: RIchard Hammond [mailto:[hidden email]]
Sent: Saturday, February 04, 2012 6:45 PM
To: [hidden email]
Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

 

Kevin - the stack trace wasn't posted because the problem is easily reproducable.
 
If you use Abode Acrobat Reader to view the PDF it looks fine as I tested this before I posted the query.  The original file was produced by a power company (who should know what they are doing)   Also, I have used another PDF to text converter on the file, and that worked fine too ...
 
Regards
 
Richard
 

> Date: Sat, 4 Feb 2012 14:23:13 -0800


> From: [hidden email]
> To: [hidden email]
> Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file
>
> Next time, post the stack trace!
>
> For everyone's reference, here is the stack trace:
>
> java.lang.RuntimeException: - is not a valid number -
> java.lang.NumberFormatException: For input string: "-"
> at com.itextpdf.text.pdf.PdfNumber.<init>(PdfNumber.java:83)
> at
> com.itextpdf.text.pdf.PdfContentParser.readPRObject(PdfContentParser.java:180)
> at com.itextpdf.text.pdf.PdfContentParser.parse(PdfContentParser.java:89)
> at
> com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:358)
> at
> com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
> at
> com.itextpdf.text.pdf.parser.PdfTextExtractor.getTextFromPage(PdfTextExtractor.java:73)
> at
> com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStreamForPage(PdfContentReaderTool.java:181)
> at
> com.itextpdf.text.pdf.parser.PdfContentReaderTool.listContentStream(PdfContentReaderTool.java:204)
> at
> com.itextpdf.text.pdf.parser.PdfContentReaderTool.main(PdfContentReaderTool.java:248)
>
>
> A little more digging, and I isolate the problem to this chunk of content:
>
> 10 w
> 1 J
> 0.0 G
> 7820 --240 m
> 7857 --233 l
> S
> Q
>
>
> That sure doesn't look like valid PDF to me. So who created this PDF, and
> why did they include two negative signs?
>
> --
> View this message in context: http://itext-general.2136553.n4.nabble.com/PDFTextExtractor-returns-an-exception-Input-string-was-not-in-a-correct-format-when-parsing-this-file-tp4357472p4358032.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> iText-questions mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
mkl
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

mkl
Richard,
RIchard Hammond wrote
You are correct, Leonard - I ran my file through a PDF validator against all the 5 or 6 variations  of PDF (PDF/A1, PDF/A2 etc) and it failed every one - - but so did all the other 30 or 40 different PDFs from all the other energy suppliers!
Maybe you should not test for PDF/A, PDF/X, ... compliance but more basically for PDF correctness. If you have PDFs which do not claim being PDF/A or PDF/X, testing by those profiles is a futile waste of time.
I've even downloaded Abode Reader X (10.1.2) from your company
I think Leonard had Adobe Acrobat on his mind, a software packet which includes Preflight, a tool to check PDFs by numerous criterion packages.

Regards, Michael
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

RIchard Hammond
Hello Michael
 
I was simply performing the validation checking that Leonard suggested ... and I only used Adobe Reader to find some PDFs from Adobe to validate.
 
I am not embarking on an Itext, PDF or Adobe bashing exercise here, I was just curious as to why Itext wouldn't/didn't convert the 'PDF' file I was originally working on ...
 
Regards
 
Richard
 
> Date: Sun, 5 Feb 2012 03:25:20 -0800

> From: [hidden email]
> To: [hidden email]
> Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file
>
> Richard,
>
> RIchard Hammond wrote
> > You are correct, Leonard - I ran my file through a PDF validator against
> > all the 5 or 6 variations of PDF (PDF/A1, PDF/A2 etc) and it failed every
> > one - - but so did all the other 30 or 40 different PDFs from all the
> > other energy suppliers!
>
> Maybe you should not test for PDF/A, PDF/X, ... compliance but more
> basically for PDF correctness. If you have PDFs which do not claim being
> PDF/A or PDF/X, testing by those profiles is a futile waste of time.
>
> > I've even downloaded Abode Reader X (10.1.2) from your company
>
> I think Leonard had Adobe Acrobat on his mind, a software packet which
> includes Preflight, a tool to check PDFs by numerous criterion packages.
>
> Regards, Michael
>
> --
> View this message in context: http://itext-general.2136553.n4.nabble.com/PDFTextExtractor-returns-an-exception-Input-string-was-not-in-a-correct-format-when-parsing-this-file-tp4357472p4358785.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> iText-questions mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Leonard Rosenthol-3

As Michael said, I said PDF validation – which checks compliance with ISO 32000-1:2008 (the PDF Standard).  That is VERY DIFFERENT than doing PDF/A (ISO 19005-1 or 19005-2) validation or PDF/X (ISO 15930-1, though 15930-8) validation.   Each is a separate standard with different requirements, etc.  

 

Leonard

 

From: RIchard Hammond [mailto:[hidden email]]
Sent: Sunday, February 05, 2012 8:58 AM
To: [hidden email]
Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

 

Hello Michael
 
I was simply performing the validation checking that Leonard suggested ... and I only used Adobe Reader to find some PDFs from Adobe to validate.
 
I am not embarking on an Itext, PDF or Adobe bashing exercise here, I was just curious as to why Itext wouldn't/didn't convert the 'PDF' file I was originally working on ...
 
Regards
 
Richard
 

> Date: Sun, 5 Feb 2012 03:25:20 -0800


> From: [hidden email]
> To: [hidden email]
> Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file
>
> Richard,
>
> RIchard Hammond wrote
> > You are correct, Leonard - I ran my file through a PDF validator against
> > all the 5 or 6 variations of PDF (PDF/A1, PDF/A2 etc) and it failed every
> > one - - but so did all the other 30 or 40 different PDFs from all the
> > other energy suppliers!
>
> Maybe you should not test for PDF/A, PDF/X, ... compliance but more
> basically for PDF correctness. If you have PDFs which do not claim being
> PDF/A or PDF/X, testing by those profiles is a futile waste of time.
>
> > I've even downloaded Abode Reader X (10.1.2) from your company
>
> I think Leonard had Adobe Acrobat on his mind, a software packet which
> includes Preflight, a tool to check PDFs by numerous criterion packages.
>
> Regards, Michael
>
> --
> View this message in context: http://itext-general.2136553.n4.nabble.com/PDFTextExtractor-returns-an-exception-Input-string-was-not-in-a-correct-format-when-parsing-this-file-tp4357472p4358785.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> iText-questions mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Paulo Soares-3
In reply to this post by RIchard Hammond
Fixed in the SVN.
 
Paulo
----- Original Message -----
Sent: Saturday, February 04, 2012 5:13 PM
Subject: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Hello
 
 
Attached is the C# program (as a .txt file)   and the PDF file it fails on.
 
 
I have - in fact - run this PDF through another text converter and it works fine.
 
 
So my question is - any idea(s) why Itext can't or won't parse it?
 
 
Thanks in advance
 
 
Richard
 
 

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

RIchard Hammond
Thank you Paulo, much appreciated
 
Regards
 
Richard
 

From: [hidden email]
To: [hidden email]
Date: Sun, 5 Feb 2012 19:53:37 +0000
Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Fixed in the SVN.
 
Paulo
----- Original Message -----
Sent: Saturday, February 04, 2012 5:13 PM
Subject: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Hello
 
 
Attached is the C# program (as a .txt file)   and the PDF file it fails on.
 
 
I have - in fact - run this PDF through another text converter and it works fine.
 
 
So my question is - any idea(s) why Itext can't or won't parse it?
 
 
Thanks in advance
 
 
Richard
 
 

------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Kevin Day
In reply to this post by RIchard Hammond
Just to help you with future inquiries:  I can review a stack trace and 75% of the time figure out what's going on.  This takes me about 3 seconds, and doesn't require me to stop what I'm doing, download a file, set up a test, etc... (which takes about 5 minutes).  I know that doesn't sound like a lot, but for me, it's a context switch, and that has a huge impact on my overall productivity.

When people make it really easy for me to fix things, they get fixed quickly.  Otherwise, they go to the bottom of the pile where I get to them when I have time.  I'm not trying to be rude on this - just giving you some pointers on how to best interact with the dev team of any open source project.  I do this work because I believe in open source - I don't get paid to do it - so I have to give the bulk of my time and energy to things that I do get paid for.  Hope you understand!

Cheerio,

- K

RIchard Hammond wrote
Kevin - the stack trace wasn't posted because the problem is easily reproducable. If you use Abode Acrobat Reader to view the PDF it looks fine as I tested this before I posted the query.  The original file was produced by a power company (who should know what they are doing)   Also, I have used another PDF to text converter on the file, and that worked fine too ... Regards Richard
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Kevin Day
In reply to this post by RIchard Hammond
The answer to your question (which actually *is* a good question - but I did answer it in my original response!) is that you have a non-conformant PDF that specifies several number in part of it's content stream using two negative signs instead of just one.  That is clearly not right - can you imagine writing code like this:

int x = --5;

and expecting it to work (well, outside of the semantecs of auto-decrement...  I guess that's a bad example, eh?).

Anyway, yes - this can (and will) be tweaked in the content parser so if it sees a negative sign, it just ignores all other negative signs until it hits a numeral.  But really, someone needs to go back to whoever created this PDF and tell them they bought a bad PDF generation library.  You would be absolutely astounded at the junk that gets into PDFs...

<rant: on>
I think that the general lesson learned by the entire development community over the past 15 years (based on the fiasco that was Internet Explorer's permissive - and often incorrect - handling of HTML) is that it's better to fail fast and force the developer to fix their mistake.  Otherwise, we wind up with a juggernaut of bad syntax out in the world that takes a massive, massive effort to fix.

Long and short, Adobe may decide at some point in the future to stop being so permissive, at which point all of those PDFs will suddenly be non-readable.  The users will all blame Adobe, but really the only mistake that Adobe made was allowing bad syntax for so long.
</rant>

Cheerio,

- K

RIchard Hammond wrote
I am not embarking on an Itext, PDF or Adobe bashing exercise here, I was just curious as to why Itext wouldn't/didn't convert the 'PDF' file I was originally working on ... Regards Richard
mkl
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

mkl
Kevin,
Kevin Day wrote
Anyway, yes - this can (and will) be tweaked in the content parser so if it sees a negative sign, it just ignores all other negative signs until it hits a numeral.
I'm just trying to complicate everything here: Are you sure the double minus is meant as a single one? Or could "7820 --240 m" be meant to be interpreted as "7820 240 m"? That would be possible if some library has a coordinate system with an inverted y axis; it lazliy inverts by prepending a minus; in case of already negative coordinate values, a double minus would result.
But really, someone needs to go back to whoever created this PDF and tell them they bought a bad PDF generation library.
Yes, yes, yes, yes, yes.

Regards,   Michael
mkl
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

mkl
In reply to this post by Leonard Rosenthol-3
Leonard Rosenthol-3 wrote
As Michael said, I said PDF validation - which checks compliance with ISO 32000-1:2008 (the PDF Standard).
BTW, I just ran the PDF supplied by Richard through Preflight --- and the result was a Preflight failure message "Beim Analysieren der Seitenbeschreibung ist ein Fehler auftreten, die PDF-Datei kann nicht überprüft werden", i.e. an error occured while analyzing the page description, the PDF file could not be checked.



So the PDF is broken in a way not even expected by the Preflight developers... ;)

Regards,   Michael

PS: I only have Preflight from Acrobat 9.4.7; maybe the 10ish Preflights show different results.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

RIchard Hammond
In reply to this post by Kevin Day
Hello Kevin
 
First of all - thank you for all your help with this.
 
Seond of all - I didn't generate the PDF myself, as you can see it was generated by an energy company (German-owned in fact) which has such massive resources it managed to choose an extremely bad/non-compliant 'generator' to do its dirty work; I was just on the receiving end of it!
 
(I used to write   int  x = --5   code until I turned 5 and went to school where I learned that a minus times another minus makes a plus (!) so I stopped doing that kind of thing)

I expect the syntax of Adobe to be in almost permanent evolution; there is no obligation on them to ever make anything backwards compatible - we developers will just have to take our chances ...
 
I agree with your 'rant' btw ... just don't get ME started on Mucrosoft's 'products' or we'll be here for the next 50 years
 
Thanks & regards
 
 
Richard
 
> Date: Sun, 5 Feb 2012 19:39:21 -0800

> From: [hidden email]
> To: [hidden email]
> Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file
>
> The answer to your question (which actually *is* a good question - but I did
> answer it in my original response!) is that you have a non-conformant PDF
> that specifies several number in part of it's content stream using two
> negative signs instead of just one. That is clearly not right - can you
> imagine writing code like this:
>
> int x = --5;
>
> and expecting it to work (well, outside of the semantecs of
> auto-decrement... I guess that's a bad example, eh?).
>
> Anyway, yes - this can (and will) be tweaked in the content parser so if it
> sees a negative sign, it just ignores all other negative signs until it hits
> a numeral. But really, someone needs to go back to whoever created this PDF
> and tell them they bought a bad PDF generation library. You would be
> absolutely astounded at the junk that gets into PDFs...
>
> <rant: on>
> I think that the general lesson learned by the entire development community
> over the past 15 years (based on the fiasco that was Internet Explorer's
> permissive - and often incorrect - handling of HTML) is that it's better to
> fail fast and force the developer to fix their mistake. Otherwise, we wind
> up with a juggernaut of bad syntax out in the world that takes a massive,
> massive effort to fix.
>
> Long and short, Adobe may decide at some point in the future to stop being
> so permissive, at which point all of those PDFs will suddenly be
> non-readable. The users will all blame Adobe, but really the only mistake
> that Adobe made was allowing bad syntax for so long.
> </rant>
>
> Cheerio,
>
> - K
>
>
> RIchard Hammond wrote
> >
> > I am not embarking on an Itext, PDF or Adobe bashing exercise here, I was
> > just curious as to why Itext wouldn't/didn't convert the 'PDF' file I was
> > originally working on ... Regards Richard
> >
>
>
> --
> View this message in context: http://itext-general.2136553.n4.nabble.com/PDFTextExtractor-returns-an-exception-Input-string-was-not-in-a-correct-format-when-parsing-this-file-tp4357472p4360359.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> iText-questions mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Paulo Soares-3
FYI this is fixed in the SVN. Acrobat interprets --234 as 234 and so do we.
 
Paulo


From: RIchard Hammond [mailto:[hidden email]]
Sent: Monday, February 06, 2012 11:41 AM
To: [hidden email]
Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file

Hello Kevin
 
First of all - thank you for all your help with this.
 
Seond of all - I didn't generate the PDF myself, as you can see it was generated by an energy company (German-owned in fact) which has such massive resources it managed to choose an extremely bad/non-compliant 'generator' to do its dirty work; I was just on the receiving end of it!
 
(I used to write   int  x = --5   code until I turned 5 and went to school where I learned that a minus times another minus makes a plus (!) so I stopped doing that kind of thing)

I expect the syntax of Adobe to be in almost permanent evolution; there is no obligation on them to ever make anything backwards compatible - we developers will just have to take our chances ...
 
I agree with your 'rant' btw ... just don't get ME started on Mucrosoft's 'products' or we'll be here for the next 50 years
 
Thanks & regards
 
 
Richard
 
> Date: Sun, 5 Feb 2012 19:39:21 -0800

> From: [hidden email]
> To: [hidden email]
> Subject: Re: [iText-questions] PDFTextExtractor returns an exception - 'Input string was not in a correct format" when parsing this file
>
> The answer to your question (which actually *is* a good question - but I did
> answer it in my original response!) is that you have a non-conformant PDF
> that specifies several number in part of it's content stream using two
> negative signs instead of just one. That is clearly not right - can you
> imagine writing code like this:
>
> int x = --5;
>
> and expecting it to work (well, outside of the semantecs of
> auto-decrement... I guess that's a bad example, eh?).
>
> Anyway, yes - this can (and will) be tweaked in the content parser so if it
> sees a negative sign, it just ignores all other negative signs until it hits
> a numeral. But really, someone needs to go back to whoever created this PDF
> and tell them they bought a bad PDF generation library. You would be
> absolutely astounded at the junk that gets into PDFs...
>
> <rant: on>
> I think that the general lesson learned by the entire development community
> over the past 15 years (based on the fiasco that was Internet Explorer's
> permissive - and often incorrect - handling of HTML) is that it's better to
> fail fast and force the developer to fix their mistake. Otherwise, we wind
> up with a juggernaut of bad syntax out in the world that takes a massive,
> massive effort to fix.
>
> Long and short, Adobe may decide at some point in the future to stop being
> so permissive, at which point all of those PDFs will suddenly be
> non-readable. The users will all blame Adobe, but really the only mistake
> that Adobe made was allowing bad syntax for so long.
> </rant>
>
> Cheerio,
>
> - K
>
>
> RIchard Hammond wrote
> >
> > I am not embarking on an Itext, PDF or Adobe bashing exercise here, I was
> > just curious as to why Itext wouldn't/didn't convert the 'PDF' file I was
> > originally working on ... Regards Richard
> >
>
>
> --
> View this message in context: http://itext-general.2136553.n4.nabble.com/PDFTextExtractor-returns-an-exception-Input-string-was-not-in-a-correct-format-when-parsing-this-file-tp4357472p4360359.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> iText-questions mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Loading...