Quantcast

Text encoding problem?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Text encoding problem?

jmheras

I have the need to obtain the exit in XML that offers TaggedPdfReaderTool. ConvertToXml

I found a problem with some of them that have accents.

For example, the word

'Número'  changes to 'N*famero'

'Página' changes to 'p*e1gina'

 ...

..

On the other hand, if the same PDF is managed to iTextSharp.text.pdf.parser. ITextExtractionStrategy,

GetResultantText returns the whole text correctly.

What do I need to do to get the text in such XML and strategy.GetResultantText returns?

Thank you in advance.

Josep Maria Heras




ADHOC SYNECTIC SYSTEMS, S.A. - AVISO LEGAL
La Informacion incluida en este e-mail es CONFIDENCIAL, siendo para uso exclusivo del destinatario arriba mencionado. Si Ud lee este mensaje y no es el destinatario indicado, le informamos que esta totalmente prohibida cualquier utilizacion, divulgacion, distribucion y/o reproduccion de esta comunicacion, total o parcial, sin autorizacion expresa en virtud de la legislacion vigente. Si ha recibido este mensaje por error, le rogamos nos lo notifique inmediatamente por esta via y proceda a su eliminacion junto con sus ficheros anexos sin leerlo ni grabarlo.

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Text encoding problem?

jmheras

Hello!

I need the XML that I get from  TaggedPdfReaderTool. ConvertToXml
I found a problem with some of them that have accents, I get de XML without accents amd some wrong words.
For example, the word:
'Número'  changes to 'N#famero'
'Página' changes to ' p#e1gina'
...
..

On the other hand, if the same PDF is managed to iTextSharp.text.pdf.parser. ITextExtractionStrategy,
GetResultantText returns the whole text with accents correctly.
What do I need to do to get the text in such XML (TaggedPdfReaderTool. ConvertToXml ) as strategy.GetResultantText returns?

Thank you in advance.
Josep Maria Hera




ADHOC SYNECTIC SYSTEMS, S.A. - AVISO LEGAL
La Informacion incluida en este e-mail es CONFIDENCIAL, siendo para uso exclusivo del destinatario arriba mencionado. Si Ud lee este mensaje y no es el destinatario indicado, le informamos que esta totalmente prohibida cualquier utilizacion, divulgacion, distribucion y/o reproduccion de esta comunicacion, total o parcial, sin autorizacion expresa en virtud de la legislacion vigente. Si ha recibido este mensaje por error, le rogamos nos lo notifique inmediatamente por esta via y proceda a su eliminacion junto con sus ficheros anexos sin leerlo ni grabarlo.

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Text encoding problem?

Paulo Soares-3
In reply to this post by jmheras
Please post the pdf and the code you are using so that we can reproduce the problem.
 
Paulo


From: Josep Maria Heras [mailto:[hidden email]]
Sent: Thursday, September 30, 2010 4:01 PM
To: [hidden email]
Subject: [iText-questions] Text encoding problem?

I have the need to obtain the exit in XML that offers TaggedPdfReaderTool. ConvertToXml

I found a problem with some of them that have accents.

For example, the word

'Número'  changes to 'N*famero'

'Página' changes to 'p*e1gina'

 ...

..

On the other hand, if the same PDF is managed to iTextSharp.text.pdf.parser. ITextExtractionStrategy,

GetResultantText returns the whole text correctly.

What do I need to do to get the text in such XML and strategy.GetResultantText returns?

Thank you in advance.

Josep Maria Heras




ADHOC SYNECTIC SYSTEMS, S.A. - AVISO LEGAL
La Informacion incluida en este e-mail es CONFIDENCIAL, siendo para uso exclusivo del destinatario arriba mencionado. Si Ud lee este mensaje y no es el destinatario indicado, le informamos que esta totalmente prohibida cualquier utilizacion, divulgacion, distribucion y/o reproduccion de esta comunicacion, total o parcial, sin autorizacion expresa en virtud de la legislacion vigente. Si ha recibido este mensaje por error, le rogamos nos lo notifique inmediatamente por esta via y proceda a su eliminacion junto con sus ficheros anexos sin leerlo ni grabarlo.


Aviso Legal:
Esta mensagem é destinada exclusivamente ao destinatário. Pode conter informação confidencial ou legalmente protegida. A incorrecta transmissão desta mensagem não significa a perca de confidencialidade. Se esta mensagem for recebida por engano, por favor envie-a de volta para o remetente e apague-a do seu sistema de imediato. É proibido a qualquer pessoa que não o destinatário de usar, revelar ou distribuir qualquer parte desta mensagem.

Disclaimer:
This message is destined exclusively to the intended receiver. It may contain confidential or legally protected information. The incorrect transmission of this message does not mean the loss of its confidentiality. If this message is received by mistake, please send it back to the sender and delete it from your system immediately. It is forbidden to any person who is not the intended receiver to use, distribute or copy any part of this message.


------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.itextpdf.com/book/
Check the site with examples before you ask questions: http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Text encoding problem?

jmheras
hi!

I've uploaded the pdf "problem.pdf"

- my code to get the whole pdf text with accents and the letter "ñ" is:
        Dim reader As PdfReader = New PdfReader(pdfByte)
        Dim strategy As parser.ITextExtractionStrategy
        For i As Integer = 1 To reader.NumberOfPages
           strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy)
            Dim sResult As String = strategy.GetResultantText
         Next

- my code to get the XMl that has incorrect words with accents,.. is:
        Dim reader As PdfReader = New PdfReader(pdfByte)
        Dim parser As New iTextSharp.text.pdf.parser.PdfReaderContentParser(reader)
        Dim ms As New System.IO.MemoryStreamproblem.pdf
        ms.SetLength(0)
        Dim info As New parser.TaggedPdfReaderTool
        info.ConvertToXml(reader, ms)
        IO.File.WriteAllBytes("Z:\usr\JM\bustia\pdf\problem.pdf.xml", ms.ToArray)

Thanks!
Josep Maria
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Text encoding problem?

Paulo Soares-3
It's fixed now in the SVN trunk.

Paulo

-----Original Message-----
From: jmheras [mailto:[hidden email]]
Sent: Friday, October 01, 2010 7:44 AM
To: [hidden email]
Subject: Re: [iText-questions] Text encoding problem?


hi!

I've uploaded the pdf "problem.pdf"

- my code to get the whole pdf text with accents and the letter "ñ" is:
        Dim reader As PdfReader = New PdfReader(pdfByte)
        Dim strategy As parser.ITextExtractionStrategy
        For i As Integer = 1 To reader.NumberOfPages
           strategy = parser.ProcessContent(i, New
iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy)
            Dim sResult As String = strategy.GetResultantText
         Next

- my code to get the XMl that has incorrect words with accents,.. is:
        Dim reader As PdfReader = New PdfReader(pdfByte)
        Dim parser As New
iTextSharp.text.pdf.parser.PdfReaderContentParser(reader)
        Dim ms As New System.IO.MemoryStream
http://itext-general.2136553.n4.nabble.com/file/n2848929/problem.pdf
problem.pdf
        ms.SetLength(0)
        Dim info As New parser.TaggedPdfReaderTool
        info.ConvertToXml(reader, ms)
        IO.File.WriteAllBytes("Z:\usr\JM\bustia\pdf\problem.pdf.xml",
ms.ToArray)

Thanks!
Josep Maria

--
View this message in context: http://itext-general.2136553.n4.nabble.com/Text-encoding-problem-tp2720960p2848929.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php

Aviso Legal:
Esta mensagem é destinada exclusivamente ao destinatário. Pode conter informação confidencial ou legalmente protegida. A incorrecta transmissão desta mensagem não significa a perca de confidencialidade. Se esta mensagem for recebida por engano, por favor envie-a de volta para o remetente e apague-a do seu sistema de imediato. É proibido a qualquer pessoa que não o destinatário de usar, revelar ou distribuir qualquer parte desta mensagem.

Disclaimer:
This message is destined exclusively to the intended receiver. It may contain confidential or legally protected information. The incorrect transmission of this message does not mean the loss of its confidentiality. If this message is received by mistake, please send it back to the sender and delete it from your system immediately. It is forbidden to any person who is not the intended receiver to use, distribute or copy any part of this message.

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
Loading...