Quantcast

Euro symbol in PdfDoc encoding vs. non breaking space in unicode...

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Euro symbol in PdfDoc encoding vs. non breaking space in unicode...

Ludger Buenger

Hello everyone,

I believe I have found a bug in Itext but since it is easy to blame
others while overlooking some crucial information, I leave the
evaluation whether this actually is a feature and I am simply mistaken
to the community.

Consider the following code:

outlines.set(bookmarkLevel, new PdfOutline(
(PdfOutline)outlines.get(parentBookmarkLevel),
destination,
//outlineTitle
"1.\u00a0Sammanfattning"
) );


This creates a bookmark in the PDF with the following appearance:
"1.€Sammanfattning".

Please not the Euro symbol displaying in the bookmark while the original
contains \u00a0 which is a unicode nonbreaking space.


After short digging into PDFreference I found the following in the
Appendix D "Encoding".

1. In PDF 1.3, the euro character was added to the Adobe standard Latin
character set. It
is encoded as 200 in WinAnsiEncoding and 240 in PDFDocEncoding,
assigning codes
that were previously unused. Apple changed the Mac OS Latin-text
encoding for code
333 from the currency character to the euro character. However, this
incompatible
change has not been reflected in PDF’s MacRomanEncoding, which continues
to map
code 333 to currency. If the euro character is desired, an encoding
dictionary can be
used to specify this single difference from MacRomanEncoding.
[...]
6. The space character is also encoded as 312 in MacRomanEncoding and as
240 in
WinAnsiEncoding. The meaning of this duplicate code is “nonbreaking
space,” but it is
typographically the same as space.

As we easily observe, PdfEncoding maps octal 240 (hex a0) to the euro
symbol - opposed to winansi, which goes the way unicode does.

also forcing unicode did not help: new
PdfString("1.\u00a0Sammanfattning", PdfObject.TEXT_UNICODE) made no
difference.


After some debugging I found the following lines in PdfString.getBytes():

if (encoding != null && encoding.equals(TEXT_UNICODE) &&
PdfEncodings.isPdfDocEncoding(value))
bytes = PdfEncodings.convertToBytes(value, TEXT_PDFDOCENCODING);
else
bytes = PdfEncodings.convertToBytes(value, encoding);

I made two observations:
1) As long as PdfEncodings.isPdfDocEncoding() believes all symbols to
belong to PdfDocEncoding, the PdfString will not use a unicode encoded
TextString but always a PdfDocEncoded one.
But since there is no non breaking space in
PdfDocEncodings.isPdfDocEncoding should not match in the first place.

2) Non the less even when converting to PdfDocEncoding the convert
method should not wrongly create a € symbol out of the void.


Regarding issue 1:
The following lines are found in PdfEncoding.isPdfDocEncoding():

if (char1 < 128 || (char1 >= 160 && char1 <= 255)) \\ wrongly matching
non breaking space (dec 160) to be in pdfDocEncoding
continue;
if (!pdfEncoding.containsKey(char1))
return false;

if we replace this line with the following:

if (char1 < 128 || (char1 > 160 && char1 <= 255)) \\ correctly matching
non breaking space (dec 160) not to be in pdfDocEncoding
continue;
if (!pdfEncoding.containsKey(char1))
return false;

Then isPdfDocEncoding() detects non breaking spaces correctly.

Regarding issue 2:

The following code is found in PdfEncodings.convertToBytes(String, String):

char char1 = cc[k];
if (char1 < 128 || (char1 >= 160 && char1 <= 255))
c = char1;
else
c = hash.get(char1);

as we can see, char 160 is taken "as is" which indeed causes the
euro-symbol and thus converts wrong.
When excluding 160 from the range and taking the char from the given
hash, than at least itext will not convert wrongly but omit the
non-convertible character.

So after having found two places where itext (imho) wrongly checks for
 >= 160 where it should check for > 160 I risk the assumption that the
third place where we compare >= 160 is likely to also be > 160.

Any comments/remarks/flames?

Best regards,

Ludger

--
Dipl-Inf. Ludger Bünger
Product Development
Team Martha
- - - - - - - - - - - - - - - -
RealObjects GmbH
Altenkesseler Str. 17/B4
66115 Saarbrücken, Germany
Tel +49 (0)681 98579 0
Fax +49 (0)681 98579 29
http://www.realobjects.com
[hidden email]
- - - - - - - - - - - - - - - -
Commercial Register: Amtsgericht Saarbrücken, HRB 12016
Managing Directors: Michael Jung, Markus Neurohr
VAT-ID: DE210373115


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Euro symbol in PdfDoc encoding vs. non breakingspace in unicode...

Paulo Soares
Looks like you found a bug.

Paulo

----- Original Message -----
From: "Ludger Bünger" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, November 27, 2007 6:22 PM
Subject: [iText-questions] Euro symbol in PdfDoc encoding vs. non
breakingspace in unicode...


>
> Hello everyone,
>
> I believe I have found a bug in Itext but since it is easy to blame
> others while overlooking some crucial information, I leave the
> evaluation whether this actually is a feature and I am simply mistaken
> to the community.
>
> Consider the following code:
>
> outlines.set(bookmarkLevel, new PdfOutline(
> (PdfOutline)outlines.get(parentBookmarkLevel),
> destination,
> //outlineTitle
> "1.\u00a0Sammanfattning"
> ) );
>
>
> This creates a bookmark in the PDF with the following appearance:
> "1.€Sammanfattning".
>
> Please not the Euro symbol displaying in the bookmark while the original
> contains \u00a0 which is a unicode nonbreaking space.
>
>
> After short digging into PDFreference I found the following in the
> Appendix D "Encoding".
>
> 1. In PDF 1.3, the euro character was added to the Adobe standard Latin
> character set. It
> is encoded as 200 in WinAnsiEncoding and 240 in PDFDocEncoding,
> assigning codes
> that were previously unused. Apple changed the Mac OS Latin-text
> encoding for code
> 333 from the currency character to the euro character. However, this
> incompatible
> change has not been reflected in PDF’s MacRomanEncoding, which continues
> to map
> code 333 to currency. If the euro character is desired, an encoding
> dictionary can be
> used to specify this single difference from MacRomanEncoding.
> [...]
> 6. The space character is also encoded as 312 in MacRomanEncoding and as
> 240 in
> WinAnsiEncoding. The meaning of this duplicate code is “nonbreaking
> space,” but it is
> typographically the same as space.
>
> As we easily observe, PdfEncoding maps octal 240 (hex a0) to the euro
> symbol - opposed to winansi, which goes the way unicode does.
>
> also forcing unicode did not help: new
> PdfString("1.\u00a0Sammanfattning", PdfObject.TEXT_UNICODE) made no
> difference.
>
>
> After some debugging I found the following lines in PdfString.getBytes():
>
> if (encoding != null && encoding.equals(TEXT_UNICODE) &&
> PdfEncodings.isPdfDocEncoding(value))
> bytes = PdfEncodings.convertToBytes(value, TEXT_PDFDOCENCODING);
> else
> bytes = PdfEncodings.convertToBytes(value, encoding);
>
> I made two observations:
> 1) As long as PdfEncodings.isPdfDocEncoding() believes all symbols to
> belong to PdfDocEncoding, the PdfString will not use a unicode encoded
> TextString but always a PdfDocEncoded one.
> But since there is no non breaking space in
> PdfDocEncodings.isPdfDocEncoding should not match in the first place.
>
> 2) Non the less even when converting to PdfDocEncoding the convert
> method should not wrongly create a € symbol out of the void.
>
>
> Regarding issue 1:
> The following lines are found in PdfEncoding.isPdfDocEncoding():
>
> if (char1 < 128 || (char1 >= 160 && char1 <= 255)) \\ wrongly matching
> non breaking space (dec 160) to be in pdfDocEncoding
> continue;
> if (!pdfEncoding.containsKey(char1))
> return false;
>
> if we replace this line with the following:
>
> if (char1 < 128 || (char1 > 160 && char1 <= 255)) \\ correctly matching
> non breaking space (dec 160) not to be in pdfDocEncoding
> continue;
> if (!pdfEncoding.containsKey(char1))
> return false;
>
> Then isPdfDocEncoding() detects non breaking spaces correctly.
>
> Regarding issue 2:
>
> The following code is found in PdfEncodings.convertToBytes(String,
> String):
>
> char char1 = cc[k];
> if (char1 < 128 || (char1 >= 160 && char1 <= 255))
> c = char1;
> else
> c = hash.get(char1);
>
> as we can see, char 160 is taken "as is" which indeed causes the
> euro-symbol and thus converts wrong.
> When excluding 160 from the range and taking the char from the given
> hash, than at least itext will not convert wrongly but omit the
> non-convertible character.
>
> So after having found two places where itext (imho) wrongly checks for
> >= 160 where it should check for > 160 I risk the assumption that the
> third place where we compare >= 160 is likely to also be > 160.
>
> Any comments/remarks/flames?
>
> Best regards,
>
> Ludger
>
> --
> Dipl-Inf. Ludger Bünger
> Product Development
> Team Martha
> - - - - - - - - - - - - - - - -
> RealObjects GmbH
> Altenkesseler Str. 17/B4
> 66115 Saarbrücken, Germany
> Tel +49 (0)681 98579 0
> Fax +49 (0)681 98579 29
> http://www.realobjects.com
> [hidden email]
> - - - - - - - - - - - - - - - -
> Commercial Register: Amtsgericht Saarbrücken, HRB 12016
> Managing Directors: Michael Jung, Markus Neurohr
> VAT-ID: DE210373115


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
iText-questions mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/
Loading...