Quantcast

How to read underline text from TextRenderInfo

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

How to read underline text from TextRenderInfo

Barathvaj
Hi

I want to identify whether the text is underlined or not. Is there any way that i can find out whether the text is underlined like font or fillcolor. Please help me to find out the approach.

Thanks
Barath
mkl
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to read underline text from TextRenderInfo

mkl
Barath,
Barathvaj wrote
I want to identify whether the text is underlined or not. Is there any way that i can find out whether the text is underlined like font or fillcolor.
In contrast to the font, the color, the rendering mode, etc, being underlined or not in PDF is not a property of the drawn glyphs. Instead, a line (or fairly often a very thin rectangle) merely happens to be drawn somewhere and character glyphs happen to be above. Thus, identifying underlined text (and differentiating it from text merely drawn near some line, e.g. background material, table or text box frames) can only be done by heuristics for generic documents.

Furthermore underlines usually are created using vector graphics (lines or slim rectangles), and the classes in the iText parser package currently ignore such vector graphics.

Thus, iText does not allow you to recognize underlined text out of the box.

If you want to extend iText to recognize underlined texts, you have to

a) extend the iText parser package classes to recognize and forward vector graphics to the render listeners; and
b) add heuristics to the standard render listeners (text extraction strategies) to recognize underlines and associate them with the text in question.

Regards,   Michael

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to read underline text from TextRenderInfo

Barathvaj
Hi ,

Thanks for the information Michael. What are the Parser class that i need to extend to achieve the functionality . I extended IRenderListner , i used GlyphRenderListner. But I not able to get the vector graphics. Can you please help me out to make a progress on it.

Regards
Barath
mkl
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to read underline text from TextRenderInfo

mkl
Barath,
Barathvaj wrote
What are the Parser class that i need to extend to achieve the functionality . I extended IRenderListner , i used GlyphRenderListner. But I not able to get the vector graphics.
First and foremost you have to enhance PdfContentStreamProcessor to also process operators which create paths (foremost move-to, line-to, and rectangle) and then draw them. It's up to you whether you first collect all operations building a path and forward that path together with the drawing operator or forward each operation individually.

Then you should either enhance the render listener interface or create a new interface to receive these drawing information. If the listener registered with the PdfContentStreamProcessor also implements that interface, you can forward the drawing instructions you collect in PdfContentStreamProcessor to it.

Then your render listener implementation should implement that new interface, too, to receive the drawing information, collect them, and when all page information are processed, it shall compare the positions of the paths with the text and accordingly mark text as underlined.

Regards,   Michael

PS: This is quite some work, and you really should know chapters 8 and 9 of the PDF specification ISO 32000-1 <http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to read underline text from TextRenderInfo

Kevin Day
mkl wrote
PS: This is quite some work, and you really should know chapters 8 and 9 of the PDF specification ISO 32000-1 <http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf>
Very true.

I've had this on my "to do eventually" list for awhile now, and it's something that could be investigated - but there would need to be some very serious design discussions to figure out what the render listener should receive.

I'm open to having those discussions, if there enough folks interested.  I'm also insanely busy with work right now, so while I can partake in design and architectural discussion, actual coding may be a bit off into the future - but coding will *never* happen if the design isn't figured out.
Loading...