|
Hi guys;
Heres what I am trying to do; I would appreciate to know if this is possible in iText. I'm not interested in constructing pdfs only deconstructing existing pdf's for analysis of content and positions of words on the page. Rather than boundary of all text on the page I want the boundary info for each word in order to generate some xml for another program I wrote. Something like this... <word id="0" x="0" y="0" width="8" height="4">The</word> <word id="1" x="12" y="0" width="7" height="4">fox</word> <word id="2" x="22" y="0" width="7" height="4">was</word> I know I can do it for a region of text; as shown in the IText in Action book in Chapter 15; but I really do want it for each individual word so I can generate invisible yet clickable hotspots over what will end up being just be a plain image. Is this possible to do with iText; how would I accomplish something like this? Thanks guys, kb ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
iText does not have any lexical analysis tools, so it does not know what a
"word" is. It only sees the drawing instructions. So you will need to obtain all of the text and coordinates for the page, then perform your own analysis to determine "words". Don't forget that the definition of a "word" differs across languagesŠ Leonard On 7/15/12 8:46 PM, "Kalani Bright" <[hidden email]> wrote: >Hi guys; > >Heres what I am trying to do; I would appreciate to know if this is >possible in iText. >I'm not interested in constructing pdfs only deconstructing existing >pdf's for analysis of content and positions of words on the page. > >Rather than boundary of all text on the page I want the boundary info >for each word in order to generate some xml for another program I wrote. > >Something like this... ><word id="0" x="0" y="0" width="8" height="4">The</word> ><word id="1" x="12" y="0" width="7" height="4">fox</word> ><word id="2" x="22" y="0" width="7" height="4">was</word> > >I know I can do it for a region of text; as shown in the IText in Action >book in Chapter 15; but I really do want it for each individual word so >I can generate invisible yet clickable hotspots over what will end up >being just be a plain image. > >Is this possible to do with iText; how would I accomplish something like >this? > >Thanks guys, > >kb > > > > >-------------------------------------------------------------------------- >---- >Live Security Virtual Conference >Exclusive live event will cover all the ways today's security and >threat landscape has changed and how IT managers can respond. Discussions >will include endpoint security, mobile security and the latest in malware >threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >_______________________________________________ >iText-questions mailing list >[hidden email] >https://lists.sourceforge.net/lists/listinfo/itext-questions > >iText(R) is a registered trademark of 1T3XT BVBA. >Many questions posted to this list can (and will) be answered with a >reference to the iText book: http://www.itextpdf.com/book/ >Please check the keywords list before you ask for examples: >http://itextpdf.com/themes/keywords.php ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
Thanks. Is there any way to get individual character sizes or anything less than a large block of text? I could calculate what I wanted if I knew how to get the rect bounds for a specific character within a larger block of text. Or if I knew how to iterate on the text pieces within a PDF Text Object and then gather further information about its pieces. Given that text formatting could be different I canʻt assume they are the same size just because they are the same letter. Thanks again, kb On 7/15/12 11:14 PM, Leonard Rosenthol wrote: > iText does not have any lexical analysis tools, so it does not know what a > "word" is. It only sees the drawing instructions. > > So you will need to obtain all of the text and coordinates for the page, > then perform your own analysis to determine "words". Don't forget that > the definition of a "word" differs across languagesŠ > > Leonard > > On 7/15/12 8:46 PM, "Kalani Bright" <[hidden email]> wrote: > >> Hi guys; >> >> Heres what I am trying to do; I would appreciate to know if this is >> possible in iText. >> I'm not interested in constructing pdfs only deconstructing existing >> pdf's for analysis of content and positions of words on the page. >> >> Rather than boundary of all text on the page I want the boundary info >> for each word in order to generate some xml for another program I wrote. >> >> Something like this... >> <word id="0" x="0" y="0" width="8" height="4">The</word> >> <word id="1" x="12" y="0" width="7" height="4">fox</word> >> <word id="2" x="22" y="0" width="7" height="4">was</word> >> >> I know I can do it for a region of text; as shown in the IText in Action >> book in Chapter 15; but I really do want it for each individual word so >> I can generate invisible yet clickable hotspots over what will end up >> being just be a plain image. >> >> Is this possible to do with iText; how would I accomplish something like >> this? >> >> Thanks guys, >> >> kb >> >> >> >> >> -------------------------------------------------------------------------- >> ---- >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> iText-questions mailing list >> [hidden email] >> https://lists.sourceforge.net/lists/listinfo/itext-questions >> >> iText(R) is a registered trademark of 1T3XT BVBA. >> Many questions posted to this list can (and will) be answered with a >> reference to the iText book: http://www.itextpdf.com/book/ >> Please check the keywords list before you ask for examples: >> http://itextpdf.com/themes/keywords.php > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > iText-questions mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/itext-questions > > iText(R) is a registered trademark of 1T3XT BVBA. > Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ > Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php > > ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
The text parser would be the best place for you to start looking. It determines the rectangles for each text draw operation (which is not the same as what you are asking, but it's a starting point at least). What you are asking for is very difficult to do with PDF in the general case because PDF doesn't have a concept of words, but that will get you a starting point so you'll at least understand what is involved. Pay particular attention to the part of the algorithm that figures out spaces between words.
Good luck! |
|
Yes; it does seem I started with the most difficult problem all while just picking up iText & the book. I had the .jar but I think its time for me to download the iText source. Thank you for pointing me there. It does seem that's the best place to start and it seems I would need to rewrite/extend PRTokenizer or something similar. I am aware of some of what is involved. I wrote a program that found rects where words were probably hiding in a drag rectangle. It didnʻt do such a good job; Adobe Acrobat Pro does a much better job and itʻll figure out some text for me too. If I can only crack this puzzle with the detection I think I'll be fine. Thanks for the help! kb On 7/16/12 10:04 AM, Kevin Day wrote: > The text parser would be the best place for you to start looking. It > determines the rectangles for each text draw operation (which is not the > same as what you are asking, but it's a starting point at least). What you > are asking for is very difficult to do with PDF in the general case because > PDF doesn't have a concept of words, but that will get you a starting point > so you'll at least understand what is involved. Pay particular attention to > the part of the algorithm that figures out spaces between words. > > Good luck! > > -- > View this message in context: http://itext-general.2136553.n4.nabble.com/Calculating-text-regions-of-individual-words-from-an-existing-PDF-tp4655616p4655622.html > Sent from the iText - General mailing list archive at Nabble.com. > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > iText-questions mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/itext-questions > > iText(R) is a registered trademark of 1T3XT BVBA. > Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ > Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php > > ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
|
Kalani,
IMO the PRTokenizer is the wrong place to look as there already is the pdf parser API. All you have to do is write a more intelligent RenderListener (or TextExtractionListener if you prefer to additionally implement a getResultantText method) than LocationTextExtractionStrategy. RenderListeners get the smallest text fragments directly available from the PDF content streams, the string arguments of the commands showing text, and the relevant transformation matrix information. The TextRenderInfo object wrapping these information offers you some methods to analyze that fragment; you might need some more functionality inspired by those methods, though. Your RenderListener implementation merely has to process this information to collect the words, their locations and their widths. Be aware, though, that the text fragments presented to the RenderListener may contain multiple words, or a part of a word, or even multiple parts of multiple words. E.g. you might receive "w i", "rd stuff", and "e", the former with location information positioned one after the other and the latter positioned to fit in the double space gap in "w i", and you would have to build "weird" and "stuff" from that. This fragmentation might be done in the PDF to position the 'i' and 'r' nearer to each other than proposed by their font and to display the 'e' in a different font. Regards, Michael |
|
mlk is right - you shouldn't have to do anything with PRTokeniser. I've already done the heavy lifting with the render listener interface. If you find that there is information you need that isn't exposed by the render listener, let me know and I'll get it added.
Please let me know about your progress on this. It would be good to have additional render listener implementations in iText, and what you are looking for will certainly have utility for other users of the library. |
|
Thanks mkl and Kevin for your responses and direction. I'll start playing with the TextExtractionListener or RenderListener as I learn iText for this purpose. I'll keep it posted and share my hopefully successful result and put it back into the library. Its not a work project (personal project) so progress might be slow as generally only part of the weekends and a night here and there. Thanks for your help. I'll post further questions, results here. kb On 7/18/12 7:28 AM, Kevin Day wrote: > mlk is right - you shouldn't have to do anything with PRTokeniser. I've > already done the heavy lifting with the render listener interface. If you > find that there is information you need that isn't exposed by the render > listener, let me know and I'll get it added. > > Please let me know about your progress on this. It would be good to have > additional render listener implementations in iText, and what you are > looking for will certainly have utility for other users of the library. > > -- > View this message in context: http://itext-general.2136553.n4.nabble.com/Calculating-text-regions-of-individual-words-from-an-existing-PDF-tp4655616p4655633.html > Sent from the iText - General mailing list archive at Nabble.com. > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > iText-questions mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/itext-questions > > iText(R) is a registered trademark of 1T3XT BVBA. > Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ > Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php > > ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php |
| Powered by Nabble | Edit this page |
