Java - PDFBox - Text Extraction -

- March 15, 2011

I am using PDF to extract text information from PDF I've successfully parse all the properties of text such as fontname, fontface, size, position etc.

Problem: I am using pdfbox1.2.1 (latest version). In the TextPosition class, getCharacter () returns the full string except the last character, the previous character is parsed as a separate string.

Formerly: "How are you" are parsed as "how are" and "u" (2 separate strings).

I do not want to be like that ..

Has anyone come on this? ..I am doing something wrong .. waiting for the answer ..

Thanks and Regards, Maggie

This problem has been resolved.

In the process, the following text (byte [] string in the following code) in PDFStreamEngine.java

  if ( SpacingText == 0 & amp; (i + codeLength) & lt; (string.length - 1)) {continue; }

should be changed to

  (if spacing text == 0 & amp; amp; (i + codeLength) & lt; (string .length)) {continue; }

Regards, Maggi

Search This Blog

Add s econ

Java - PDFBox - Text Extraction -

Comments

Post a Comment

Popular posts from this blog

paypal - How to know the URL referrer in PHP? -

oauth - Facebook OAuth2 Logout does not remove fb_ cookie -

wpf - Line breaks and indenting for the XAML of a saved FlowDocument? -