Java - PDFBox - Text Extraction -
I am using PDF to extract text information from PDF I've successfully parse all the properties of text such as fontname, fontface, size, position etc.
Problem: I am using pdfbox1.2.1 (latest version). In the TextPosition class, getCharacter () returns the full string except the last character, the previous character is parsed as a separate string.
Formerly: "How are you" are parsed as "how are" and "u" (2 separate strings).
I do not want to be like that ..
Has anyone come on this? ..I am doing something wrong .. waiting for the answer ..
Thanks and Regards, Maggie
This problem has been resolved.
In the process, the following text (byte [] string in the following code)
in PDFStreamEngine.java
if ( SpacingText == 0 & amp; (i + codeLength) & lt; (string.length - 1)) {continue; }
should be changed to
(if spacing text == 0 & amp; amp; (i + codeLength) & lt; (string .length)) {continue; }
Regards, Maggi
Comments
Post a Comment