Java - PDFBox - Text Extraction -


I am using PDF to extract text information from PDF I've successfully parse all the properties of text such as fontname, fontface, size, position etc.

Problem: I am using pdfbox1.2.1 (latest version). In the TextPosition class, getCharacter () returns the full string except the last character, the previous character is parsed as a separate string.

Formerly: "How are you" are parsed as "how are" and "u" (2 separate strings).

I do not want to be like that ..

Has anyone come on this? ..I am doing something wrong .. waiting for the answer ..

Thanks and Regards, Maggie

This problem has been resolved.

In the process, the following text (byte [] string in the following code) in PDFStreamEngine.java

  if ( SpacingText == 0 & amp; (i + codeLength) & lt; (string.length - 1)) {continue; }  

should be changed to

  (if spacing text == 0 & amp; amp; (i + codeLength) & lt; (string .length)) {continue; }  

Regards, Maggi


Comments

Popular posts from this blog

c# - sqlDecimal to decimal clr stored procedure Unable to cast object of type 'System.Data.SqlTypes.SqlDecimal' to type 'System.IConvertible' -

Calling GetGUIThreadInfo from Outlook VBA -

Obfuscating Python code? -