How to extract pdf text from BT and ET section line by line and how to decode Tj block to unicode in pdfbox - extract

As the title, I have the special requirement to extract text line by line or block by block of BT and ET.
below is the pdf content, I tried PDFTextstripper class, but it is not what I want,
so any one has the solution to resolve the problem?
I wanna parse this : [ (=\324Z\016) ] TJ.......
this is my pdf: https://dl.dropboxusercontent.com/u/63353043/docu.pdf
below is my code:
enter coList<Object> tokens = pages.get(0).getContents().getStream().getStreamTokens();
tokens.forEach(s->{
if(s instanceof COSString){
System.out.print(s.toString());
}
but, I get thoes:
COSString{&Ð}COSString{O6}COSString{&³}COSString{p»}COSString{6±}COSString{˛¨}COSString{+^}COSString{+·}COSString{˚©}COSString{9}COSString{O©}COSString{en}COSString{˛¨}COSString{Fœ}COSString{0ł}COSString{Q¯}COSString{#”}COSString{˛¨}COSString{+^}COSString{(Ï}COSString{˚©}COSString{9}COSString{O©}COSString{en}COSString{Zo}COSString{#°}COSString{˜}COSString{p»}COSString{#Š}COSString{5×}COSString{,
}COSString{:É}COSString{(Ù}COSString{4ÿ}COSString{ä}COSString{_Á}COSString{˛¨}COSString{:É}COSString{p»}COSString{O©}COSString{en}COSString{#p}COSString{/F}COSString{O©}COSString{en}COSString{F,}COSString{_N}COSString{!}COSString{9»}COSString{]˘}COSString{!¢}COSString{˜.}COSString{p»}COSString{#°}COSString{˜}COSString{#p}COSString{<:}COSString{Zo}COSString{1¸}COSString{ä}COSString{˚~}COSString{F³}COSString{!Ø}COSString{]Š}COSString{2}COSString{6±}COSString{˛¨}COSString{gî}COSString{+·}COSString{9á}COSString{XS}COSString{hP}COSString{h[}COSString{˜º}COSString{˚.}COSString{p»}COSString{5d}COSString{5×}COSString{]˘}COSString{_ö}COSString{#c}COSString{2˚}COSString{]˜}COSString{+·}COSString{9á}COSString{p»}COSString{#b}COSString{˚.}COSString{eÚ}COSString{;
}COSString{!5}COSString{:É}COSString{XS}COSString{hP}COSString{h[}COSString{˜º}COSString{p»}COSString{B!}COSString{&Ø}COSString{,}COSString{/F}COSString{^r}COSString{˛²}COSString{2&}COSString{˜.}COSString{N}COSString{*ø}COSString{˜.}COSString{&¢}COSString{+B}COSString{+·}COSString{9á}COSString{ä}COSString{ZX}COSString{˚}COSString{˛µ}COSString{6«}COSString{0ł}COSString{!Ø}COSString{p»}COSString{O©}COSString{en}COSString{F,}COSString{Q±}COSString{"}}COSString{XS}COSString{&Ð}COSString{_N}COSString{!}COSString{9»}COSString{˛²}COSString{F,}COSString{(Ï}COSString{O©}COSString{en}COSString{F³}COSString{/?}COSString{˛¨}COSString{=­}COSString{˚4}COSString{9}COSString{p»}COSString{;}COSString{5×}COSString{_Á}COSString{˜³}COSString{#É}COSString{0·}COSString{F,}COSString{Q±}COSString{"}}COSString{p»}COSString{1û}COSString{"}}COSString{˚.}COSString{O©}COSString{en}COSString{p»}COSString{#°}COSString{˜}COSString{Lê}COSString{5d}COSString{Zo}COSString{1¸}COSString{"G}COSString{˚.}COSString{ä}COSString{&Ð}COSString{Kå}COSString{Yª}COSString{#°}COSString{#´}COSString{5ê}COSString{p»}COSString{(Ï}COSString{O©}COSString{en}COSString{ZR}COSString{pÉ}COSString{p„}COSString{˜}COSString{G“}COSString{_û}COSString{%v}COSString{pÎ}COSString{1û}COSString{"}}COSString{1¹}COSString{F,}COSString{˛µ}COSString{5×}COSString{˜}COSString{2˚}COSString{]˜}COSString{+·}COSString{9á}COSString{p»}COSString{O´}COSString{5×}COSString{#b}COSString{˚.}COSString{+·}COSString{9á}COSString{p»}COSString{˜}COSString{]y}COSString{/0}COSString{}COSString{2§}COSString{˚.}COSString{˛¨}COSString{8E}COSString{N}COSString{*ø}COSString{22}COSString{++}COSString{&¢}COSString{+B}COSString{)%}COSString{ä}COSString{&Ð}COSString{!Í}COSString{˚b}COSString{f¨}COSString{Y)}COSString{.}COSString{"Q}COSString{5ê}COSString{p»}COSString{)*}COSString{7D}COSString{˛¨}COSString{˜³}COSString{˚b}COSString{P¥}COSString{&Ð}COSString{!Í}COSString{˚b}COSString{˛µ}COSString{G“}COSString{0m}COSString{F,}COSString{0m}COSString{<i}COSString{˛³}COSString{p»}COSString{;fi}COSString{˛µ}COSString{BÞ}COSString{\}COSString{F,}COSString{BO}COSString{Bˆ}COSString{Q™}COSString{-Ž}COSString{F,}COSString{!Ñ}COSString{Fr}COSString{p»}COSString{$™}COSString{/½}COSString{BO}COSString{Bˆ}COSString{F,}COSString{#™}COSString{5×}COSString{˛¨}COSString{nƒ}COSString{nƒ}COSString{p»}COSString{˚}COSString{5×}COSString{f‰}COSString{P¥}COSString{#Š}COSString{\\}COSString{F,}COSString{p°}COSString{1¹}COSString{<:}COSString{6±}COSString{C®}COSString{DÙ}COSString{˛µ}COSString{Q¯}COSString{_Á}COSString{9Ë}COSString{F,}COSString{˚b}COSString{#°}COSString{˜}COSString{p»}COSString{_Á}COSString{9Ë}COSString{F,}COSString{˚b}COSString{˚}COSString{<:}COSString{6±}COSString{C®}COSString{DÙ}COSString{˛µ}COSString{Cˆ}COSString{/?}COSString{1¸}COSString{"G}COSString{p°}COSString{p“}COSString{/4}COSString{˜.}COSString{p»}COSString{+·}COSString{O©}COSString{en}COSString{F,}COSString{˚3}COSString{9}COSString{7D}COSString{#Þ}COSString{DÙ}COSString{;}COSString{T}COSString{T`}COSString{5“}COSString{˛²}COSString{p»}COSString{]2}COSString{ }COSString{]2}COSString{(Ï}COSString{p°}COSString{:}COSString{6«}COSString{Må}COSString{&Ð}COSString{˛µ}COSString{M;}COSString{0·}COSString{e;}COSString{O«}COSString{aw}COSString{˚b}COSString{F,}COSString{FÇ}COSString{ZH}

(If this wasn't so long, it would have been more appropriate as a comment to the question instead of as an answer. But comments are too limited.)
The OP in his question shows that he essentially wants to parse the content stream of e.g. a page and extract the strings drawn in a legible form. He attempts this by simply taking the tokens in the content stream and looking at the COSString instances in there:
List<Object> tokens = pages.get(0).getContents().getStream().getStreamTokens();
tokens.forEach(s->{
if (s instanceof COSString) {
System.out.print(s.toString());
}
});
Unfortunately the output looks like a mess.
Why do the string values in the content stream look so messy?
The reason for this is that those COSString instances represent the PDF string objects as they are, and that there is no single encoding of PDF string objects in content streams, not even a limitation to a few standardized ones.
The encoding of a string completely depends on the definition of the font currently active when the string drawing instruction in question is executed.
Fonts in PDFs can be defined to use either some standard encoding or a custom one and it is very common, in particular in case of embedded font subsets, to use custom encodings mapping constructed similar to this:
the code 1 to the glyph of this font which is first used on the page,
the code 2 on the second glyph of this font on the page which is not identical to the first,
the code three to the third glyph of this font on the page not identical to either of the first two,
etc...
Obviously there is no good to conjecture the meaning of string bytes for such encodings.
Thus, when parsing the content stream, you have to keep track of the current font and look up the meaning of each byte (or multi byte sequence!) of a COSString in the definition of that current font in the resource dictionary of the current page.
How to map those messy bytes using the current font definition?
The encoding of a PDF font might have to be determined in different ways.
There may be a ToUnicode map in the font definition which shows you which bytes to map to which character.
Otherwise the encoding may be a standard encoding like MacRomanEncoding or WinAnsiEncoding in which case one has to bring along the mapping table oneself (they are printed in the PDF specification).
Otherwise the encoding might be based on such a standard encoding but deviations are given by a mapping from codes to names of glyphs. If certain standard names are used, the character can be derived from that name. These names are listed in another document.
Otherwise some CIDSystemInfo entry may point to yet another standard Registry and Ordering from which to derive a mapping table specified in other documents.
Otherwise the font program itself may include usable mappings to Unicode.
Otherwise ???
Any pitfalls to evade?
The current font is a PDF graphics state attribute. Thus, one does not only have to remember the most recently set font but also consider the effects of operations changing the whole graphics state, in particular the save-graphics-state and restore-graphics-state operations which push the current graphics state onto a stack or pop it there-from.
How can PDF libraries help you?
PDF libraries which support you in text extraction can do so by doing all the heavy lifting for you,
parsing the content stream,
keeping track of the graphics state,
determining the encoding of the current font,
translating any drawn PDF strings using that encoding,
and only forward the resulting characters and some extra data (current position on the page, text drawing orientation, font and font size, colors, and other effects) to you.
In case of PDFBox 2.0.x, this is what the PDFTextStreamEngine class does for you: you only have to override its processTextPosition method to which PDFBox forwards those enriched character information in the TextPosition parameter. As you also want to know the starts and ends of the BT ... ET text object envelopes, you also have to override beginText and endText.
The class PDFTextStripper is based on that class and collects and sorts those character information bits to build a string containing the page text which it eventually returns.
In PDFBox 1.8.x there was a very similar PDFTextStripper class but the base methods were not that properly separated in a base class, everything was somewhat more intermingled and it was harder to implement one's own extraction ways.
(In other PDF libraries there are similar constructs, sometimes event based like in PDFBox, sometimes as a collected sequence of Textposition-like objects.)

Related

write_html() method in fpdf not using font/encoding specified

I'm creating a PDF with a large collection of quotes that I've imported into python with docx2python, using html=True so that they have some tags. I've done some processing to them so they only really have the bold, italics, underline, or break tags. I've sorted them and am trying to write them onto a PDF using the fpdf library, specifically the pdf.write_html(quote) method. The trouble comes with several special characters I have, so I am hoping to encode the PDF to UTF-8. To write with .write_html(), I had to create a new class as shown in their readthedocs under the .write_html() method at the very bottom of the left hand side:
from fpdf import FPDF, HTMLMixin
class htmlFPDF(FPDF, HTMLMixin):
pass
pdf = htmlFPDF()
pdf.add_page()
#set the overall PDF to utf-8 to preserve special characters
pdf.set_doc_option('core_fonts_encoding', 'utf-8')
pdf.write_html(quote) #[![a section of quote giving trouble with quotations][2]][2]
The list of quotes that I have going into the pdf all appear with their special characters and the html tags (<u> or <i>) in the debugger, but after the .write_html() step they then show up in the pdf file with mojibake, even before being saved, as seen through debugger. An example being "dayâ€ÂTMs demands", when it should be "day's demands" (the apostrophe is curled clockwise in the quote, but this textbox doesn't support).
I've tried updating the font I use by
pdf.add_font('NotoSans', '', 'NotoSans-Regular.ttf', uni=True)
pdf.set_font('NotoSans', '', size=12)
added after the .add_page() method, but this doesn't change the current font (or fix mojibake) on the PDF unless I use the more common .write(text_height, quote) method, which renders the underline/italicize tags into the PDF as text. The .write() method does preserve the special characters. I'm not trying to change the font really, but make sure that what's written onto the PDF preserves the special characters instead of mojibake them.
I've also attempted some .encode/.decode action before going into the .write_html(), as well as attempted some methods from the ftfy library. And tried adding '' to the start of each quote to no effect.
If anyone has ideas for a way to iterate through each line on the PDF that'd be terrific, since then I could use ftfy to fix the mojibake. But ideally, it would be some other html tag at the start of each quote or a way to change the font/encoding of the .write_html() method, maybe in the class declaration?
Or if I'm at a dead-end and should just split each quote on '<', use if statements to detect underlines, italicize, etc., and use the .write() method after all.
Extract docx to html works really bad with docx2python. I do this few month ago. I recommend PyDocX. docx2python are good for docx file content extracting, not converting it into a html.

Displaying UTF-8 codes from JSON file as Emoticons

I am loading a JSON file that contains some UTF-8 codes, that represent emoticons.
The JSON content looks as follows:
"Studying! \uf4d6"
"Winning \uf40e\uf3c1 #4mile"
"Cheer me on \uf603 #werunamsterdam"
These UTF-8 codes are displayed as blocks in the browser. But when I look at this Unicode reference in Firefox, the codes are actually recognized!
(for example, UF4D6 is a book)
How do I convert the code from my json so that a browser can display them?
The code points from \uE000 to \uF8FF are in a private use area, so there aren't any standard glyphs associated with them.
You can, however, create your own font with suitable icons at these code points. This can be done quite easily using online tools like IcoMoon. Alternatively, use a string replacement routine to swap these characters with suitable markup (e.g., replace \uf4d6 with <img src="/icons/book.png" alt="[Book]" />)
These emoticons are encoded as regular characters as defined in Unicode, i.e. they're no different from the letter "A" or "%". All you need is a font that has glyphs for these "characters". Since not everyone can be expected to have such fonts installed (apparently you don't), if you want maximum compatibility, there are libraries for most languages that replace these characters with equivalent images. Google for one that suits your needs.

Mapping plain text back into HTML document

Situation: I have a group of strings that represent Named Entities that were extracted from something that used to be an HTML doc. I also have both the original HTML doc, the stripped-of-all-markup plain text that was fed to the NER engine, and the offset/length of the strings in the stripped file.
I need to annotate the original HTML doc with highlighted instances of the NEs. To do that I need to do the following:
Find the start / end points of the NE strings in the HTML doc. Something that resulted in a DOM Range Object would probably be ideal.
Given that Range object, apply a styling (probably using something like <span class="ne-person" data-ne="123">...</span>) to the range. This is tricky because there is no guarantee that the range won't include multiple DOM elements (<a>, <strong>, etc.) and the span needs to start/stop correctly within each containing element so I don't end up with totally bogus HTML.
Any solutions (full or partial) are welcome. The back-end is mostly Python/Django, and the front-end is using jQuery. We would rather do this on the back-end, but I'm open to anything.
(I was a bit iffy on how to tag this question, so feel free to re-tag it.)
Use a range utility method plus an annotation library such as one of the following:
artisan.js
annotator.js
vie.js
The free software Rangy JavaScript library is your friend. Regarding your two tasks:
Find the start / end points of the […] strings in the HTML doc. You can use Range#findText() from the TextRange extension. It indeed results in a DOM Level 2 Range compatible object [source].
Given that Range object, apply a styling […] to the range. This can be handled with the Rangy Highlighter module. If necessary, it will use multiple DOM elements for the highlighting to keep up a DOM tree structure.
Discussion: Rangy is a cross-browser implementation of the DOM Level 2 range utility methods proposed by #Paul Sweatte. Using an annotation library would be a further extension on range library functionality; for example, Rangy will be the basis of Annotator 2.0 [source]. It's just not required in your case, since you only want to render highlights, not allow users to add them.

What characters are allowed in the HTML Name attribute inside input tag?

I have a PHP script that will generate <input>s dynamically, so I was wondering if I needed to filter any characters in the name attribute.
I know that the name has to start with a letter, but I don't know any other rules. I figure square brackets must be allowed, since PHP uses these to create arrays from form data. How about parentheses? Spaces?
Note, that not all characters are submitted for name attributes of form fields (even when using POST)!
White-space characters are trimmed and inner white-space characters as well the character . are replaced by _.
(Tested in Chrome 23, Firefox 13 and Internet Explorer 9, all Win7.)
Any character you can include in an [X]HTML file is fine to put in an <input name>. As Allain's comment says, <input name> is defined as containing CDATA, so the only things you can't put in there are the control codes and invalid codepoints that the underlying standard (SGML or XML) disallows.
Allain quoted W3 from the HTML4 spec:
Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire ISO10646 character set.
However this isn't really true in practice.
The theory is that application/x-www-form-urlencoded data doesn't have a mechanism to specify an encoding for the form's names or values, so using non-ASCII characters in either is “not specified” as working and you should use POSTed multipart/form-data instead.
Unfortunately, in the real world, no browser specifies an encoding for fields even when it theoretically could, in the subpart headers of a multipart/form-data POST request body. (I believe Mozilla tried to implement it once, but backed out as it broke servers.)
And no browser implements the astonishingly complex and ugly RFC2231 standard that would be necessary to insert encoded non-ASCII field names into the multipart's subpart headers. In any case, the HTML spec that defines multipart/form-data doesn't directly say that RFC2231 should be used, and, again, it would break servers if you tried.
So the reality of the situation is there is no way to know what encoding is being used for the names and values in a form submission, no matter what type of form it is. What browsers will do with field names and values that contain non-ASCII characters is the same for GET and both types of POST form: it encodes them using the encoding the page containing the form used. Non-ASCII GET form names are no more broken than everything else.
DLH:
So name has a different data type for than it does for other elements?
Actually the only element whose name attribute is not CDATA is <meta>. See the HTML4 spec's attribute list for all the different uses of name; it's an overloaded attribute name, having many different meanings on the different elements. This is generally considered a bad thing.
However, typically these days you would avoid name except on form fields (where it's a control name) and param (where it's a plugin-specific parameter identifier). That's only two meanings to grapple with. The old-school use of name for identifying elements like <form> or <a> on the page should be avoided (use id instead).
The only real restriction on what characters can appear in form control names is when a form is submitted with GET
"The "get" method restricts form data set values to ASCII characters." reference
There's a good thread on it here.
While Allain's comment did answer OP's direct question and bobince provided some brilliant in-depth information, I believe many people come here seeking answer to more specific question: "Can I use a dot character in form's input name attribute?"
As this thread came up as first result when I searched for this knowledge I guessed I may as well share what I found.
Firstly, Matthias' claimed that:
character . are replaced by _
This is untrue. I don't know if browser's actually did this kind of operation back in 2013 - though, I doubt that. Browsers send dot characters as they are(talking about POST data)! You can check it in developer tools of any decent browser.
Please, notice that tiny little comment by abluejelly, that probably is missed by many:
I'd like to note that this is a server-specific thing, not a browser thing. Tested on Win7 FF3/3.5/31, IE5/7/8/9/10/Edge, Chrome39, and Safari Windows 5, and all of them sent " test this.stuff" (four leading spaces) as the name in POST to the ASP.NET dev server bundled with VS2012.
I checked it with Apache HTTP server(v2.4.25) and indeed input name like "foo.bar" is changed to "foo_bar". But in a name like "foo[foo.bar]" that dot is not replaced by _!
My conclusion: You can use dots but I wouldn't use it as this may lead to some unexpected behaviours depending on HTTP server used.
Do you mean the id and name attributes of the HTML input tag?
If so, I'd be very tempted to restrict (or convert) allowed "input" name characters into only a-z (A-Z), 0-9 and a limited range of punctuation (".", ",", etc.), if only to limit the potential for XSS exploits, etc.
Additionally, why let the user control any aspect of the input tag? (Might it not ultimately be easier from a validation perspective to keep the input tag names are 'custom_1', 'custom_2', etc. and then map these as required.)

Data URIs in GWT

Is it possible to create data URI's in GWT?
I want to inject a byte array image as an actual image using a data URI.
You should checkout ClientBundle in GWT's trunk. It will create data urls automatically for browsers that support them and fallbacks for that other browser: http://code.google.com/p/google-web-toolkit/wiki/ClientBundle
The feature won't ship until GWT 2.0, but it's in heavy use now.
Yes. It is completely possible to do this. I'd done it for an application until I realized IE6 doesn't handle binary data streams this way. You can do it in several ways. For the purposes of my example, I'm already assuming that you've converted the byte array to a string somewhere, and that it is properly encoded and of the proper type for your data URI. I'm also assuming you know the basic format (or can find it) of your chosen data scheme.
I've taken these examples from the Wikipedia article on data URI scheme.
The first is to just use raw HTML to make the image reference as you normally would and have it inserted into the page.
HTML html = new HTML("<img src=\"data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAABGdBTUEAALGP
C/xhBQAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9YGARc5KB0XV+IA
AAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAF1J
REFUGNO9zL0NglAAxPEfdLTs4BZM4DIO4C7OwQg2JoQ9LE1exdlYvBBeZ7jq
ch9//q1uH4TLzw4d6+ErXMMcXuHWxId3KOETnnXXV6MJpcq2MLaI97CER3N0
vr4MkhoXe0rZigAAAABJRU5ErkJggg==\" alt=\"Red dot\">");
You can also just use an image. (Which should produce roughly the same output HTML/JS.)
Image image = new Image("data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAABGdBTUEAALGP
C/xhBQAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9YGARc5KB0XV+IA
AAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAF1J
REFUGNO9zL0NglAAxPEfdLTs4BZM4DIO4C7OwQg2JoQ9LE1exdlYvBBeZ7jq
ch9//q1uH4TLzw4d6+ErXMMcXuHWxId3KOETnnXXV6MJpcq2MLaI97CER3N0
vr4MkhoXe0rZigAAAABJRU5ErkJggg==");
This allows you to use the full power of the Image abstraction on top of your loaded image.
I'm still thinking that you may want to expand on this solution and use GWT's deferred binding mechanism to deal with browsers that do not support data URIs. (IE6,IE7)