I am loading a JSON file that contains some UTF-8 codes, that represent emoticons.
The JSON content looks as follows:
"Studying! \uf4d6"
"Winning \uf40e\uf3c1 #4mile"
"Cheer me on \uf603 #werunamsterdam"
These UTF-8 codes are displayed as blocks in the browser. But when I look at this Unicode reference in Firefox, the codes are actually recognized!
(for example, UF4D6 is a book)
How do I convert the code from my json so that a browser can display them?
The code points from \uE000 to \uF8FF are in a private use area, so there aren't any standard glyphs associated with them.
You can, however, create your own font with suitable icons at these code points. This can be done quite easily using online tools like IcoMoon. Alternatively, use a string replacement routine to swap these characters with suitable markup (e.g., replace \uf4d6 with <img src="/icons/book.png" alt="[Book]" />)
These emoticons are encoded as regular characters as defined in Unicode, i.e. they're no different from the letter "A" or "%". All you need is a font that has glyphs for these "characters". Since not everyone can be expected to have such fonts installed (apparently you don't), if you want maximum compatibility, there are libraries for most languages that replace these characters with equivalent images. Google for one that suits your needs.
Related
I'm creating a PDF with a large collection of quotes that I've imported into python with docx2python, using html=True so that they have some tags. I've done some processing to them so they only really have the bold, italics, underline, or break tags. I've sorted them and am trying to write them onto a PDF using the fpdf library, specifically the pdf.write_html(quote) method. The trouble comes with several special characters I have, so I am hoping to encode the PDF to UTF-8. To write with .write_html(), I had to create a new class as shown in their readthedocs under the .write_html() method at the very bottom of the left hand side:
from fpdf import FPDF, HTMLMixin
class htmlFPDF(FPDF, HTMLMixin):
pass
pdf = htmlFPDF()
pdf.add_page()
#set the overall PDF to utf-8 to preserve special characters
pdf.set_doc_option('core_fonts_encoding', 'utf-8')
pdf.write_html(quote) #[![a section of quote giving trouble with quotations][2]][2]
The list of quotes that I have going into the pdf all appear with their special characters and the html tags (<u> or <i>) in the debugger, but after the .write_html() step they then show up in the pdf file with mojibake, even before being saved, as seen through debugger. An example being "dayâ€ÂTMs demands", when it should be "day's demands" (the apostrophe is curled clockwise in the quote, but this textbox doesn't support).
I've tried updating the font I use by
pdf.add_font('NotoSans', '', 'NotoSans-Regular.ttf', uni=True)
pdf.set_font('NotoSans', '', size=12)
added after the .add_page() method, but this doesn't change the current font (or fix mojibake) on the PDF unless I use the more common .write(text_height, quote) method, which renders the underline/italicize tags into the PDF as text. The .write() method does preserve the special characters. I'm not trying to change the font really, but make sure that what's written onto the PDF preserves the special characters instead of mojibake them.
I've also attempted some .encode/.decode action before going into the .write_html(), as well as attempted some methods from the ftfy library. And tried adding '' to the start of each quote to no effect.
If anyone has ideas for a way to iterate through each line on the PDF that'd be terrific, since then I could use ftfy to fix the mojibake. But ideally, it would be some other html tag at the start of each quote or a way to change the font/encoding of the .write_html() method, maybe in the class declaration?
Or if I'm at a dead-end and should just split each quote on '<', use if statements to detect underlines, italicize, etc., and use the .write() method after all.
Extract docx to html works really bad with docx2python. I do this few month ago. I recommend PyDocX. docx2python are good for docx file content extracting, not converting it into a html.
As the title, I have the special requirement to extract text line by line or block by block of BT and ET.
below is the pdf content, I tried PDFTextstripper class, but it is not what I want,
so any one has the solution to resolve the problem?
I wanna parse this : [ (=\324Z\016) ] TJ.......
this is my pdf: https://dl.dropboxusercontent.com/u/63353043/docu.pdf
below is my code:
enter coList<Object> tokens = pages.get(0).getContents().getStream().getStreamTokens();
tokens.forEach(s->{
if(s instanceof COSString){
System.out.print(s.toString());
}
but, I get thoes:
COSString{&Ð}COSString{O6}COSString{&³}COSString{p»}COSString{6±}COSString{˛¨}COSString{+^}COSString{+·}COSString{˚©}COSString{9}COSString{O©}COSString{en}COSString{˛¨}COSString{Fœ}COSString{0ł}COSString{Q¯}COSString{#”}COSString{˛¨}COSString{+^}COSString{(Ï}COSString{˚©}COSString{9}COSString{O©}COSString{en}COSString{Zo}COSString{#°}COSString{˜}COSString{p»}COSString{#Š}COSString{5×}COSString{,
}COSString{:É}COSString{(Ù}COSString{4ÿ}COSString{ä}COSString{_Á}COSString{˛¨}COSString{:É}COSString{p»}COSString{O©}COSString{en}COSString{#p}COSString{/F}COSString{O©}COSString{en}COSString{F,}COSString{_N}COSString{!}COSString{9»}COSString{]˘}COSString{!¢}COSString{˜.}COSString{p»}COSString{#°}COSString{˜}COSString{#p}COSString{<:}COSString{Zo}COSString{1¸}COSString{ä}COSString{˚~}COSString{F³}COSString{!Ø}COSString{]Š}COSString{2}COSString{6±}COSString{˛¨}COSString{gî}COSString{+·}COSString{9á}COSString{XS}COSString{hP}COSString{h[}COSString{˜º}COSString{˚.}COSString{p»}COSString{5d}COSString{5×}COSString{]˘}COSString{_ö}COSString{#c}COSString{2˚}COSString{]˜}COSString{+·}COSString{9á}COSString{p»}COSString{#b}COSString{˚.}COSString{eÚ}COSString{;
}COSString{!5}COSString{:É}COSString{XS}COSString{hP}COSString{h[}COSString{˜º}COSString{p»}COSString{B!}COSString{&Ø}COSString{,}COSString{/F}COSString{^r}COSString{˛²}COSString{2&}COSString{˜.}COSString{N}COSString{*ø}COSString{˜.}COSString{&¢}COSString{+B}COSString{+·}COSString{9á}COSString{ä}COSString{ZX}COSString{˚}COSString{˛µ}COSString{6«}COSString{0ł}COSString{!Ø}COSString{p»}COSString{O©}COSString{en}COSString{F,}COSString{Q±}COSString{"}}COSString{XS}COSString{&Ð}COSString{_N}COSString{!}COSString{9»}COSString{˛²}COSString{F,}COSString{(Ï}COSString{O©}COSString{en}COSString{F³}COSString{/?}COSString{˛¨}COSString{=}COSString{˚4}COSString{9}COSString{p»}COSString{;}COSString{5×}COSString{_Á}COSString{˜³}COSString{#É}COSString{0·}COSString{F,}COSString{Q±}COSString{"}}COSString{p»}COSString{1û}COSString{"}}COSString{˚.}COSString{O©}COSString{en}COSString{p»}COSString{#°}COSString{˜}COSString{Lê}COSString{5d}COSString{Zo}COSString{1¸}COSString{"G}COSString{˚.}COSString{ä}COSString{&Ð}COSString{Kå}COSString{Yª}COSString{#°}COSString{#´}COSString{5ê}COSString{p»}COSString{(Ï}COSString{O©}COSString{en}COSString{ZR}COSString{pÉ}COSString{p„}COSString{˜}COSString{G“}COSString{_û}COSString{%v}COSString{pÎ}COSString{1û}COSString{"}}COSString{1¹}COSString{F,}COSString{˛µ}COSString{5×}COSString{˜}COSString{2˚}COSString{]˜}COSString{+·}COSString{9á}COSString{p»}COSString{O´}COSString{5×}COSString{#b}COSString{˚.}COSString{+·}COSString{9á}COSString{p»}COSString{˜}COSString{]y}COSString{/0}COSString{}COSString{2§}COSString{˚.}COSString{˛¨}COSString{8E}COSString{N}COSString{*ø}COSString{22}COSString{++}COSString{&¢}COSString{+B}COSString{)%}COSString{ä}COSString{&Ð}COSString{!Í}COSString{˚b}COSString{f¨}COSString{Y)}COSString{.}COSString{"Q}COSString{5ê}COSString{p»}COSString{)*}COSString{7D}COSString{˛¨}COSString{˜³}COSString{˚b}COSString{P¥}COSString{&Ð}COSString{!Í}COSString{˚b}COSString{˛µ}COSString{G“}COSString{0m}COSString{F,}COSString{0m}COSString{<i}COSString{˛³}COSString{p»}COSString{;fi}COSString{˛µ}COSString{BÞ}COSString{\}COSString{F,}COSString{BO}COSString{Bˆ}COSString{Q™}COSString{-Ž}COSString{F,}COSString{!Ñ}COSString{Fr}COSString{p»}COSString{$™}COSString{/½}COSString{BO}COSString{Bˆ}COSString{F,}COSString{#™}COSString{5×}COSString{˛¨}COSString{nƒ}COSString{nƒ}COSString{p»}COSString{˚}COSString{5×}COSString{f‰}COSString{P¥}COSString{#Š}COSString{\\}COSString{F,}COSString{p°}COSString{1¹}COSString{<:}COSString{6±}COSString{C®}COSString{DÙ}COSString{˛µ}COSString{Q¯}COSString{_Á}COSString{9Ë}COSString{F,}COSString{˚b}COSString{#°}COSString{˜}COSString{p»}COSString{_Á}COSString{9Ë}COSString{F,}COSString{˚b}COSString{˚}COSString{<:}COSString{6±}COSString{C®}COSString{DÙ}COSString{˛µ}COSString{Cˆ}COSString{/?}COSString{1¸}COSString{"G}COSString{p°}COSString{p“}COSString{/4}COSString{˜.}COSString{p»}COSString{+·}COSString{O©}COSString{en}COSString{F,}COSString{˚3}COSString{9}COSString{7D}COSString{#Þ}COSString{DÙ}COSString{;}COSString{T}COSString{T`}COSString{5“}COSString{˛²}COSString{p»}COSString{]2}COSString{ }COSString{]2}COSString{(Ï}COSString{p°}COSString{:}COSString{6«}COSString{Må}COSString{&Ð}COSString{˛µ}COSString{M;}COSString{0·}COSString{e;}COSString{O«}COSString{aw}COSString{˚b}COSString{F,}COSString{FÇ}COSString{ZH}
(If this wasn't so long, it would have been more appropriate as a comment to the question instead of as an answer. But comments are too limited.)
The OP in his question shows that he essentially wants to parse the content stream of e.g. a page and extract the strings drawn in a legible form. He attempts this by simply taking the tokens in the content stream and looking at the COSString instances in there:
List<Object> tokens = pages.get(0).getContents().getStream().getStreamTokens();
tokens.forEach(s->{
if (s instanceof COSString) {
System.out.print(s.toString());
}
});
Unfortunately the output looks like a mess.
Why do the string values in the content stream look so messy?
The reason for this is that those COSString instances represent the PDF string objects as they are, and that there is no single encoding of PDF string objects in content streams, not even a limitation to a few standardized ones.
The encoding of a string completely depends on the definition of the font currently active when the string drawing instruction in question is executed.
Fonts in PDFs can be defined to use either some standard encoding or a custom one and it is very common, in particular in case of embedded font subsets, to use custom encodings mapping constructed similar to this:
the code 1 to the glyph of this font which is first used on the page,
the code 2 on the second glyph of this font on the page which is not identical to the first,
the code three to the third glyph of this font on the page not identical to either of the first two,
etc...
Obviously there is no good to conjecture the meaning of string bytes for such encodings.
Thus, when parsing the content stream, you have to keep track of the current font and look up the meaning of each byte (or multi byte sequence!) of a COSString in the definition of that current font in the resource dictionary of the current page.
How to map those messy bytes using the current font definition?
The encoding of a PDF font might have to be determined in different ways.
There may be a ToUnicode map in the font definition which shows you which bytes to map to which character.
Otherwise the encoding may be a standard encoding like MacRomanEncoding or WinAnsiEncoding in which case one has to bring along the mapping table oneself (they are printed in the PDF specification).
Otherwise the encoding might be based on such a standard encoding but deviations are given by a mapping from codes to names of glyphs. If certain standard names are used, the character can be derived from that name. These names are listed in another document.
Otherwise some CIDSystemInfo entry may point to yet another standard Registry and Ordering from which to derive a mapping table specified in other documents.
Otherwise the font program itself may include usable mappings to Unicode.
Otherwise ???
Any pitfalls to evade?
The current font is a PDF graphics state attribute. Thus, one does not only have to remember the most recently set font but also consider the effects of operations changing the whole graphics state, in particular the save-graphics-state and restore-graphics-state operations which push the current graphics state onto a stack or pop it there-from.
How can PDF libraries help you?
PDF libraries which support you in text extraction can do so by doing all the heavy lifting for you,
parsing the content stream,
keeping track of the graphics state,
determining the encoding of the current font,
translating any drawn PDF strings using that encoding,
and only forward the resulting characters and some extra data (current position on the page, text drawing orientation, font and font size, colors, and other effects) to you.
In case of PDFBox 2.0.x, this is what the PDFTextStreamEngine class does for you: you only have to override its processTextPosition method to which PDFBox forwards those enriched character information in the TextPosition parameter. As you also want to know the starts and ends of the BT ... ET text object envelopes, you also have to override beginText and endText.
The class PDFTextStripper is based on that class and collects and sorts those character information bits to build a string containing the page text which it eventually returns.
In PDFBox 1.8.x there was a very similar PDFTextStripper class but the base methods were not that properly separated in a base class, everything was somewhat more intermingled and it was harder to implement one's own extraction ways.
(In other PDF libraries there are similar constructs, sometimes event based like in PDFBox, sometimes as a collected sequence of Textposition-like objects.)
How can I use for example the glyph name "rcaron.terminal" which has no Unicode value in HTML? or any other such case? Is it even possible? I think it must be surely but I got no clue. It's easy for regular letters like the glyph "ß" where I would just type "ß" and get that character or "ß" (same result) but for glyphs without any Unicode value I don't know what I'm supposed to do...? I've tried also "&rcaron.terminal" but nothing, where as something like "&hearts" would work giving a heart glyph of god knows what font, probably Arial I dunno.
Do I need to use state some specific encoding aside from ANSI in my html document?
ie. < meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-8" > or something... like Im really lost lol
All I found on the net was this http://text-symbols.com/html/unicode/ but I cant find any more info so I came here.
Please help! Thanks! :)
There are no glyphs in HTML which do not have a Unicode name.
If you really need to have a glyph which is not representable using regular Unicode, you might want to create a font of your own and define the glyphs you need in the private use area; but obviously, then, your HTML will be impossible to use without that particular font.
Background links:
http://arstechnica.com/information-technology/2008/10/embedded-web-fonts/
http://www.font-face.com/
Practical guides:
http://blog.fogcreek.com/trello-uses-an-icon-font-and-so-can-you/
http://blogs.atlassian.com/2013/07/how-to-make-an-icon-font-the-8-step-guide/
First navigate to this site: https://fontdrop.info/#/?darkmode=true
Upload the file with your font
Click on the Ligatures tab.
Every Glyph should have a Components field
copy the components for the character you want to use
paste that string into HTML
You don't need any & or #, it just detects the string and converts it.
I pretty much built this website in firebug, then when I copied the code into a text document and tried loading it, firefox wouldn't interpret the "ΧΨ" in the source. However, it does a fantastic job using them while I'm typing this.
Wassup wid dat?
You can't just type a character into an HTML tag, it must be a valid character and if not use the proper character code. See this list:
http://htmlhelp.com/reference/html40/entities/symbols.html
You can use Entity, Decimal, or Hex to represent your character like this:
<p>ΧΨ</p>
That's the HTML representation of "ΧΨ"
Cheers
When outputting HTML content from a database, some encoded characters are being properly interpreted by the browser while others are not.
For example, %20 properly becomes a space, but %AE does not become the registered trademark symbol.
Am I missing some sort of content encoding specifier?
(note: I cannot realistically change the content to, for example, ® as I do not have control over the input editor's generated markup)
%AE is not valid for HTML safe ASCII,
You can view the table here: http://www.ascii.cl/htmlcodes.htm
It looks like you are dealing with Windows Word encoding (windows-1252?? something like that) it really will NOT convert to html safe, unless you do some sort of translation in the middle.
The byte AE is the ISO-8859-1 representation for the registered trademark. If you don't see anything, then apparently the URL decoder is using other charset to URL-decode it. In for example UTF-8, this byte does not represent any valid character.
To fix this, you need to URL-decode it using ISO-8859-1, or to convert the existing data to be URL-encoded using UTF-8.
That said, you should not confuse HTML(XML) encoding like ® with URL encoding like %AE.
The '%20' encoding is URL encoding. It's only useful for URLs, not for displaying HTML.
If you want to display the reg character in an HTML page, you have two options: Either use an HTML entity, or transmit your page as UTF-8.
If you do decide to use the entity code, it's fairly simple to convert them en-masse, since you can use numeric entities; you don't have to use the named entities -- ie use ® rather than &#reg;.
If you need to know entity codes for every character, I find this cheat-sheet very helpful: http://www.evotech.net/blog/2007/04/named-html-entities-in-numeric-order/
What server side language are you using? Check for a URL Decode function.
If you are using php you can use urldecode() but you should be careful about + characters.