write_html() method in fpdf not using font/encoding specified - html

I'm creating a PDF with a large collection of quotes that I've imported into python with docx2python, using html=True so that they have some tags. I've done some processing to them so they only really have the bold, italics, underline, or break tags. I've sorted them and am trying to write them onto a PDF using the fpdf library, specifically the pdf.write_html(quote) method. The trouble comes with several special characters I have, so I am hoping to encode the PDF to UTF-8. To write with .write_html(), I had to create a new class as shown in their readthedocs under the .write_html() method at the very bottom of the left hand side:
from fpdf import FPDF, HTMLMixin
class htmlFPDF(FPDF, HTMLMixin):
pass
pdf = htmlFPDF()
pdf.add_page()
#set the overall PDF to utf-8 to preserve special characters
pdf.set_doc_option('core_fonts_encoding', 'utf-8')
pdf.write_html(quote) #[![a section of quote giving trouble with quotations][2]][2]
The list of quotes that I have going into the pdf all appear with their special characters and the html tags (<u> or <i>) in the debugger, but after the .write_html() step they then show up in the pdf file with mojibake, even before being saved, as seen through debugger. An example being "dayâ€ÂTMs demands", when it should be "day's demands" (the apostrophe is curled clockwise in the quote, but this textbox doesn't support).
I've tried updating the font I use by
pdf.add_font('NotoSans', '', 'NotoSans-Regular.ttf', uni=True)
pdf.set_font('NotoSans', '', size=12)
added after the .add_page() method, but this doesn't change the current font (or fix mojibake) on the PDF unless I use the more common .write(text_height, quote) method, which renders the underline/italicize tags into the PDF as text. The .write() method does preserve the special characters. I'm not trying to change the font really, but make sure that what's written onto the PDF preserves the special characters instead of mojibake them.
I've also attempted some .encode/.decode action before going into the .write_html(), as well as attempted some methods from the ftfy library. And tried adding '' to the start of each quote to no effect.
If anyone has ideas for a way to iterate through each line on the PDF that'd be terrific, since then I could use ftfy to fix the mojibake. But ideally, it would be some other html tag at the start of each quote or a way to change the font/encoding of the .write_html() method, maybe in the class declaration?
Or if I'm at a dead-end and should just split each quote on '<', use if statements to detect underlines, italicize, etc., and use the .write() method after all.

Extract docx to html works really bad with docx2python. I do this few month ago. I recommend PyDocX. docx2python are good for docx file content extracting, not converting it into a html.

Related

Strange symbol shows up on website (L SEP)?

I noticed on my website, http://www.cscc.org.sg/, there's this odd symbol that shows up.
It says L SEP. In the HTML Code, it display the same thing.
Can someone shows me how to remove them?
That character is U+2028 or HTML entity code 
 which is a kind of newline character. It's not actually supposed to be displayed. I'm guessing that either your server side scripts failed to translate it into a new line or you are using a font that displays it.
But, since we know the HTML and UNICODE vales for the character, we can add a few lines of jQuery that should get rid of the character. Right now, I'm just replacing it with an empty space in the code below. Just add this:
$(document).ready(function() {
$("body").children().each(function() {
$(this).html($(this).html().replace(/
/g," "));
});
});
This should work, though please note that I have not tested this and may not work as none of my browsers will display the character.
But if it doesn't, you can always try pasting your text block onto http://www.nousphere.net/cleanspecial.php which will remove any special characters.
Some fonts render LS as L SEP. Such a glyph is designed for unformatted presentations of the character, such as when viewing the raw characters of a file in a binary editor. In a formatted presentation, actual line spacing should be displayed instead of the glyph.
The problem is that neither the web server nor web browser are interpreting the LS as a newline. The web server could detect the LS and replace it with <br>. Such a feature would fit well with a web server that dynamically generates HTML anyway, but would add overhead and complexity to a web server that serves file contents without modification.
If a LS makes its way to the web browser, the web browser doesn't interpret it as formatting. Page formatting is based only on HTML tags. For example, LF and CR just affect formatting of the HTML source code, not the web page's formatting (except in <pre> sections). The browser could in principle interpret LS and PS (paragraph separator) as <br> and <p>, but the HTML standard doesn't tell browsers to do that. (It seems to me like it would be a good addition.)
To replace the raw LS character with the line separation that the content creator likely intended, you'll need to replace the LS characters with HTML markup such as <br>.
This is the solution for the 'strange symbol' issue.
$(document).ready(function () {
$("body").children().each(function() {
document.body.innerHTML = document.body.innerHTML.replace(/\u2028/g, ' ');
});
})
The jquery/js solutions here work to remove the character, but it broke my Revolution Slider. I ended up doing a search replace for the character on the wp_posts tabel with Better Search Replace plugin: https://wordpress.org/plugins/better-search-replace/
When you copy paste the character from a page to the plugin box, it is invisible, but it does work. Before doing DB replaces, always have a database (or full) backup ready! And be sure to uncheck the bottom checkbox to not do a dry run with the plugin.

Sending an entire html page with phpmailer

I purchased an email template and customized it to what I want, but now I'm stuck on how to send it.
The template file is an entire html page with included fonts and even a stylesheet at the top (styletags actually, not an attached file) with mediaqueries and some font settings.
Do I send all of that? Including header tags and more? It's basically an entire html page.
How part of my emailscript is build now:
$mail->isHTML(true);
$texts = 'html template';
$mail->msgHTML($texts);
But I haven't tested my own code yet because the template file is full of single quotes and double quotes. So the variable doesn't like that, I tried replacing all single quotes to double quotes in the templatestring but then my fonts are not working anymore (I tested remotely using this site: https://putsmail.com/).
PHPMailer itself doesn't do anything special with HTML content (though msgHTML reads your content to make a rough plain-text version). It sounds like you're just having trouble quoting inline content. You might like to try a different form of quoting called nowdoc which will tolerate all kinds of quotes without a problem:
$texts = <<<'EOT'
html template
EOT;
$mail->msgHTML($texts);
You could also save it in an external file and read it in when you need it, which also avoids quoting issues:
$mail->msgHTML(file_get_contents('template.html'));

Is there a way on include non-breaking space in csv import to InDesign?

I'm using InDesign's data merge to make playing cards. Is there a way to include non-breaking space in the data? I would like some words to be kept on same lines.
I tried copy pasting non-breaking space from InDesign and web browser to csv file without success. &nbps; doesn't work either.
I would recommend using a dummy string in your source and proceed to a later replacement within InDesign (i.e. ##NBSP## > non breaking space character).
Try using the Unicode character for nbsp in InDesign: U00A0. If it is anything like the InDesign scripting it should be entered \u00A0. or "\u00A0"
You should combine your copy and paste with Einar's method of Unicode. Meaning, copy a non-breaking space from InDesign and paste it in your CSV file. Then, when you read the CSV file, either via a script or using place option of InDesign, make sure you are using Unicode format while reading the file. That will make sure your unicode characters get preserved because non-breaking space is a unicode character.
I copied and pasted the non-breaking space from Indesign into the CSV file using Brackets. When I placed the InDesign file using the place option, I made sure I used UTF8 reading (in the import options dialog box) and it preserved the space for me.
Is that what you are trying to achieve? Does that help?
Thanks,
Abdul

Displaying UTF-8 codes from JSON file as Emoticons

I am loading a JSON file that contains some UTF-8 codes, that represent emoticons.
The JSON content looks as follows:
"Studying! \uf4d6"
"Winning \uf40e\uf3c1 #4mile"
"Cheer me on \uf603 #werunamsterdam"
These UTF-8 codes are displayed as blocks in the browser. But when I look at this Unicode reference in Firefox, the codes are actually recognized!
(for example, UF4D6 is a book)
How do I convert the code from my json so that a browser can display them?
The code points from \uE000 to \uF8FF are in a private use area, so there aren't any standard glyphs associated with them.
You can, however, create your own font with suitable icons at these code points. This can be done quite easily using online tools like IcoMoon. Alternatively, use a string replacement routine to swap these characters with suitable markup (e.g., replace \uf4d6 with <img src="/icons/book.png" alt="[Book]" />)
These emoticons are encoded as regular characters as defined in Unicode, i.e. they're no different from the letter "A" or "%". All you need is a font that has glyphs for these "characters". Since not everyone can be expected to have such fonts installed (apparently you don't), if you want maximum compatibility, there are libraries for most languages that replace these characters with equivalent images. Google for one that suits your needs.

HTML encoding issues - "Â" character showing up instead of " "

I've got a legacy app just starting to misbehave, for whatever reason I'm not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF.
The process works like this:
Pull an HTML template from a DB with tokens in it to be replaced (e.g. "~CompanyName~", "~CustomerName~", etc.)
Replace the tokens with real data
Tidy the HTML with a simple regex function that property formats HTML tag attribute values (ensures quotation marks, etc, since ActivePDF's rendering engine hates anything but single quotes around attribute values)
Send off the HTML to a web service that creates the PDF.
Somewhere in that mess, the non-breaking spaces from the HTML template (the s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character when viewing the document in a browser (FireFox). ActivePDF pukes on these non-UTF8 characters.
My question: since I don't know where the problem stems from and don't have time to investigate it, is there an easy way to re-encode or find-and-replace the bad characters? I've tried sending it through this little function I threw together, but it turns it all into gobbledegook doesn't change anything.
Private Shared Function ConvertToUTF8(ByVal html As String) As String
Dim isoEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
Dim source As Byte() = isoEncoding.GetBytes(html)
Return Encoding.UTF8.GetString(Encoding.Convert(isoEncoding, Encoding.UTF8, source))
End Function
Any ideas?
EDIT:
I'm getting by with this for now, though it hardly seems like a good solution:
Private Shared Function ReplaceNonASCIIChars(ByVal html As String) As String
Return Regex.Replace(html, "[^\u0000-\u007F]", " ")
End Function
Somewhere in that mess, the non-breaking spaces from the HTML template (the s) are encoding as ISO-8859-1 so that they show up incorrectly as an "Â" character
That'd be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it'd be 0xC2,0xA0, which, if you (incorrectly) view it as ISO-8859-1 comes out as " ". That includes a trailing nbsp which you might not be noticing; if that byte isn't there, then something else has mauled your document and we need to see further up to find out what.
What's the regexp, how does the templating work? There would seem to be a proper HTML parser involved somewhere if your strings are (correctly) being turned into U+00A0 NON-BREAKING SPACE characters. If so, you could just process your template natively in the DOM, and ask it to serialise using the ASCII encoding to keep non-ASCII characters as character references. That would also stop you having to do regex post-processing on the HTML itself, which is always a highly dodgy business.
Well anyway, for now you can add one of the following to your document's <head> and see if that makes it look right in the browser:
for HTML4: <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
for HTML5: <meta charset="utf-8">
If you've done that, then any remaining problem is ActivePDF's fault.
If any one had the same problem as me and the charset was already correct, simply do this:
Copy all the code inside the .html file.
Open notepad (or any basic text editor) and paste the code.
Go "File -> Save As"
Enter you file name "example.html" (Select "Save as type: All Files (.)")
Select Encoding as UTF-8
Hit Save and you can now delete your old .html file and the encoding should be fixed
Problem:
Even I was facing the problem where we were sending '£' with some string in POST request to CRM System, but when we were doing the GET call from CRM , it was returning '£' with some string content. So what we have analysed is that '£' was getting converted to '£'.
Analysis:
The glitch which we have found after doing research is that in POST call we have set HttpWebRequest ContentType as "text/xml" while in GET Call it was "text/xml; charset:utf-8".
Solution:
So as the part of solution we have included the charset:utf-8 in POST request and it works.
In my case this (a with caret) occurred in code I generated from visual studio using my own tool for generating code. It was easy to solve:
Select single spaces ( ) in the document. You should be able to see lots of single spaces that are looking different from the other single spaces, they are not selected. Select these other single spaces - they are the ones responsible for the unwanted characters in the browser. Go to Find and Replace with single space ( ). Done.
PS: It's easier to see all similar characters when you place the cursor on one or if you select it in VS2017+; I hope other IDEs may have similar features
In my case I was getting latin cross sign instead of nbsp, even that a page was correctly encoded into the UTF-8. Nothing of above helped in resolving the issue and I tried all.
In the end changing font for IE (with browser specific css) helped, I was using Helvetica-Nue as a body font changing to the Arial resolved the issue .
I was having the same sort of problem. Apparently it's simply because PHP doesn't recognise utf-8.
I was tearing my hair out at first when a '£' sign kept showing up as '£', despite it appearing ok in DreamWeaver. Eventually I remembered I had been having problems with links relative to the index file, when the pages, if viewed directly would work with slideshows, but not when used with an include (but that's beside the point. Anyway I wondered if this might be a similar problem, so instead of putting into the page that I was having problems with, I simply put it into the index.php file - problem fixed throughout.
The reason for this is PHP doesn't recognise utf-8.
Here you can check it for all Special Characters in HTML
http://www.degraeve.com/reference/specialcharacters.php
Well I got this Issue too in my few websites and all i need to do is customize the content fetler for HTML entites. before that more i delete them more i got, so just change you html fiter or parsing function for the page and it worked. Its mainly due to HTML editors in most of CMSs. the way they store parse the data caused this issue (In My case). May this would Help in your case too