I'm fetching a Website, but all the Special Characters in the String from .getContentText() or .getContentText("UTF-8") are encoded as ’ and such.
I've really run out of ideas, and to be honest don't quite understand at which point this Encoding happens. Thanks a lot for your help. I could solve it by "manually" replacing all the occurances, but that doesnt seem very clean.
var response = UrlFetchApp.fetch("https://podtail.com/de/top-podcasts/de/");
var html = response.getContentText();
Your sample code suggests that you are retrieving the HTML source code of a specific page. That HTML source code uses ’ and friends, so the data will be in that format. It is unclear why you would need to decode those HTML entities.
If you really need to decode the HTML fully in Google Apps Script, you will need a parser of fairly respectable complexity. There are some shortcuts that you can try if your app has an HTML user interface of its own, but it would probably make more sense to use a library like the one by mathiasbynens.
If you only want to replace some HTML entities with their non-encoded equivalents, you may want to just use String.replace().
Related
I am trying to copy the source code of a 3rd party email provider I'm using to match their look and feel.
I am viewing the raw source, to format it I had to remove all the =\ns (the end value of a line because the raw source breaks it up), however I still can't seem to figure out what to do from there to copy the styling since I can't find a parser to correctly handle this email HTML. Any recommendations on formatting an email or just grabbing the styling of it?
PS: I'm using nodemailer for sending emails
It was a bit of a pain, but here are the steps in visual studio code:
Remove all =\n ("=" followed by new line)
rename to an .html file if you haven't instead of .eml
replace all instances of =3D" with ="
search for any more locations for 3D, delete each one you can
Now you have good html which just needs to be formatted, I recommend using a formatter like https://www.freeformatter.com/html-formatter.html
You now have your template, use templated strings to insert variables as needed. Use this for the html value of nodemailer
Currently I am scraping Instagram comments for a sentiment analysis project, and am using an Instagram scraper. It is supposed to output a comment file but it doesn't, so a workaround is to find the query URL in the log file and paste it into a browser.
An example URL would be this https://www.instagram.com/graphql/query/?query_hash=33ba35852cb50da46f5b5e889df7d159&variables={%22shortcode%22:%22CMex-IGn1G-%22,%22first%22:50,%22after%22:%22QVFCaERkTm84aWF3T1Exbmw5V0xhb05haVBEY2JaYmxhSTNGWVZ4M2RQWi0yVzVUSExlUlRYOUtsOVEtM0trRzBmSGxyYjdJV094a1hlYm1aLXZjdkVpZQ==%22}.
On Firefox I am able to view the JSON response and am also able to download it through two ways:
CTRL + A to select all and paste into a JSON file.
Download webpage as a JSON file.
The issue with these methods are that neither of these retain the emoji data. The first loses the emojis as they are not stored in unicode, but rather as question marks ???. I assumed this was related to the encoding, so tried to paste the raw response into Unicode files. Instead they are the emojis which can be represented as emojis ️🙌👏😍, but not unicode.
The second method either saves it with only the message {"message":"rate limited","status":"fail"} or another incorrect format.
The thing is, is that a few months ago I scraped some pages and managed to save the comments with the emojis stored in the unicode format. This is frustrating as I know it can be done, but I can't remember the process how I did it as I would have tried something basic, as I have outlined.
I am out of ideas and would greatly appreciate any help. Thank you.
I'm still a bit newbie in the code game, and i would like some advices from senpai.
Context :
I'm making a angular 5 app which has a form, which is using also QuillJS, a rich text editor for only one question (the previous questions are simple input field for strings or numbers). My goal is to allow my users to download the form and the text from QuillJS they completed, on a .docx file (Word). And of course i'm doing this because i want to keep the formatted text from QuillJs, otherwise i would have just get a good ol' string.
Issue :
The point is, i'm already building a docx file for the first questions of the form and the only method i found for now to put my html string from QuillJs in a Word readable data type, is to use html-docx-js library.
This post even explain how. But, BUT, i don't want to use saveAs function (see the post), that create a file and put the content in it. I want to put the content in the docx file i'm already creating.
So here is my question, how would you, senpai, do it ?
The thing is that i've got a Blob file (cf post), but i don't know how to put it in my docx file. I tried to see if FileReader function could do the job, but well... i don't get how to integrate this special Blob file type (which is : application/vnd.openxmlformats-officedocument.wordprocessingml.document) in the docx file.
Maybe there is another way, i'm open to any suggestions, i don't mind at all to change my way of doing.
Thank you. Save internet, give me a tip.
The official documentation for html-docx-js does not state any other options than the asBlob method. I suggest two options:
Decoding the DOCX:
The Blob filetype is not special. The blob is just binary representation of the docx. I found in SE question that the docs in fact zipped XML document. You could unzip it using JSZip or other JS solution, then read it using FileReader and try to deal with it in a DOM manner. I'm not qualified to go into details how that could work.
Adding HTML to the user input first and then outputting it as a whole
This is changing the way you want to do it. In this way, I would first create formatted HTML with the data you collected in other parts of the questionnaire. Then you append the rich data from the rich editor. At last you take this HTML data and save it into single file using the asBlob function.
The second solution will maybe strip some customization from your original approach, but it seems much faster to implement.
I know very little html, I have a backend application that does a mongodb lookup. I am building a simple html screen with forms to accept value to a web service which will run the mongo query and reply on the screen.
When I pass a filename path field in my form like this
\\test.server.com\filetest\test
in my web service app, I see the value coming in as
%5c%5Ctest.server.com%5cfiletest%5ctest
how can I get the value without this translation.
Matter fact I was hoping it would come in like this
\\\\test.server.com\\filetest\\test
as that is how things got stored in mongo.
You cannot pass a backslash directly as it is. That's because URLs can only be ASCII encoded. What this means is, that when you need to pass some special characters like Ü, as well as characters that need to be escaped in URLs (as spaces, backslashes, etc.) you need a way to represent them with ASCII symbols.
In your case the URL is getting encoded and backslashes are converted to %5c. To have them revert to '\' you need to either:
Decode them back in your server-side code. This is your best bet. This is done in different ways, depending on the technology your backend uses. In PHP, for example, you can use urldecode function - here.
Decode characters before querying in mongodb itself. This you will need to work on, because I'm not aware of a functionality that does this for you out of the box.
More info on URL encoding can be found here.
Hope this helps!
I'm saving scraped data to a web app, and here's a sample param:
400\xB0F.
This is the 'degree' character from a website, but when I put that into my model I get the dreaded invalid byte sequence in UTF-8 error.
Since it's coming from the web I thought I might try some client side encoding, so javascript turns that into: 400%B0F. This can at least get saved by ActiveRecord with no issue, but Rails seems to be escaping it again on the way out so those entities aren't decoded by the browser, so my show method shows the entire encoded string.
Where should I be cleaning up my input data, and what methods might be the best to use for unpredictable input?
Thanks!
Years ago I had, and solved, this very same problem in builder. Take a look at the to_xs method: http://builder.rubyforge.org/classes/String.html#M000007
You can require builder, and use it directly (you might want to pass false to escaping or you will get entity escaped output). Either that, or simply steal and adapt the source.
Update: here is the original, standalone, library:
http://intertwingly.net/stories/2005/09/28/xchar.rb
Perhaps you can use a binary form (like for upload file) with enctype="multipart/form-data" in form tag. Like this, you can use this data as a binary data ?
It's depends perhaps of waht you do with this data.
URI.unescape was the trick, after I encoded it client-side