How do you save a JSON response with Emojis as Unicode? - json

Currently I am scraping Instagram comments for a sentiment analysis project, and am using an Instagram scraper. It is supposed to output a comment file but it doesn't, so a workaround is to find the query URL in the log file and paste it into a browser.
An example URL would be this https://www.instagram.com/graphql/query/?query_hash=33ba35852cb50da46f5b5e889df7d159&variables={%22shortcode%22:%22CMex-IGn1G-%22,%22first%22:50,%22after%22:%22QVFCaERkTm84aWF3T1Exbmw5V0xhb05haVBEY2JaYmxhSTNGWVZ4M2RQWi0yVzVUSExlUlRYOUtsOVEtM0trRzBmSGxyYjdJV094a1hlYm1aLXZjdkVpZQ==%22}.
On Firefox I am able to view the JSON response and am also able to download it through two ways:
CTRL + A to select all and paste into a JSON file.
Download webpage as a JSON file.
The issue with these methods are that neither of these retain the emoji data. The first loses the emojis as they are not stored in unicode, but rather as question marks ???. I assumed this was related to the encoding, so tried to paste the raw response into Unicode files. Instead they are the emojis which can be represented as emojis ️🙌👏😍, but not unicode.
The second method either saves it with only the message {"message":"rate limited","status":"fail"} or another incorrect format.
The thing is, is that a few months ago I scraped some pages and managed to save the comments with the emojis stored in the unicode format. This is frustrating as I know it can be done, but I can't remember the process how I did it as I would have tried something basic, as I have outlined.
I am out of ideas and would greatly appreciate any help. Thank you.

Related

Error when using non ASCII characters in MS Office URI Schemes

I want to open a file using ms-excel. I read the method at URI Schemes Documentation. I am using the following code.
<a href='ms-excel:ofe|u|file:C:/Users/*********/Downloads/testätür.xlsx'>link</a>
This works fine if my filepath contains only english characters, however it gives error when using non English characters like ö,ä,ß. The link works fine without the 'ms-excel:ofe|u|' and downloads the file but it doesn't work with it. So I can only assume the problem is here.
file:C:/Users/***********/Downloads/testätür.xlsx
If I try to open the file link above, the following error occurs
Can someone help me understand the issue here?
You need to encode the file path properly before using a hyperlink in the document. See What is the proper way to URL encode Unicode characters? for more information on that.

CSV displayed Arabic letter as?

I had a csv file initially it has Arabic letters in it, I did not know and I have made changes to it like putting formulas and saved it. Later when I open it I found all arabic characters are displayed as ?.
I browsed internet and tried all ways of importing data from this csv but the arabic characters that were saved as ? are still appearing as ?. I badly want to retrieve them as those were my leads.
Is there any way that I can extract Arabic characters from that file which was saved already. or is there any way I can restore the earliest version which has Arabic text in it. I do not have history version in windows so that option is not available.
Realy any help is appreciated.
You have to separate between file content, and display of the file.
The file actually does not contain "arabic" but a series of bytes. How these are displayed is totally up to the program that read the file. Simplified it's kind of like the string 1234 which you can read as either "12" and "34" or "1234".
The display is an interpretation of the file content. So it's not directly possible to say if it's wrong or not.
The offending program MAY have written the data back in an incorrect way, either directly as "?" characters, or as meaningless data. Then the file data is changed forever, and can't be brought back in that file.
If the offending program writes the data right, but always displays "?", the data may be there - you need to check in a unicode-enabled program.
Make a simple test with a sample dummy file. See this sample file

Freemarker CSV generation - CSV with Chinese text truncates the csv contents

I have this very weird problem. I'm using Java 8, Struts2 and Freemarker 2.3.23 to generate reports in csv and html file formats (via.csv.ftl and .html.ftl templates both saved in utf-8 encoding), with data coming from postgres database.
The data has chinese characters in it and when I generate the report in html format, it is fine and complete and chinese characters are displayed properly. But when the report is generated in csv, I have observed that:
If I run the app with -Dfile.encoding=UTF-8 VM option, the chinese characters are generated properly but the report is incomplete (i.e. the texts are truncated specifically on the near end part)
If I run the app without -Dfile.encoding=UTF-8 VM option, the chinese characters are displayed in question marks (?????) but the report is complete
Also, the app uses StringWriter to write the data to the csv and html templates.
So, what could be the problem? Am I hitting Java character limits? I do not see error in the logs either. Appreciate your help. Thanks in advance.
UPDATE:
The StringWriter returns the data in whole, however when writing the data to the OutputStream, this is where some of the data gets lost.
ANOTHER UPDATE:
Looks like the issue is on contentLength (because the app is a webapp and csv is generated as file-download type) being generated from the data as Strings using String.length(). The String.length() method returns less value when there should be more. Maybe it has something to do with the chinese characters that's why length is being reported with less value.
I was able to resolve the issue with contentLength by using String.getBytes("UTF-8").length

Angular 5 : How to integrate html data (which is a formatted text) in a .docx file?

I'm still a bit newbie in the code game, and i would like some advices from senpai.
Context :
I'm making a angular 5 app which has a form, which is using also QuillJS, a rich text editor for only one question (the previous questions are simple input field for strings or numbers). My goal is to allow my users to download the form and the text from QuillJS they completed, on a .docx file (Word). And of course i'm doing this because i want to keep the formatted text from QuillJs, otherwise i would have just get a good ol' string.
Issue :
The point is, i'm already building a docx file for the first questions of the form and the only method i found for now to put my html string from QuillJs in a Word readable data type, is to use html-docx-js library.
This post even explain how. But, BUT, i don't want to use saveAs function (see the post), that create a file and put the content in it. I want to put the content in the docx file i'm already creating.
So here is my question, how would you, senpai, do it ?
The thing is that i've got a Blob file (cf post), but i don't know how to put it in my docx file. I tried to see if FileReader function could do the job, but well... i don't get how to integrate this special Blob file type (which is : application/vnd.openxmlformats-officedocument.wordprocessingml.document) in the docx file.
Maybe there is another way, i'm open to any suggestions, i don't mind at all to change my way of doing.
Thank you. Save internet, give me a tip.
The official documentation for html-docx-js does not state any other options than the asBlob method. I suggest two options:
Decoding the DOCX:
The Blob filetype is not special. The blob is just binary representation of the docx. I found in SE question that the docs in fact zipped XML document. You could unzip it using JSZip or other JS solution, then read it using FileReader and try to deal with it in a DOM manner. I'm not qualified to go into details how that could work.
Adding HTML to the user input first and then outputting it as a whole
This is changing the way you want to do it. In this way, I would first create formatted HTML with the data you collected in other parts of the questionnaire. Then you append the rich data from the rich editor. At last you take this HTML data and save it into single file using the asBlob function.
The second solution will maybe strip some customization from your original approach, but it seems much faster to implement.

Can URL #anchor contain binary data?

I'm trying to encode web pages state in #anchor. Right now I am base64 encoding a JSON string, but it sometimes gets too long (10K+). Apparently I hit some kind of URL length limitation and it just doesn't work right (it gets cut off and JSON data structure can't be reconstructed).
I talked with some of my buddies and they said try to bzip or gzip it. I tried that, but now my #anchor is binary data.
I haven't been able to decode it properly, and I'm not sure if it even got sent correctly as part of URL.
Does anyone know how to add binary data in #anchor, if it's a good idea, or how to come up with an alternative working solution for my problem?
I would not bother with all of this.
Use Local Storage for your large data, and send a reference through your anchor to the data.