CSV displayed Arabic letter as? - csv

I had a csv file initially it has Arabic letters in it, I did not know and I have made changes to it like putting formulas and saved it. Later when I open it I found all arabic characters are displayed as ?.
I browsed internet and tried all ways of importing data from this csv but the arabic characters that were saved as ? are still appearing as ?. I badly want to retrieve them as those were my leads.
Is there any way that I can extract Arabic characters from that file which was saved already. or is there any way I can restore the earliest version which has Arabic text in it. I do not have history version in windows so that option is not available.
Realy any help is appreciated.

You have to separate between file content, and display of the file.
The file actually does not contain "arabic" but a series of bytes. How these are displayed is totally up to the program that read the file. Simplified it's kind of like the string 1234 which you can read as either "12" and "34" or "1234".
The display is an interpretation of the file content. So it's not directly possible to say if it's wrong or not.
The offending program MAY have written the data back in an incorrect way, either directly as "?" characters, or as meaningless data. Then the file data is changed forever, and can't be brought back in that file.
If the offending program writes the data right, but always displays "?", the data may be there - you need to check in a unicode-enabled program.
Make a simple test with a sample dummy file. See this sample file

Related

How to retrieve original pdf stored as MySQL mediumblob?

A table containing almost four thousand records includes a mediumblob field for each record that contains the record's associated PDF report. Under both MySQL Workbench and phpMyAdmin the relevant DOCUMENT column displays the data as a BLOB button or link. In the case of phpMyAdmin the link also indicates the size of the data the Blob contains.
The issue is that when the Blob button/link is clicked, under MySQL Workbench opening any of the files using the SQL Editor only displays the raw Blob data and under phpMyAdmin th link only allows the Blob data to be saved as a .bin file instead of displaying or saving the data as a viewable PDF file. All previous attempts to retrieve the original PDFs using PHP have failed - see related earlier thread: Extract Pdf from MySql Dump Saved as Text.
The filename field in the table shows that all the stored files are PDF files. Further research and tests indicate that the mediumblob data has been stored as application/octet-streams.
My question is how can the original PDFs be retrieved as readable PDFs? Is it possible for a .bin file saved from the database to be converted or used to recover the original PDF file?
Any assistance would be greatly appreciated.
In line with my assumption and Isaac's suggestion the only solution was to be able to speak to one of the software developers. It transpires that the documents have been zipped using an third-party library as well as the header being removed before then being stored in the database.
The third-party library used is version 2.0.50727 of Chilkat, available from www.chilkatsoft.com. That version no longer appears to be available, but hopefully at least one of the later versions may do the job.
Thanks again for everyone's input and assistance.
Based on the discussion in the comments, it sounds like you'll need to either refer to the original source code or consult with the original developer to determine exactly how the data was stored.
Using phpMyAdmin to download the mediumblob data as a file will download a .bin file in many cases, I actually don't recall how it determines content type (for instance, a PNG file will download with a .png extension, but most other binary files simply download as a .bin when phpMyAdmin isn't sure what the extension should be, PDF included). So the behavior you're seeing from phpMyAdmin is expected and correct, but since the .bin file doesn't work when it's renamed to .pdf that means something has probably gone wrong with the import and upload.
BLOB data is often stored in a pretty standardized way, but it seems your data doesn't follow that method.
Without us seeing the code directly, we can't guess what exactly happened with storing the data and would only be guessing.

How do you save a JSON response with Emojis as Unicode?

Currently I am scraping Instagram comments for a sentiment analysis project, and am using an Instagram scraper. It is supposed to output a comment file but it doesn't, so a workaround is to find the query URL in the log file and paste it into a browser.
An example URL would be this https://www.instagram.com/graphql/query/?query_hash=33ba35852cb50da46f5b5e889df7d159&variables={%22shortcode%22:%22CMex-IGn1G-%22,%22first%22:50,%22after%22:%22QVFCaERkTm84aWF3T1Exbmw5V0xhb05haVBEY2JaYmxhSTNGWVZ4M2RQWi0yVzVUSExlUlRYOUtsOVEtM0trRzBmSGxyYjdJV094a1hlYm1aLXZjdkVpZQ==%22}.
On Firefox I am able to view the JSON response and am also able to download it through two ways:
CTRL + A to select all and paste into a JSON file.
Download webpage as a JSON file.
The issue with these methods are that neither of these retain the emoji data. The first loses the emojis as they are not stored in unicode, but rather as question marks ???. I assumed this was related to the encoding, so tried to paste the raw response into Unicode files. Instead they are the emojis which can be represented as emojis ️🙌👏😍, but not unicode.
The second method either saves it with only the message {"message":"rate limited","status":"fail"} or another incorrect format.
The thing is, is that a few months ago I scraped some pages and managed to save the comments with the emojis stored in the unicode format. This is frustrating as I know it can be done, but I can't remember the process how I did it as I would have tried something basic, as I have outlined.
I am out of ideas and would greatly appreciate any help. Thank you.

Freemarker CSV generation - CSV with Chinese text truncates the csv contents

I have this very weird problem. I'm using Java 8, Struts2 and Freemarker 2.3.23 to generate reports in csv and html file formats (via.csv.ftl and .html.ftl templates both saved in utf-8 encoding), with data coming from postgres database.
The data has chinese characters in it and when I generate the report in html format, it is fine and complete and chinese characters are displayed properly. But when the report is generated in csv, I have observed that:
If I run the app with -Dfile.encoding=UTF-8 VM option, the chinese characters are generated properly but the report is incomplete (i.e. the texts are truncated specifically on the near end part)
If I run the app without -Dfile.encoding=UTF-8 VM option, the chinese characters are displayed in question marks (?????) but the report is complete
Also, the app uses StringWriter to write the data to the csv and html templates.
So, what could be the problem? Am I hitting Java character limits? I do not see error in the logs either. Appreciate your help. Thanks in advance.
UPDATE:
The StringWriter returns the data in whole, however when writing the data to the OutputStream, this is where some of the data gets lost.
ANOTHER UPDATE:
Looks like the issue is on contentLength (because the app is a webapp and csv is generated as file-download type) being generated from the data as Strings using String.length(). The String.length() method returns less value when there should be more. Maybe it has something to do with the chinese characters that's why length is being reported with less value.
I was able to resolve the issue with contentLength by using String.getBytes("UTF-8").length

CodedUI test does not read data from CSV input file

I am having difficulty mapping a CSV file with the Coded UI test method. This is most likely a stupid question but I cannot seem to find a solution for my problem, at least not one that works. I have made sure to set the property of the CSV file to Copy always.
I have also imported the CSV file by writing the following line above the test method.
[DataSource("Microsoft.VisualStudio.TestTools.DataSource.CSV", "|DataDirectory|\\Data\\login.csv", "login#csv", DataAccessMethod.Sequential), DeploymentItem("login.csv"), TestMethod]
The file name is login.csv and it resides in the Data folder.
The test will compile without any problem but once the test executes the fields that should receive input from the CSV file are left empty and the execution is interrupted. I've tried replacing the data from the CSV file by using Strings and it works perfectly fine. The piece of code I am using to import each parameter is:
TestContext.DataRow["Username"].ToString()
Also, the CSV file contains something along the following lines:
Username,Password,Fullname
admin#mail.com,password,Admin
Is there anyone who can point what it is I am forgetting.
Update: I pinpointed the issue, it seems like the issue only revolves around the first column in the csv file. When I try to import any of the other values it works perfectly fine.
Some text files start with a Byte Order Mark (BOM). The CSV reader within Coded UI does not handle the BOM and treats it as part of the first field name. The screen shot below shows the debug trace of a CSV file with a BOM and that same file shown in Notepad++. The DataRow.ItemArray[...] values are as expected. The DataRow.Table.Columns.ResultsView[...] shows the field names, but the first field name includes the BOM.
This CSV file with a BOM was created in Visual Studio using Solution Explorer => Add => New item => C# => General => Text file. Previously I have created a spread sheet with Microsoft Excel and saved it as a CSV file, that file did not have a BOM. I have also created files with Notepad++ and saved as CSV and they did not have a BOM. It appears that Visual Studio creates files with a BOM but when editing CSV files it does not add a BOM.
Visual Studio can create files with the correct encoding. Within "Step 2 - Create a data set" of this Microsoft page it states the text below. (Thanks also to Holistic Developer for providing very similar details in a comment.):
It is important to save the .csv file using the correct encoding. On the FILE menu, choose Advanced Save Options and choose Unicode
(UTF-8 without signature) – Codepage 65001 as the encoding.
For Visaul Studio 2010, i could solve issue be selecting "Western European (Windows) - Codepage 1252" encoding for CSV files.
Summary of steps:
In visual studio 2010, Open CSV file > Go to File menu > Select " Advanced Save Options" > Select "Western European (Windows) - Codepage 1252" > Save.
This should help.
This is not the best solution but its kind of a workaround. I simply set the first element to something random and since I don't need access to the first element it doesn't matter that I don't have access to it.
If anyone finds a correct way to solve this problem I'd be grateful for your solution.

MS Office no longer works as BLOB

Hi does anyone know why MS Office such as doc, docx and xls can no longer be viewed when retrieved from a mysql db when stored as Blob?
The doc and docx used to download and open without any problem, but now it no longer recognises the file format.
I'd like to ditto your problem. Images and plain text files upload/download from mysql blob field. Doc and docx files seemed to be corrupted. I've read somewhere of a rumor of mysql truncating the last 4 bits but I can't verify that.
I have used xvi32 (a hex editor) to compare local originals of files with versions dowloaded from BLOB/LONGBLOB fields. It seems that extra bytes, which I think represent a CRLF are appended, as far as I can work out by Windows when the file is written. This doesn't seem to be a problem for some graphic formats which are to some extent fault-tolerant, but the office XML format files are corrupted by this extra data.
I have tried using ob_clean() and ob_flush() [that is, in php] before printing/echoing the file contents, but still corrupted as far as Office is concerned.
I know this is an old thread but I would appreciate any solutions anyone might have found since it was last updated.
Did you try with a short txt file instead of .doc and see if the contents are different than what you expected?