Downloaded CSV starts with BACKSPACE and other weird characters - csv

I downloaded a CSV (encoded in UTF-8) from an FTP server (using some VB6 code which has always worked in the past) and found it started with 08 00 50 9e (BACKSPACE NULL P ž in ASCII).
I've downloaded the same file (a different version) before and never had a problem, so I don't believe the FTP client is at fault here.
Is there some meaning to those characters?
I've tried searching for that string on Google, but (obviously?) did not succeed in the search.

I found the answer... it was an issue the VB6 code: instead of Print #iFileNumber, sFileContents, it used Put #iFileNumber, , sFileContents in Binary mode instead of Output mode (no idea why it worked before, but perhaps I changed something without realising it).
Put adds a four-byte string length indicator, hence 08 00 50 9e.
Problematic code
Open App.Path & "\Temp.csv" For Binary As #iFileNumber
Put #iFileNumber, , StrConv(x.Value, vbUnicode)
Close
Working code
Open App.Path & "\Temp.csv" For Output As #iFileNumber
Print #iFileNumber, StrConv(x.Value, vbUnicode)
Close

Related

Reading CSV file with Chinese Character [One character cannot be shown]

When I am opening a csv file containing Chinese characters, using Microsoft Excel, TextWrangler and Sublime Text, there are some Chinese words, which cannot be displayed properly. I have no ideas why this is the case.
Specifically, the csv file can be found in the following link: https://www.hkex.com.hk/eng/plw/csv/List_of_Current_SEHK_EP.CSV
One of the word that cannot be displayed correctly is shown here:
As you can see a ? can be found.
Using mac file command as suggested by
http://osxdaily.com/2015/08/11/determine-file-type-encoding-command-line-mac-os-x/ tell me that the csv format is utf-16le.
I am wondering what's the problem, why I cannot read that specific text?
Is it related to encoding? Or is it related to my laptop setting? Trying to use Mac and windows 10 on Mac (via Parallel Desktop) cannot display the work correctly.
Thanks for the help. I really want to know why this specific text cannot be displayed properly.
The actual name of HSBC Broking Securities is:
滙豐金融證券(香港)有限公司
The first character, U+6ED9 滙, is one of the troublesome HKSCS characters: characters that weren't available in standard pre-Unicode Big-5, which were grafted on in incompatible ways later.
For a while there was an unfortunate convention of converting these characters into Private Use Area characters when converting to Unicode. This data was presumably converted back then and is now mangled, replacing 滙 with U+E05E  Private Use Area Character.
For PUA cases that you're sure are the result of HKSCS-compatibility-bodge, you can convert back to proper Unicode using this table.

I'd like to understand how a windows smart quote turns into "’"

Here's the workflow:
user types in Word; Word changes a single apostrophe to a "smart quote"
user pastes the test from word into a form on a web page; the page the form is in is encoded in UTF-8
the data gets saved into a MySQL database with the encoding latin1
when retrieved from the database by a PHP app (which assumes the database encoding is UTF-8) and displayed in a UTF-8 web page, the quote displays as ’
I realise there's a mismatch between the encoding of the input and output pages and the database. That I'm going to fix.
Shouldn't the character survive the trip to and from the database anyway?
And how does a single character (0x92 if I'm not confused) go through that process and come out the other end as three characters?
Can someone talk me through what's happening to the bytes at each stage of the process?
Step 1:
Word converts ' to ’ (Unicode codepoint U+2019, RIGHT SINGLE QUOTATION MARK).
Step 2:
’ is encoded into UTF-8 as E2 80 99
Step 3:
This appears to be where the problem occurs. It looks like the UTF-8 string is stored without conversion in the latin-1-encoded MySQL field:
E2 80 99 in latin-1 is ’.
Step 4:
Either here or in the previous step, that falsely used latin-1 string is converted to UTF-8.
’ in UTF-8 is C3 A2 E2 82 AC E2 84 A2.
This will display on a UTF-8-encoded website as ’.

Line Feeds and Carriage Rerturns in Data: 0D 0A

I am writing a data clean up script (MS Smart Quotes, etc.) that will operate on mySQL tables encoded in Latin1. While scanning the data I noticed a ton of 0D 0A where the line breaks are.
Since I am cleaning the data, should I also address all of the 0D, too, by removing them? Is there ever a good reason to keep 0D (carriage return) anymore?
Thanks!
0D0A (\r\n), and 0A (\n) are line terminators; \r\n is mostly used in OS Windows, \n in unix systems.
Is there ever a good reason to keep 0D anymore?
I think you should answer this question yourself.
You could remove '\r' from the data, but make sure that the programs that will use this data understand that '\n' means the end of line very well. In most cases it is taken into account, but check just in case.
The CR/LF combination is a Windows thing. *NIX operating systems just use LF. So based on the application that uses your data, you'll need to make the decision on whether you want/need to filter out CR's. See the Wikipedia entry on newline for more info.
Python's readline() returns a line followed with a \O12. \O means Octal. 12 is octal for decimal 10. You can see on the ASCII table that Dec 10 is NL or LF. Newline or line feed.
Standard for end-of-line in a unix text or script file.
http://www.asciitable.com/
So be aware that the len() will include the NL unless you try to read past the EOF the len() will never be zero.
Therefore if you INSERT any line of text obtained by the Python readline() into a mysql table it will include the NL character by default, at the end.

Weird character at start of json content type

I'm trying to return json content read from MySQL server. This is supposed to be easy but, there is a 'weird' character that keeps appearing at start of the content.
I have two pages for returning content:
kcb433.sytes.net/as/test.php?json=true&limit=6&input=d
this test.php is from a script written by Timothy Groves, which converts an array to json output
http://kcb433.sytes.net/k.php?k=4
this one is supposed to do the same
I tried to validate it here jsonformatter.curiousconcept.com but just page 1 gets validated, page 2 says that it does not contain JSON data.
If accessed directly both pages has no problems. Then what is the difference, why both don't get validated?
Then I found this page jsonformat.com and tried the same thing. Page 1 was ok and page 2 wasn't but, surprisingly the data could be read. At a glance,
{"a":"b"}
may look good but there is a character in front.
According to a hex editor online, this is the value of the string above (instead of 9 values, there are 10):
-- 7B 22 61 22 3A 22 62 22 7D
The code to echo json in page 2 is:
header("Content-Type: application/json");
echo "{\"a\":\"b\"}";
Your k.php file has BOM signature at the start, save k.php again with UTF8 without BOM.

Migrating MS Access data to MySQL: character encoding issues

We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.
Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.
It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.