Importing foreign languages from csv file to Stata - csv

I am using Stata 12. I have encountered the following problems. I am importing a bunch of .csv files to Stata using the insheet command. The datasets may conclude Russian, Croatian, Turkish, etc. I think they are encoded in "UTF-8". In .csv files, they are correct. After I imported them into Stata, the original strings are incorrect and become the strange characters. Would you please help me with that? Does Stat-Transfer can solve the problems? Does it support .csv format?
For example,
the original file is like:
My code is like:
insheet using name.csv, c n
save name.dta,replace
The result is like:
And I have tried to adjust the script in the fonts option, which does not work.

As #Nick Cox commented earlier, the problem is that Stata just doesn't support Unicode/UTF-8 encoding.
No, StatTransfer wouldn't resolve the problem (please refer to this explanation).
You can do the trick using an online decoder or MS Word. Let's do it with one language first, say, Russian as in your screenshots. Check out the correct encodings for Croatian, Turkish, and other languages you have.
Save the string variable from your .csv file as plain text (.txt), choosing the UTF-8 encoding option.
Encoding conversion:
Use iconv, suggested by #Dimitriy V. Masterov, or
Use an online tool, such as this: upload .txt file, choose source encoding as UTF-8 and output encoding according to the language of interest (for Russian, it must be CP1251), click "convert" button and save the output file, or
If you have MS Office, you can use also MS Word for the same purpose. Right click on .txt file, choose "Open with...", choose to open with MS Word. In the appeared window, confirm that the file encoding is "Unicode (UTF-8)", open, then click "Save as...", save as plain text. In the newly appeared window, choose "Cyrillic (Windows)" and mark "Insert line breaks". Save.
Check out your new .txt file - it still should have some strange characters (like ÌßÑÎÊÎÌÁÈÍÀÒ) but now Stata can display them properly.
Copy-paste the new string variable in Stata Data Editor, right click on the variable, choose "Font...", and then string "Cyrillic". You should see correct names on the screen both in data editor and in the results window (even though the string itself is intact).
Depending on your OS, you might need to install all appropriate languages first.
Hope it helps.

Update Answer: As of version 14, all of Stata is Unicode aware. That is results, help files, do files, ado files, data labels, etc.
This does not help users limited to accessing versions of Stata before 14, but is one kind of solution. Using the OP's example:
. insheet using "/home/Alexis/Desktop/data.csv"
(3 vars, 4 obs)
. ed
. list
+------------------------------------------------------------------------------+
| v1 v2 v3 |
|------------------------------------------------------------------------------|
1. | RU00040778 RUS ПРAЙCBOTEРXAУCKУПEРC AУДИT |
2. | RU00044434 RUS КПMГ |
3. | RU00044428 RUS Эрнст энд Янг |
4. | RU00044428 RUS Аудиторско-консулбтационная группа Раэвитие Биэнес-систем |
+------------------------------------------------------------------------------+

Related

ERP ( IFS ) export into CSV - coding problem

I'm exporting some data from the ERP system ( IFS ) into the CSV file. From that CSV it's being uploaded to another tool.
I have a problem with character coding. Until now we were pulling only Dannish and Finnish data and used the WE8MSWIN1252. Now we need to include also Polish signs. Unfortunately the coding that we have is not covering the special characters in Polish. I've tried already AL16UTF16, AL32UTF8, EEC8EUROASCI and none of them gave me the expected result ( having all of the Dannish, Finnish, Polish special signs visible correctly in the CSV). Is there any coding which would cover all ofthose signs right into the CSV? While ?I was opening the AL32UTF8 in notepad it worked fine, but we have to use the CSV due to the integration that is the next step in the puzzle.
Please note that changing the csv to anything else is really the last resort. We don't want to play with the integration that is going further.

import CSV in another language to SAS

I am attempting to import a CSV file which is in French to my US based analysis. I have noticed several issues in the import related to the use of accents. I put the csv file into a text reader and found that the data look like this
I am unsure how to get rid of the [sub] pieces and format this properly.
I am on SAS 9.3 and am unable to edit the CSV as it is a shared CSV with French researchers. I am also limited to what I can do in terms of additional languages within SAS because of admin rights.
I have tried the following fixes:
data want(encoding=asciiany);
set have;
comment= Compress(comment,'0D0A'x);
comment= TRANWRD(comment,'0D0A'x,'');
comment= TRANWRD(comment,'0D'x,'');
comment= TRANWRD(comment,"\u001a",'');
How can I resolve these issues?
While this would have been a major issue a few decades ago, nowadays, it's very simple to determine the encoding and then run your SAS in the right mode.
First, open the CSV in a text editor, not the basic Notepad but almost any other; Notepad++ is free, for example, or Ultraedit or Textpad, on Windows, or on the Mac, BBEdit, or several others will do. I'll assume Notepad++ for the rest of this answer, but all of them have some way of doing this. If you're in a restricted no-admin-rights environment, good news: Notepad++ can be installed in your user folder with no admin rights (or even on a USB!). (Also, an advanced text editor is a vital data science tool, so you should have one anyway.)
In Notepad++, once you open the file there will be an encoding in the bottom right: "UTF-8", "WLATIN1", "ASCII", etc., depending on the encoding of the file. Look and see what that is, and write it down.
Once you have that, you can try starting SAS in that encoding. For the rest of this, I assume it is in UTF-8 as that is fairly standard, but replace UTF-8 with whatever the encoding you determined. earlier.
See this article for more details; the instructions are for 9.4, but they have been the same for years. If this doesn't work, you'll need to talk to your SAS administrator, and they may need to modify your SAS installation.
You can either:
Make a new shortcut (a copy of the one you run SAS with) and add -encoding UTF-8 to the command line
Create a new configuration file, point SAS to it, and include ENCODING=UTF-8 in the configuration file.
Note that this will have some other impacts - the datasets you create will be encoded in UTF-8, and while SAS is capable of handling that, it will add some extra notes to the log and some extra time if you later do work in non-UTF8 SAS with this, or if you use non-UTF8 SAS datasets in this mode.
This worked:
data want;
array f[8] $4 _temporary_ ('ä' 'ö' 'ü' 'ß' 'Ä' 'Ö' 'Ü' 'É');
array t[8] $4 _temporary_ ('ae' 'oe' 'ue' 'ss' 'Ae' 'Oe' 'Ue' 'E');
set have;
newvar=oldvar;
newvar = Compress(newvar,'0D0A'x);
newvar = TRANWRD(newvar,'0D0A'x,'');
newvar = TRANWRD(newvar,'0D'x,'');
newvar = TRANWRD(newvar,"\u001a",'');
newvar = compress(newvar, , 'kw');
do _n_=1 to dim(f);
d=tranwrd(d, trim(f[_n_]), trim(t[_n_]));
end;
run;

Octave - dlmread and csvread convert the first value to zero

When I try to read a csv file in Octave I realize that the very first value from it is converted to zero. I tried both csvread and dlmread and I'm receiving no errors. I am able to open the file in a plain text editor and I can see the correct value there. From what I can tell, there are no funny hidden characters, spacings, or similar in the csv file. Files also contain only numbers. The only thing that I feel might be important is that I have five columns/groups that each have different number of values in them.
I went through the commands' documentation on Octave Forge and I do not know what may be causing this. Does anyone have an idea what I can troubleshoot?
To try to illustrate the issue, if I try to load a file with the contents:
1.1,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
Command window will return:
0.0,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
( with additional trailing zeros after the decimal point).
Command syntaxes I'm using are:
dt = csvread("FileName.csv")
and
dt = dlmread("FileName.csv",",")
and they both return the same.
Your csv file contains a Byte Order Mark right before the first number. You can confirm this if you open the file in a hex editor, you will see the sequence EF BB BF before the numbers start.
This causes the first entry to be interpreted as a 'string', and since strings are parsed based on whether there are numbers in 'front' of the string sequence, this is parsed as the number zero. (see also this answer for more details on how csv entries are parsed).
In my text editor, if I start at the top left of the file, and press the right arrow key once, you can tell that the cursor hasn't moved (meaning I've just gone over the invisible byte order mark, which takes no visible space). Pressing backspace at this point to delete the byte order mark allows the csv to be read properly. Alternatively, you may have to fix your file in a hex editor, or find some other way to convert it to a proper Ascii file (or UTF without the byte order mark).
Also, it may be worth checking how this file was produced; if you have any control in that process, perhaps you can find why this mark was placed in the first place and prevent it. E.g., if this was exported from Excel, you can choose plain 'csv' format instead of 'utf-8 csv'.
UPDATE
In fact, this issue seems to have already been submitted as a bug and fixed in the development branch of octave. See #58813 :)

Reading CSV file with Chinese Character [One character cannot be shown]

When I am opening a csv file containing Chinese characters, using Microsoft Excel, TextWrangler and Sublime Text, there are some Chinese words, which cannot be displayed properly. I have no ideas why this is the case.
Specifically, the csv file can be found in the following link: https://www.hkex.com.hk/eng/plw/csv/List_of_Current_SEHK_EP.CSV
One of the word that cannot be displayed correctly is shown here:
As you can see a ? can be found.
Using mac file command as suggested by
http://osxdaily.com/2015/08/11/determine-file-type-encoding-command-line-mac-os-x/ tell me that the csv format is utf-16le.
I am wondering what's the problem, why I cannot read that specific text?
Is it related to encoding? Or is it related to my laptop setting? Trying to use Mac and windows 10 on Mac (via Parallel Desktop) cannot display the work correctly.
Thanks for the help. I really want to know why this specific text cannot be displayed properly.
The actual name of HSBC Broking Securities is:
滙豐金融證券(香港)有限公司
The first character, U+6ED9 滙, is one of the troublesome HKSCS characters: characters that weren't available in standard pre-Unicode Big-5, which were grafted on in incompatible ways later.
For a while there was an unfortunate convention of converting these characters into Private Use Area characters when converting to Unicode. This data was presumably converted back then and is now mangled, replacing 滙 with U+E05E  Private Use Area Character.
For PUA cases that you're sure are the result of HKSCS-compatibility-bodge, you can convert back to proper Unicode using this table.

Using EmEditor saving a Unicode file to another format distorts/changes the format. Solution?

There is a MySQL backup file which is a huge file - about 3 GB. There is one table that has a LONGBLOB column that stores JPEG image data.
The file imports successfully if done from MySQL Workbench - Data Import/Restore.
I need to open this file and extract the first few lines (about two rows of INSERTs of the table with the image data) so that I can test if another program can import this data into another MySQL database.
I tried opening the file with EmEditor (which is good at opening large files) and then copy/paste only upto one Insert statement of the script into a new file (upto about line 25, because the table in question is the first table in the backup script), and then Paste the selection into a new file.
Here comes the problem:
However this messes up the encoding (even though I save as utf8). I realize this when I try to import (restore) this new file (again using MySQL Workbench) into a MySQL database, the restore goes ahead without errors, but the JPEG images in the blob column are now destroyed/corrupted.
My guess is that the encoding is different between the original file and new file.
EmEditor does not show the encoding on the original file, there is an option to detect, and it detects it as 'UTF8 Unsigned'. But when saving I save it as UTF8. I tried also saving as ANSI, ISO8859 (windows default), etc, etc.. but everytime the same result.
Do you have any solution for this particular problem? ie I want to only cut the first few lines of the huge backup file and save to a new file keeping the encoding the same, so that the images (blobs) are not changed. Is there any way this can be done with EmEditor (ie do I have the wrong approach [ie Cut-Paste]?) Is there any specialized software that can do this? How can I diagnose what is going wrong here?
Thanks for any responses.
this messes up the encoding (even though I save as utf8)
UTF-8 is not a good choice for arbitrary binary data. There are many sequences of high-bytes which are not valid in UTF-8, so you will mangle them at some point during the load-alter-save process.
If you load the file using an encoding that maps every single byte to a unique character, and re-save the file using that same encoding, you should preserve the original content(*). ISO-8859-1 is the encoding usually chosen for this purpose, since it simply maps each byte 0..0xFF to the Unicode code point with the same number.
(*: assuming the editor is binary-safe with regard to other tricky points like nulls, \n/\r and other control characters... I believe EmEditor can be.)
When opening the original file in EmEditor, trying selecting the encoding as Binary (ASCII View). The Binary (ASCII View) will, as bobince said, map each byte to a unique character and preserve that when you save the file. I think this should fix your problem.