import CSV in another language to SAS - csv

I am attempting to import a CSV file which is in French to my US based analysis. I have noticed several issues in the import related to the use of accents. I put the csv file into a text reader and found that the data look like this
I am unsure how to get rid of the [sub] pieces and format this properly.
I am on SAS 9.3 and am unable to edit the CSV as it is a shared CSV with French researchers. I am also limited to what I can do in terms of additional languages within SAS because of admin rights.
I have tried the following fixes:
data want(encoding=asciiany);
set have;
comment= Compress(comment,'0D0A'x);
comment= TRANWRD(comment,'0D0A'x,'');
comment= TRANWRD(comment,'0D'x,'');
comment= TRANWRD(comment,"\u001a",'');
How can I resolve these issues?

While this would have been a major issue a few decades ago, nowadays, it's very simple to determine the encoding and then run your SAS in the right mode.
First, open the CSV in a text editor, not the basic Notepad but almost any other; Notepad++ is free, for example, or Ultraedit or Textpad, on Windows, or on the Mac, BBEdit, or several others will do. I'll assume Notepad++ for the rest of this answer, but all of them have some way of doing this. If you're in a restricted no-admin-rights environment, good news: Notepad++ can be installed in your user folder with no admin rights (or even on a USB!). (Also, an advanced text editor is a vital data science tool, so you should have one anyway.)
In Notepad++, once you open the file there will be an encoding in the bottom right: "UTF-8", "WLATIN1", "ASCII", etc., depending on the encoding of the file. Look and see what that is, and write it down.
Once you have that, you can try starting SAS in that encoding. For the rest of this, I assume it is in UTF-8 as that is fairly standard, but replace UTF-8 with whatever the encoding you determined. earlier.
See this article for more details; the instructions are for 9.4, but they have been the same for years. If this doesn't work, you'll need to talk to your SAS administrator, and they may need to modify your SAS installation.
You can either:
Make a new shortcut (a copy of the one you run SAS with) and add -encoding UTF-8 to the command line
Create a new configuration file, point SAS to it, and include ENCODING=UTF-8 in the configuration file.
Note that this will have some other impacts - the datasets you create will be encoded in UTF-8, and while SAS is capable of handling that, it will add some extra notes to the log and some extra time if you later do work in non-UTF8 SAS with this, or if you use non-UTF8 SAS datasets in this mode.

This worked:
data want;
array f[8] $4 _temporary_ ('ä' 'ö' 'ü' 'ß' 'Ä' 'Ö' 'Ü' 'É');
array t[8] $4 _temporary_ ('ae' 'oe' 'ue' 'ss' 'Ae' 'Oe' 'Ue' 'E');
set have;
newvar=oldvar;
newvar = Compress(newvar,'0D0A'x);
newvar = TRANWRD(newvar,'0D0A'x,'');
newvar = TRANWRD(newvar,'0D'x,'');
newvar = TRANWRD(newvar,"\u001a",'');
newvar = compress(newvar, , 'kw');
do _n_=1 to dim(f);
d=tranwrd(d, trim(f[_n_]), trim(t[_n_]));
end;
run;

Related

How do you fix the following error? I am trying to Data Table Import Wizard to load a csv file into Workbench

I am trying to upload a .csv file into Workbench using the Table Data Import Wizard.
I receive the following error whenever attempting to load it:
Unhandled exception: 'ascii' codec can't decode byte 0xc3 in position 1253: ordinal not in range(128)
I have tried previous solutions that suggested I encode the .csv file as a MS-DOS csv and as a UTF-8 csv. Neither have worked for me.
Attempting to change the data in the file would not be feasible since its made up of thousands of cells, so it would quite impractical. Is there anything that can be done to resolve this?
What was after the C3? What should have been there?
C3, when interpreted as "latin1" is à -- an unlikely character.
More likely is a 2-byte UTF-8 code that starts with C3. This includes the accented letters of Western European languages. Example é, hex C3A9.
You tried "UTF-8 csv" -- Please provide the specifics of how you tried it. What settings in the Wizard, etc.
Probably you should state that the data is "UTF-8" or utf8mb4, depending on whether you are referring to outside or inside MySQL.
Meanwhile, if you are loading the data into an existing "table", let's see SHOW CREATE TABLE. It should probably not say "ascii" anywhere; instead, it should probably say "utf8mb4".

Importing foreign languages from csv file to Stata

I am using Stata 12. I have encountered the following problems. I am importing a bunch of .csv files to Stata using the insheet command. The datasets may conclude Russian, Croatian, Turkish, etc. I think they are encoded in "UTF-8". In .csv files, they are correct. After I imported them into Stata, the original strings are incorrect and become the strange characters. Would you please help me with that? Does Stat-Transfer can solve the problems? Does it support .csv format?
For example,
the original file is like:
My code is like:
insheet using name.csv, c n
save name.dta,replace
The result is like:
And I have tried to adjust the script in the fonts option, which does not work.
As #Nick Cox commented earlier, the problem is that Stata just doesn't support Unicode/UTF-8 encoding.
No, StatTransfer wouldn't resolve the problem (please refer to this explanation).
You can do the trick using an online decoder or MS Word. Let's do it with one language first, say, Russian as in your screenshots. Check out the correct encodings for Croatian, Turkish, and other languages you have.
Save the string variable from your .csv file as plain text (.txt), choosing the UTF-8 encoding option.
Encoding conversion:
Use iconv, suggested by #Dimitriy V. Masterov, or
Use an online tool, such as this: upload .txt file, choose source encoding as UTF-8 and output encoding according to the language of interest (for Russian, it must be CP1251), click "convert" button and save the output file, or
If you have MS Office, you can use also MS Word for the same purpose. Right click on .txt file, choose "Open with...", choose to open with MS Word. In the appeared window, confirm that the file encoding is "Unicode (UTF-8)", open, then click "Save as...", save as plain text. In the newly appeared window, choose "Cyrillic (Windows)" and mark "Insert line breaks". Save.
Check out your new .txt file - it still should have some strange characters (like ÌßÑÎÊÎÌÁÈÍÀÒ) but now Stata can display them properly.
Copy-paste the new string variable in Stata Data Editor, right click on the variable, choose "Font...", and then string "Cyrillic". You should see correct names on the screen both in data editor and in the results window (even though the string itself is intact).
Depending on your OS, you might need to install all appropriate languages first.
Hope it helps.
Update Answer: As of version 14, all of Stata is Unicode aware. That is results, help files, do files, ado files, data labels, etc.
This does not help users limited to accessing versions of Stata before 14, but is one kind of solution. Using the OP's example:
. insheet using "/home/Alexis/Desktop/data.csv"
(3 vars, 4 obs)
. ed
. list
+------------------------------------------------------------------------------+
| v1 v2 v3 |
|------------------------------------------------------------------------------|
1. | RU00040778 RUS ПРAЙCBOTEРXAУCKУПEРC AУДИT |
2. | RU00044434 RUS КПMГ |
3. | RU00044428 RUS Эрнст энд Янг |
4. | RU00044428 RUS Аудиторско-консулбтационная группа Раэвитие Биэнес-систем |
+------------------------------------------------------------------------------+

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.

Migrating MS Access data to MySQL: character encoding issues

We have an MS Access .mdb file produced, I think, by an Access 2000 database. I am trying to export a table to SQL with mdbtools, using this command:
mdb-export -S -X \\ -I orig.mdb Reviewer > Reviewer.sql
That produces the file I expect, except one thing: Some of the characters are represented as question marks. This: "He wasn't ready" shows up like this: "He wasn?t ready", only in some cases (primarily single/double curly quotes), where maybe the content was pasted into the DB from MS Word. Otherwise, the data look great.
I have tried various values for "export MDB_ICONV=". I've tried using iconv on the resulting file, with ISO-8859-1 in the from/to, with UTF-8 in the from/to, with WINDOWS-1250 and WINDOWS-1252 and WINDOWS-1256 in the from, in various combinations. But I haven't succeeded in getting those curly quotes back.
Frankly, based on the way the resulting file looks, I suspect the issue is either in the original .mdb file, or in mdbtools. The malformed characters are all single question marks, but it is clear that they are not malformed versions of the same thing; so (my gut says) there's not enough data in the resulting file; so (my gut says) the issue can't be fixed in the resulting file.
Has anyone run into this one before? Any tips for moving forward? FWIW, I don't have and never have had MS Access -- the file is coming from a 3rd party -- so this could be as simple as changing something on the database, and I would be very glad to hear that.
Thanks.
Looks like "smart quotes" have claimed yet another victim.
MS word takes plain ascii quotes and translates them to the double-byte left-quote and right-quote characters and translates a single quote into the double byte apostrophe character. The double byte characters in question blelong to to an MS code page which is roughly compatable with unicode-16 except for the silly quote characters.
There is a perl script called 'demoroniser.pl' which undoes all this malarky and converts the quotes back to plain ASCII.
It's most likely due to the fact that the data in the Access file is UTF, and MDB Tools is trying to convert it to ascii/latin/is0-8859-1 or some other encoding. Since these encodings don't map all the UTF characters properly, you end up with question marks. The information here may help you fix your encoding issues by getting MDB Tools to use the correct encoding.

Can SAS convert CSV files into Binary Format?

The output we need to produce is a standard delimited file but instead of ascii content we need binary. Is this possible using SAS?
Is there a specific Binary Format you need? Or just something non-ascii? If you're using proc export, you're probably limited to whatever formats are available. However, you can always create the csv manually.
If anything will do, you could simply zip the csv file.
Running on a *nix system, for example, you'd use something like:
filename outfile pipe "gzip -c > myfile.csv.gz";
Then create the csv manually:
data _null_;
set mydata;
file outfile;
put var1 "," var2 "," var3;
run;
If this is PC/Windows SAS, I'm not as familiar, but you'll probably need to install a command-line zip utility.
This link from SAS suggests using winzip, which has a freely downloadable version. Otherwise, the code is similar.
http://support.sas.com/kb/26/011.html
You can actually make a CSV file as a SAS catalog entry; CSV is a valid SAS Catalog entry type.
Here's an example:
filename of catalog "sasuser.test.class.csv";
proc export data=sashelp.class
outfile=of
dbms=dlm;
delimiter=',';
run;
filename of clear;
This little piece of code exports SASHELP.CLASS to a SAS Catalog entry of entry type CSV.
This way you get a binary format you can move between SAS installations on different platforms with PROC CPORT/CIMPORT, not having to worry if the used binary package format is available to your SAS session, since it's an internal SAS format.
Are you saying you have binary data that you want to output to csv?
If so, I don't think there is necessarily a defined standard for how this should be handled.
I suggest trying it (proc export comes to mind) and seeing if the results match your expectations.
Using SAS, output a .csv file; Open it in Excel and Save As whichever format your client wants. You can automate this process with a little bit of scripting in ### as well. (Substitute ### with your favorite scripting language.)