Chance encoding of many files - csv

I have more than 500 CSV files, some of them with utf-8 encoding anda others with ascii encoding.
Also, I don't have the exact name of each file (names change weekly so I have to open them using glob).
When I try to open files, the ones with ascii encoding hace error: 'UnicodeDecodeError: utf-8 codec can't decode byte.....'
I have read about this and I have tried with encoding latin-1, changing files extension, etc but I couldn't Solve it.
How can I open those files using glob without error or how can I change the encoding of all files from a folder?
Thanks
How can I open those files using glob without error or how can I change the encoding of all files from a folder?

Related

Trying to load a UTF-8 CSV file with a flat file source in SSIS, keep getting errors saying it is a ANSI file format

I have a SSIS data flow task that reads from a CSV file and stores the results in a table.
I am simply loading the CSV file by rows (not even seperating the columns) and dumpting the entire row to the database, very simple process.
The file contains UTF-8 characters, and the file also has the UTF BOM already as I verified this.
Now when I load the file using a flat file connection, I have the following settings currently:
Unicode checked
Advanced editor shows the column as "Unicode text stream DT_NTEXT".
When I run the package, I get this error:
[Flat File Source [16]] Error: The data type for "Flat File
Source.Outputs[Flat File Source Output].Columns[DataRow]" is DT_NTEXT,
which is not supported with ANSI files. Use DT_TEXT instead and
convert the data to DT_NTEXT using the data conversion component.
[Flat File Source [16]] Error: Unable to retrieve column information
from the flat file connection manager.
It is telling me to use DT_TEXT but my file is UTF-8 and it will loose its encoding right? Makes no sense to me.
I have also tried with the Unicode checkbox unchecked, and setting the codepage to "65001 UTF-8" but I still get an error like the above.
Why does it say my file is an ANSI file?
I have opened my file in sublime text and saved it as UTF-8 with BOM. My preview of the flat file does show other languages correctly like Chinese and English combined.
When I didn't check Unicode, I would also get this error saying the flat files error output column is DT_TEXT and when I try and change it to Unicode text stream it gives me a popup error and doesn't allow me to do this.
I have faced this same issue for years, and to me it seems like it could be a bug with the Flat File Connection provider in SQL Server Integration Services (SSIS). I don't have a direct answer to your question, but I do have a workaround. Before I load data, I convert all UTF-8 encoded text files to UTF-16LE (Little Endian). It's a hassle, and the files take up about twice the amount of space uncompressed, but when it comes to loading Unicode into MS-SQL, UTF-16LE just works!
With regards to the actual conversion step I would say that is for you to decide what will work best in your workflow. When I have just a few files then I convert them one-by-one in a text editor, but when I have a lot of files then I use PowerShell. For example,
Powershell -c "Get-Content -Encoding UTF8 'C:\Source.csv' | Set-Content -Encoding Unicode 'C:\UTF16\Source.csv'"

 or ? character is prepended to first column when reading csv file from s3 by using camel

The csv file is located in S3 bucket, and I am using camel aws to consume the csv file.
However, whenever the csv file is loaded to local,  or ? character is pretended to first column.
For example,
original file
firstname, lastname
brian,xi
after load to local
firstname,lastname
brian,xi
I have done research on this link : R's read.csv prepending 1st column name with junk text
however, it does not seem to work for camel.
how to read csv file from s3
use aws-s3 to consume csv file from s3 bucket such as "Exchange s3File = consumer.receive(s3Endpoint)" where s3Endpoint = "aws-s3://keys&secret?prefix=%s&deleteAfterRead=false&amazonS3Client=#awsS3client"
The characters  are a UTF-8 BOM (Hex EF BB BF). So this is meta data about the file content that is placed at the beginning of the file (because there is no "header" or similar place where it can be saved to).
If you read a file that begins with this sequence, but you read it as Windows standard encoding (CP1252) or ISO-8859-1, you get exactly these three strange characters at the beginning of the file content.
To avoid that you have to read the file as UTF-8 and BOM aware as suggested in #jws comment. He also provided this link with an example how to use a BOMInputStream to read such files correctly.
If the file is correctly read, and you write it back into a file with a different encoding like CP1252, the BOM should be removed.
So, now the question is how exactly do you read the file with Camel? If you (or a library) read it (perhaps by default) with a non-UTF-8 encoding, that explains why you get these characters in the file content.

Is there any annotation / comments I can put in file for PhpStorm to force file encoding?

We are using Windows-1252 character-set in one of our files. I have set proper file encoding in Phpstorm > Settings for this particular PHP file. Remaining project is UTF8. This works for me.
The problem comes with other developers in my organization. They have UTF encoding set in their settings and they don't have this file specific custom settings. When they save anything in this file, it converts the special characters.
Is there any doc block OR annotation like
// #FILE_ENCODING Windows-1252
that I can put in my PHP file so PhpStorm auto detects it?

Encoding Issue in Talend Open Studio

I am working on a Talend Project, Where we are Transforming data from 1000's of XML files to CSV and we are creating CSV file encoding as UTF-8 from Talend itself.
But the issue is that some of the Files are created as UTF-8 and some of them created as ASCII , I am not sure why this is happening The files should always be created as UTF.
As mentioned in the comments, UTF8 is a superset of ASCII. This means that the code point for any ASCII characters will be the same in UTF8 as ASCII.
Any program identifying a file containing only ASCII characters will then simply assume it is ASCII encoded. It is only when you include characters outside of the ASCII character set that the file may be recognised by whatever heuristic the reading program uses.
The only exception to this is for file types that specifically state their encoding. This includes things like (X)HTML and XML which typically start with an encoding declaration.
You can go to the Advanced tab of the tFileOutputDelimited (or other kind of tFileOutxxx) you are using and select UTF-8 encoding.
Here is an image of the advanced tab where to perform the selection
I am quite sure the unix file util makes assumptions based on content of the file being in some range and or having specific start (magic numbers). In your case if you generate a perfectly valid UTF-8 file, but you just use only the ASCII subset the file util will probably flag it as ASCII. In that event you are fine, as you have a valid UTF-8 file. :)
To force talend to get a file as you wish, you can add an additional column to your file (for example in a tMap) and set an UTF-8 character in this column. The generated file will be in UTF8 as the other repliers mentioned.

Reading a CSV w/ CFFile & Non-Roman Characters

Update: The original CSV was created in Excel; when I copied the data in to a Google Spreadsheet and downloaded a CSV from Drive, it works fine. I'm guessing there's an encoding issue w/ the Excel CSV? Is there any way to work around this w/ Excel or do we need to tell our clients to use Google docs?
I've got a CSV w/ non-roman characters (my example is in French, but we support entirely non-roman languages such as Arabic and Thai as well) that I'm reading via ColdFusion's cffile. The problem is the output from the read is converting all the accented characters into a weird ? symbol (�). There was originally no charset specified on the cffile, so I tried adding utf-8 (no change) and utf-16 (everything is converted to sort-of Chinese?).
Anyone know how I can get this data out of the CSV without losing/messing up the characters?
CSV Example:
Smith,Joan,joan.smith#test.com,Hôpital Jésus
Original cffile:
<cffile action="read" file="#expandedFilePath#" variable="strCSV">
cffile w/ charset added:
<cffile action="read" file="#expandedFilePath#" variable="strCSV" charset="utf-8">
cfdump of strCSV (no charset/utf-8 charset):
Smith,Joan,joan.smith#test.com,H�pital J�sus
cfdump of strCSV (utf-16 charset):
卭楴栬䩯慮ⱪ潡渮獭楴桀瑥獴⹣潭ⱈ楴慬⁊畳ഊ
Excel, like most Windows programs, uses the CP-1252 encoding (not UTF-8; and this is important: ALSO NOT ISO-8859-1 as recognised by most encoding guessers). Did you already try to do:
<cffile action="read" file="#expandedFilePath#"
variable="strCSV"
charset="windows-1252" />
If this works, can you rely on your inputs to always be default Windows files?