How to read data from a CSV file of two possible encodings? - csv

I want to read data from csv files with two possible encodings (UTF-8 and ISO-8859-15). I mean different files with different encodings. Not the same file with two encodings.
Now I can only read data correctly from a utf-8 encoding file. Can I just implement this by adding an extra option? For example . encoding: 'ISO-8859-15'
What i have:
def csv
file = File.open(file.tempfile)
CSV.open(file, csv_options)
end
private
def csv_options
{
col_sep: ";",
headers: true,
return_headers: false,
skip_blanks: true
}
end

Once you know what encoding your file has, you can pass inside the CSV options i.e.
external_encoding: Encoding::ISO_8859_15,
internal_encoding: Encoding::UTF_8
(This would establish, that the file is ISO-8859-15, but you want the strings internally as UTF-8).
So the strategy is that you decided first (before opening the file), what encoding you want, and then use the appropriate option Hash.

Related

Write a CSV based on another CSV file creating an additional empty row? [duplicate]

import csv
with open('thefile.csv', 'rb') as f:
data = list(csv.reader(f))
import collections
counter = collections.defaultdict(int)
for row in data:
counter[row[10]] += 1
with open('/pythonwork/thefile_subset11.csv', 'w') as outfile:
writer = csv.writer(outfile)
for row in data:
if counter[row[10]] >= 504:
writer.writerow(row)
This code reads thefile.csv, makes changes, and writes results to thefile_subset1.
However, when I open the resulting csv in Microsoft Excel, there is an extra blank line after each record!
Is there a way to make it not put an extra blank line?
The csv.writer module directly controls line endings and writes \r\n into the file directly. In Python 3 the file must be opened in untranslated text mode with the parameters 'w', newline='' (empty string) or it will write \r\r\n on Windows, where the default text mode will translate each \n into \r\n.
#!python3
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
In Python 2, use binary mode to open outfile with mode 'wb' instead of 'w' to prevent Windows newline translation. Python 2 also has problems with Unicode and requires other workarounds to write non-ASCII text. See the Python 2 link below and the UnicodeReader and UnicodeWriter examples at the end of the page if you have to deal with writing Unicode strings to CSVs on Python 2, or look into the 3rd party unicodecsv module:
#!python2
with open('/pythonwork/thefile_subset11.csv', 'wb') as outfile:
writer = csv.writer(outfile)
Documentation Links
https://docs.python.org/3/library/csv.html#csv.writer
https://docs.python.org/2/library/csv.html#csv.writer
Opening the file in binary mode "wb" will not work in Python 3+. Or rather, you'd have to convert your data to binary before writing it. That's just a hassle.
Instead, you should keep it in text mode, but override the newline as empty. Like so:
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
Note: It seems this is not the preferred solution because of how the extra line was being added on a Windows system. As stated in the python document:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
Windows is one such platform where that makes a difference. While changing the line terminator as I described below may have fixed the problem, the problem could be avoided altogether by opening the file in binary mode. One might say this solution is more "elegent". "Fiddling" with the line terminator would have likely resulted in unportable code between systems in this case, where opening a file in binary mode on a unix system results in no effect. ie. it results in cross system compatible code.
From Python Docs:
On Windows, 'b' appended to the mode
opens the file in binary mode, so
there are also modes like 'rb', 'wb',
and 'r+b'. Python on Windows makes a
distinction between text and binary
files; the end-of-line characters in
text files are automatically altered
slightly when data is read or written.
This behind-the-scenes modification to
file data is fine for ASCII text
files, but it’ll corrupt binary data
like that in JPEG or EXE files. Be
very careful to use binary mode when
reading and writing such files. On
Unix, it doesn’t hurt to append a 'b'
to the mode, so you can use it
platform-independently for all binary
files.
Original:
As part of optional paramaters for the csv.writer if you are getting extra blank lines you may have to change the lineterminator (info here). Example below adapated from the python page csv docs. Change it from '\n' to whatever it should be. As this is just a stab in the dark at the problem this may or may not work, but it's my best guess.
>>> import csv
>>> spamWriter = csv.writer(open('eggs.csv', 'w'), lineterminator='\n')
>>> spamWriter.writerow(['Spam'] * 5 + ['Baked Beans'])
>>> spamWriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
The simple answer is that csv files should always be opened in binary mode whether for input or output, as otherwise on Windows there are problems with the line ending. Specifically on output the csv module will write \r\n (the standard CSV row terminator) and then (in text mode) the runtime will replace the \n by \r\n (the Windows standard line terminator) giving a result of \r\r\n.
Fiddling with the lineterminator is NOT the solution.
A lot of the other answers have become out of date in the ten years since the original question. For Python3, the answer is right in the documentation:
If csvfile is a file object, it should be opened with newline=''
The footnote explains in more detail:
If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
Use the method defined below to write data to the CSV file.
open('outputFile.csv', 'a',newline='')
Just add an additional newline='' parameter inside the open method :
def writePhoneSpecsToCSV():
rowData=["field1", "field2"]
with open('outputFile.csv', 'a',newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(rowData)
This will write CSV rows without creating additional rows!
I'm writing this answer w.r.t. to python 3, as I've initially got the same problem.
I was supposed to get data from arduino using PySerial, and write them in a .csv file. Each reading in my case ended with '\r\n', so newline was always separating each line.
In my case, newline='' option didn't work. Because it showed some error like :
with open('op.csv', 'a',newline=' ') as csv_file:
ValueError: illegal newline value: ''
So it seemed that they don't accept omission of newline here.
Seeing one of the answers here only, I mentioned line terminator in the writer object, like,
writer = csv.writer(csv_file, delimiter=' ',lineterminator='\r')
and that worked for me for skipping the extra newlines.
with open(destPath+'\\'+csvXML, 'a+') as csvFile:
writer = csv.writer(csvFile, delimiter=';', lineterminator='\r')
writer.writerows(xmlList)
The "lineterminator='\r'" permit to pass to next row, without empty row between two.
Borrowing from this answer, it seems like the cleanest solution is to use io.TextIOWrapper. I managed to solve this problem for myself as follows:
from io import TextIOWrapper
...
with open(filename, 'wb') as csvfile, TextIOWrapper(csvfile, encoding='utf-8', newline='') as wrapper:
csvwriter = csv.writer(wrapper)
for data_row in data:
csvwriter.writerow(data_row)
The above answer is not compatible with Python 2. To have compatibility, I suppose one would simply need to wrap all the writing logic in an if block:
if sys.version_info < (3,):
# Python 2 way of handling CSVs
else:
# The above logic
I used writerow
def write_csv(writer, var1, var2, var3, var4):
"""
write four variables into a csv file
"""
writer.writerow([var1, var2, var3, var4])
numbers=set([1,2,3,4,5,6,7,2,4,6,8,10,12,14,16])
rules = list(permutations(numbers, 4))
#print(rules)
selection=[]
with open("count.csv", 'w',newline='') as csvfile:
writer = csv.writer(csvfile)
for rule in rules:
number1,number2,number3,number4=rule
if ((number1+number2+number3+number4)%5==0):
#print(rule)
selection.append(rule)
write_csv(writer,number1,number2,number3,number4)
When using Python 3 the empty lines can be avoid by using the codecs module. As stated in the documentation, files are opened in binary mode so no change of the newline kwarg is necessary. I was running into the same issue recently and that worked for me:
with codecs.open( csv_file, mode='w', encoding='utf-8') as out_csv:
csv_out_file = csv.DictWriter(out_csv)

Dump Chinese data into a json file

I am falling on a problem, while dumping a chinese data (non-latin language data) into a json file.
I am trying to store list into a json file with the following code;
with open("file_name.json","w",encoding="utf8") as file:
json.dump(edits,file)
It will dumped without any errors.
When i am viewing a file, it will look like this,
[{sentence: \u5979\u7d30\u5c0f\u8072\u5c0d\u6211\u8aaa\uff1a\u300c\u6211\u501f\u4f60\u4e00\u679d\u925b\u7b46\u3002\u300d}...]
And I also tried out, without encoding option.
with open("file_name.json","w") as file:
json.dump(edits,file)
My question is, why my json file look like this, and how to dump my json file with having chinese string instead of unicode string.
Any helps would be appreciated. Thanks : )
Check out the docs for json.dump.
Specifically, it has a switch ensure_ascii that if set to False should make the function not escape the characters.
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.

How to write a list of string as double quotes, so that json can load?

Consider the following code, where the json can't load back because after the manipulation, the single quotes become double quotes, how can write to file as double quote list so that json can load back?
import configparser
import json
config = configparser.ConfigParser()
config.read("config.ini")
l = json.loads(config.get('Basic', 'simple_list'))
new_config = configparser.ConfigParser()
new_config.add_section("Basic")
new_config.set('Basic', 'simple_list', str(l))
with open("config1.ini", 'w') as f:
new_config.write(f)
config = configparser.ConfigParser()
config.read("config1.ini")
l = json.loads(config.get('Basic', 'simple_list'))
The settings.ini file content is like this:
[Basic]
simple_list = ["a", "b"]
As already mentionned by #L3viathan, the purely technical answer is "use json.dumps() instead of str()" (and yes, it works for dicts too).
BUT: storing json in an ini file is very bad idea. "ini" is a file format on it's own (even if not as strictly specified as json or yaml) and it has been designed to be user-editable with just any text editor. FWIW, the simple canonical way to store "lists" in an ini file is simply to store them as comma separated values, ie:
[Basic]
simple_list = a,b
and parse this back when reading the config as
values = config.get('Basic', 'simple_list')).split(",")
wrt/ "storing dicts", an ini file IS already a (kind of) dict since it's based on key:value pairs. It's restricted to two levels (sections and keys), but here again that's by design - it's a format designed for end-users, not for programmers.
Now if the ini forma doesn't suits your needs, nothing prevents you from just using a json (or yaml) file instead for the whole config

How to read a file where one column data is present in other column using Talend Data Integration

I get data from a CSV format daily.
Example data looks like:
Emp_ID emp_leave_id EMP_LEAVE_reason Emp_LEAVE_Status Emp_lev_apprv_cnt
E121 E121- 21 Head ache, fever, stomach-ache Approved 16
E139 E139_ 5 Attending a marraige of my cousin Approved 03
Here you can see that emp_leave_id and EMP_LEAVE_reason column data is shifted/scattered into the next columns.
So the problem by using tFileInputDelimited and various reading patterns I couldn't load data correctly into my target database. Mainly I'm not able to read the data correctly with that component in Talend.
Is there a way that I can properly parse this CSV to get my data in the format that I want?
This is probably a TSV file. Not sure about Talend, but uniVocity can parse these files for you:
TsvDataStoreConfiguration tsv = new TsvDataStoreConfiguration("my_TSV_datastore");
tsv.setLimitOfRowsLoadedInMemory(10000);
tsv.addEntities("/some/dir/with/your_files", "ISO-8859-1"); //all files in the given directory path will accessible entities.
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("my_Database", myDataSource);
database.setLimitOfRowsLoadedInMemory(10000);
Univocity.registerEngine(new EngineConfiguration("My_ETL_Engine", tsv, database));
DataIntegrationEngine engine = Univocity.getEngine("My_ETL_Engine");
DataStoreMapping dataStoreMapping = engine.map("my_TSV_datastore", "my_Database");
EntityMapping entityMapping = dataStoreMapping.map("your_TSV_filename", "some_database_table");
entityMapping.identity().associate("Emp_ID", "emp_leave_id").toGeneratedId("pk_leave"); //assumes your database does not keep the original ids.
entityMapping.value().copy("EMP_LEAVE_reason", "Emp_LEAVE_Status").to("reason", "status"); //just copies whatever you need
engine.executeCycle(); //executes the mapping.
Do not use a CSV parser to parse TSV inputs. It won't handle escape sequences properly (such as \t inside the value, you will get the escape sequence instead of a tab character), and will surely break if your value has a quote in it (a CSV parser will try to find the closing quote character and will keep reading chars until it finds another quote)
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Invalid byte sequence in UTF-8, CSV import, Rails 4

I have a rake task that populates my database from a CSV file:
require 'csv'
namespace :import_data_csv do
desc "Import teams from csv file"
task import_data: :environment do
CSV.foreach(file, :headers => true) do |row|
#various import tasks
This had been working properly, but with a new CSV file, I'm getting the following error on the 6th row of the CSV file:
Invalid byte sequence in UTF-8
I have looked through the row and can't seem to find any irregular characters.
I've also attempted a couple other fixes recommended on stackoverflow:
- Changing the CSV.foreach to:
reader = CSV.open(file, "r")
reader.each do |row|
And changing:
CSV.foreach(file, headers => true) do |row|
to:
CSV.foreach(file, encoding: "r:ISO-8859-1", :headers => true) do |row|
None of these seem to correct the issue.
Suggestions?
I solved this by saving the file as a MDOS CSV, instead of the standard CSV file or the Windows CSV format.
The answer for me was to take the CSV file and save it to a text file. Then replace the tabs with commas. Then save the file as UTF-8 encoded. Finally, change the .txt to .csv and make sure it works in Excel BUT DON'T save it in Excel. Just close it when you see it looks correct. Then upload it.
A long non-programatic solution, but for my purposes it's sufficient.
Source is here: https://help.salesforce.com/apex/HTViewSolution?id=000003837&language=en_US