Invalid byte sequence in UTF-8, CSV import, Rails 4 - csv

I have a rake task that populates my database from a CSV file:
require 'csv'
namespace :import_data_csv do
desc "Import teams from csv file"
task import_data: :environment do
CSV.foreach(file, :headers => true) do |row|
#various import tasks
This had been working properly, but with a new CSV file, I'm getting the following error on the 6th row of the CSV file:
Invalid byte sequence in UTF-8
I have looked through the row and can't seem to find any irregular characters.
I've also attempted a couple other fixes recommended on stackoverflow:
- Changing the CSV.foreach to:
reader = CSV.open(file, "r")
reader.each do |row|
And changing:
CSV.foreach(file, headers => true) do |row|
to:
CSV.foreach(file, encoding: "r:ISO-8859-1", :headers => true) do |row|
None of these seem to correct the issue.
Suggestions?

I solved this by saving the file as a MDOS CSV, instead of the standard CSV file or the Windows CSV format.

The answer for me was to take the CSV file and save it to a text file. Then replace the tabs with commas. Then save the file as UTF-8 encoded. Finally, change the .txt to .csv and make sure it works in Excel BUT DON'T save it in Excel. Just close it when you see it looks correct. Then upload it.
A long non-programatic solution, but for my purposes it's sufficient.
Source is here: https://help.salesforce.com/apex/HTViewSolution?id=000003837&language=en_US

Related

PyCharm does not display Cyrillic

Help fix the problem:
The task was the following - to save the database model in the fixed structure. I did it using the terminal (python manage.py dumpdata ...)
Json file was created but does not display Cyrillic. utf-8 encoding, please help.
enter image description here
enter image description here
I tried to change the encoding type in the settings, I tried to manually rewrite the json file
import json
file =
r'C:\Programming\python_django\store\products\fixtures\category.json'
file_2 =
r'C:\Programming\python_django\store\products\fixtures\category.json'
with open(file, 'rb') as open_file:
data = json.load(open_file)
with open(file_2, 'w') as write_file:
write_file.write(json.dumps(data, ensure_ascii=False,
indent=0,
separators=(',', ':
')).encode('cp866').decode('cp1251'))

Write a CSV based on another CSV file creating an additional empty row? [duplicate]

import csv
with open('thefile.csv', 'rb') as f:
data = list(csv.reader(f))
import collections
counter = collections.defaultdict(int)
for row in data:
counter[row[10]] += 1
with open('/pythonwork/thefile_subset11.csv', 'w') as outfile:
writer = csv.writer(outfile)
for row in data:
if counter[row[10]] >= 504:
writer.writerow(row)
This code reads thefile.csv, makes changes, and writes results to thefile_subset1.
However, when I open the resulting csv in Microsoft Excel, there is an extra blank line after each record!
Is there a way to make it not put an extra blank line?
The csv.writer module directly controls line endings and writes \r\n into the file directly. In Python 3 the file must be opened in untranslated text mode with the parameters 'w', newline='' (empty string) or it will write \r\r\n on Windows, where the default text mode will translate each \n into \r\n.
#!python3
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
In Python 2, use binary mode to open outfile with mode 'wb' instead of 'w' to prevent Windows newline translation. Python 2 also has problems with Unicode and requires other workarounds to write non-ASCII text. See the Python 2 link below and the UnicodeReader and UnicodeWriter examples at the end of the page if you have to deal with writing Unicode strings to CSVs on Python 2, or look into the 3rd party unicodecsv module:
#!python2
with open('/pythonwork/thefile_subset11.csv', 'wb') as outfile:
writer = csv.writer(outfile)
Documentation Links
https://docs.python.org/3/library/csv.html#csv.writer
https://docs.python.org/2/library/csv.html#csv.writer
Opening the file in binary mode "wb" will not work in Python 3+. Or rather, you'd have to convert your data to binary before writing it. That's just a hassle.
Instead, you should keep it in text mode, but override the newline as empty. Like so:
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
Note: It seems this is not the preferred solution because of how the extra line was being added on a Windows system. As stated in the python document:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
Windows is one such platform where that makes a difference. While changing the line terminator as I described below may have fixed the problem, the problem could be avoided altogether by opening the file in binary mode. One might say this solution is more "elegent". "Fiddling" with the line terminator would have likely resulted in unportable code between systems in this case, where opening a file in binary mode on a unix system results in no effect. ie. it results in cross system compatible code.
From Python Docs:
On Windows, 'b' appended to the mode
opens the file in binary mode, so
there are also modes like 'rb', 'wb',
and 'r+b'. Python on Windows makes a
distinction between text and binary
files; the end-of-line characters in
text files are automatically altered
slightly when data is read or written.
This behind-the-scenes modification to
file data is fine for ASCII text
files, but it’ll corrupt binary data
like that in JPEG or EXE files. Be
very careful to use binary mode when
reading and writing such files. On
Unix, it doesn’t hurt to append a 'b'
to the mode, so you can use it
platform-independently for all binary
files.
Original:
As part of optional paramaters for the csv.writer if you are getting extra blank lines you may have to change the lineterminator (info here). Example below adapated from the python page csv docs. Change it from '\n' to whatever it should be. As this is just a stab in the dark at the problem this may or may not work, but it's my best guess.
>>> import csv
>>> spamWriter = csv.writer(open('eggs.csv', 'w'), lineterminator='\n')
>>> spamWriter.writerow(['Spam'] * 5 + ['Baked Beans'])
>>> spamWriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
The simple answer is that csv files should always be opened in binary mode whether for input or output, as otherwise on Windows there are problems with the line ending. Specifically on output the csv module will write \r\n (the standard CSV row terminator) and then (in text mode) the runtime will replace the \n by \r\n (the Windows standard line terminator) giving a result of \r\r\n.
Fiddling with the lineterminator is NOT the solution.
A lot of the other answers have become out of date in the ten years since the original question. For Python3, the answer is right in the documentation:
If csvfile is a file object, it should be opened with newline=''
The footnote explains in more detail:
If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
Use the method defined below to write data to the CSV file.
open('outputFile.csv', 'a',newline='')
Just add an additional newline='' parameter inside the open method :
def writePhoneSpecsToCSV():
rowData=["field1", "field2"]
with open('outputFile.csv', 'a',newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(rowData)
This will write CSV rows without creating additional rows!
I'm writing this answer w.r.t. to python 3, as I've initially got the same problem.
I was supposed to get data from arduino using PySerial, and write them in a .csv file. Each reading in my case ended with '\r\n', so newline was always separating each line.
In my case, newline='' option didn't work. Because it showed some error like :
with open('op.csv', 'a',newline=' ') as csv_file:
ValueError: illegal newline value: ''
So it seemed that they don't accept omission of newline here.
Seeing one of the answers here only, I mentioned line terminator in the writer object, like,
writer = csv.writer(csv_file, delimiter=' ',lineterminator='\r')
and that worked for me for skipping the extra newlines.
with open(destPath+'\\'+csvXML, 'a+') as csvFile:
writer = csv.writer(csvFile, delimiter=';', lineterminator='\r')
writer.writerows(xmlList)
The "lineterminator='\r'" permit to pass to next row, without empty row between two.
Borrowing from this answer, it seems like the cleanest solution is to use io.TextIOWrapper. I managed to solve this problem for myself as follows:
from io import TextIOWrapper
...
with open(filename, 'wb') as csvfile, TextIOWrapper(csvfile, encoding='utf-8', newline='') as wrapper:
csvwriter = csv.writer(wrapper)
for data_row in data:
csvwriter.writerow(data_row)
The above answer is not compatible with Python 2. To have compatibility, I suppose one would simply need to wrap all the writing logic in an if block:
if sys.version_info < (3,):
# Python 2 way of handling CSVs
else:
# The above logic
I used writerow
def write_csv(writer, var1, var2, var3, var4):
"""
write four variables into a csv file
"""
writer.writerow([var1, var2, var3, var4])
numbers=set([1,2,3,4,5,6,7,2,4,6,8,10,12,14,16])
rules = list(permutations(numbers, 4))
#print(rules)
selection=[]
with open("count.csv", 'w',newline='') as csvfile:
writer = csv.writer(csvfile)
for rule in rules:
number1,number2,number3,number4=rule
if ((number1+number2+number3+number4)%5==0):
#print(rule)
selection.append(rule)
write_csv(writer,number1,number2,number3,number4)
When using Python 3 the empty lines can be avoid by using the codecs module. As stated in the documentation, files are opened in binary mode so no change of the newline kwarg is necessary. I was running into the same issue recently and that worked for me:
with codecs.open( csv_file, mode='w', encoding='utf-8') as out_csv:
csv_out_file = csv.DictWriter(out_csv)

How to read data from a CSV file of two possible encodings?

I want to read data from csv files with two possible encodings (UTF-8 and ISO-8859-15). I mean different files with different encodings. Not the same file with two encodings.
Now I can only read data correctly from a utf-8 encoding file. Can I just implement this by adding an extra option? For example . encoding: 'ISO-8859-15'
What i have:
def csv
file = File.open(file.tempfile)
CSV.open(file, csv_options)
end
private
def csv_options
{
col_sep: ";",
headers: true,
return_headers: false,
skip_blanks: true
}
end
Once you know what encoding your file has, you can pass inside the CSV options i.e.
external_encoding: Encoding::ISO_8859_15,
internal_encoding: Encoding::UTF_8
(This would establish, that the file is ISO-8859-15, but you want the strings internally as UTF-8).
So the strategy is that you decided first (before opening the file), what encoding you want, and then use the appropriate option Hash.

convert CSV to JSON using Python

I need to convert a CSV file to JSON file using Python. I used this,
variable = csv.DictReader(file.csv)
It throws this ERROR
csv.Error: line contains NULL byte
I checked the CSV file in Excel, it shows no NULL chars, but when I printed the data in CSV file using Python. There are some data like SOHNULNULHG (here last 2 letters, HG is the data displaying in the Excel). I need to remove these ASCII chars in the CSV file, while converting to JSON. (i.e. I need only HG from the above string)
I just ran into the same issue. I converted my csv file to csv UTF-8 and ran it again without any errors. That seemed to fix the ASCII char issue. Hope that helps.
To convert the csv type, I just opened my file up in Excel, did save as, then selected CSV UTF-8(Comma delimited)(*.csv) in the Save as type.
Hope that helps.

How to load CSV dataset with corrupted columns?

I've exported a client database to a csv file, and tried to import it to Spark using:
spark.sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("table.csv")
After doing some validations, I find out that some ids were null because a column sometimes has a carriage return. And that dislocated all next columns, with a domino effect, corrupting all the data.
What is strange is that when calling printSchema the resulting table structure is good.
How to fix the issue?
You seemed to have had a lot of luck with inferSchema that it worked fine (since it only reads few records to infer the schema) and so printSchema gives you a correct result.
Since the CSV export file is broken and assuming you want to process the file using Spark (given its size for example) read it using textFile and fix the ids. Save it as CSV format and load it back.
I'm not sure what version of spark you are using, but beginning in 2.2 (I believe), there is a 'multiLine' option that can be used to keep fields together that have line breaks in them. From some other things I've read, you may need to apply some quoting and/or escape character options to get it working just how you want it.
spark.read
.csv("table.csv")
.option("header", "true")
.option("inferSchema", "true")
**.option("multiLine", "true")**