How does Pyarrow read_csv handle different file encodings? - csv

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still works without issue(which is great! but i dont understand why / if it only auto handles certain encodings). The only parameters im using are setting the delimiter="|" (with ParseOptions) and auto_dict_encode=True with (ConvertOptions).
How is pyarrow handling different encoding types?

pyarrow currently has no functionality to deal with different encodings, and assumes UTF8 for string/text data.
But the reason it doesn't raise an error is that pyarrow will read any non-UTF8 strings as a "binary" type column, instead of "string" type.
A small example:
# writing a small file with latin encoding
with open("test.csv", "w", encoding="latin") as f:
f.writelines(["col1,col2\n", "u,ù"])
Reading with pyarrow gives string for the first column (which only contains ASCII characters, thus also valid UTF8), but reads the second column as binary:
>>> from pyarrow import csv
>>> csv.read_csv("test.csv")
pyarrow.Table
col1: string
col2: binary
With pandas you indeed get an error by default (because pandas has no binary data type, and will try to read all text columns as python strings, thus UTF8):
>>> pd.read_csv("test.csv")
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 0: invalid start byte
>>> pd.read_csv("test.csv", encoding="latin")
col1 col2
0 u ù

It's now possible to specify encodings with pyarrow.read_csv.
According to the pyarrow docs for read_csv:
The encoding can be changed using the ReadOptions class.
A minimal example follows:
from pyarrow import csv
options = csv.ReadOptions(encoding='latin1')
table = csv.read_csv('path/to/file', options)
From what I can tell, the functionality was added in this PR, so it should work starting with pyarrow 1.0.

Related

\x00 is coming between each char in Pyspark dataframe

I am reading.CSV UTF-8 file into Pyspark dataframe. In dataframe getting \x00 in each char of each column in dataframe.
For example
In csv-
Username
Xyz
In dataframe, it's coming in like square box. When collect() it show like below. So square box =\x00
\x00U\x00S\x00E....
\x00X\x00y\x00Z\x00
Can you please help
Issue
Your issue may be that you are not reading the file using the correct encoding.
Recommendation
You may achieve this using the encoding option when reading from your csv file. #JosefZ suggested using utf_16_BE and utf_16_LE which is a good start for determining what the real encoding is being used by your file. However, while these options may work in python, pyspark is looking for the following string encoding values:
US-ASCII
ISO-8859-1
UTF-8
UTF-16BE
UTF-16LE
UTF-16
You may learn more about string encodings here.
For your specific use case, you could try using .option("encoding","UTF-16") or simply replace and test the encoding with the options listed above eg:
df = (
spark.read
.format("csv")
.option("header",False) # optional
.option("encoding","UTF-16")
.schema("username string") # optional
.load("testfile.d") # - replace with your actual file name
)
Full Reproducible Example
# Creating test csv file
with open("testfile.d","wb") as fp:
fp.write(b'\x00X\x00y\x00Z') # writing encoded content
# reading and displaying content without specifying encoding
spark.read.format("csv").option("header",False).schema("username string").load("testfile.d").show()
+--------+
|username|
+--------+
| �X�y�Z|
+--------+
# reading and displaying content by specifying encoding
spark.read.format("csv").option("header",False).option("encoding","UTF-16").schema("username string").load("testfile.d").show()
+--------+
|username|
+--------+
| XyZ|
+--------+

Creating a Pandas DataFrame from a CSV file with JSON in it

I have a Postgres database where two columns are jsonb data. I used this command to get a CSV copy of the database: \copy (SELECT * FROM articles) TO articles.csv CSV DELIMITER ‘,’ HEADER
I am using Python 3.6. When I load this CSV file into a Pandas dataframe with read_csv I get what appears to be a doubly encoded string for all the json data:
e.g. articles.iloc[0]['word_count'] gives me:
'"{\\"he\\":8,\\"is\\":8,\\"a\\":26,\\"wealthy\\":1,\\"international\\":2,\\"entrepreneur\\":1,\\"known\\":3,\\"for\\":9,\\"generous\\":1,\\"donations\\":2,\\"to\\":17,\\"his\\":6,\\"alma\\":1,\\"mater\\":1,\\"harvard\\":11,\\"now\\":2,\\"court\\":12,\\"says\\":1,\\"the\\":51,\\"university\\":3,\\"must\\":2,\\"cooperate\\":1,\\"in\\":21,\\"hunt\\":1,\\"assets\\":3,\\"federal\\":2,\\"judge\\":2,\\"boston\\":3,\\"has\\":4,\\"ruled\\":2,\\"that\\":10,\\"provide\\":1,\\"testimony\\":1,\\"and\\":11,\\"produce\\":1,\\"documents\\":3,\\"disclosing\\":1,\\"bank\\":1,\\"accounts\\":1,\\"routing\\":1,\\"numbers\\":1,\\"wire\\":1,\\"transfers\\":1,\\"other\\":2,\\"interbank\\":1,\\"messages\\":1,\\"used\\":1,\\"by\\":11,\\"an\\":6,\\"alumnus\\":1,\\"charles\\":1,\\"c\\":2,\\"spackman\\":19,\\"send\\":1,\\"money\\":2,\\"mr\\":19,\\"hong\\":5,\\"kongbased\\":1,\\"businessman\\":1,\\"leads\\":2,\\"group\\":4,\\"global\\":1,\\"investment\\":1,\\"holding\\":1,\\"company\\":10,\\"with\\":3,\\"billion\\":1,\\"under\\":1,\\"management\\":1,\\"ruling\\":4,\\"places\\":1,\\"ivy\\":1,\\"league\\":1,\\"college\\":1,\\"uncomfortable\\":1,\\"predicament\\":1,\\"of\\":19,\\"revealing\\":1,\\"confidential\\":1,\\"financial\\":1,\\"information\\":3,\\"gleaned\\":1,\\"from\\":2,\\"influential\\":1,\\"benefactor\\":1,\\"no\\":2,\\"small\\":1,\\"donor\\":1,\\"according\\":2,\\"website\\":1,\\"sponsors\\":1,\\"scholarship\\":2,\\"fund\\":1,\\"asian\\":1,\\"students\\":1,\\"at\\":2,\\"harvardasia\\":1,\\"council\\":1,\\"served\\":1,\\"as\\":2,\\"cochairman\\":1,\\"reunion\\":1,\\"gifts\\":1,\\"class\\":1,\\"year\\":1,\\"also\\":2,\\"korean\\":6,\\"name\\":1,\\"yoo\\":1,\\"shin\\":1,\\"choi\\":1,\\"obtained\\":1,\\"undergraduate\\":1,\\"degree\\":1,\\"economics\\":1,\\"spokeswoman\\":1,\\"melodie\\":1,\\"jackson\\":1,\\"said\\":8,\\"would\\":2,\\"not\\":5,\\"comment\\":2,\\"on\\":4,\\"order\\":1,\\"part\\":1,\\"longfought\\":1,\\"quest\\":1,\\"aggrieved\\":1,\\"investor\\":2,\\"sang\\":1,\\"cheol\\":1,\\"woo\\":3,\\"collect\\":2,\\"judgment\\":4,\\"against\\":1,\\"involving\\":1,\\"south\\":4,\\"business\\":3,\\"deal\\":1,\\"case\\":3,\\"could\\":2,\\"have\\":3,\\"furtherreaching\\":1,\\"implications\\":1,\\"douglas\\":1,\\"kellner\\":2,\\"manhattan\\":1,\\"lawyer\\":2,\\"who\\":2,\\"specializes\\":1,\\"recovering\\":1,\\"hidden\\":1,\\"worldwide\\":1,\\"if\\":2,\\"diverted\\":1,\\"funds\\":1,\\"when\\":1,\\"should\\":1,\\"been\\":3,\\"paying\\":1,\\"thats\\":1,\\"fraudulent\\":1,\\"transfer\\":1,\\"they\\":2,\\"sue\\":2,\\"get\\":2,\\"back\\":2,\\"theyd\\":1,\\"be\\":1,\\"entitled\\":1,\\"it\\":4,\\"can\\":1,\\"show\\":1,\\"was\\":6,\\"fraudulently\\":1,\\"transferred\\":1,\\"john\\":1,\\"han\\":1,\\"firm\\":2,\\"kobre\\":1,\\"kim\\":1,\\"which\\":5,\\"handling\\":1,\\"investors\\":1,\\"had\\":3,\\"plans\\":1,\\"unwittingly\\":1,\\"entangled\\":1,\\"dispute\\":1,\\"collection\\":1,\\"effort\\":1,\\"dates\\":1,\\"stock\\":2,\\"collapse\\":2,\\"littauer\\":2,\\"technologies\\":1,\\"ltd\\":1,\\"technology\\":1,\\"seoul\\":1,\\"high\\":2,\\"major\\":1,\\"fled\\":1,\\"korea\\":3,\\"amid\\":1,\\"claims\\":1,\\"price\\":1,\\"manipulation\\":1,\\"departing\\":1,\\"before\\":3,\\"authorities\\":2,\\"arrested\\":1,\\"partner\\":1,\\"later\\":3,\\"insiders\\":1,\\"profited\\":1,\\"selling\\":1,\\"their\\":1,\\"shares\\":1,\\"while\\":2,\\"minority\\":1,\\"shareholders\\":1,\\"including\\":1,\\"suffered\\":1,\\"enormous\\":1,\\"losses\\":1,\\"ordered\\":1,\\"pay\\":1,\\"million\\":2,\\"mushroomed\\":1,\\"because\\":5,\\"accumulating\\":1,\\"interest\\":1,\\"managing\\":1,\\"director\\":1,\\"richard\\":1,\\"lee\\":1,\\"related\\":1,\\"lawsuit\\":1,\\"pending\\":1,\\"kong\\":4,\\"filed\\":1,\\"appeared\\":1,\\"unaware\\":1,\\"until\\":2,\\"just\\":1,\\"overturned\\":1,\\"supreme\\":1,\\"all\\":1,\\"defendants\\":1,\\"except\\":1,\\"upheld\\":1,\\"him\\":1,\\"did\\":2,\\"appear\\":1,\\"defend\\":1,\\"himself\\":1,\\"acknowledging\\":1,\\"fined\\":1,\\"connection\\":1,\\"matter\\":1,\\"maintains\\":1,\\"commit\\":1,\\"offenses\\":1,\\"woos\\":1,\\"lawyers\\":1,\\"argue\\":1,\\"efforts\\":1,\\"hampered\\":1,\\"what\\":1,\\"papers\\":2,\\"called\\":1,\\"mazelike\\":1,\\"network\\":1,\\"offshore\\":1,\\"nominees\\":1,\\"trusts\\":1,\\"many\\":1,\\"are\\":1,\\"managed\\":1,\\"close\\":1,\\"family\\":1,\\"members\\":1,\\"classmates\\":1,\\"example\\":1,\\"estate\\":1,\\"where\\":1,\\"lives\\":1,\\"section\\":1,\\"forbes\\":1,\\"described\\":1,\\"wealthiest\\":1,\\"neighborhood\\":1,\\"earth\\":1,\\"owned\\":2,\\"through\\":1,\\"series\\":1,\\"shell\\":1,\\"companies\\":1,\\"turn\\":3,\\"british\\":1,\\"virgin\\":1,\\"islands\\":1,\\"say\\":1,\\"entered\\":1,\\"feb\\":1,\\"william\\":1,\\"g\\":1,\\"young\\":1,\\"district\\":1,\\"gives\\":1,\\"march\\":1,\\"over\\":2,\\"banking\\":1,\\"orders\\":1,\\"spackmans\\":2,\\"daughter\\":1,\\"claire\\":1,\\"sophomore\\":1,\\"testify\\":1,\\"records\\":1,\\"about\\":1,\\"her\\":1,\\"fathers\\":1,\\"american\\":1,\\"citizen\\":1,\\"permanent\\":1,\\"resident\\":1,\\"well\\":1,\\"partly\\":1,\\"son\\":1,\\"james\\":1,\\"adopted\\":1,\\"americans\\":1,\\"after\\":1,\\"biological\\":1,\\"parents\\":1,\\"died\\":1,\\"during\\":1,\\"war\\":1,\\"advanced\\":1,\\"world\\":1,\\"become\\":1,\\"chief\\":1,\\"prudentials\\":1,\\"insurance\\":1,\\"holdings\\":1,\\"younger\\":1,\\"include\\":1,\\"entertainment\\":1,\\"produced\\":1,\\"science\\":1,\\"fiction\\":1,\\"movie\\":1,\\"snowpiercer\\":1,\\"starring\\":1,\\"tilda\\":1,\\"swinton\\":1,\\"octavia\\":1,\\"spencer\\":1}"'
In order to get a python dictionary from the above string I have to call json.loads(json.loads()) on it. Since I want to convert the whole column to dictionaries I tried articles['word_count'].apply( lambda x: json.loads(json.loads(x)) ) but this gives me an error:
TypeError: the JSON object must be str, bytes or bytearray, not 'float'
How do I fix this? OR am I missing a command when I export to CSV from my database? OR am I missing a command when I call read_csv in Pandas?
Note: I have tried the 'converter' option with read_csv and I get this error: JSONDecodeError: Expecting value: line 1 column 1 (char 0) My function is:
def dec(s):
return json.loads( json.loads(s) )
Use pd.io.json.json_normalize() to convert an entire column of JSON data into a separate DataFrame with the same number of rows:
http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.io.json.json_normalize.html
For your case it'd be something like this:
pd.io.json.json_normalize(articles.word_count)
You might have to preprocess it if Pandas doesn't understand the escaping in your input data.
Beyond all that, since your data comes from a database, you should consider just loading it directly, without the CSV intermediary. Pandas has functions for this, such as read_sql_query() and read_sql_table().

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

I am trying to read twitter data from json file using python 2.7.12.
Code I used is such:
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def get_tweets_from_file(file_name):
tweets = []
with open(file_name, 'rw') as twitter_file:
for line in twitter_file:
if line != '\r\n':
line = line.encode('ascii', 'ignore')
tweet = json.loads(line)
if u'info' not in tweet.keys():
tweets.append(tweet)
return tweets
Result I got:
Traceback (most recent call last):
File "twitter_project.py", line 100, in <module>
main()
File "twitter_project.py", line 95, in main
tweets = get_tweets_from_dir(src_dir, dest_dir)
File "twitter_project.py", line 59, in get_tweets_from_dir
new_tweets = get_tweets_from_file(file_name)
File "twitter_project.py", line 71, in get_tweets_from_file
line = line.encode('ascii', 'ignore')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now.
In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.
It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code.
See https://stackoverflow.com/a/34378962/1554386 for more information
The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.
0x80 is valid in some characters sets. In windows-1252/cp1252 it's €.
The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.
Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.
with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
for line in twitter_file:
# line is now a <type 'unicode'>
tweet = json.loads(line)
The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.
For others who come across this question due to the error message, I ran into this error trying to open a pickle file when I opened the file in text mode instead of binary mode.
This was the original code:
import pickle as pkl
with open(pkl_path, 'r') as f:
obj = pkl.load(f)
And this fixed the error:
import pickle as pkl
with open(pkl_path, 'rb') as f:
obj = pkl.load(f)
I got a similar error by accidentally trying to read a parquet file as a csv
pd.read_csv(file.parquet)
pd.read_parquet(file.parquet)
The error occurs when you are trying to read a tweet containing sentence like
"#Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String .
but Converting to raw String Still gives error so i better i suggest you to
read a json file something like this:
import codecs
import json
with codecs.open('tweetfile','rU','utf-8') as f:
for line in f:
data=json.loads(line)
print data["tweet"]
keys.append(data["id"])
fulldata.append(data["tweet"])
which will get you the data load from json file .
You can also write it to a csv using Pandas.
import pandas as pd
output = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )
output.to_csv( "tweets.csv", index=False, quoting=1 )
Then read from csv to avoid the encoding and decoding problem
hope this will help you solving you problem.
Midhun

How to read .csv file that contains utf-8 values by pandas dataframe

I'm trying to read .csv file that contains utf-8 data in some of its columns. The method of reading is by using pandas dataframe. The code is as following:
df = pd.read_csv('Cancer_training.csv', encoding='utf-8')
Then I got the following examples of errors with different files:
(1) 'utf-8' codec can't decode byte 0xcf in position 14:invalid continuation byte
(2) 'utf-8' codec can't decode byte 0xc9 in position 3:invalid continuation byte
Could you please share your ideas and experience with such problem? Thank you.
[python: 3.4.1.final.0,
pandas: 0.14.1]
sample of the raw data, I cannot put full record because of the legal restrictions of the medical data:
I had this problem for no apparent reason, I managed to get it work using this:
df = pd.read_csv('file', encoding = "ISO-8859-1")
not sure why though
I've also done as Irh09 proposed but the second file it read it was wrongly decoded and couldn't find a column with tildes (á, é, í, ó, ú).
So I recomend encapsulating the error like this:
try:
df = pd.read_csv('file', encoding = "utf-8")
except:
df = pd.read_csv('file', encoding= "ISO-8859-1")

Python 2.7 writing strings elements (character) to a binary file

I am using Python 2.7 to access an API that returns JSON with a single key="ringtone_file" and an associated value that is an mp3 file encoded for transport via HTTP. I created a bogus mp3 file consisting of 256 bytes in order from 0x00 through 0xff and the returned file appears below.
{"ringtone_file":"\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n\u000b\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"}
I accessed the API using the following code minus exception handing code
import requests
response = requests.get(url)
dict = response.json()
print dict
This yields the following output
{u'ringtone_file': u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'}
What I desire to do is write each character or hex value of this string to a file in binary format. I desire the result to be a file of size 256 bytes where the first byte in the file has value 0 and the last byte has value 255. I can't change the API. Can someone suggest a reasonable way of accomplishing this with Python 2.7.
I attempted to do what was obvious to me which was to open a file for writing in binary mode and then writing the unicode string to the file. The error message from the codec indicates I can't write values between and including 128 and 255.
Since the string value is Unicode, you have to encode the string to write it to a file. The latin1 codec directly maps to the first 256 Unicode characters, so use .encode('latin1') on the string.
Example:
>>> s=u'\x00\x01\x02\xfd\xfe\xff'
>>> s
u'\x00\x01\x02\xfd\xfe\xff' # Unicode string
>>> s.encode('latin1')
'\x00\x01\x02\xfd\xfe\xff' # Now a byte string.