So, I'm using the following code to get pandas to read my JSON text file-
f = open('C:/Users/stans/WFH Project/data.json')
data = json.load(f)
df = pd.DataFrame(data, index=[0])
f.close()
Once I execute the cell, I get
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
1535: character maps to
I used the above coding for a smaller sample of JSON data and it worked. But, since I updated the file to include a much larger sample, I get that error.
I verified that the JSON format is correct, and I also tried in the open statement-
encoding='utf-8'
and
errors='ignore'
Both produced value errors. Any ideas? Thanks in advance for your help!
I am converting the Json string into a Python dictionary object and I get the following error for the below code:
import json
path = 'data2012-03-16.txt'
records = [json.loads(line) for line in open(path)]
Error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 6: invalid start byte
few suggestion-
Maybe the encoding of the file is not valid? try to open it in notepad++ and change the encoding.
Are you sure your json file is well formatted? try open it in json parser and check it.
Why you got error with byte 0x92 in position 6 what is in this index of your file? maybe you have problem with all the \/ issue, try to replace it with other letters and check if it is working. beside, you can use the elimination way and try to open other file with the same code.After that work open thin version of this file etc.
I have a CSV file and I wish to understand its encoding. Is there a menu option in Microsoft Excel that can help me detect it
OR do I need to make use of programming languages like C# or PHP to deduce it.
You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.
In Linux systems, you can use file command. It will give the correct encoding
Sample:
file blah.csv
Output:
blah.csv: ISO-8859 text, with very long lines
If you use Python, just use a print() function to check the encoding of a csv file. For example:
with open('file_name.csv') as f:
print(f)
The output is something like this:
<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf8'>
You can also use python chardet library
# install the chardet library
!pip install chardet
# import the chardet library
import chardet
# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open("test.csv", 'rb') as file:
print(chardet.detect(file.read()))
Use chardet https://github.com/chardet/chardet (documentation is short and easy to read).
Install python, then pip install chardet, at last use the command line command.
I tested under GB2312 and it's pretty accurate. (Make sure you have at least a few characters, sample with only 1 character may fail easily).
file is not reliable as you can see.
Or you can execute in python console or in Jupyter Notebook:
import csv
data = open("file.csv","r")
data
You will see information about the data object like this:
<_io.TextIOWrapper name='arch.csv' mode='r' encoding='cp1250'>
As you can see it contains encoding infotmation.
CSV files have no headers indicating the encoding.
You can only guess by looking at:
the platform / application the file was created on
the bytes in the file
In 2021, emoticons are widely used, but many import tools fail to import them. The chardet library is often recommended in the answers above, but the lib does not handle emoticons well.
icecream = '🍦'
import csv
with open('test.csv', 'w') as f:
wf = csv.writer(f)
wf.writerow(['ice cream', icecream])
import chardet
with open('test.csv', 'rb') as f:
print(chardet.detect(f.read()))
{'encoding': 'Windows-1254', 'confidence': 0.3864823918622268, 'language': 'Turkish'}
This gives UnicodeDecodeError while trying to read the file with this encoding.
The default encoding on Mac is UTF-8. It's included explicitly here but that wasn't even necessary... but on Windows it might be.
with open('test.csv', 'r', encoding='utf-8') as f:
print(f.read())
ice cream,🍦
The file command also picked this up
file test.csv
test.csv: UTF-8 Unicode text, with CRLF line terminators
My advice in 2021, if the automatic detection goes wrong: try UTF-8 before resorting to chardet.
In Python, You can Try...
from encodings.aliases import aliases
alias_values = set(aliases.values())
for encoding in set(aliases.values()):
try:
df=pd.read_csv("test.csv", encoding=encoding)
print('successful', encoding)
except:
pass
As it is mentioned by #3724913 (Jitender Kumar) to use file command (it also works in WSL on Windows), I was able to get encoding information of a csv file by executing file --exclude encoding blah.csv using info available on man file as file blah.csv won't show the encoding info on my system.
import pandas as pd
import chardet
def read_csv(path: str, size: float = 0.10) -> pd.DataFrame:
"""
Reads a CSV file located at path and returns it as a Pandas DataFrame. If
nrows is provided, only the first nrows rows of the CSV file will be
read. Otherwise, all rows will be read.
Args:
path (str): The path to the CSV file.
size (float): The fraction of the file to be used for detecting the
encoding. Defaults to 0.10.
Returns:
pd.DataFrame: The CSV file as a Pandas DataFrame.
Raises:
UnicodeError: If the encoding of the file cannot be detected with the
initial size, the function will retry with a larger size (increased by
0.20) until the encoding can be detected or an error is raised.
"""
try:
byte_size = int(os.path.getsize(path) * size)
with open(path, "rb") as rawdata:
result = chardet.detect(rawdata.read(byte_size))
return pd.read_csv(path, encoding=result["encoding"])
except UnicodeError:
return read_csv(path=path, size=size + 0.20)
Hi, I just added a function to find the correct encoding and read the csv in the given file path. Thought it would be useful
Just add the encoding argument that matches the file you`re trying to upload.
open('example.csv', encoding='UTF8')
Utilizing Python3, BeautifulSoup and very minimal regex, I'm trying to scrape the text off of this webpage:
http://www.presidency.ucsb.edu/ws/?pid=2921
I have already succesfully extracted its html into a file. In fact I've done this with almost all of the presidential speeches available on this website; I have 247 (out of 258 possible) speeches' html saved locally on my computer.
My code for extracting just the text off of each page looks like this:
import re
from bs4 import BeautifulSoup
with open('scan_here.txt') as reference: #'scan_here.txt' is a file containing all the pages whose html I have downloaded successfully
for line in reference:
line_unclean = reference.readline() #each file's name is just a random string of 5-6 integers
line = str(re.sub(r'\n', '', line_unclean)) #for removing '\n' from each file name
f = open(('local_path_to_folder_containing_all_the_html_files\\') + line)
doc = f.read()
soup = BeautifulSoup(doc, 'html.parser')
for speech in soup.select('span.display-text'):
final_speech = str(speech)
print(final_speech)
Utilizing this code, I get the following error message:
Traceback (most recent call last):
File "extract_individual_speeches.py", line 11, in <module>
doc = f.read()
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 56443: invalid start byte
I understand this is a decode error and have tried to run this code on other html files, not just the first one which appears on the list of file names in 'scan_text.txt'. Same error, so I think it's an encoding issue local to the html files.
I think the problem might lie with this third line of the html, which has the same encoding for all my html files:
<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">
What is 'windows-1251?' I assume it's the problem here. I've looked it up and seen there are some windows-1251 to UTF-8 converters, but I didn't see one which works well with Python.
I found this SO thread which seems to deal with this issue of conversion, but I'm not sure how to integrate it with my existing code.
Any help on this issue is much appreciated, TIA.
'windows-1251' is a standard Windows encoding. What you need is UTF-8. You can define an encoding when you open a file.
Try something like this:
with open(file,'r',encoding='windows-1251') as f:
text = f.read()
or:
text = text.decode('windows-1251')
You can also use codecs:
import codecs
f = codecs.open(file,'r','windows-1251').read()
codecs.open(file,'w','UTF-8').write(f)
I'm trying to read .csv file that contains utf-8 data in some of its columns. The method of reading is by using pandas dataframe. The code is as following:
df = pd.read_csv('Cancer_training.csv', encoding='utf-8')
Then I got the following examples of errors with different files:
(1) 'utf-8' codec can't decode byte 0xcf in position 14:invalid continuation byte
(2) 'utf-8' codec can't decode byte 0xc9 in position 3:invalid continuation byte
Could you please share your ideas and experience with such problem? Thank you.
[python: 3.4.1.final.0,
pandas: 0.14.1]
sample of the raw data, I cannot put full record because of the legal restrictions of the medical data:
I had this problem for no apparent reason, I managed to get it work using this:
df = pd.read_csv('file', encoding = "ISO-8859-1")
not sure why though
I've also done as Irh09 proposed but the second file it read it was wrongly decoded and couldn't find a column with tildes (á, é, í, ó, ú).
So I recomend encapsulating the error like this:
try:
df = pd.read_csv('file', encoding = "utf-8")
except:
df = pd.read_csv('file', encoding= "ISO-8859-1")