PyCharm does not display Cyrillic - json

Help fix the problem:
The task was the following - to save the database model in the fixed structure. I did it using the terminal (python manage.py dumpdata ...)
Json file was created but does not display Cyrillic. utf-8 encoding, please help.
enter image description here
enter image description here
I tried to change the encoding type in the settings, I tried to manually rewrite the json file

import json
file =
r'C:\Programming\python_django\store\products\fixtures\category.json'
file_2 =
r'C:\Programming\python_django\store\products\fixtures\category.json'
with open(file, 'rb') as open_file:
data = json.load(open_file)
with open(file_2, 'w') as write_file:
write_file.write(json.dumps(data, ensure_ascii=False,
indent=0,
separators=(',', ':
')).encode('cp866').decode('cp1251'))

Related

How to get CSV to work in in RevitPythonShell?

Has anyone figured out how to get CSV (or any other package) to work in RevitPythonShell? I've only been able to get Excel from Interop to work.
When I try running csv in RPS, the terminal executes and shows no error or any kind of feed back, and the file is not created either.
This is the basic code I'm trying to run which comes from a tutorial on CSV I believe.
with open('mycsv2.csv', 'w') as f:
fieldnames = ['column1', 'column2', 'column3']
thewriter = csv.DictWriter(f, fieldnames=fieldnames)
thewriter.writeheader()
for i in range(1, 10):
thewriter.writerow({'column1':'one', 'column2':'two', 'column3':'three'})
I find CSV much more user friendly and easier to understand than Interop Excel. I believe I've read its doable somewhere but of course I cant find the source now.
All help, tips, or tricks are appreciated.
I can get it to work by supplying the full path name to the open function, so it looks like (showing full path to my Documents Folder):
import csv
with open(r'C:\Users\callum\Documents\mycsv2.csv', 'w') as f:
fieldnames = ['column1', 'column2', 'column3']
thewriter = csv.DictWriter(f, fieldnames=fieldnames)
thewriter.writeheader()
for i in range(1, 10):
thewriter.writerow({'column1':'one', 'column2':'two', 'column3':'three'})
Let me know if that does the trick!

I am trying to parse 350 files on my local disk and store the data into database as json objects

I am parsing 350 txt files having json data using python. I am able to retrieve 62 of those object and store them on mysql database, but after that I am getting an error saying JSONDecodeError: ExtraData
Python:
import os
import ast
import json
import mysql.connector as mariadb
from mysql.connector.constants import ClientFlag
mariadb_connection = mariadb.connect(user='root', password='137800000', database='shaproject',client_flags=[ClientFlag.LOCAL_FILES])
cursor = mariadb_connection.cursor()
sql3 = """INSERT INTO shaproject.alttwo (alttwo_id,responses) VALUES """
os.chdir('F:/Code Blocks/SEM 2/DM/Project/350/For Merge Disqus')
current_list_dir=os.listdir()
print(current_list_dir)
cur_cwd=os.getcwd()
cur_cwd=cur_cwd.replace('\\','/')
twoid=1
for every_file in current_list_dir:
file=open(cur_cwd + "/" + every_file)
utffile=file.read()
data=json.loads(utffile)
for i in range(0,len(data['response'])):
data123 = json.dumps(data['response'][i])
tup=(twoid,data123)
print(sql3+str(tup))
twoid+=1
cursor.execute(sql3+str(tup)+";")
print(tup)
mariadb_connection.commit()
I have searched online and found that multiple dump statements are resulting in this error. But I am unable to resolve it.
You want to use glob.
Rather than os.listdir(), which is too permissive,
use glob to focus on just the *.json files.
Print out the name of the file before asking .loads() to parse it.
Rename any badly formatted files to .txt rather than .json, in order to skip them.
Note that you can pass the open file directly to .load(), if you wish.
Closing open files would be a good thing.
Rather than a direct assignment (with no close()!)
you would be better off with with:
with open(cur_cwd + "/" + every_file) as file:
data = json.load(file)
Talking about current current working directory seems
both repetitive and redundant.
It would suffice to call it cwd.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

I am trying to read twitter data from json file using python 2.7.12.
Code I used is such:
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def get_tweets_from_file(file_name):
tweets = []
with open(file_name, 'rw') as twitter_file:
for line in twitter_file:
if line != '\r\n':
line = line.encode('ascii', 'ignore')
tweet = json.loads(line)
if u'info' not in tweet.keys():
tweets.append(tweet)
return tweets
Result I got:
Traceback (most recent call last):
File "twitter_project.py", line 100, in <module>
main()
File "twitter_project.py", line 95, in main
tweets = get_tweets_from_dir(src_dir, dest_dir)
File "twitter_project.py", line 59, in get_tweets_from_dir
new_tweets = get_tweets_from_file(file_name)
File "twitter_project.py", line 71, in get_tweets_from_file
line = line.encode('ascii', 'ignore')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now.
In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.
It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code.
See https://stackoverflow.com/a/34378962/1554386 for more information
The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.
0x80 is valid in some characters sets. In windows-1252/cp1252 it's €.
The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.
Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.
with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
for line in twitter_file:
# line is now a <type 'unicode'>
tweet = json.loads(line)
The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.
For others who come across this question due to the error message, I ran into this error trying to open a pickle file when I opened the file in text mode instead of binary mode.
This was the original code:
import pickle as pkl
with open(pkl_path, 'r') as f:
obj = pkl.load(f)
And this fixed the error:
import pickle as pkl
with open(pkl_path, 'rb') as f:
obj = pkl.load(f)
I got a similar error by accidentally trying to read a parquet file as a csv
pd.read_csv(file.parquet)
pd.read_parquet(file.parquet)
The error occurs when you are trying to read a tweet containing sentence like
"#Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String .
but Converting to raw String Still gives error so i better i suggest you to
read a json file something like this:
import codecs
import json
with codecs.open('tweetfile','rU','utf-8') as f:
for line in f:
data=json.loads(line)
print data["tweet"]
keys.append(data["id"])
fulldata.append(data["tweet"])
which will get you the data load from json file .
You can also write it to a csv using Pandas.
import pandas as pd
output = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )
output.to_csv( "tweets.csv", index=False, quoting=1 )
Then read from csv to avoid the encoding and decoding problem
hope this will help you solving you problem.
Midhun

How to check encoding of a CSV file

I have a CSV file and I wish to understand its encoding. Is there a menu option in Microsoft Excel that can help me detect it
OR do I need to make use of programming languages like C# or PHP to deduce it.
You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.
In Linux systems, you can use file command. It will give the correct encoding
Sample:
file blah.csv
Output:
blah.csv: ISO-8859 text, with very long lines
If you use Python, just use a print() function to check the encoding of a csv file. For example:
with open('file_name.csv') as f:
print(f)
The output is something like this:
<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf8'>
You can also use python chardet library
# install the chardet library
!pip install chardet
# import the chardet library
import chardet
# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open("test.csv", 'rb') as file:
print(chardet.detect(file.read()))
Use chardet https://github.com/chardet/chardet (documentation is short and easy to read).
Install python, then pip install chardet, at last use the command line command.
I tested under GB2312 and it's pretty accurate. (Make sure you have at least a few characters, sample with only 1 character may fail easily).
file is not reliable as you can see.
Or you can execute in python console or in Jupyter Notebook:
import csv
data = open("file.csv","r")
data
You will see information about the data object like this:
<_io.TextIOWrapper name='arch.csv' mode='r' encoding='cp1250'>
As you can see it contains encoding infotmation.
CSV files have no headers indicating the encoding.
You can only guess by looking at:
the platform / application the file was created on
the bytes in the file
In 2021, emoticons are widely used, but many import tools fail to import them. The chardet library is often recommended in the answers above, but the lib does not handle emoticons well.
icecream = '🍦'
import csv
with open('test.csv', 'w') as f:
wf = csv.writer(f)
wf.writerow(['ice cream', icecream])
import chardet
with open('test.csv', 'rb') as f:
print(chardet.detect(f.read()))
{'encoding': 'Windows-1254', 'confidence': 0.3864823918622268, 'language': 'Turkish'}
This gives UnicodeDecodeError while trying to read the file with this encoding.
The default encoding on Mac is UTF-8. It's included explicitly here but that wasn't even necessary... but on Windows it might be.
with open('test.csv', 'r', encoding='utf-8') as f:
print(f.read())
ice cream,🍦
The file command also picked this up
file test.csv
test.csv: UTF-8 Unicode text, with CRLF line terminators
My advice in 2021, if the automatic detection goes wrong: try UTF-8 before resorting to chardet.
In Python, You can Try...
from encodings.aliases import aliases
alias_values = set(aliases.values())
for encoding in set(aliases.values()):
try:
df=pd.read_csv("test.csv", encoding=encoding)
print('successful', encoding)
except:
pass
As it is mentioned by #3724913 (Jitender Kumar) to use file command (it also works in WSL on Windows), I was able to get encoding information of a csv file by executing file --exclude encoding blah.csv using info available on man file as file blah.csv won't show the encoding info on my system.
import pandas as pd
import chardet
def read_csv(path: str, size: float = 0.10) -> pd.DataFrame:
"""
Reads a CSV file located at path and returns it as a Pandas DataFrame. If
nrows is provided, only the first nrows rows of the CSV file will be
read. Otherwise, all rows will be read.
Args:
path (str): The path to the CSV file.
size (float): The fraction of the file to be used for detecting the
encoding. Defaults to 0.10.
Returns:
pd.DataFrame: The CSV file as a Pandas DataFrame.
Raises:
UnicodeError: If the encoding of the file cannot be detected with the
initial size, the function will retry with a larger size (increased by
0.20) until the encoding can be detected or an error is raised.
"""
try:
byte_size = int(os.path.getsize(path) * size)
with open(path, "rb") as rawdata:
result = chardet.detect(rawdata.read(byte_size))
return pd.read_csv(path, encoding=result["encoding"])
except UnicodeError:
return read_csv(path=path, size=size + 0.20)
Hi, I just added a function to find the correct encoding and read the csv in the given file path. Thought it would be useful
Just add the encoding argument that matches the file you`re trying to upload.
open('example.csv', encoding='UTF8')

Invalid byte sequence in UTF-8, CSV import, Rails 4

I have a rake task that populates my database from a CSV file:
require 'csv'
namespace :import_data_csv do
desc "Import teams from csv file"
task import_data: :environment do
CSV.foreach(file, :headers => true) do |row|
#various import tasks
This had been working properly, but with a new CSV file, I'm getting the following error on the 6th row of the CSV file:
Invalid byte sequence in UTF-8
I have looked through the row and can't seem to find any irregular characters.
I've also attempted a couple other fixes recommended on stackoverflow:
- Changing the CSV.foreach to:
reader = CSV.open(file, "r")
reader.each do |row|
And changing:
CSV.foreach(file, headers => true) do |row|
to:
CSV.foreach(file, encoding: "r:ISO-8859-1", :headers => true) do |row|
None of these seem to correct the issue.
Suggestions?
I solved this by saving the file as a MDOS CSV, instead of the standard CSV file or the Windows CSV format.
The answer for me was to take the CSV file and save it to a text file. Then replace the tabs with commas. Then save the file as UTF-8 encoded. Finally, change the .txt to .csv and make sure it works in Excel BUT DON'T save it in Excel. Just close it when you see it looks correct. Then upload it.
A long non-programatic solution, but for my purposes it's sufficient.
Source is here: https://help.salesforce.com/apex/HTViewSolution?id=000003837&language=en_US