How can a read csv.deflate hdfs files in a dask datfarame? - csv

I am trying to read csv.deflate files from hdfs path and put them in dask dataframe. I tried read_csv and I am getting "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte" error. Then, I set engine='python' and encoding='utf-8' but I am still getting the same error.

Perhaps the compression= keyword would help? How would you read this data locally with Pandas? I suspect that you need the same keyword arguments that you would need in that case.

Related

I'm getting UnicodeDecodeError when trying to load a JSON file into a dataframe

So, I'm using the following code to get pandas to read my JSON text file-
f = open('C:/Users/stans/WFH Project/data.json')
data = json.load(f)
df = pd.DataFrame(data, index=[0])
f.close()
Once I execute the cell, I get
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
1535: character maps to
I used the above coding for a smaller sample of JSON data and it worked. But, since I updated the file to include a much larger sample, I get that error.
I verified that the JSON format is correct, and I also tried in the open statement-
encoding='utf-8'
and
errors='ignore'
Both produced value errors. Any ideas? Thanks in advance for your help!

Premature end of line Weka error

I'm new to Weka and I have to use it for a University project. So, I created a .csv file and when I try to upload it to Weka, it says: "not recognised as a CSV data file. Reason: 1 problem encountered on line 2".
Then, if I open the .csv file with Notepad and then save as .arff file, when I try to open it again with Weka, in this case I get another error message: "not recognised as an arff data file. Reason: premature end of line, read Token[EOL], line 8".
Please help, I don't know much about working with Weka and really don't know what could be the problem, even though I did a lot of research about this problem.
This is the file: https://app.box.com/s/adfpf1zatgpl5mo20u5hdd1gnqihnq40
#Relation "PIB_Rata inflatiei"
#Attribute "PIB" NUMERIC
#Attribute "Rata_inflatiei" NUMERIC
#Data
30624.3,20780.9,27980.4,31920.3,37657.0,37168.3,35838.9,41978.0,36183.4,37439.0,40717.1,46174.0,59867.6,76217.6,99699.2,123533.7,171540.2,208185.1,167421.6,167998.1,185362.3,171664.6,191548.1,199325.9,177956.0
128.0,211.2,255.2,136.8,32.2,38.8,154.8,59.1,45.8,45.7,34.5,22.5,15.3,11.3,9.0,6.6,4.8,7.8,5.6,6.1,5.8,3.3,4.0,1.1,-0.6
In the ARFF format (as well as CSV) instances are rows, and attributes are columns.
Your file thus has too many columns, ever row must have exactly.two values.

How to load OSM (GeoJSON) data to ArangoDB?

How I can load OSM data to ArangoDB?
I loaded data sed named luxembourg-latest.osm.pbf from OSM, than converted it to JSON with OSMTOGEOJSON, after I tried to load result geojson to ArangoDB with next command: arangoimp --file out.json --collection lux1 --server.database geodb and got hude list of errors:
...
2017-03-17T12:44:28Z [7712] WARNING at position 719386: invalid JSON type (expecting object, probably parse error), offending context: ],
2017-03-17T12:44:28Z [7712] WARNING at position 719387: invalid JSON type (expecting object, probably parse error), offending context: [
2017-03-17T12:44:28Z [7712] WARNING at position 719388: invalid JSON type (expecting object, probably parse error), offending context: 5.867441,
...
What I am doing wrong?
upd: it's seems that converter osm2json converter should be run with option osmtogeojson --ndjson that produce items not as single Json, but in line by line mode.
As #dmitry-bubnenkov already found out, --ndjson is required to produce the right input for ArangoImp.
One has to know here, that ArangoImp expects a JSON-Subset (since it doesn't parse the json on its own) dubbed as JSONL.
Thus, Each line of the JSON-File is expected to become one json document in the collection after the import. To maximize performance and simplify the implementation, The json is not completely parsed before sending it to the server.
It tries to chop the JSON into chunks with the maximum request size that the server permits. It leans on the JSONL-line endings to isolate possible chunks.
However, the server expects valid JSON for sure. Sending the chopped part to the server with possibly incomplete JSON documents will lead to parse errors on the server, which is the error message you saw in your output.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

I am trying to read twitter data from json file using python 2.7.12.
Code I used is such:
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def get_tweets_from_file(file_name):
tweets = []
with open(file_name, 'rw') as twitter_file:
for line in twitter_file:
if line != '\r\n':
line = line.encode('ascii', 'ignore')
tweet = json.loads(line)
if u'info' not in tweet.keys():
tweets.append(tweet)
return tweets
Result I got:
Traceback (most recent call last):
File "twitter_project.py", line 100, in <module>
main()
File "twitter_project.py", line 95, in main
tweets = get_tweets_from_dir(src_dir, dest_dir)
File "twitter_project.py", line 59, in get_tweets_from_dir
new_tweets = get_tweets_from_file(file_name)
File "twitter_project.py", line 71, in get_tweets_from_file
line = line.encode('ascii', 'ignore')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now.
In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.
It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code.
See https://stackoverflow.com/a/34378962/1554386 for more information
The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.
0x80 is valid in some characters sets. In windows-1252/cp1252 it's €.
The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.
Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.
with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
for line in twitter_file:
# line is now a <type 'unicode'>
tweet = json.loads(line)
The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.
For others who come across this question due to the error message, I ran into this error trying to open a pickle file when I opened the file in text mode instead of binary mode.
This was the original code:
import pickle as pkl
with open(pkl_path, 'r') as f:
obj = pkl.load(f)
And this fixed the error:
import pickle as pkl
with open(pkl_path, 'rb') as f:
obj = pkl.load(f)
I got a similar error by accidentally trying to read a parquet file as a csv
pd.read_csv(file.parquet)
pd.read_parquet(file.parquet)
The error occurs when you are trying to read a tweet containing sentence like
"#Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String .
but Converting to raw String Still gives error so i better i suggest you to
read a json file something like this:
import codecs
import json
with codecs.open('tweetfile','rU','utf-8') as f:
for line in f:
data=json.loads(line)
print data["tweet"]
keys.append(data["id"])
fulldata.append(data["tweet"])
which will get you the data load from json file .
You can also write it to a csv using Pandas.
import pandas as pd
output = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )
output.to_csv( "tweets.csv", index=False, quoting=1 )
Then read from csv to avoid the encoding and decoding problem
hope this will help you solving you problem.
Midhun

Error converting string to dictionary object

I am converting the Json string into a Python dictionary object and I get the following error for the below code:
import json
path = 'data2012-03-16.txt'
records = [json.loads(line) for line in open(path)]
Error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 6: invalid start byte
few suggestion-
Maybe the encoding of the file is not valid? try to open it in notepad++ and change the encoding.
Are you sure your json file is well formatted? try open it in json parser and check it.
Why you got error with byte 0x92 in position 6 what is in this index of your file? maybe you have problem with all the \/ issue, try to replace it with other letters and check if it is working. beside, you can use the elimination way and try to open other file with the same code.After that work open thin version of this file etc.