I'm getting the following errors when trying to decode this data, and the 2nd error after trying to compensate for the unicode error:
Error 1:
write.writerows(subjects)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 160: ordinal not in range(128)
Error 2:
with open("data.csv", encode="utf-8", "w",) as writeFile:
SyntaxError: non-keyword arg after keyword arg
Code
import requests
import json
import csv
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encode="utf-8", "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
Using requests and with the correction to the second part (as below) I have no problem running. I think your first problem is due to the second error (is a consequence of that being incorrect).
I am on Python3 and can run yours with my fix to open line and with
r = urllib.request.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
I personally would use requests.
import requests
import csv
data = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1').json()
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encoding ="utf-8", mode = "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
For your second, looking at documentation for open function, you need to use the right argument names and add the name of the mode argument if not positional matching.
with open("data.csv", encoding ="utf-8", mode = "w") as writeFile:
I've got an Athena table where some fields have a fairly complex nested format. The backing records in S3 are JSON. Along these lines (but we have several more levels of nesting):
CREATE EXTERNAL TABLE IF NOT EXISTS test (
timestamp double,
stats array<struct<time:double, mean:double, var:double>>,
dets array<struct<coords: array<double>, header:struct<frame:int,
seq:int, name:string>>>,
pos struct<x:double, y:double, theta:double>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json'='true')
LOCATION 's3://test-bucket/test-folder/'
Now we need to be able to query the data and import the results into Python for analysis. Because of security restrictions I can't connect directly to Athena; I need to be able to give someone the query and then they will give me the CSV results.
If we just do a straight select * we get back the struct/array columns in a format that isn't quite JSON.
Here's a sample input file entry:
{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}
And example output:
select * from test
"timestamp","stats","dets","pos"
"1.520640777666096E9","[{time=15.0, mean=45.23, var=0.31}, {time=19.0, mean=17.315, var=2.612}]","[{coords=[2.4, 1.7, 0.3], header={frame=1, seq=1, name=hello}}]","{x=5.0, y=1.4, theta=0.04}"
I was hoping to get those nested fields exported in a more convenient format - getting them in JSON would be great.
Unfortunately it seems that cast to JSON only works for maps, not structs, because it just flattens everything into arrays:
SELECT timestamp, cast(stats as JSON) as stats, cast(dets as JSON) as dets, cast(pos as JSON) as pos FROM "sampledb"."test"
"timestamp","stats","dets","pos"
"1.520640777666096E9","[[15.0,45.23,0.31],[19.0,17.315,2.612]]","[[[2.4,1.7,0.3],[1,1,""hello""]]]","[5.0,1.4,0.04]"
Is there a good way to convert to JSON (or another easy-to-import format) or should I just go ahead and do a custom parsing function?
I have skimmed through all the documentation and unfortunately there seems to be no way to do this as of now. The only possible workaround is
converting a struct to a json when querying athena
SELECT
my_field,
my_field.a,
my_field.b,
my_field.c.d,
my_field.c.e
FROM
my_table
Or I would convert the data to json using post processing. Below script shows how
#!/usr/bin/env python
import io
import re
pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)
with io.open("test.csv") as f:
headers = list(map(lambda f: f.strip(), f.readline().split(",")))
for line in f.readlines():
orig_line = line
data = []
for i, l in enumerate(line.split('","')):
data.append(headers[i] + ":" + re.sub('^"|"$', "", l))
line = "{" + ','.join(data) + "}"
line = pattern1.sub(r'"\1":', line)
line = pattern2.sub(r':"\1"', line)
print(line)
The output on your input data is
{"timestamp":1.520640777666096E9,"stats":[{"time":15.0, "mean":45.23, "var":0.31}, {"time":19.0, "mean":17.315, "var":2.612}],"dets":[{"coords":[2.4, 1.7, 0.3], "header":{"frame":1, "seq":1, "name":"hello"}}],"pos":{"x":5.0, "y":1.4, "theta":0.04}
}
Which is a valid JSON
The python code from #tarun almost got me there, but I had to modify it in several ways due to my data. In particular, I have:
json structures saved in Athena as strings
Strings that contain multiple words, and therefore need to be in between double quotes. Some of them contain "[]" and "{}" symbols.
Here is the code that worked for me, hopefully will be useful for others:
#!/usr/bin/env python
import io
import re, sys
pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)
with io.open(sys.argv[1]) as f:
headers = list(map(lambda f: f.strip(), f.readline().split(",")))
print(headers)
for line in f.readlines():
orig_line = line
#save the double quote cases, which mean there is a string with quotes inside
line = re.sub('""', "#", orig_line)
data = []
for i, l in enumerate(line.split('","')):
item = re.sub('^"|"$', "", l.rstrip())
if (item[0] == "{" and item[-1] == "}") or (item[0] == "[" and item[-1] == "]"):
data.append(headers[i] + ":" + item)
else: #we have a string
data.append(headers[i] + ": \"" + item + "\"")
line = "{" + ','.join(data) + "}"
line = pattern1.sub(r'"\1":', line)
line = pattern2.sub(r':"\1"', line)
#restate the double quotes to single ones, once inside the json
line = re.sub("#", '"', line)
print(line)
This method is not by modifying the Query.
Its by Post Processing For Javascript/Nodejs we can use the npm package athena-struct-parser.
Detailed Answer with Example
https://stackoverflow.com/a/67899845/6662952
Reference - https://www.npmjs.com/package/athena-struct-parser
I used a simple approach to get around the struct -> json Athena limitation. I created a second table where the json columns were saved as raw strings. Using presto json and array functions I was able to query the data and return the valid json string to my program:
--Array transform functions too
select
json_extract_scalar(dd, '$.timestamp') as timestamp,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.time')) as arr_stats_time,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.mean')) as arr_stats_mean,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.var')) as arr_stats_var
from
(select '{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}' as dd);
I know the query will take longer to execute but there are ways to optimize.
I worked around this by creating a second table using the same S3 location, but changed the field's data type to string. The resulting CSV then had the string that Athena pulled from the object in the JSON file and I was able to parse the result.
I also had to adjust the #tarun code, because I had more complex data and nested structures. Here is the solution I've got, I hope it helps:
import re
import json
import numpy as np
pattern1 = re.compile(r'(?<=[{,\[])\s*([^{}\[\],"=]+)=')
pattern2 = re.compile(r':([^{}\[\],"]+|()(?![{\[]))')
pattern3 = re.compile(r'"null"')
def convert_metadata_to_json(value):
if type(value) is str:
value = pattern1.sub('"\\1":', value)
value = pattern2.sub(': "\\1"', value)
value = pattern3.sub('null', value)
elif np.isnan(value):
return None
return json.loads(value)
df = pd.read_csv('test.csv')
df['metadata_json'] = df.metadata.apply(convert_metadata_to_json)
I am getting following error on reading a large 6gb single line json file:
Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648
spark does not read json files with new lines hence the entire 6 gb json file is on a single line:
jf = sqlContext.read.json("jlrn2.json")
configuration:
spark.driver.memory 20g
Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up.
Keep in mind that Spark is expecting each line to be a valid JSON document, not the file as a whole. Below is the relevant line from the Spark SQL Progamming Guide
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
So if your JSON document is in the form...
[
{ [record] },
{ [record] }
]
You'll want to change it to
{ [record] }
{ [record] }
I stumbled upon this while reading a huge JSON file in PySpark and getting the same error. So ,if anyone else is also wondering how to save a JSON file in the format that PySpark can read properly, here is a quick example using pandas:
import pandas as pd
from collections import dict
# create some dict you want to dump
list_of_things_to_dump = [1, 2, 3, 4, 5]
dump_dict = defaultdict(list)
for number in list_of_things_to_dump:
dump_dict["my_number"].append(number)
# save data like this using pandas, will work of the bat with PySpark
output_df = pd.DataFrame.from_dict(dump_dict)
with open('my_fancy_json.json', 'w') as f:
f.write(output_df.to_json(orient='records', lines=True))
After that loading JSON in PySpark is as easy as:
df = spark.read.json("hdfs:///user/best_user/my_fancy_json.json", schema=schema)