How to read a gzip jsonl from a byte offset to debug a BigQuery error? - json

I am exporting json newline files into BigQuery and the BigQuery errors give me a byte offset of the original gziped jsonl file such as
JSON parsing error in row starting at position 727720: Repeated field must be imported as a JSON array. Field: named_entities.alt_form."
I have tried using the Python package indexed gzip to read from the offset but indexed gzip mangles the lines sadly. I have also tried using the builtin python gzip package to try and get the relevant line unsuccessfully:
import gzip
import ujson as json
f = open('myfile.json.gz', 'rb')
g = gzip.GzipFile(fileobj=f)
fasz = g.read()
byte_offset_to_line = {}
for line in g:
byte_offset = f.tell()
byte_offset_to_line[byte_offset] = line
target = 727720
ls = sorted([(abs(target-k),k) for k in byte_offset_to_line.keys() if k < target])
line_of_interest = byte_offset_to_line[ls[0]]
text = str(line_of_interest)
malformed_json = json.loads(text[2:-3])
With the above snippet I can get the nearest line's byte offset. But then when I tried just uploading that line to a test table in BQ it works sadly so I think I am not getting the correct line.
I was wondering if there's a better approach to solve this problem? I am not sure why my above snippet doesn't work to be honest.

Related

dataframe results are not returned while reading csv file

I m trying to read a csv file, below is the code i used , its not returning any results. In the specified path , the csv file has data in it. I had some issue when i used ValidFile = spark.read.csv(ValidationFileDest, header = True) , for this the result is returned but the data for the columns were interchanges and nulls were assinged thats the reason i applied mode DROPMALFORMED in my code. But it is not returning any result.
parquetextension=".parquet"
BronzeStage_Path = "dbfs:/mnt/bronze/stage/" +parentname+"/" +filename
#validated_path="dbfs:/mnt/bronze/landing/ClaimDenialsSouce/"+parentname+"/"+"current/"+"Valid/"+todayDate+"_"+"CDAValidFile"+extension
# df_sourcefilevalid.repartition(1).write.format(write_format).option("header", "true").save(BronzeStagePath)
# ValidFileSrc_BS= get_csv_files(exception_path)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("parquet_example") \
.getOrCreate()
spark.conf.set("spark.sql.csv.parser.columnPruning.enabled",False)
ValidFile = spark.read.format('csv').option("mode","DROPMALFORMED").options(header='true', inferSchema='true').load(ValidationFileDest)
display(ValidFile)
Make sure to check if you are providing the correct file path or the variable of your CSV file. I have repro'd in our environment and was able read the CSV file without any issue
Reading CSV file :
filepath="dbfs:/FileStore/test11-1.csv"
df11 = spark.read.format("csv").option("mode", "DROPMALFORMED").option("header", "true").load(filepath)
display(df11)

Unable to print output of JSON code into a .csv file

I'm getting the following errors when trying to decode this data, and the 2nd error after trying to compensate for the unicode error:
Error 1:
write.writerows(subjects)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 160: ordinal not in range(128)
Error 2:
with open("data.csv", encode="utf-8", "w",) as writeFile:
SyntaxError: non-keyword arg after keyword arg
Code
import requests
import json
import csv
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encode="utf-8", "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
Using requests and with the correction to the second part (as below) I have no problem running. I think your first problem is due to the second error (is a consequence of that being incorrect).
I am on Python3 and can run yours with my fix to open line and with
r = urllib.request.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
I personally would use requests.
import requests
import csv
data = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1').json()
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encoding ="utf-8", mode = "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
For your second, looking at documentation for open function, you need to use the right argument names and add the name of the mode argument if not positional matching.
with open("data.csv", encoding ="utf-8", mode = "w") as writeFile:

Serialise and deserialise pandas periodIndex series

The pandas Series.to_json() function is creating unreadable JSON when using a PeriodIndex.
The error that occurs is:
json.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 5 (char 4)
I've tried changing the orient, but in all of these combinations of serialising and deserialising the index is lost.
idx = pd.PeriodIndex(['2019', '2020'], freq='A')
series = pd.Series([1, 2], index=idx)
json_series = series.to_json() # This is a demo - in reality I'm storing this in a database, but this code throws the same error
value = json.loads(json_series)
A link to the pandas to_json docs
A link to the python json lib docs
The reason I'm not using json.dumps is that the pandas series object is not serialisable.
Python 3.7.3 Pandas 0.24.2
A workaround is to convert PeriodIndex to regular Index before dump and convert it back to PeriodIndex after load:
regular_idx = period_idx.astype(str)
# then dump
# after load
period_idx = pd.to_datetime(regular_idx).to_period()

python 3 read csv UnicodeDecodeError

I have a very simple bit of code that takes in a CVS and puts it into a 2D array. It runs fine on Python2 but in Python3 I get the error below. Looking through the documentation,I think I need to use .decode() Could someone please explain how to use it in the context of my code and why I don't need to do anything in Python2
Error:
line 21, in
for row in datareader:
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 5002: invalid start byte
import csv
import sys
fullTable = sys.argv[1]
datareader = csv.reader(open(fullTable, 'r'), delimiter=',')
full_table = []
for row in datareader:
full_table.append(row)
print(full_table)
open(argv[1], encoding='ISO-8859-1')
CSV contained characters where were not UTF-8 which seemed like the default. I am however surprised that python2 dealt with this issue without any problems.

Reading the data written to s3 by Amazon Kinesis Firehose stream

I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.
My record object looks like
ItemPurchase {
String personId,
String itemId
}
The data is written to S3 looks like:
{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}
NO COMMA SEPERATION.
NO STARTING BRACKET as in a Json Array
[
NO ENDING BRACKET as in a Json Array
]
I want to read this data get a list of ItemPurchase objects.
List<ItemPurchase> purchases = getPurchasesFromS3(IOUtils.toString(s3ObjectContent))
What is the correct way to read this data?
It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.
Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method
This will allow you to read a bunch of concatenated JSON records without any delimiters between them.
Python code:
import json
decoder = json.JSONDecoder()
with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:
content = content_file.read()
content_length = len(content)
decode_index = 0
while decode_index < content_length:
try:
obj, decode_index = decoder.raw_decode(content, decode_index)
print("File index:", decode_index)
print(obj)
except JSONDecodeError as e:
print("JSONDecodeError:", e)
# Scan forward and keep trying to decode
decode_index += 1
I also had the same problem, here is how I solved.
replace "}{" with "}\n{"
line split by "\n".
input_json_rdd.map(lambda x : re.sub("}{", "}\n{", x, flags=re.UNICODE))
.flatMap(lambda line: line.split("\n"))
A nested json object has several "}"s, so split line by "}" doesn't solve the problem.
I've had the same issue.
It would have been better if AWS allowed us to set a delimiter but we can do it on our own.
In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.
This, of course, resulted in a 1-line file which could not be parsed.
So, to solve this, I have concatenated the tweet's JSON with a \n.
This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.
Hope this helps you.
I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.
If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.
I used a transformation Lambda to add a line break at the end of every record
def lambda_handler(event, context):
output = []
for record in event['records']:
# Decode from base64 (Firehose records are base64 encoded)
payload = base64.b64decode(record['data'])
# Read json as utf-8
json_string = payload.decode("utf-8")
# Add a line break
output_json_with_line_break = json_string + "\n"
# Encode the data
encoded_bytes = base64.b64encode(bytearray(output_json_with_line_break, 'utf-8'))
encoded_string = str(encoded_bytes, 'utf-8')
# Create a deep copy of the record and append to output with transformed data
output_record = copy.deepcopy(record)
output_record['data'] = encoded_string
output_record['result'] = 'Ok'
output.append(output_record)
print('Successfully processed {} records.'.format(len(event['records'])))
return {'records': output}
Use this simple Python code.
input_str = '''{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}'''
data_str = "[{}]".format(input_str.replace("}{","},{"))
data_json = json.loads(data_str)
And then (if you want) convert to Pandas.
import pandas as pd
df = pd.DataFrame().from_records(data_json)
print(df)
And this is result
itemId personId
0 i-111 p-111
1 i-222 p-222
2 i-333 p-333
If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.
You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:
import json
def read_block(stream):
open_brackets = 0
block = ''
while True:
c = stream.read(1)
if not c:
break
if c == '{':
open_brackets += 1
elif c == '}':
open_brackets -= 1
block += c
if open_brackets == 0:
yield block
block = ''
if __name__ == "__main__":
c = 0
with open('firehose_json_blob', 'r') as f:
for block in read_block(f):
record = json.loads(block)
print(record)
This problem can be solved with a JSON parser that consumes objects one at a time from a stream. The raw_decode method of the JSONDecoder exposes just such a parser, but I've written a library that makes it straightforward to do this with a one-liner.
from firehose_sipper import sip
for entry in sip(bucket=..., key=...):
do_something_with(entry)
I've added some more details in this blog post
In Spark, we had the same problem. We're using the following:
from pyspark.sql.functions import *
#udf
def concatenated_json_to_array(text):
final = "["
separator = ""
for part in text.split("}{"):
final += separator + part
separator = "}{" if re.search(r':\s*"([^"]|(\\"))*$', final) else "},{"
return final + "]"
def read_concatenated_json(path, schema):
return (spark.read
.option("lineSep", None)
.text(path)
.withColumn("value", concatenated_json_to_array("value"))
.withColumn("value", from_json("value", schema))
.withColumn("value", explode("value"))
.select("value.*"))
It works as follows:
Read the data as one string per file (no delimiters!)
Use a UDF to introduce the JSON array and split the JSON objects by introducing a comma. Note: be careful not to break any strings with }{ in them!
Parse the JSON with a schema into DataFrame fields.
Explode the array into separate rows
Expand the value object into column.
Use it like this:
from pyspark.sql.types import *
schema = ArrayType(
StructType([
StructField("type", StringType(), True),
StructField("value", StructType([
StructField("id", IntegerType(), True),
StructField("joke", StringType(), True),
StructField("categories", ArrayType(StringType()), True)
]), True)
])
)
path = '/mnt/my_bucket_name/messages/*/*/*/*/'
df = read_concatenated_json(path, schema)
I've written more details and considerations here: Parsing JSON data from S3 (Kinesis) with Spark. Do not just split by }{, as it can mess up your string data! For example: { "line": "a\"r}{t" }.
You can use below script.
If streamed data size is not over buffer size that you set, each file of s3 have one pair of brackets([]) and comma.
import base64
print('Loading function')
def lambda_handler(event, context):
output = []
for record in event['records']:
print(record['recordId'])
payload = base64.b64decode(record['data']).decode('utf-8')+',\n'
# Do custom processing on the payload here
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(payload.encode('utf-8'))
}
output.append(output_record)
last = len(event['records'])-1
print('Successfully processed {} records.'.format(len(event['records'])))
start = '['+base64.b64decode(output[0]['data']).decode('utf-8')
end = base64.b64decode(output[last]['data']).decode('utf-8')+']'
output[0]['data'] = base64.b64encode(start.encode('utf-8'))
output[last]['data'] = base64.b64encode(end.encode('utf-8'))
return {'records': output}
Using JavaScript Regex.
JSON.parse(`[${item.replace(/}\s*{/g, '},{')}]`);