Importing JSON file into Firebase error - json

I keep getting that there is an error uploading/importing my JSON file into Firebase. I initially had an excel spreadsheet that I saved as a CSV file, then I used a CSV to JSON converter.
I validated the JSON file (which have the .json extension) with a couple of online tools.
Though, I'm still getting an error.
Here is an example of my JSON:
{
"Rk": 1,
"Tm": "SEA",
"H/A": "H",
"DOW": "Sun",
"Opp": "CLE",
"QB": "Russell Wilson",
"Grade": "BLUE",
"Def mu pts": 4,
"Inj status": 0,
"Notes": "Got to wonder if not having a proven power RB under center will negatively impact Wilson's production.",
"TFS $50K": "$8,300",
"Init sal": "$8,300",
"Var": "$0",
"WC": 0
}

The issue is your key's..
Firebase keys must be:
UTF-8 encoded, cannot contain . $ # [ ] / or ASCII control characters
0-31 or 127
your $50k key and the H/A are the issues.

Related

Python Lambda actioning CSV file once but not the second time

I am experiencing a strange issue with my Python code. It's objective is the following:
Retrieve a .csv from S3
Convert that .csv into JSON (Its an array of objects)
Add a few key value pairs to each object in the Array, and change the original key values
Validate the JSON
Sent the JSON to a /output S3 bucket
Load the JSON into Dynamo
Here's what the .csv looks like:
Prefix,Provider
ABCDE,Provider A
QWERT,Provider B
ASDFG,Provider C
ZXCVB,Provider D
POIUY,Provider E
And here's my python script:
import json
import boto3
import ast
import csv
import os
import datetime as dt
from datetime import datetime
import jsonschema
from jsonschema import validate
s3 = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
providerCodesSchema = {
"type": "array",
"items": {
"type": "object",
"properties": {
"providerCode": {"type": "string", "maxLength": 5},
"providerName": {"type": "string"},
"activeFrom": {"type": "string", "format": "date"},
"activeTo": {"type": "string"},
"apiActiveFrom": {"type": "string"},
"apiActiveTo": {"type": "string"},
"countThreshold": {"type": "string"}
},
"required": ["providerCode", "providerName"]
}
}
datestamp = dt.datetime.now().strftime("%Y/%m/%d")
timestamp = dt.datetime.now().strftime("%s")
updateTime = dt.datetime.now().strftime("%Y/%m/%d/%H:%M:%S")
nowdatetime = dt.datetime.now()
yesterday = nowdatetime - dt.timedelta(days=1)
nintydaysfromnow = nowdatetime + dt.timedelta(days=90)
def lambda_handler(event, context):
filename_json = "/tmp/file_{ts}.json".format(ts=timestamp)
filename_csv = "/tmp/file_{ts}.csv".format(ts=timestamp)
keyname_s3 = "newloader-ptv/output/{ds}/{ts}.json".format(ds=datestamp, ts=timestamp)
json_data = []
for record in event['Records']:
bucket_name = record['s3']['bucket']['name']
key_name = record['s3']['object']['key']
s3_object = s3.get_object(Bucket=bucket_name, Key=key_name)
data = s3_object['Body'].read()
contents = data.decode('latin')
with open(filename_csv, 'a', encoding='utf-8') as csv_data:
csv_data.write(contents)
with open(filename_csv, encoding='utf-8-sig') as csv_data:
csv_reader = csv.DictReader(csv_data)
for csv_row in csv_reader:
json_data.append(csv_row)
for elem in json_data:
elem['providerCode'] = elem.pop('Prefix')
elem['providerName'] = elem.pop('Provider')
for element in json_data:
element['activeFrom'] = yesterday.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['activeTo'] = nintydaysfromnow.strftime("%Y-%m-%dT%H:%M:%S.00-00:00")
element['apiActiveFrom'] = " "
element['apiActiveTo'] = " "
element['countThreshold'] = "3"
element['updateDate'] = updateTime
try:
validate(instance=json_data, schema=providerCodesSchema)
except jsonschema.exceptions.ValidationError as err:
print(err)
err = "Given JSON data is InValid"
return None
with open(filename_json, 'w', encoding='utf-8-sig') as json_file:
json_file.write(json.dumps(json_data, default=str))
with open(filename_json, 'r', encoding='utf-8-sig') as json_file_contents:
response = s3.put_object(Bucket=bucket_name, Key=keyname_s3, Body=json_file_contents.read())
for jsonElement in json_data:
table = dynamodb.Table('privateProviders-loader')
table.put_item(Item=jsonElement)
print("finished enriching JSON")
os.remove(filename_csv)
os.remove(filename_json)
return None
I'm new to Python, so please forgive any amateur mistakes in the code.
Here's my issue:
When I deploy the code, and add a valid .csv into my S3 bucket, everything works.
When I then add an invalid .csv into my S3 buck, again it work, the import fails as the validation kicks in and tells me the problem.
However, when I add the valid .csv back into the S3 bucket, I get the same cloudwatch log as I did for the invalid .csv, and my Dynamo isn't updated, nor is an output JSON file sent to /output in S3.
With some troubleshooting I've noticed the following behavour:
When I first deploy the code, the first .csv loads as expected (dynamo table updated + JSON file sent to S3 + cloudwatch logs documenting the process)
If I enter the same valid .csv into the S3 bucket, it gives me the same nice looking cloudwatch logs, but none of the other actions take place (Dynamo not updated etc)
If I add the invalid .csv, that seems to break the cycle, and I get a nice Cloudwatch log showing the validation has kicked in, but if I reload the valid .csv, which just previously resulted in good cloudwatch logs (but no actual real outputs), I now get a repeat of the validation error log.
In short, the first time the function is invoked, it seems to work, the second time it doesn't.
It seems as though the python function is caching something or not closing out the function when finished, and I've played about with the return command etc, but nothing I've tried works. I've sunk many hours into trying to move parts of the code around etc. thinking the structure or order of events is the problem, and I've the code above gives me the closest behaviour to expected, given that it seems to work completely the first and only time I load the .csv into S3.
Any help or general pointers would be massively appreciated.
Thanks
P.s. Here's an example of the Cloudwatch log when validation kicks in a and stops an invalid .csv from being processed. If I then add a valid .csv to S£, the function is triggered, but I get this same error, even though the file is actually good.
2021-06-29T22:12:27.709+01:00 'ABCDEE' is too long
2021-06-29T22:12:27.709+01:00 Failed validating 'maxLength' in schema['items']['properties']['providerCode']:
2021-06-29T22:12:27.709+01:00 {'maxLength': 5, 'type': 'string'}
2021-06-29T22:12:27.709+01:00 On instance[2]['providerCode']:
2021-06-29T22:12:27.709+01:00 'ABCDEE'
2021-06-29T22:12:27.710+01:00 END RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9
2021-06-29T22:12:27.710+01:00 REPORT RequestId: 81dd6a2d-130b-4c8f-ad08-39307841adf9 Duration: 482.43 ms Billed Duration: 483

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

Cleaning of JSON Objects using Spark

I have been trying to clean my json file.I used RDD to read the Json file and then tried to clean it using replace function but still I am not getting the correct json file because of the escape sequences present in the JSON value.
Here is my code with which I am trying to clean the JSON file of various disturbances.
The cleaned JSON shows errors.Please review and tell the issue**
val readjson = sparkSession
.sparkContext.textFile("dev.json")
val json=readjson.map(element=>element
.replace("\"\":\"\"","\":\"")
.replace("\"\",\"\"","\",\"")
.replace("\"\":","\":")
.replace(",\"\"",",\"")
.replace("\"{\"\"","{\"")
.replace("\"\"}\"","\"}"))
.saveAsTextFile("JSON")
HERE IS MY JSON FILE
"{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":"""",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":"""",""GENDR_CD"":"""",""PROV_NPI"":"""",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":"""",""PROFPROFL_DESGTN"":"""",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"
I tried cleaning the above json and got the following result:-
{
"SEQ_NO": 596514,
"PROV_DEMOG_SK": 596514,
"PROV_ID": "QMP000003370581",
"FRST_NM": "",
"LAST_NM": "RICHARD WHITTINGTON BUTCHER",
"FUL_NM": "",
"GENDR_CD": "",
"PROV_NPI": "",
"PROV_STAT": "Incomplete",
"PROV_TY": "03",
"DT_OF_BRTH": "",
"PROFPROFL_DESGTN": "",
"ETL_LAST_UPDT_DT_TM": "2020-04-28 11:43:31.000000",
"PROV_CLSFTN_CD": "A",
"SRC_DATA_KEY": 50,
"OPRN_CD": "I",
"REC_SET": "F"
}
The JSON validators present online show that it is incorrect
Looks like your JSON has one or few control character \u0009 try replacing them with
.replaceAll("\\u0009"," ")
You can do it in below sequence
val replacedVal = """{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":\"\"\"",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":\"\"\"",""GENDR_CD"":\"\"\"",""PROV_NPI"":\"\"\"",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":\"\"\"",""PROFPROFL_DESGTN"":\"\"\"",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"""
.replace("""\"""",""""""")
.replace("""""""",""""""")
.replaceAll("\\u0009"," ")

How to evaluate JSON Path with fields that contain quotes inside a value?

I have a NiFi flow that takes JSON files and evaluates a JSON Path argument against them. It work perfectly except when dealing with records that contain Korean text. The Jayway JSONPath evaluator does not seem to recognize the escape "\" in the headline field before the double quote character. Here is an example (newlines added to help with formatting):
{"data": {"body": "[이데일리 김관용 기자] 우리 군이 2018 남북정상회담을 앞두고 남
북간 군사적 긴장\r\n완화와 평화로운 회담 분위기 조성을 위해 23일 0시를 기해 군사분계선
(MDL)\r\n일대에서의 대북확성기 방송을 중단했다.\r\n\r\n국방부는 이날 남북정상회담 계기
대북확성기 방송 중단 관련 내용을 발표하며\r\n“이번 조치가 남북간 상호 비방과 선전활동을
중단하고 ‘평화, 새로운 시작’을\r\n만들어 나가는 성과로 이어지기를 기대한다”고 밝혔
다.\r\n\r\n전방부대 우리 군 장병이 대북확성기 방송을 위한 장비를 점검하고 있다.\r\n[사
진=국방부공동취재단]\r\n\r\n\r\n\r\n▶ 당신의 생활 속 언제 어디서나 이데일리 \r\n▶
스마트 경제종합방송 ‘이데일리 TV’ | 모바일 투자정보 ‘투자플러스’\r\n▶ 실시간 뉴스와
속보 ‘모바일 뉴스 앱’ | 모바일 주식 매매 ‘MP트래블러Ⅱ’\r\n▶ 전문가를 위한 국내 최상의
금융정보단말기 ‘이데일리 마켓포인트 3.0’ | ‘이데일리 본드웹 2.0’\r\n▶ 증권전문가방송
‘이데일리 ON’ 1666-2200 | ‘ON스탁론’ 1599-2203\n<ⓒ종합 경제정보 미디어 이데일리 -
무단전재 & 재배포 금지> \r\n",
"mimeType": "text/plain",
"language": "ko",
"headline": "국방부 \"軍 대북확성기 방송, 23일 0시부터 중단\"",
"id": "EDYM00251_1804232beO/5WAUgdlYbHS853hYOGrIL+Tj7oUjwSYwT"}}
If this object is in my file, the JSON path evaluates blanks for all the path arguments. Is there a way to force the Jayway engine to recognize the "\"? It appears to function correctly in other languages.
I was able to resolve this after understanding the difference between definite and indefinite paths. The Jayway github README points out the following will make a path indefinite and return a list:
When evaluating a path you need to understand the concept of when a
path is definite. A path is indefinite if it contains:
.. - a deep scan operator
?(<expression>) - an expression
[<number>, <number> (, <number>)] - multiple array indexes Indefinite paths
always returns a list (as represented by current JsonProvider).
My JSON looked like the following:
{
"Version":"14",
"Items":[
{"data": {"body": "[이데일리 ... \r\n",
"mimeType": "text/plain",
"language": "ko",
"headline": "국방부 \"軍 ... 중단\"",
"id": "1"}
},
{"data": {"body": "[이데일리 ... \r\n",
"mimeType": "text/plain",
"language": "ko",
"headline": "국방부 \"軍 ... 중단\"",
"id": "2"}
...
}
]
}
This JSON path selector I was using ($.data.headline) did not grab the values as I expected. It instead returned null values.
Changing it to $.Items[*].data.headline or $..data.headline returns a list of each headline.

Reading a json file into a RDD (not dataFrame) using pyspark

I have the following file: test.json >
{
"id": 1,
"name": "A green door",
"price": 12.50,
"tags": ["home", "green"]
}
I want to load this file into a RDD. This is what I tried:
rddj = sc.textFile('test.json')
rdd_res = rddj.map(lambda x: json.loads(x))
I got an error:
Expecting object: line 1 column 1 (char 0)
I don't completely understand what does json.loads do.
How can I resolve this problem ?
textFile reads data line by line. Individual lines of your input are not syntactically valid JSON.
Just use json reader:
spark.read.json("test.json", multiLine=True)
or (not recommended) whole text files
sc.wholeTextFiles("test.json").values().map(json.loads)