Reading a json file into a RDD (not dataFrame) using pyspark - json

I have the following file: test.json >
{
"id": 1,
"name": "A green door",
"price": 12.50,
"tags": ["home", "green"]
}
I want to load this file into a RDD. This is what I tried:
rddj = sc.textFile('test.json')
rdd_res = rddj.map(lambda x: json.loads(x))
I got an error:
Expecting object: line 1 column 1 (char 0)
I don't completely understand what does json.loads do.
How can I resolve this problem ?

textFile reads data line by line. Individual lines of your input are not syntactically valid JSON.
Just use json reader:
spark.read.json("test.json", multiLine=True)
or (not recommended) whole text files
sc.wholeTextFiles("test.json").values().map(json.loads)

Related

Python Regex: How to match the string and then modify that string by adding something at the end

UPDATED CODE: It is working but now the problem is that the code is attaching same random_value to every Path.
Following is my code with a sample chunk of text. I want to read Path and it's value then add (/some unique random alphabet and number combination) at the end of every Path value without changing the already existed value. For example I want the Path to be like
"Path" : "already existed value/1A" e.t.c something like that.
I am unable to make the exact regex pattern of replacing it.
Any help would be appreciated.
It can be done by json parse but the requirement of the task is to do it via REGEX.
from io import StringIO
import re
import string
import random
reader = StringIO("""{
"Bounds": [
{
"HasClip": true,
"Lang": "no",
"Page": 0,
"Path": "//Document/Sect[2]/Aside/P",
"Text": "Potsdam, den 9. Juni 2021 ",
"TextSize": 12.0
}
],
},
{
"Bounds": [
{
"HasClip": true,
"Lang": "de",
"Page": 0,
"Path": "//Document/Sect[3]/P[4]",
"Text": "this is some text ",
"TextSize": 9.0,
}
],
}""")
def id_generator(size=3, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
text = reader.read()
random_value = id_generator()
pattern = r'"Path": "(.*?)"'
replacement = '"Path": "\\1/'+random_value+'"'
text = re.sub(pattern, replacement, text)
#This is working but it is only attaching one same random_value on every Path
print(text)
Use group 1 in the replacement:
replacement = '"Path": "\\1/1A"'
See live demo.
The replacement regex \1 puts back what was captured in group 1 of the match via (.*?).
Since you already have a json structure, maybe it would help to use the json module to parse it.
import json
myDict = json.loads("your json string / variable here")
# now myDict is a dictionary that you can use to loop/read/edit/modify and you can then export myDict as json.

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

Cleaning of JSON Objects using Spark

I have been trying to clean my json file.I used RDD to read the Json file and then tried to clean it using replace function but still I am not getting the correct json file because of the escape sequences present in the JSON value.
Here is my code with which I am trying to clean the JSON file of various disturbances.
The cleaned JSON shows errors.Please review and tell the issue**
val readjson = sparkSession
.sparkContext.textFile("dev.json")
val json=readjson.map(element=>element
.replace("\"\":\"\"","\":\"")
.replace("\"\",\"\"","\",\"")
.replace("\"\":","\":")
.replace(",\"\"",",\"")
.replace("\"{\"\"","{\"")
.replace("\"\"}\"","\"}"))
.saveAsTextFile("JSON")
HERE IS MY JSON FILE
"{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":"""",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":"""",""GENDR_CD"":"""",""PROV_NPI"":"""",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":"""",""PROFPROFL_DESGTN"":"""",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"
I tried cleaning the above json and got the following result:-
{
"SEQ_NO": 596514,
"PROV_DEMOG_SK": 596514,
"PROV_ID": "QMP000003370581",
"FRST_NM": "",
"LAST_NM": "RICHARD WHITTINGTON BUTCHER",
"FUL_NM": "",
"GENDR_CD": "",
"PROV_NPI": "",
"PROV_STAT": "Incomplete",
"PROV_TY": "03",
"DT_OF_BRTH": "",
"PROFPROFL_DESGTN": "",
"ETL_LAST_UPDT_DT_TM": "2020-04-28 11:43:31.000000",
"PROV_CLSFTN_CD": "A",
"SRC_DATA_KEY": 50,
"OPRN_CD": "I",
"REC_SET": "F"
}
The JSON validators present online show that it is incorrect
Looks like your JSON has one or few control character \u0009 try replacing them with
.replaceAll("\\u0009"," ")
You can do it in below sequence
val replacedVal = """{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":\"\"\"",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":\"\"\"",""GENDR_CD"":\"\"\"",""PROV_NPI"":\"\"\"",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":\"\"\"",""PROFPROFL_DESGTN"":\"\"\"",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"""
.replace("""\"""",""""""")
.replace("""""""",""""""")
.replaceAll("\\u0009"," ")

Importing JSON file into Firebase error

I keep getting that there is an error uploading/importing my JSON file into Firebase. I initially had an excel spreadsheet that I saved as a CSV file, then I used a CSV to JSON converter.
I validated the JSON file (which have the .json extension) with a couple of online tools.
Though, I'm still getting an error.
Here is an example of my JSON:
{
"Rk": 1,
"Tm": "SEA",
"H/A": "H",
"DOW": "Sun",
"Opp": "CLE",
"QB": "Russell Wilson",
"Grade": "BLUE",
"Def mu pts": 4,
"Inj status": 0,
"Notes": "Got to wonder if not having a proven power RB under center will negatively impact Wilson's production.",
"TFS $50K": "$8,300",
"Init sal": "$8,300",
"Var": "$0",
"WC": 0
}
The issue is your key's..
Firebase keys must be:
UTF-8 encoded, cannot contain . $ # [ ] / or ASCII control characters
0-31 or 127
your $50k key and the H/A are the issues.

create fixtures with custom manager methods, json dumps and ways to avoid type error :xxx is not json serializable

I'm trying to create a test fixture using custom manager methods as my app uses a subset of dbtables and fewer records. so i dropped the idea of using initial_data. In manager I'm doing something like this. in Managers.py:
sitedict = Site.objects.filter(pk=1234).values()[0]
custdict = Customer.objects.filter(custid=123456).values()[0]
customer = {"pk":123456,"model":"myapp.customer","fields":custdict}
site = {"pk":0001,"model":"myapp.site","fields":sitedict}
csvfile = open('shoppingcart/bsofttestdata.csv','wb')
csv_writer = csv.writer(csvfile)
csv_writer.writerow([customer,site])
then i did modify my csv file to replace single quotes with double, etc. Then i did save that file as json.Sorry if its too dumb way but this is the first time I'm creating testdata,I'd love to learn better way.Sample data of the file is like this in : myapp/fixtures/testdata.json
[{"pk": 123456, "model": "myapp.customer", "fields": {"city": "abc", "maritalstatus": None, "zipcode": "12345", "lname": "fdfdf", "state": "AZ", "agentid": 1111, "fname": "sdadsad", "email": "abcd#xxx.com", "phone": "0000000000", "custid":123456,"datecreate": datetime.datetime(2011, 3, 29, 11, 40, 18, 157612)}},{"pk":0001, "model": "myapp.site", "fields": {"url": "http://google.com", "websitecode": "", "notify": True, "fee": 210.0, "id":0001}}]
I used this to run my tests but i got the following error:
EProblem installing fixture '/var/lib/django/myproject/myapp/fixtures/testdata.json':
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.6/django/core/management/commands/loaddata.py", line 150, in handle
for obj in objects:
File "/usr/lib/pymodules/python2.6/django/core/serializers/json.py", line 41, in Deserializer
for obj in PythonDeserializer(simplejson.load(stream)):
File "/usr/lib/pymodules/python2.6/simplejson/__init__.py", line 267, in load
parse_constant=parse_constant, **kw)
File "/usr/lib/pymodules/python2.6/simplejson/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib/pymodules/python2.6/simplejson/decoder.py", line 335, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/pymodules/python2.6/simplejson/decoder.py", line 353, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
in stead of using raw find replace its better to use something as shown here and when we have some datatypes that JSON doesn't support.this would be helpful to get rid of TypeError: xxxxxxx is not JSON serializable or specifically stackover post for Datetime problem will be helpful.
EDIT:
instead of writing to csv then manually modifying it,I did the following:
with open('myapp/fixtures/customer_testdata.json',mode = 'w') as f:
json.dump(customer,f,indent=2)
here is small code I used to get out of the TypeError:xxxx not json blah blah problem
for key in cust.keys():
value = cust[key]
if isinstance(cust[key],datetime.datetime):
temp = cust[key].timetuple() # this converts datetime.datetime to time.struct_time
cust.update({key:{'__class__':'time.asctime','__value__':time.asctime(temp)}})
return cust
if we convert datetime.datetime to any other type, then we have to chang the class accordingly. E.g timestamp --> float here is fantastic reference for datetime conversions
Hope this is helpful.