How to remove unwanted delimiter in json using python - json

jsonValue="{'Employee': ['{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}', '{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}', '{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}']}
"
with open("F://IDP Umesh//Data Transformation//test.json", 'w') as jsonFile:
jsonFile.write(json.dumps(jsonValue))
Out put from test.json
{"Employee": ["{\"userId\":\"rirani\",\"jobTitleName\":\"Developer\",\"firstName\":\"Romin\",\"lastName\":\"Irani\",\"preferredFullName\":\"Romin Irani\",\"employeeCode\":\"E1\",\"region\":\"CA\",\"phoneNumber\":\"408-1234567\",\"emailAddress\":\"romin.k.irani#gmail.com\"}", "{\"userId\":\"nirani\",\"jobTitleName\":\"Developer\",\"firstName\":\"Neil\",\"lastName\":\"Irani\",\"preferredFullName\":\"Neil Irani\",\"employeeCode\":\"E2\",\"region\":\"CA\",\"phoneNumber\":\"408-1111111\",\"emailAddress\":\"neilrirani#gmail.com\"}", "{\"userId\":\"thanks\",\"jobTitleName\":\"Program Directory\",\"firstName\":\"Tom\",\"lastName\":\"Hanks\",\"preferredFullName\":\"Tom Hanks\",\"employeeCode\":\"E3\",\"region\":\"CA\",\"phoneNumber\":\"408-2222222\",\"emailAddress\":\"tomhanks#gmail.com\"}"]}
How to remove '\' from the json content and make the valid json ?
Appreciate if anyone can help on this?
Thanks

Try this.
import json
jsonValue={'Employee': ['{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}', '{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}', '{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}']}
jsonValue['Employee'] = [json.loads(i ) for i in jsonValue['Employee']]
print(jsonValue)
with open("test.json", 'w') as jsonFile:
jsonFile.write(json.dumps(jsonValue))
The problem with your code is that you're dumping a string formatted as a json, dumps works when you need to convert a dict to a json formatted string.

Related

Spark can't get delimiter for CSV file

I have a CSV file like this CSV read by pandas like this
But when I read it with PySpark, it turned out like this
CSV read by PySpark
What's wrong with the delimiter in Spark and how can I fix it?
From the posted images, %2C, which is URL encode equivalent of ,, seems to be your delimiter.
Set delimiter to %2C and also use header option:
df = spark.read.option("header",True).option("delimiter", "%2C").csv(path)
Input CSV File:
date%2Copening%2Chigh%2Clow%2Cclose%2Cadjclose%2Cvolume
2022-12-09%2C100%2C101%2C99%2C99.5%2C99.5%2C10000000
2022-12-09%2C200%2C202%2C199%2C199%2C199.1%2C20000000
2022-12-09%2C300%2C303%2C299%2C299%2C299.2%2C30000000
Output dataframe:
+----------+-------+----+---+-----+--------+--------+
|date |opening|high|low|close|adjclose|volume |
+----------+-------+----+---+-----+--------+--------+
|2022-12-09|100 |101 |99 |99.5 |99.5 |10000000|
|2022-12-09|200 |202 |199|199 |199.1 |20000000|
|2022-12-09|300 |303 |299|299 |299.2 |30000000|
+----------+-------+----+---+-----+--------+--------+

python 3 - how to clean json string with double backslashes and u00

I have several ugly json strings like the following:
test_string = '{\\"test_key\\": \\"Testing tilde \\u00E1\\u00F3\\u00ED\\"}'
that I need to transform it in a more visually friendly dictionary and then save it to a file:
{'test_key': 'Testing tilde áóí'}
So for that I am doing:
test_string = test_string.replace("\\\"", "\"") # I suposse there is a safer way to do this
print(test_string)
#{"test_key": "Testing tilde \u00E1\u00F3\u00ED"}
test_dict = json.loads(test_string, strict=False)
print(test_dict)
#{'test_key': 'Testing tilde áóí'}
At this point test_dict seems correct. Then I save it to a file:
with open('test.json', "w") as json_w_file:
json.dump(test_dict, json_w_file)
At this point the content of test.json is the ugly version of the json:
{"test_key": "Testing tilde \u00E1\u00F3\u00ED"}
Is there a safer way to transform my ugly json to a dictionary?
Then how could I save the visually friendly version of my dictionary to a file?
Python 3
The string looks like double-encoded json to me. This decodes it an writes a utf-8 json file.
test_string = '{\\"test_key\\": \\"Testing tilde \\u00E1\\u00F3\\u00ED\\"}'
test_dict = json.loads(json.loads(f'"{test_string}"'))
with open('test.json', "w") as json_w_file:
json.dump(test_dict, json_w_file, ensure_ascii=False)

Add a new line in front of each line before writing to JSON format using Spark in Scala

I'd like to add one new line in front of each of my json document before Spark writes it into my s3 bucket:
df.createOrReplaceTempView("ParquetTable")
val parkSQL = spark.sql("select LAST_MODIFIED_BY, LAST_MODIFIED_DATE, NVL(CLASS_NAME, className) as CLASS_NAME, DECISION, TASK_TYPE_ID from ParquetTable")
parkSQL.show(false)
parkSQL.count()
parkSQL.write.json("s3://test-bucket/json-output-7/")
with only this command, it'll produce files with contents below:
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
but, what I'd like to achieve is something like below:
{"index":{}}
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
Any insight on how to achieve this result would be greatly appreciated!
Below code will concat {"index":{}} with existing row data in DataFrame & It will convert data into json then save json data using text format.
df
.select(
lit("""{"index":{}}""").as("index"),
to_json(struct($"*")).as("json_data")
)
.select(
concat_ws(
"\n", // This will split index column & other column data into two lines.
$"index",
$"json_data"
).as("data")
)
.write
.format("text") // This is required.
.save("s3://test-bucket/json-output-7/")
Final Output
cat part-00000-24619b28-6501-4763-b3de-1a2f72a5a4ec-c000.txt
{"index":{}}
{"CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}

Formatting json file in Python

counter={"a":1,"b":2}
With open('egg.json' , 'w') as json_file:
json.dump(counter, json_file)
So when I review my json file, it shows this:
{a:1 , b:2}
But I need it to be something like this:
[ [a:1], [b:2] ]
I've already tried adding
json.dump(counter, json_file, separator (' [ ', ' ] ')
But nothing will do the trick...
Is there a way to format the json file like the way you can format a CSV file?
I'd really like to know..... Thanks.
[a:1], [b:2] isn't valid json, so using the json module won't help you here.
If for some reason you want a formatted string output, you could instead do the following (don't call the file egg.json since it won't be valid json!):
counter = {'a':1, 'b':2}
output = []
for k, v in sorted(counter.items()):
output.append('[{}:{}]'.format(k, v))
with open('egg.txt', 'w') as txt_f:
txt_f.write(', '.join(output))

Get a JSON text

I have this JSON text:
data = {"one":"number","two":"string","three":"number","four":[{"five":"number","six","string"},{"five":"number","six":"string"}]}
How I can get "five"'s number and "six"'s string using Python 3.3 and using json module ?
P.S.: If I do print data['five'] it doesn't works with this error:
print(data['five'])
KeyError: 'five'
Thanks,
Marco
Try this:
data = {"one":"number","two":"string","three":"number","four":[{"five":"number","six":"string"},{"five":"number","six":"string"}]}
print(data['four'][0]['five']) # number
print(data['four'][0]['six']) # string