Formatting json file in Python - json

counter={"a":1,"b":2}
With open('egg.json' , 'w') as json_file:
json.dump(counter, json_file)
So when I review my json file, it shows this:
{a:1 , b:2}
But I need it to be something like this:
[ [a:1], [b:2] ]
I've already tried adding
json.dump(counter, json_file, separator (' [ ', ' ] ')
But nothing will do the trick...
Is there a way to format the json file like the way you can format a CSV file?
I'd really like to know..... Thanks.

[a:1], [b:2] isn't valid json, so using the json module won't help you here.
If for some reason you want a formatted string output, you could instead do the following (don't call the file egg.json since it won't be valid json!):
counter = {'a':1, 'b':2}
output = []
for k, v in sorted(counter.items()):
output.append('[{}:{}]'.format(k, v))
with open('egg.txt', 'w') as txt_f:
txt_f.write(', '.join(output))

Related

Add a new line in front of each line before writing to JSON format using Spark in Scala

I'd like to add one new line in front of each of my json document before Spark writes it into my s3 bucket:
df.createOrReplaceTempView("ParquetTable")
val parkSQL = spark.sql("select LAST_MODIFIED_BY, LAST_MODIFIED_DATE, NVL(CLASS_NAME, className) as CLASS_NAME, DECISION, TASK_TYPE_ID from ParquetTable")
parkSQL.show(false)
parkSQL.count()
parkSQL.write.json("s3://test-bucket/json-output-7/")
with only this command, it'll produce files with contents below:
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
but, what I'd like to achieve is something like below:
{"index":{}}
{"LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
Any insight on how to achieve this result would be greatly appreciated!
Below code will concat {"index":{}} with existing row data in DataFrame & It will convert data into json then save json data using text format.
df
.select(
lit("""{"index":{}}""").as("index"),
to_json(struct($"*")).as("json_data")
)
.select(
concat_ws(
"\n", // This will split index column & other column data into two lines.
$"index",
$"json_data"
).as("data")
)
.write
.format("text") // This is required.
.save("s3://test-bucket/json-output-7/")
Final Output
cat part-00000-24619b28-6501-4763-b3de-1a2f72a5a4ec-c000.txt
{"index":{}}
{"CLASS_NAME":"/SC/Trade/HTS_CA/1234abcd","DECISION":"AGREE","LAST_MODIFIED_BY":"david","LAST_MODIFIED_DATE":"2018-06-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}
{"index":{}}
{"CLASS_NAME":"/SC/Import/HTS_US/9876abcd","DECISION":"DISAGREE","LAST_MODIFIED_BY":"sarah","LAST_MODIFIED_DATE":"2018-08-26 12:02:03.0","TASK_TYPE_ID":"abcd1234-832b-43b6-afa6-361253ffe1d5"}

Is there a way to programmatically set a dataset's schema from a .csv

As an example, I have a .csv which uses the Excel dialect which uses something like Python's csv module doubleQuote to escape quotes.
For example, consider the row below:
"XX ""YYYYYYYY"", ZZZZZZ ""QQQQQQ""","JJJJ ""MMMM"", RRRR ""TTTT""",1234,RRRR,60,50
I would want the schema to then become:
[
'XX "YYYYYYYY", ZZZZZZ "QQQQQQ"',
'JJJJ "MMMM", RRRR "TTTT"',
1234,
'RRRR',
60,
50
]
Is there a way to set the schema of a dataset in a programmatic/automated fashion?
While you can do this in code, foundrys dataset-app can also do this natively. This means you can skip writing the code (which is nice) but also means you can potentially save a step in your pipeline (which might save you on runtime.)
After uploading the files to a dataset, press "edit schema" on the dataset:
Then apply settings like the following, which would result in the desired outcome in your case:
Then press "save and validate" and the dataset should end up with the correct schema:
Starting with this example:
Dataset<Row> dataset = files
.sparkSession()
.read()
.option("inferSchema", "true")
.csv(csvDataset);
output.getDataFrameWriter(dataset).write();
Add the header, quote, and escape options, like so:
Dataset<Row> dataset = files
.sparkSession()
.read()
.option("inferSchema", "true")
.option("header", "true")
.option("quote", "\"")
.option("escape", "\"")
.csv(csvDataset);
output.getDataFrameWriter(dataset).write();

How to remove unwanted delimiter in json using python

jsonValue="{'Employee': ['{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}', '{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}', '{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}']}
"
with open("F://IDP Umesh//Data Transformation//test.json", 'w') as jsonFile:
jsonFile.write(json.dumps(jsonValue))
Out put from test.json
{"Employee": ["{\"userId\":\"rirani\",\"jobTitleName\":\"Developer\",\"firstName\":\"Romin\",\"lastName\":\"Irani\",\"preferredFullName\":\"Romin Irani\",\"employeeCode\":\"E1\",\"region\":\"CA\",\"phoneNumber\":\"408-1234567\",\"emailAddress\":\"romin.k.irani#gmail.com\"}", "{\"userId\":\"nirani\",\"jobTitleName\":\"Developer\",\"firstName\":\"Neil\",\"lastName\":\"Irani\",\"preferredFullName\":\"Neil Irani\",\"employeeCode\":\"E2\",\"region\":\"CA\",\"phoneNumber\":\"408-1111111\",\"emailAddress\":\"neilrirani#gmail.com\"}", "{\"userId\":\"thanks\",\"jobTitleName\":\"Program Directory\",\"firstName\":\"Tom\",\"lastName\":\"Hanks\",\"preferredFullName\":\"Tom Hanks\",\"employeeCode\":\"E3\",\"region\":\"CA\",\"phoneNumber\":\"408-2222222\",\"emailAddress\":\"tomhanks#gmail.com\"}"]}
How to remove '\' from the json content and make the valid json ?
Appreciate if anyone can help on this?
Thanks
Try this.
import json
jsonValue={'Employee': ['{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}', '{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}', '{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}']}
jsonValue['Employee'] = [json.loads(i ) for i in jsonValue['Employee']]
print(jsonValue)
with open("test.json", 'w') as jsonFile:
jsonFile.write(json.dumps(jsonValue))
The problem with your code is that you're dumping a string formatted as a json, dumps works when you need to convert a dict to a json formatted string.

Julia - Rewriting a CSV

Complete Julia newbie here.
I'd like to do some processing on a CSV. Something along the lines of:
using CSV
in_file = CSV.Source('/dir/in.csv')
out_file = CSV.Sink('/dir/out.csv')
for line in CSV.eachline(in_file)
replace!(line, "None", "")
CSV.writeline(out_file, line)
end
This is in pseudocode, those aren't existing functions.
Idiomatically, should I iterate on 1:CSV.countlines(in_file)? Do a while and check something?
If all you want to do is replace a string in the line, you do not need any CSV parsing utilities. All you do is read the file line by line, replace, and write. So:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
out = open(outfile, "w+")
for line in readlines(infile)
newline = replace(line, "a", "b")
write(out, newline)
end
close(out)
This will replicate the pseudocode you have in your question.
If you need to parse and read the csv field by field, use the readcsv function in base.
data=readcsv(infile)
typeof(data) #Array{Any,2}
This will return the data in the file as a 2 dimensional array. You can process this data any way you want, and write it back using the writecsv function.
for i in 1:size(data,1) #iterate by rows
data[i, 1] = "This is " * data[i, 1] # Add text to first column
end
writecsv(outfile, data)
Documentation for these functions:
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.readcsv
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.writecsv

Select (ignore if does not exists) for JSON logs Spark SQL

I am new to Apache spark and trying out a few POCs around this. I am trying to read json logs which are structured but a few fields are not always guaranteed, for example :
{
"item": "A",
"customerId": 123,
"hasCustomerId": true,
.
.
.
},
{
"item": "B",
"hasCustomerId": false,
.
.
.
}
}
Assume I want to transform these JSON logs into CSV, I was trying out Spark SQL to get hold of all the fields by simple Select statements but as the second JSON is missing a field(although it does has an identifier) I am not sure how can I handle this.
I want to transform the above json logs to
item, customerId, ....
A , 123 , ....
B , null/0 , ....
You should use SqlContext to read the JOSN file, sqlContext.read.json("file/path") But if you want to convert it into CSV and then you want to read it with missing values. Your CSV file should be look like
item,customerId,hasCustomerId, ....
A,123,, .... // hasCustomerId is null
B,,888, .... // customerId is null
i.e. empty record. Then you have to read this like
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("file/path")