My problem is that I cannot see the data I pulled from the mongo database separately. The data comes as a dictionary and when I try to read it with pandas, it returns the sub dictionary group as a single data.
import pandas
dic = {
"value1" : "a",
"value2" : {
"subvalue1" : "sub-a",
"subvalue2" : "sub-b"
},
"value3" : "c"
}
df = pandas.DataFrame(dic)
df = pandas.DataFrame(list(dic.items()), columns=["value1","subvalue1"])
print(df)
When I run the code, the output I get is as follows.
value1 subvalue1
0 value1 a
1 value2 {'subvalue1': 'sub-a', 'subvalue2': 'sub-b'}
2 value3 c
Process finished with exit code 0
What I want is I want to produce an output with the values in the "columns" array by writing a code like the one below.
import pandas
dic = {
"value1" : "a",
"value2" : {
"subvalue1" : "sub-a",
"subvalue2" : "sub-b"
},
"value3" : "c"
}
df = pandas.DataFrame(dic)
df = pandas.DataFrame(list(dic.items()), columns=["value1","subvalue1","subvalue2","value3"])
print(df)
output sample I want
How can i do this.
Thank you for all.
you can do that by flattening the dictionary like this:
def flatten_dict(d):
flattened={}
for key, val in d.items():
if isinstance(val,dict):
flattened.update(flatten_dict(val))
else:
flattened.update({key:val})
return flattened
I don't recommend this way however. If you happen to have a dictionary of the form {"a":{"same_key_name":"value_a"}, "b":{"same_key_name":"value_b"}} then the flattened dictionary would be {'same_key_name': 'value_b'}
A safer and more canonical way to do it is to flatten the dictionary but keep the key names in a concatenated form:
df = pd.json_normalize(d, sep='_')
Related
I am trying to write a small JSON script that parses JSON files. I need to include multiple variables in the code but currently, I'm stuck since f string does not seem to be working as I expected. Here is an example code:
import json
test = 10
json_data = f'[{"ID": {test},"Name":"Pankaj","Role":"CEO"}]'
json_object = json.loads(json_data)
json_formatted_str = json.dumps(json_object, indent=2)
print(json_formatted_str)
The above code returns an error:
json_data = f'[{"ID": { {test} },"Name":"Pankaj","Role":"CEO"}]'
ValueError: Invalid format specifier
Could you, please let me know how can I add variables to the JSON?
Thank you.
You can put extra{ and } to your string:
import json
test = 10
json_data = f'[{{"ID": {test},"Name":"Pankaj","Role":"CEO"}}]'
json_object = json.loads(json_data)
json_formatted_str = json.dumps(json_object, indent=2)
print(json_formatted_str)
Prints:
[
{
"ID": 10,
"Name": "Pankaj",
"Role": "CEO"
}
]
I'm traversing a directory tree, which contains directories and files. I know I could use os.walk for this, but this is just an example of what I'm doing, and the end result has to be recursive.
The function to get the data out is below:
def walkfn(dirname):
for name in os.listdir(dirname):
path = os.path.join(dirname, name)
if os.path.isdir(path):
print(name)
walkfn(path)
elif os.path.isfile(path):
print(name)
Assuming we had a directory structure such as this:
testDir/
a/
1/
2/
testa2.txt
testa.txt
b/
3/
testb3.txt
4/
The code above would return the following:
a
testa.txt
1
2
testa2.txt
c
d
b
4
3
testb3.txt
It's doing what I would expect at this point, and the values are all correct, but I'm trying to get this data into a JSON object. I've seen that I can add these into nested dictionaries, and then convert it to JSON, but I've failed miserably at getting them into nested dictionaries using this recursive method.
The JSON I'm expecting out would be something like:
{
"test": {
"b": {
"4": {},
"3": {
"testb3.txt": null
}
},
"a": {
"testa.txt": null,
"1": {},
"2": {
"testa2.txt": null
}
}
}
}
You should pass json_data in your recursion function:
import os
from pprint import pprint
from typing import Dict
def walkfn(dirname: str, json_data: Dict=None):
if not json_data:
json_data = dict()
for name in os.listdir(dirname):
path = os.path.join(dirname, name)
if os.path.isdir(path):
json_data[name] = dict()
json_data[name] = walkfn(path, json_data=json_data[name])
elif os.path.isfile(path):
json_data.update({name: None})
return json_data
json_data = walkfn(dirname="your_dir_name")
pprint(json_data)
Input CSV Data
userid, Code, Status
1234, 1 , final
1287, 2, notfinal
#Applied Pyspark Script
#Create Spark Session
spark = SparkSession.builder.master("yarn").appName().enableHiveSupport().config("spark.some.config.option", "some-value").getOrCreate()
#read csv data into dataframe
df = spark.read.load("Book3.csv",format="csv", sep=",", inferSchema="true", header="true")
#define schema for json df
newschema = StructType([StructField("userid", StringType()),StructField("report",
StringType(),metadata={"maxlength":6000})])
jsondf = df.rdd.map(lambda row: (row[0], ({"Code":row[1],"status" : row[2]})))\
.map(lambda row: (row[0], json.dumps(row[1])))\
.toDF(newschema)
jsondf.write.format("mongo").mode("append")\
.option("uri","mongodb://gcp.mongodb.net/").option("database","dbname").option("collection",
"testcollection").save()
Resulant Mongo Data
{
"userid" : "1234",
"report" : "{\"Code\": \"1\", \"status\": \"final\"}"
}
{
"userid" : "1287",
"report" : "{\"Code\": \"2\", \"status\": \"notfinal\"}"
}
In mongo i get a complete json encoded string in "report" which is not a surprise given i have taken report field as Stringtype().
This effectively makes any nested field based search in mongo impossible and whole code is useless then.
How can i make it a proper nested json so that mongo can search on nested fields as well ?
when i try to change field to proper structred json using below code
>>> new_df = sql_context.read.json(df.rdd.map(lambda r: r.json))
>>> new_df.printSchema()
i get error that "raise AttributeError(item) AttributeError: json"
Please help with soem code tips...
i am ok to use groupby as well but struggling what to put in aggregate functions and i need dataframe in result to write to mongo.
The solution is to properly define schema in pyspark "df_schema" and then map your base df into a new df "df_mongo" making sure that df.rdd.map should follow the pattern defined in df_schema .
df = spark.read.load("sourcelocation",format="csv", sep="|", inferSchema="true", header="true")
df_schema = StructType([StructField("field1", StringType(),True),StructField("field2", StringType(),True)])
df_mongo = df.rdd.map(lambda row: ([row[15],row[12]])).toDF(df_schema)
df_mongo.write.format("mongo").mode("append").option("uri",mongodb_uri). \
option("database",dbname).option("collection", collection_name).save()
def json = '{"book": [{"id": "01","language": "Java","edition": "third","author": "Herbert Schildt"},{"id": "07","language": "C++","edition": "second","author": "E.Balagurusamy"}]}'
Using Groovy code, how to get the "id" values printed for "book" array?
Output:
[01, 07]
This is the working example using your input JSON.
import groovy.json.*
def json = '''{"book": [
{"id": "01","language": "Java","edition": "third","author": "Herbert Schildt"},
{"id": "07","language": "C++","edition": "second","author": "E.Balagurusamy"}
]
}'''
def jsonObj = new JsonSlurper().parseText(json)
println jsonObj.book.id // This will return the list of all values of matching key.
Demo here on groovy console : https://groovyconsole.appspot.com/script/5178866532352000
1.Input is JSON file that contains multiple records. Example:
[
{"user": "user1", "page": 1, "field": "some"},
{"user": "user2", "page": 2, "field": "some2"},
...
]
2.I need to load each record from the file as a Document to MongoDB collection.
Using casbah for interacting with mongo, inserting data may look like:
def saveCollection(inputListOfDbObjects: List[DBObject]) = {
val xs = inputListOfDbObjects
xs foreach (obj => {
Collection.save(obj)
})
Question: What is the correct way (using scala) to parse JSON to get data as List[DBObject] at output?
Any help is appreciated.
You could use the parser combinator library in Scala.
Here's some code I found that does this for JSON: http://booksites.artima.com/programming_in_scala_2ed/examples/html/ch33.html#sec4
Step 1. Create a class named JSON that contains your parser rules:
import scala.util.parsing.combinator._
class JSON extends JavaTokenParsers {
def value : Parser[Any] = obj | arr |
stringLiteral |
floatingPointNumber |
"null" | "true" | "false"
def obj : Parser[Any] = "{"~repsep(member, ",")~"}"
def arr : Parser[Any] = "["~repsep(value, ",")~"]"
def member: Parser[Any] = stringLiteral~":"~value
}
Step 2. In your main function, read in your JSON file, passing the contents of the file to your parser.
import java.io.FileReader
object ParseJSON extends JSON {
def main(args: Array[String]) {
val reader = new FileReader(args(0))
println(parseAll(value, reader))
}
}