I have multiple JSON File that need to be converted in one CSV File
These are the example JSON code
tryout1.json
{
"Product":{
"one":"Desktop Computer",
"two":"Tablet",
"three":"Printer",
"four":"Laptop"
},
"Price":{
"five":700,
"six":250,
"seven":100,
"eight":1200
}}
tryout2.json
{
"Product":{
"one":"dell xps tower",
"two":"ipad",
"three":"hp office jet",
"four":"macbook"
},
"Price":{
"five":500,
"six":200,
"seven":50,
"eight":1000
}}
This is the python code that I wrote for converting those 2 json files
import pandas as pd
df1 = pd.read_json('/home/mich/Documents/tryout.json')
print(df1)
df2 = pd.read_json('/home/mich/Documents/tryout2.json')
print(df2)
df = pd.concat([df1, df2])
df.to_csv ('/home/mich/Documents/tryout.csv', index = None)
result = pd.read_csv('/home/mich/Documents/tryout.csv')
print(result)
But I didn't get the result I need. How can I print the first json file in one column (for both product and price) and the second in the next column? (view Image via Link)
The result I got
[]
The result that I need
[]
You can first create a combined column of product and prices then concat them.
I am using axis = 1 since i want them to be combined side by side.(columns)
axis = 0 will combine by rows.
import pandas as pd
df1 = pd.read_json('/home/mich/Documents/tryout.json')
df1['product_price'] = df1['Product'].fillna(df1['Price'])
df2 = pd.read_json('/home/mich/Documents/tryout2.json')
df2['product_price'] = df2['Product'].fillna(df2['Price'])
pd.concat([df1['product_price'], df2['product_price']],axis=1)
Related
I am trying to bring in JIRA data into Foundry using an external API. When it comes in via Magritte, the data gets stored in AVRO and there is a column called response. The response column has data that looks like this...
[{"id":"customfield_5","name":"test","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["cf[5]","test"],"schema":{"type":"user","custom":"com.atlassian.jira.plugin.system.customfieldtypes:userpicker","customId":5}},{"id":"customfield_2","name":"test2","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["test2","cf[2]"],"schema":{"type":"option","custom":"com.atlassian.jira.plugin.system.customfieldtypes:select","customId":2}}]
Due to the fact that this imports as AVRO, the documentation that talks about how to convert this data that's in Foundry doesn't work. How can I convert this data into individual columns and rows?
Here is the code that I've attempted to use:
from transforms.api import transform_df, Input, Output
from pyspark import SparkContext as sc
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf
import json
import pyspark.sql.types as T
#transform_df(
Output("json output"),
json_raw=Input("json input"),
)
def my_compute_function(json_raw, ctx):
sqlContext = SQLContext(sc)
source = json_raw.select('response').collect() # noqa
# Read the list into data frame
df = sqlContext.read.json(sc.parallelize(source))
json_schema = T.StructType([
T.StructField("id", T.StringType(), False),
T.StructField("name", T.StringType(), False),
T.StructField("custom", T.StringType(), False),
T.StructField("orderable", T.StringType(), False),
T.StructField("navigable", T.StringType(), False),
T.StructField("searchable", T.StringType(), False),
T.StructField("clauseNames", T.StringType(), False),
T.StructField("schema", T.StringType(), False)
])
udf_parse_json = udf(lambda str: parse_json(str), json_schema)
df_new = df.select(udf_parse_json(df.response).alias("response"))
return df_new
# Function to convert JSON array string to a list
def parse_json(array_str):
json_obj = json.loads(array_str)
for item in json_obj:
yield (item["a"], item["b"])
Parsing Json in a string column to a struct column (and then into separate columns) can be easily done using the F.from_json function.
In your case, you need to do:
df = df.withColumn("response_parsed", F.from_json("response", json_schema))
Then you can do this or similar to get the contents into different columns:
df = df.select("response_parsed.*")
However, this won't work as your schema is incorrect, you actually have a list of json structs in each row, not just 1, so you need a T.ArrayType(your_schema) wrapping around the whole thing, you'll also need to do an F.explode before selecting, to get each array element in its own row.
An additional useful function is F.get_json_object, which allows you to get json one json object from a json string.
Using a UDF like you've done could work, but UDFs are generally much less performant than native spark functions.
Additionally, all the AVRO file format does in this case is to merge multiple json files into one big file, with each file in its own row, so the example under "Rest API Plugin" - "Processing JSON in Foundry" should work as long as you skip the 'put this schema on the raw dataset' step.
I used the magritte-rest connector to walk through the paged results from search:
type: rest-source-adapter2
restCalls:
- type: magritte-paging-inc-param-call
method: GET
path: search
paramToIncrease: startAt
increaseBy: 50
initValue: 0
parameters:
startAt: '{%startAt%}'
extractor:
- type: json
assign:
issues: /issues
allowNull: true
condition:
type: magritte-rest-non-empty-condition
var: issues
maxIterationsAllowed: 4096
cacheToDisk: false
oneFilePerResponse: false
That yielded a dataset that looked like this:
Once I had that, this parsed expanded and parsed the returned JSON issues into a properly-typed DataFrame with fields holding the inner structure of the issue as a (very complex) struct:
import json
from pyspark.sql import Row
from pyspark.sql.functions import explode
def issues_enumerated(All_Issues_Paged):
def generate_issue_row(input_row: Row) -> Row:
"""
Generates a dataframe of each responses issue array as a single array record per-Row
"""
d = input_row.asDict()
resp_json = d['response']
resp_obj = json.loads(resp_json)
issues = list(map(json.dumps,resp_obj['issues']))
return Row(issues=issues)
# array-per-record
unexploded_df = All_Issues_Paged.rdd.map(generate_issue_row).toDF()
# row-per-record
row_per_record_df = unexploded_df.select(explode(unexploded_df.issues))
# raw JSON string per-record RDD
issue_json_strings_rdd = row_per_record_df.rdd.map(lambda _: _.col)
# JSON object dataframe
issues_df = spark.read.json(issue_json_strings_rdd)
issues_df.printSchema()
return issues_df
Input CSV Data
userid, Code, Status
1234, 1 , final
1287, 2, notfinal
#Applied Pyspark Script
#Create Spark Session
spark = SparkSession.builder.master("yarn").appName().enableHiveSupport().config("spark.some.config.option", "some-value").getOrCreate()
#read csv data into dataframe
df = spark.read.load("Book3.csv",format="csv", sep=",", inferSchema="true", header="true")
#define schema for json df
newschema = StructType([StructField("userid", StringType()),StructField("report",
StringType(),metadata={"maxlength":6000})])
jsondf = df.rdd.map(lambda row: (row[0], ({"Code":row[1],"status" : row[2]})))\
.map(lambda row: (row[0], json.dumps(row[1])))\
.toDF(newschema)
jsondf.write.format("mongo").mode("append")\
.option("uri","mongodb://gcp.mongodb.net/").option("database","dbname").option("collection",
"testcollection").save()
Resulant Mongo Data
{
"userid" : "1234",
"report" : "{\"Code\": \"1\", \"status\": \"final\"}"
}
{
"userid" : "1287",
"report" : "{\"Code\": \"2\", \"status\": \"notfinal\"}"
}
In mongo i get a complete json encoded string in "report" which is not a surprise given i have taken report field as Stringtype().
This effectively makes any nested field based search in mongo impossible and whole code is useless then.
How can i make it a proper nested json so that mongo can search on nested fields as well ?
when i try to change field to proper structred json using below code
>>> new_df = sql_context.read.json(df.rdd.map(lambda r: r.json))
>>> new_df.printSchema()
i get error that "raise AttributeError(item) AttributeError: json"
Please help with soem code tips...
i am ok to use groupby as well but struggling what to put in aggregate functions and i need dataframe in result to write to mongo.
The solution is to properly define schema in pyspark "df_schema" and then map your base df into a new df "df_mongo" making sure that df.rdd.map should follow the pattern defined in df_schema .
df = spark.read.load("sourcelocation",format="csv", sep="|", inferSchema="true", header="true")
df_schema = StructType([StructField("field1", StringType(),True),StructField("field2", StringType(),True)])
df_mongo = df.rdd.map(lambda row: ([row[15],row[12]])).toDF(df_schema)
df_mongo.write.format("mongo").mode("append").option("uri",mongodb_uri). \
option("database",dbname).option("collection", collection_name).save()
The format in the file looks like this
{ 'match' : 'a', 'score' : '2'},{......}
I've tried pd.DataFrame and I've also tried reading it by line but it gives me everything in one cell
I'm new to python
Thanks in advance
Expected result is a pandas dataframe
Try use json_normalize() function
Example:
from pandas.io.json import json_normalize
values = [{'match': 'a', 'score': '2'}, {'match': 'b', 'score': '3'}, {'match': 'c', 'score': '4'}]
df = json_normalize(values)
print(df)
Output:
If one line of your file corresponds to one JSON object, you can do the following:
# import library for working with JSON and pandas
import json
import pandas as pd
# make an empty list
data = []
# open your file and add every row as a dict to the list with data
with open("/path/to/your/file", "r") as file:
for line in file:
data.append(json.loads(line))
# make a pandas data frame
df = pd.DataFrame(data)
If there is more than only one JSON object on one row of your file, then you should find those JSON objects, for example here are two possible options. The solution with the second option would look like this:
# import all you will need
import pandas as pd
import json
from json import JSONDecoder
# define function
def extract_json_objects(text, decoder=JSONDecoder()):
pos = 0
while True:
match = text.find('{', pos)
if match == -1:
break
try:
result, index = decoder.raw_decode(text[match:])
yield result
pos = match + index
except ValueError:
pos = match + 1
# make an empty list
data = []
# open your file and add every JSON object as a dict to the list with data
with open("/path/to/your/file", "r") as file:
for line in file:
for item in extract_json_objects(line):
data.append(item)
# make a pandas data frame
df = pd.DataFrame(data)
Looking for a way to modify the script, below to produce a single CSV from multiple JSON files. It should include multiple rows, each row returning values for the same fields, but tied to a single JSON (ROW 1 = JSON 1, ROW 2 = JSON 2, etc.). The following produces a CSV with one row of data.
import pandas as pd
df = pd.read_json("pywu.cache.json")
df = df.loc[["station_id", "observation_time", "weather", "temperature_string", "display_location"],"current_observation"].T
df = df.append(pd.Series([df["display_location"]["latitude"], df["display_location"]["longitude"]], index=["latitude", "longitude"]))
df = df.drop("display_location")
print(df['latitude'], df['longitude'])
df = pd.to_numeric(df, errors="ignore")
pd.DataFrame(df).T.to_csv("CurrentObs.csv", index=False, header=False, sep=",")
I'm trying to grab some numbers from this json file, but I don't how to do it correctly. This is the json file I am trying to gather information from:
http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
I've been trying to get this code to work, but I can't figure it out:
import json
from pprint import pprint
with open('data.json') as data_file:
data = json.load(data_file)
data["rowSet"] ["1610612737"] ["Atlanta Hawks"]
I'm trying to get the statistics from each team.
The following Python script should do it.
#!/usr/bin/env python
import json
with open('leaguedashteamstats.json') as data_file:
data = json.load(data_file)
# extract headers names
headers = data['resultSets'][0]['headers']
# extract raw json rows
raw_rows = data['resultSets'][0]['rowSet']
team_stats = []
for row in raw_rows:
print row[1] # prints team name
# mixes header names and values and prints them out
for (header, value) in zip(headers, row):
print header, value
print '\n'
Both data and code can be seen here:
https://gist.github.com/cevaris/24d0b7d97677667aedb14059a6959da1#file-1-team-stats-output
Disclaimer: this code doesn't contain any validation, but it should lead you in the right direction:
import json
with open('data.json') as data_file:
data = json.load(data_file)
for rs in data.get('resultSets'):
for r_ in [r for r in rs.get('rowSet') if r[1] == 'Atlanta Hawks']:
print(r_)
You basically need to determine specific keys that you are going to loop through, or obtain.
This should hopefully get you to where you need to be.