I have a python dictionary that looks like this:
{'id': 5677240, 'name': 'Conjunto de Panelas antiaderentes com 05 Peças Paris', 'quantity': 21, 'price': '192.84', 'category': 'Panelas'}
But when I try to write it to a JSON file, there's a little mess on the encoding:
{"id": 5677240, "name": "Conjunto de Panelas antiaderentes com 05 Pe\u00e7as Paris", "quantity": 21, "price": 192.84, "category": "Panelas"}
I've already tried putting encoding='utf-8 and using locale as false, but neither helped me.
You will have to disable ensure_ascii while writing to a file.
import json
data = {'id': 5677240, 'name': 'Conjunto de Panelas antiaderentes com 05 Peças Paris', 'quantity': 21, 'price': '192.84', 'category': 'Panelas'}
with open("a.json", "w") as f:
json.dump(data, f, ensure_ascii=False)
Output in file:
{"id": 5677240, "name": "Conjunto de Panelas antiaderentes com 05 Peças Paris", "quantity": 21, "price": "192.84", "category": "Panelas"}
Related
I have a df_movies and col of geners that look like json format.
|genres |
[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 37, 'name': 'Western'}]
How can I extract the first field of 'name': val?
way #1
df_movies.withColumn
("genres_extract",regexp_extract(col("genres"),
""" 'name': (\w+)""",1)).show(false)
way #2
df_movies.withColumn
("genres_extract",regexp_extract(col("genres"),
"""[{'id':\s\d,\s 'name':\s(\w+)""",1))
Excepted: Action
You can use get_json_object function:
Seq("""[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 37, "name": "Western"}]""")
.toDF("genres")
.withColumn("genres_extract", get_json_object(col("genres"), "$[0].name" ))
.show()
+--------------------+--------------+
| genres|genres_extract|
+--------------------+--------------+
|[{"id": 28, "name...| Action|
+--------------------+--------------+
Another possibility is using the from_json function together with a self defined schema. This allows you to "unwrap" the json structure into a dataframe with all of the data in there, so that you can use it however you want!
Something like the following:
import org.apache.spark.sql.types._
Seq("""[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 37, "name": "Western"}]""")
.toDF("genres")
// Creating the necessary schema for the from_json function
val moviesSchema = ArrayType(
new StructType()
.add("id", StringType)
.add("name", StringType)
)
// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
.withColumn("parsedMovies", explode(from_json(col("genres"), moviesSchema)))
.select("parsedMovies.*")
parsedDf.show(false)
+---+---------+
| id| name|
+---+---------+
| 28| Action|
| 12|Adventure|
| 37| Western|
+---+---------+
I'm getting information from an external api. I'm getting an error when I try to update the database as the string formatting is incorrect.
There is a mixture of Double quotes and single quotes caused by the single quote O' in the first line which must tell python to use Double quotes.
To get over this I tried json.dumps() and that gives me the result below and causes a syntax error when inserting to the Database, because of the single quotes around every object.
[
'{"author_name": "Michael P O\'Shaughnessy", "rating": 4, "text": "Stayed for a midweek"}',
'{"author_name": "camille williams", "rating": 5, "text": "We looked around and found"}',
'{"author_name": "natasha sevrugina", "rating": 5, "text": "Stayed at the "}',
'{"author_name": "niamh kelly", "rating": 5, "text": "Great hotel in a central location"}',
'{"author_name": "Janette Wade", "rating": 5, "text": "Excellent staff
\\ud83d\\udc4f\\n\\nJanette\\nSpiritual Ceremonies"}'
]
It should look like this to achieve, valid Json
[
{"author_name": "Michael P O\' Shaughnessy", "rating": 4, "text": "Stayed for a midweek night"},
{"author_name": "camille williams", "rating": 5, "text": "We looked around and found"}
]
With out jsondumps the return from the api is, notice only the first line is changed
[
{'author_name': "Michael P O'Shaughnessy", 'rating': 4, 'text': "Stayed for a midweek night"}
{'author_name': 'camille williams', 'rating': 5, 'text': 'We looked around and found that was great."},
{'author_name': 'natasha sevrugina', 'rating': 5, 'text': 'Stayed at the Hotel '},
{'author_name': 'niamh kelly', 'rating': 5, 'text': 'Great hotel in a central location. '},
{'author_name': 'Janette Wade', 'rating': 5, 'text': 'Modern rooms. Great room service. '}
]
Which also gives a syntax error because of the mixture of single, double and single quotes.
Here are my calls to the api:
review_author = dictionary['result']['reviews'][i]['author_name']
review_rating = dictionary['result']['reviews'][i]['rating']
review_text = dictionary['result']['reviews'][i]['text']
dict_keys = ["author_name", "rating", "text"]
res_dict = {dict_keys[0]: review_author, dict_keys[1]: review_rating, dict_keys[2]: review_text}
bus_reviews.append(json.dumps(res_dict))
How can I remove the single quotes around json.dump()
I changed this
bus_reviews.append(json.dumps(res_dict))
to
bus_reviews.append(res_dict)
and I put the return statement to include the json dumps
return json.dumps(bus_reviews)
I am very new to Json files. I scraped a txt file with some million json objects such as:
{
"created_at":"Mon Oct 14 21:04:25 +0000 2013",
"default_profile":true,
"default_profile_image":true,
"description":"...",
"followers_count":5,
"friends_count":560,
"geo_enabled":true,
"id":1961287134,
"lang":"de",
"name":"Peter Schmitz",
"profile_background_color":"C0DEED",
"profile_background_image_url":"http://abs.twimg.com/images/themes",
"utc_offset":-28800,
...
}
{
"created_at":"Fri Oct 17 20:04:25 +0000 2015",
...
}
I want to extract the columns into a data frame in R:
Variable Value
created_at X
default_profile Y
…
In general, similar to how done here(multiple Json objects in one file extract by python) in Python. If anyone has an idea or a suggestion, help would be much appreciated! Thank you!
Here is an example on how you could approach it with two objects. I assume you were able to read the JSON from a file, otherwise see here.
myjson = '{"created_at": "Mon Oct 14 21:04:25 +0000 2013", "default_profile": true,
"default_profile_image": true, "description": "...", "followers_count":
5, "friends_count": 560, "geo_enabled": true, "id": 1961287134, "lang":
"de", "name": "Peter Schmitz", "profile_background_color": "C0DEED",
"profile_background_image_url": "http://abs.twimg.com/images/themes", "utc_offset": -28800}
{"created_at": "Mon Oct 15 21:04:25 +0000 2013", "default_profile": true,
"default_profile_image": true, "description": "...", "followers_count":
5, "friends_count": 560, "geo_enabled": true, "id": 1961287134, "lang":
"de", "name": "Peter Schmitz", "profile_background_color": "C0DEED",
"profile_background_image_url": "http://abs.twimg.com/images/themes", "utc_offset": -28800}
'
library("rjson")
# Split the text into a list of all JSON objects. I chose '!x!x!' pretty randomly.. There may be better ways of keeping the brackets wile splitting.
my_json_objects = head(strsplit(gsub('\\}','\\}!x!x!', myjson),'!x!x!')[[1]],-1)
# read the text as JSON objects
json_data <- lapply(my_json_objects, function(x) {fromJSON(x)})
# Transform to dataframes
json_data <- lapply(json_data, function(x) {data.frame(val=unlist(x))})
Output:
[[1]]
val
created_at Mon Oct 14 21:04:25 +0000 2013
default_profile TRUE
default_profile_image TRUE
description ...
followers_count 5
friends_count 560
geo_enabled TRUE
id 1961287134
lang de
name Peter Schmitz
profile_background_color C0DEED
profile_background_image_url http://abs.twimg.com/images/themes
utc_offset -28800
[[2]]
val
created_at Mon Oct 15 21:04:25 +0000 2013
default_profile TRUE
default_profile_image TRUE
description ...
followers_count 5
friends_count 560
geo_enabled TRUE
id 1961287134
lang de
name Peter Schmitz
profile_background_color C0DEED
profile_background_image_url http://abs.twimg.com/images/themes
utc_offset -28800
Hope this helps!
I am getting this error while trying to import this JSON into google bigquery table
file-00000000: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. (error code: invalid)
JSON parsing error in row starting at position 0 at file: file-00000000. Start of array encountered without start of object. (error code: invalid)
This is the JSON
[{'instrument_token': 11192834, 'average_price': 8463.45, 'last_price': 8471.1, 'last_quantity': 75, 'buy_quantity': 1065150, 'volume': 5545950, 'depth': {'buy': [{'price': 8471.1, 'quantity': 300, 'orders': 131072}, {'price': 8471.0, 'quantity': 300, 'orders': 65536}, {'price': 8470.95, 'quantity': 150, 'orders': 65536}, {'price': 8470.85, 'quantity': 75, 'orders': 65536}, {'price': 8470.7, 'quantity': 225, 'orders': 65536}], 'sell': [{'price': 8471.5, 'quantity': 150, 'orders': 131072}, {'price': 8471.55, 'quantity': 375, 'orders': 327680}, {'price': 8471.8, 'quantity': 1050, 'orders': 65536}, {'price': 8472.0, 'quantity': 1050, 'orders': 327680}, {'price': 8472.1, 'quantity': 150, 'orders': 65536}]}, 'ohlc': {'high': 8484.1, 'close': 8336.45, 'low': 8422.35, 'open': 8432.75}, 'mode': 'quote', 'sell_quantity': 998475, 'tradeable': True, 'change': 1.6151959167271395}]
http://jsonformatter.org/ also gives parse error for this JSON block. Need help understanding where the formatting is wrong - this is the JSON from a rest API
This is not valid JSON. JSON uses double quotes, not single quotes. Also, True should be true.
If I had to guess, I would guess that this is Python code being passed off as JSON. :-)
I suspect that even once this is made into correct JSON, it's not the format Google BigQuery is expecting. From https://cloud.google.com/bigquery/data-formats#json_format, it looks like you should have a text file with one JSON object per line. Try just this:
{"mode": "quote", "tradeable": true, "last_quantity": 75, "buy_quantity": 1065150, "depth": {"buy": [{"quantity": 300, "orders": 131072, "price": 8471.1}, {"quantity": 300, "orders": 65536, "price": 8471.0}, {"quantity": 150, "orders": 65536, "price": 8470.95}, {"quantity": 75, "orders": 65536, "price": 8470.85}, {"quantity": 225, "orders": 65536, "price": 8470.7}], "sell": [{"quantity": 150, "orders": 131072, "price": 8471.5}, {"quantity": 375, "orders": 327680, "price": 8471.55}, {"quantity": 1050, "orders": 65536, "price": 8471.8}, {"quantity": 1050, "orders": 327680, "price": 8472.0}, {"quantity": 150, "orders": 65536, "price": 8472.1}]}, "change": 1.6151959167271395, "average_price": 8463.45, "ohlc": {"close": 8336.45, "high": 8484.1, "open": 8432.75, "low": 8422.35}, "instrument_token": 11192834, "last_price": 8471.1, "sell_quantity": 998475, "volume": 5545950}
OP has a valid JSON record but that wouldn't work with Biq Query, and here's why:
Google Big Query supports, JSON objects {}, one object per line. Check this out.
This basically means that you cannot supply list [] as json records and expect Big Query to detect it. You must always have one json object per line.
Here's a quick reference to what I am saying.
and there are more.
at last,
I highly recommend you read up the below and check out the link for more information on different forms of JSON structures, read this from the json.org
I have a json file with the following format
[
{
"id": 2,
"createdBy": 0,
"status": 0,
"utcTime": "Oct 14, 2014 4:49:47 PM",
"placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia",
"longitude": 77.5983817,
"latitude": 12.9832418,
"createdDate": "Sep 16, 2014 2:59:03 PM",
"accuracy": 5,
"loginType": 1,
"mobileNo": "0000005567"
},
{
"id": 4,
"createdBy": 0,
"status": 0,
"utcTime": "Oct 14, 2014 4:52:48 PM",
"placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia",
"longitude": 77.5983817,
"latitude": 12.9832418,
"createdDate": "Oct 8, 2014 5:24:42 PM",
"accuracy": 5,
"loginType": 1,
"mobileNo": "0000005566"
}
]
when i try to load the data to pig using JsonLoader class I am getting a error like Unexpected end-of-input: expected close marker for OBJECT
a = LOAD '/user/root/jsoneg/exp.json' USING JsonLoader('id:int,createdBy:int,status:int,utcTime:chararray,placeName:chararray,longitude:double,latitude:double,createdDate:chararray,accuracy:double,loginType:double,mobileNo:chararray');
b = foreach a generate $0,$1,$2;
dump b;
I am also faced similar kind of problem sometime back, later i came to know that Pig JSON will not support multiline json format. It will always expect the json input must be in single line.
Instead of native Jsonloader, i suggest you to use elephantbird json loader. It is pretty good for Jsons formats.
You can download the jars from the below link
http://www.java2s.com/Code/Jar/e/elephant.htm
I changed your input format to single line and loaded through elephantbird as below
input.json
{"test":[{"id": 2,"createdBy": 0,"status": 0,"utcTime": "Oct 14, 2014 4:49:47 PM","placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia","longitude": 77.5983817,"latitude": 12.9832418,"createdDate": "Sep 16, 2014 2:59:03 PM","accuracy": 5,"loginType": 1,"mobileNo": "0000005567"},{"id": 4,"createdBy": 0,"status": 0,"utcTime": "Oct 14, 2014 4:52:48 PM","placeName": "21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia","longitude": 77.5983817,"latitude": 12.9832418,"createdDate": "Oct 8, 2014 5:24:42 PM","accuracy": 5,"loginType": 1,"mobileNo": "0000005566"}]}
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
B = FOREACH A GENERATE FLATTEN($0#'test');
C = FOREACH B GENERATE FLATTEN($0) AS mymap;
D = FOREACH C GENERATE mymap#'id',mymap#'placeName',mymap#'status';
DUMP D;
Output:
(2,21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia,0)
(4,21/F, Cunningham Main Rd, Sampangi Rama NagarBengaluruKarnatakaIndia,0)