Basically, I'm trying to convert the query response from Solr server into a json object that can be passed to a third party api. However, as per my code below, I'm not able to do it:
import solr
import json
if __name__=='__main__':
s = solr.SolrConnection('http://localhost:8983/solr')
op = open('output.json','w')
for term in ['searchstring1','searchstring2','searcstring']:
t = s.query('title:%s'%term,rows=100, wt='json')
for news in t.results:
op.write(news)
Output: Traceback (most recent call last):
File "querying.py", line 11, in
op.write(news)
TypeError: expected a character buffer object
I have very briefly read about Solr and just found this solrpy library to store the query result in a json format. Any help in this regard will be much appreciated.
You're trying to write the actual python dictionary to the file, and not a json representation of it. The error message is trying to tell you that the object you're passing in can not be used as a character buffer directly. Try to use json.dumps to create a JSON representation as a string before writing it to a file.
Related
I hate json files. They are unwieldy and hard to handle :( Please tell me why the following doesn't work:
with open('data.json', 'r+') as file_object:
data = json.load(file_object)[user_1]['balance']
am_nt = 5
data += int(am_nt['amount'])
print(data)
file_object[user_1]['balance'] = data
Through trial and error (and many print statements), I have discovered that it opens the file, goes to the correct place, and then actually adds the am_nt, but I can't make the original json file update. Please help me :( :( . I get:
2000
TypeError: '_io.TextIOWrapper' object is not subscriptable
json is fun to work with as it is similar to python data structures.
The error is: object is not subscriptable
This error is for this line:
file_object[user_1]['balance'] = data
file_object is not json/dictionary data that can be updated like above. Hence the error.
Try to read the json data:
data=json.load(file_object)
Then manipulate the data as python dictionary. And save the file.
I'm validating my response from a GET call through a .json file
match response == read('match_response.json')
Now I want to reuse this file for various other features as only one field in the .json varies. Let's say this param in the json file is "varyingField"
I'm trying to pass this field every time I am matching the response but not able to
def varyingField = 'VARIATION1'
match response == read('match_response.json') {'varyingField' : '#(varyingField)'}}
In the json file I have
"varyingField": "#(varyingField)"
You are trying to use an argument to read for a JSON file ? Sorry such a thing is not supported in Karate, please read the docs.
Use this pattern:
create a JSON file that has all your "happy path" values set
use the read() syntax to load the file (which means this is re-usable across multiple tests)
use the set keyword to update only the field for your scenario or negative test
For more details, refer this answer: https://stackoverflow.com/a/51896522/143475
Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)
I am using the tweetscores package of R to get 'tweets list from twitter. The tweets are stored in json format. While converting it to a data frame I get a lexical error
' Error: lexical error: inside a string, '\' occurs before a character which it may not.".
Any solution to the mentioned error.
A part of the json file text
":[{"text":["MUFC"],"indices":[[83],[88]]}],"symbols":[],"user_mentions":[],"urls":[]},"metadata":{"iso_language_code":["en"],"result_type":["recent"]},"source":["http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"],"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":[7.32108114527322e+017],"id_str":["732108114527322112"],"name":["wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww(^o^)/"],"screen_name":["SukiSukinal"],"location":["+6222"],"description":["Alliansi osaosi ngevote kagak. katanya sih fans a.k.a + "],"url":null,"entities":{"description":{"urls":[]}},"protected":[false],"followers_count":[163],"friends_count":[107],"listed_count":[4],"created_at":["Mon May 16 07:19:11 +0000
Json format does not allow backslashes so you need to escape them. replace any '\' character found with '\\'. Refer [here][1]
[1]: http://www.json.org/ for more info
You likely have an incomplete json string, which may be caused by the package or by an interrupted connection to Twitter's API. A complete json string returned from Twitter should look something like the following:
which I got using rtweet's stream_tweets() function. With a complete string returned by Twitter's REST or stream API, you should be able to convert the data using basically any json parser (e.g., jsonlite::fromJSON()).
I am using MongoDB 3.4 and Python 2.7. I have retrieved a document from the database and I can print it and the structure indicates it is a Python dictionary. I would like to write out the content of this document as a JSON file. When I create a simple dictionary like d = {"one": 1, "two": 2} I can then write it to a file using json.dump(d, open("text.txt", 'w'))
However, if I replace d in the above code with the the document I retrieve from MongoDB I get the error
ObjectId is not JSON serializable
Suggestions?
As you have found out, the issue is that the value of _id is in ObjectId.
The class definition for ObjectId is not understood by the default json encoder to be serialised. You should be getting similar error for ANY Python object that is not understood by the default JSONEncoder.
One alternative is to write your own custom encoder to serialise ObjectId. However, you should avoid inventing the wheel and use the provided PyMongo/bson utility method bson.json_util
For example:
from bson import json_util
import json
json.dump(json_util.dumps(d), open("text.json", "w"))
The issue is that “_id” is actually an object and not natively deserialized. By replacing the _id with a string as in mydocument['_id'] ='123 fixed the issue.