String not valid UTF-8 when inserting json file into mongodb - json

i am storing a json file produced by cucumber json report into mongodb. But i get "String not valid UTF-8” when inserting json into mongodb.
Because the feature file has 年/月/日 , this is represented as "\u5e74/\u6708/\u65e5" in the json file.
This is what i am doing to store it in mongo
json_string=File.read(file_path)
data = JSON.parse(json_string)
#col.insert(data)
I can see that after JSON.parse, the already encoded string further changes to "\x90\u0013s/\x90\u0013s/\x90\u0013s"
exception is happening at insert statement.
any help appreciated.
I tried the following but still does not work
json_string=File.read(file_path,:encoding => 'UTF-8')
data = JSON.parse(json_string.force_encoding("UTF-8"))
Using Ruby 1.9.3

Related

Reading JSON in Azure Synapse

I'm trying to understand the code for reading JSON file in Synapse Analytics. And here's the code provided by Microsoft documentation:
Query JSON files using serverless SQL pool in Azure Synapse Analytics
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
go
I wonder why the format = 'csv'. Is it trying to convert JSON to CSV to flatten the file?
Why they didn't just read the file as a SINGLE_CLOB I don't know
When you use SINGLE_CLOB then the entire file is important as one value and the content of the file in the doc is not well formatted as a single JSON. Using SINGLE_CLOB will make us do more work after using the openrowset, before we can use the content as JSON (since it is not valid JSON we will need to parse the value). It can be done but will require more work probably.
The format of the file is multiple JSON's like strings, each in separate line. "line-delimited JSON", as the document call it.
By the way, If you will check the history of the document at GitHub, then you will find that originally this was not the case. As much as I remember, originally the file included a single JSON document with an array of objects (was wrapped with [] after loaded). Someone named "Ronen Ariely" in fact found this issue in the document, which is why you can see my name in the list if the Authors of the document :-)
I wonder why the format = 'csv'. Is it trying to convert json to csv to flatten the hierarchy?
(1) JSON is not a data type in SQL Server. There is no data type name JSON. What we have in SQL Server are tools like functions which work on text and provide support for strings which are JSON's like format. Therefore, we do not CONVERT to JSON or from JSON.
(2) The format parameter has nothing to do with JSON. It specifies that the content of the file is a comma separated values file. You can (and should) use it whenever your file is well formatted as comma separated values file (also commonly known as csv file).
In this specific sample in the document, the values in the csv file are strings, which each one of them has a valid JSON format. Only after you read the file using the openrowset, we start to parse the content of the text as JSON.
Notice that only after the title "Parse JSON documents" in the document, the document starts to speak about parsing the text as JSON.

Is it possible to write json data to a file in azure blob storage without converting it to string?

I'm working on a project in azure databricks where I need to write my transformed data which is in JSON format to a file(.json) which further is written to DB.
I've tried with dataframes,rdd options. some snippets of the things I've tried
df.collect.map( line => {
//transformation logic to create json
(field1,field2,json);
})
var dataframe = processedList.toList.toDF("f1","f2","json");
dataframe .repartition(1).write.mode("overwrite").json(path)
This code works fine but the 'value' which is json data is treated/written as String as it contains all the escape characters etc. Cannot directly use JsonObject as dataframe doesn't support it.
So is there a way to write to the file without it converting to String?
What's the type of json column? It's probably string, thus Spark treats it as a string literal. Try
df.withColumn(to_json("json").alias("json")).write.json(path)

'JSON.parse' replaces ':' with '=>'

I have a string:
{"name":"hector","time":"1522379137221"}
I want to parse the string into JSON and expect to get:
{"name":"hector","time":"1522379137221"}
I am doing:
require 'json'
JSON.parse
which produces this:
{"name"=>"hector","time"=>"1522379137221"}
Can someone tell me how I can keep :? I don't understand why it adds =>.
After you parse the json data you should see it in the programming language that you are using.
Ruby uses => to separated the key from the value in hash (while json uses :).
So the ruby output is correct and the data is ready for you to manipute in your code. When you convert your hash to json, the json library will convert the => back to :.
JSON does not have symbol class. Hence, nothing in JSON data corresponds to Ruby symbol. Under a trivial conversion from JSON to Ruby like JSON.parse, you cannot have a symbol in the output.

Create JSON file from MongoDB document using Python

I am using MongoDB 3.4 and Python 2.7. I have retrieved a document from the database and I can print it and the structure indicates it is a Python dictionary. I would like to write out the content of this document as a JSON file. When I create a simple dictionary like d = {"one": 1, "two": 2} I can then write it to a file using json.dump(d, open("text.txt", 'w'))
However, if I replace d in the above code with the the document I retrieve from MongoDB I get the error
ObjectId is not JSON serializable
Suggestions?
As you have found out, the issue is that the value of _id is in ObjectId.
The class definition for ObjectId is not understood by the default json encoder to be serialised. You should be getting similar error for ANY Python object that is not understood by the default JSONEncoder.
One alternative is to write your own custom encoder to serialise ObjectId. However, you should avoid inventing the wheel and use the provided PyMongo/bson utility method bson.json_util
For example:
from bson import json_util
import json
json.dump(json_util.dumps(d), open("text.json", "w"))
The issue is that “_id” is actually an object and not natively deserialized. By replacing the _id with a string as in mydocument['_id'] ='123 fixed the issue.

Json file needs to convert to Objects before inserting to mysqlite?

I am making an Android apps and I have a big json file and I like to store it in mysqlite.
Should I convert the json file to objects before inserting into mysqlite?
thanks.
I see that you were able to convert it to JSON object. After that You should convert the JSON object to a String then save the string as VARCHAR in the DB