Ad-hoc queries to a massive JSON dataset - json

I have a massive dataset stored in Azure BLOB in JSON format. Some apps are constantly adding new data to it. BLOBs are organized in partitions like
/dataset={name}/date={YYYY-MM-DD}/one_or_more_json_files
Data pieces do not follow any particular schema. JSON field names are not in consistent letter case. Some JSON rows can be broken.
Could someone advise a good way to query this data without defining schema in advance. I would like to do something like
select * from my_huge_json_dataset where dataset='mydataset' and date>'2015-04-01'
without defining explicit schema for the table
My first consideration was HIVE but it turns out that SerDe needs schema to be defined to create a table. json_tuple could be an answer but it is case-sensitive and crashes if meets malformed json row.
I am also considering Apache Drill and Pig but have no experience with them and would like some guidance.

You could use Apache Drill, you only need to configure a new storage pointing to your dataset folder:
{
"type": "file",
"enabled": true,
"connection": "file:///",
"config": null,
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"json": {
"type": "json",
"extensions": [
"json"
]
}
}
So, if you defined that Storage Plugin as 'dfs' for example you could query from the root directory without defining any schema using ANSI SQL, just like:
SELECT * FROM dfs.dataset.date.`file.json`;
or even filter by your folder name in the same query using dir0.
I encourage you to visit their documentation site Apache Drill documentation in your case specially Querying JSON files

Related

How to insert json data to sqflite table without clicking anything in app(Something like at first init of the app)

Can you help me with inserting ready json data that is stored in assets file to sqflite table.
And for example my json file
[
{
"word": "json",
"value": "format that uses human-readable text to store and transmit data "
},
{
"word": "life",
"value": "a gift"
},
{
"word": "work",
"value": "money"
}
]
and also at inserting that data to table, sqflite should add an another string for each key-value containing something like "inFavorite": "false".
so I've seen a lot of videos doing that in app, but not one that insert ready data to sqflite.
I would be grateful if you could help me!
Is that possible?
Thank you for your help!
I tried a lot of things but there is no result

Can I import table in a JSON Schema validation?

I am writing a JSON schema validation. I have an ID field whose values are imported from a table in SQL Server. These values are large and are frequently updated, so is there a way to dynamically connect to this table in the server and validate the JSON? Below is an example code of my schema:
{
"type": "object",
"required": ["employees"],
"properties": {
"employees": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": { "type": "integer", enum = [134,2123,3213,444,5525,6234,7532,825,9342]}
}
}
}
}
In place of 'enum' I want to connect to a table so the ID values are updated when the table is updated.
As Greg said, there is nothing in JSON Schema which allows you to do this.
Some implementations have created their own extensions to allow external sources. Many implementations allow custom keywords. You would have to check your documentation.
You should consider the cost of querying the database at the same time as checking structural correctness. It may be benficial to do ID checking which hits your database after you've confirmed the data is of the correct format and structure.

mySQL JSON Document Store method for inserting data into node 3 levels deep

I want to take the data from here: https://raw.githubusercontent.com/usnistgov/oscal-content/master/examples/ssp/json/ssp-example.json
which I've pulled into a mySQL database called "ssp_models" into a JSON column called 'json_data', and I need add a new 'name' and 'type' entry into the 'parties' node with a new uuid in the same format as the example.
So in my mySQL database table, "ssp_models", I have this entry: Noting that I should be able to write the data by somehow referencing "66c2a1c8-5830-48bd-8fdd-55a1c3a52888" as the record to modify.
All the example I've seen online seem to force me to read out the entire JSON into a variable, make the addition, and then cram it back into the json_data column, which seems costly, especially with large JSON data-sets.
Isn't there a simple way I can say
"INSERT INTO ssp_models JSON_INSERT <somehow burrow down to 'system-security-plan'.metadata.parties (name, type) VALUES ('Raytheon', 'organization') WHERE uuid = '66c2a1c8-5830-48bd-8fdd-55a1c3a52888'
I was looking at this other stackoverflow example for inserting into JSON:
How to create and insert a JSON object using MySQL queries?
However, that's basically useful when you are starting from scratch, vs. needing to add JSON data to data that already exists.
You may want to read https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html and explore each of the functions, and try them out one by one, if you're going to continue working with JSON data in MySQL.
I was able to do what you describe this way:
update ssp_models set json_data = json_array_append(
json_data,
'$."system-security-plan".metadata.parties',
json_object('name', 'Bingo', 'type', 'farmer')
)
where uuid = '66c2a1c8-5830-48bd-8fdd-55a1c3a52888';
Then I checked the data:
mysql> select uuid, json_pretty(json_data) from ssp_models\G
*************************** 1. row ***************************
uuid: 66c2a1c8-5830-48bd-8fdd-55a1c3a52888
json_pretty(json_data): {
"system-security-plan": {
"uuid": "66c2a1c8-5830-48bd-8fdd-55a1c3a52888",
"metadata": {
"roles": [
{
"id": "legal-officer",
"title": "Legal Officer"
}
],
"title": "Enterprise Logging and Auditing System Security Plan",
"parties": [
{
"name": "Enterprise Asset Owners",
"type": "organization",
"uuid": "3b2a5599-cc37-403f-ae36-5708fa804b27"
},
{
"name": "Enterprise Asset Administrators",
"type": "organization",
"uuid": "833ac398-5c9a-4e6b-acba-2a9c11399da0"
},
{
"name": "Bingo",
"type": "farmer"
}
]
}
}
}
I started with data like yours, but for this test, I truncated everything after the parties array.

Create JSON file from a query with FOR JSON clause result in ADF

I need to create a JSON file from azure SQL database and store the file in Azure blob storage.
In ADF, I created a simple pipeline with one Copy Data activity to achieve this.
I used t-sql query with FOR JSON clause to get data from the database.
SELECT * FROM stage.Employee FOR JSON AUTO, ROOT ('main_root')
Here is my source:
And this is a sink:
After execute pipeline, the created file looks like this
I want to get a normal JSON file with the structure
{
"main_root": [
{
"Employee_No": "1000",
"Status": "Employee",
"..." "...",
"..."
},
{
"Employee_No": "1000",
"Status": "Employee",
"..." "...",
"..."
},
{
"Employee_No": "1000",
"Status": "Employee",
"..."
"...",
"..."
Any help I will appreciate.
You are building a hierarchical structure from a relation source, so you'll want to build your R2H logic in data flows to accommodate this data transformation.
Set the SQL DB table as your source and then build your hierarchical structure in a Derived Column with sub-columns for hierarchies and collect data into arrays using Aggregate with the collect() function.

Store Json field as String in Elastic search?

I am trying to index a json field in elastic search, I have given it external mapping that this field should be treated as string and not json, also indexing is not required for it, so no need to analyze it. The mapping for this is following
"json_field": {
"type": "string",
"index": "no"
},
Still at the time of indexing, this field is getting analyzed and because of that I am getting MapperParsingException
in Short How can we store json as a string in elastic search without getting analyzed ?
Finally got it,
if you want to store JSON as a string, without analyzing it, the mapping should be like this
"json_field": {
"type": "object",
"enabled" : false
},
The enabled flag allows to disable parsing and indexing a named object completely. This is handy when a portion of the JSON document contains an arbitrary JSON which should not be indexed, nor added to the mapping.
Update - From ES version 7.12 "enabled" has been changed to "index".
As of today ElasticSearch 7.12 "enabled" is now "index".
So mapping should be like this:
"json_field": {
"type": "object",
"index" : false
},
Solution
Set "enabled": false for the field.
curl -X PUT "localhost:9200/{{INDEX-NAME}}/_mapping/doc" -H 'Content-Type: application/json' -d'
{
"properties" : {
"json_field" : {
"type" : "object",
"enabled": false
}
}
}
Note: This cannot be applied to the existing field. Either pass it in mapping during the creation of index or you can always create a new field.
Explanation
The enabled setting, which can be applied only to the top-level mapping definition and to object fields, causes Elasticsearch to skip parsing of the contents of the field entirely. The JSON can still be retrieved from the _source field, but it is not searchable or stored in any other way:
Ref: Elasticsearch Docs