Removing double-encoded JSON values in PSQL - json

Given a table in Postgresql, defined approximately as follows:
Column | Type | Modifiers | Storage | Stats target | Description
-------------+-----------------------------+-----------+----------+--------------+-------------
id | character varying | not null | extended | |
answers | json | | extended | |
we accidentally did a number of inserts to this database of doubly-encoded JSON objects, ie the json value is a string, that is a json-encoded object -- for example:
"{\"a\": 1}"
We'd like to find a query that would convert these values to the JSON objects they represent, for example:
{"a": 1}
We can easily select the bad values by doing:
SELECT * FROM table WHERE json_type(answers) = 'string'
but we are having trouble coming up with a way to parse the JSON in PSQL.

Unfortunaly, there is no string-extraction function for the json[b] type(s) directly, but you can workaround this by embedding the value inside a JSON array & using the ->> operator for string extraction at the 0 array index:
UPDATE table
SET answers = (CONCAT('[', answers::text, ']')::json ->> 0)::json
WHERE json_type(answers) = 'string'
This should work with lower PostgreSQL versions too (9.3). For newer versions (9.4+), you could use the json_build_array() function too.

Related

Manual Insert into MySQLx Collection

I have a large amount of JSON data that needs to be inserted into a MySQLx Collection table. The current Node implementation keeps crashing when I attempt to load my JSON data in, and I suspect it's because I'm inserting too much at once through the collection API. I'd like to manually insert the data into the database using a traditional SQL statement (in the hope that they will get me pass this NodeJs crash).
The problem is that I have this table def:
+--------------+---------------+------+-----+---------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------------+------+-----+---------+-------------------+
| doc | json | YES | | NULL | |
| _id | varbinary(32) | NO | PRI | NULL | STORED GENERATED |
| _json_schema | json | YES | | NULL | VIRTUAL GENERATED |
+--------------+---------------+------+-----+---------+-------------------+
But when running
insert into documents values ('{}', DEFAULT, DEFAULT)
I get:
ERROR 3105 (HY000): The value specified for generated column '_id' in table 'documents' is not allowed.
I've tried with not providing the DEFAULTs, with NULL (but _id doesn't allow NULL even though that's the default), with 0 for _id, with numbers and with uuid_to_bin(uuid()) but I still get the same error.
How can I insert this data into the table directly (I'm using session.sql('INSERT...').bind(JSON.stringify(data)).execute() - using the #mysql/xdevapi library)
The _id column is auto generated from the value of the namesake field in the JSON document. When you use the CRUD interface to insert documents, the X Plugin is capable of generating a unique value for this field. However, by executing a plain SQL statement, you are also by-passing that logic. So, you are able to insert documents if you generate the _ids yourself, otherwise you will bump into that error.
As an example (using crypto.randomInt()):
const { randomInt } = require('crypto')
session.sql('insert into documents (doc) values (?)')
.bind(JSON.stringify({ _id: randomInt(Math.pow(2, 48) - 1) }))
.execute()
Though I'm curious about the issue with the CRUD API and I wanted to see if I was able to reproduce it as well. How are you inserting those documents in that case and what kind of feedback (if any) is provided when it "crashes"?
Disclaimer: I'm the lead developer of the MySQL X DevAPI connector for Node.js

Create hive external table with complex data type and load from CSV or TSV having few columns with serialized JSON object

I have CSV (or TSV) with a column ('nw_day' in example below) having serialized array object and another column ('res_m' in example below) having serialized JSON object. It also has columns with STRING, TIMESTAMP, and FLOAT data type.
For the TSV that looks somewhat like (showing first row)
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
| com_id | w_start_time | cap | nw_day | res_m |
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
| dtf_id | 2019-04-24 06:00:03 | 444.3 | {'Fri','Mon','Sat','Sun','Thurs','Tue','Wed'} | {"some_str":"str_one","some_n":1,"some_t":2019-04-24 06:00:03.700+0000}|
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
I have tried the following statement, but it is not giving me perfect results.
CREATE EXTERNAL TABLE IF NOT EXISTS table_name(
com_id STRING,
w_start_time TIMESTAMP,
cap FLOAT,
nw_day array <STRING>,
res_m STRUCT <
some_str: STRING,
some_n: BIGINT,
some_t: TIMESTAMP
>)
COMMENT 's_e_s'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/location/to/folder/containing/csv'
TBLPROPERTIES ("skip.header.line.count"="1");
So, I'm thinking I deserialize those objects into hive complex datatypes with ARRAYS and STRUCT. But that is not exactly what I get when I run
select * from table_name limit 1;
which gives me
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| com_id | w_start_time | cap | nw_day | res_m |
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| dtf_id | 2019-04-24 06:00:03 | 444.3 | ["{'Fri'"," 'Mon'"," 'Sat'"," 'Sun'"," 'Thurs'"," 'Tue'"," 'Wed'}"] | {"some_str":"{\"some_str\":\"str_one\",\"some_n\":1,\"some_t\":2019-04-24 06:00:03.700+0000}\","some_n":null,"some_t":null}|
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
So, it considering the whole object as a string and split the string by delimiter.
I need some help understanding how to load data from CSV/TSV to complex data types in Hive.
I found a similar question but the requirement is little different and there is no complex datatype involved in there.
Any help would be much appreciated. If this cannot be done and a preprocessing step has to be included prior to loading, some example of input data to complex datatype loads in hive would help me. Thanks in advance!

Count number of objects in JSON array stored as Hive string column

I have a Hive table with a JSON string stored as a string in a column.
Something like this.
Id | Column1 (String)
1 | [{k1:v1,k2:v2},{k3:v3,k4:v4}]
2 | [{k1:v1,k2:v2}]
I want to count the number of JSON objects in the column.
Id | Count
1 | 2
2 | 1
What would be the query to achieve this?
If the JSON objects are such simple structs without nested structs then you can split by '}' and use size()-1:
size(split(column,'[}]'))-1
It works with empty strings correctly, NULLs require special handling if you need to convert to 0:
case when column is null then 0 else size(split(column,'[}]'))-1 end

Using MySQL JSON field to join on a table with custom fields

So I made this system to store custom objects with custom fields for an app that I'm developing. First I have object_def where I save the object definitions:
id | name | fields
------------------------------------------------------------
101 | Group 1 | [{"name": "Title", "id": "AbCdE123"}, ...]
102 | Group 2 | [{"name": "Name", "id": "FgHiJ456"}, ...]
So we have ID (INT), name (VARCHAR) and fields (LONGTEXT). In fields are the object fields like this: {id: string, type: string, name: string}[].
Now In the object table, I have this:
id | object_def_id | object_values
------------------------------------------------------------
235 | 101 | {"AbCdE123": "The Object", ... }
236 | 102 | {"FgHiJ456": "John Perez", ... }
Where object_values is a LONGTEXT also. With that system, I'm able to show the objects on a table in my app using JSON.parse().
Now I've learned that there is a JSON type in MySQL and I want it to use it to do queries and stuff (I'm really new to this).
I've changed the LONGTEXT to JSON and now I wanted to do a SELECT that show the results like this:
#Select objects in group 1:
id | group | Title | ... | other_custom_field
-------------------------------------------------------
235 | Group 1 | The Object | ... | other_custom_value
#Select objects in group 2:
id | group | Name | ... | other_custom_field
-------------------------------------------------------
236 | Group 2 | John Perez | ... | other_custom_value
Id, then group name (I can do this with INNER JOIN) and then all the custom fields with the respective values.
Is this possible? How can I achieve this (hopefully without changing my database structure)? I'm learning MySQL, SQL and databases as I go so I really appreciate your help. Thanks!
Problems I see with your design:
Incorrect JSON format.
[{name: 'Title', id: 'AbCdE123'}, ...]
Should be:
[{"name": "Title", "id": "AbCdE123"}, ...]
You should use the JSON data type instead of LONGTEXT, because JSON will at least reject invalid JSON syntax.
Setting column headings based on data. You can't do this in SQL. Columns and headings must be fixed at the time you prepare the query. You can't do an SQL query that changes its own column headings.
Your object def has an array of attributes, but there's no way in MySQL 5.7 to loop over the "rows" of a JSON array. You'll need to use the JSON_TABLE() in MySQL 8.0.
That will get you closer to being able to look up object values, but then you'll still have to pivot the data into the result set you describe, with one attribute in each column, as if the data had been stored in a traditional way. But SQL doesn't allow you to do dynamic pivoting in a single query. You can't make an SQL query that dynamically grows its own select-list based on the data it finds.
This all makes me wonder...
Why don't you just store the data in the traditional way?
Create a table per object type. Add one column to that table per attribute. That way you get column names. You get column types. You get column constraints — for example, how would you simulate NOT NULL or UNIQUE in your current system?
If you don't want to use SQL, then don't. There are alternatives, like document databases or key/value databases. But don't torture poor SQL by using it to implement an Inner-Platform.

json single value condition in sql

I have an SQL database field that contains JSON type data stored.
-----------------------------
id | tags |
-----------------------------
1 | ['cat','dog'] |
2 | ['lion','cat','dog'] |
I want to select from this table by passing where condition as cat and get all the JSON fields. How would I do this?
Use the JSON_EXTRACT function as of MySQL 5.7.8. Extract from
https://dev.mysql.com/doc/refman/5.7/en/json.html#json-paths
A JSON path expression selects a value within a JSON document.
Path expressions are useful with functions that extract parts of or modify a JSON document, to specify where within that document to operate. For example, the following query extracts from a JSON document the value of the member with the name key:
mysql> SELECT JSON_EXTRACT('{"id": 14, "name": "Aztalan"}', '$.name');
+---------------------------------------------------------+
| JSON_EXTRACT('{"id": 14, "name": "Aztalan"}', '$.name') |
+---------------------------------------------------------+
| "Aztalan" |
+---------------------------------------------------------+
Note in the example they are inline creatinng a JSON object so for you
:'{"id": 14, "name": "Aztalan"}' would actually be your column value.
JSON manipulation in MySQL is slow. I suggest that you use REGEXP.
// get all rows having "cat" value
select * from animals_tbl
where tags regexp 'cat';
// get all rows based on multiple filters
select * from animals_tbl
where tags regexp 'lion|cat';