Using an unstructured JSON column for Clickhouse v22.3

Using an unstructured JSON column for Clickhouse v22.3 - json

I'm using the new JSON column for Clickhouse, which was added in version 22.3.
There is a great blog post here on the Clickhouse website about it - https://clickhouse.com/blog/clickhouse-newsletter-april-2022-json-json-json/
I'm trying to add unstructured JSON, where the document type isn't known until it's inserted. I've been using Postgres with JSONB and Snowflake with VARIANT for this and it's been working great.
With Clickhouse (v22.4.5.9, current as of 2022-05-14), here is what I'm doing:
-- We need to enable this flag to use JSON, as it's currently (as of 2022-05-14) experimental.
set allow_experimental_object_type = 1;
-- Create an example table for our testing, we can use the Memory engine as it'll be tiny.
create table example_json (
json_data json
)
engine = Memory();
-- Now let's insert two different JSON documents, usually this would be batched, but for the sake of this
-- example, let's just use two inserts.
-- insert into example_json(json)
INSERT INTO example_json VALUES ('{"animal": "dog"}');
-- Returns ('dog'), great.
select * from example_json;
-- Returns "dog", even cooler.
select json_data.animal from example_json;
-- Now we want to change around the values
INSERT INTO example_json VALUES ('{"name": "example", "animal": {"breed": "cat"}}');
This throws the following error:
Code: 15. DB::Exception: Data in Object has ambiguous paths: 'animal.breed' and 'animal'. (DUPLICATE_COLUMN) (version 22.4.5.9 (official build))
I think that under the hood Clickhouse is converting the keys to column types, but won't change the type if a conflicting type is then created?
Is there a way to insert JSON like this to Clickhouse?

It is precisely like you described. Clickhouse will try to infer the types of all the columns on INSERT - JSON is internally represented as a tuple.
You can check what type is currently inferred by running:
SET describe_extend_object_types=1;
DESCRIBE example_json;
You will see that this table already has a column called animal hence CH will report it as a duplicate:
DESCRIBE TABLE example_json
SETTINGS describe_extend_object_types = 1
Query id: 884a9a85-d883-45b9-8c90-f957a39a995e
┌─name──────┬─type─────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ json_data │ Tuple(animal String) │ │ │ │ │ │
└───────────┴──────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
1 row in set. Elapsed: 0.001 sec.
You can find more details about this here: https://clickhouse.com/docs/en/guides/developer/working-with-json/json-semi-structured#handling-data-changes

Can you try parse_json function within Snowflake to insert values into the JSON table.
https://docs.snowflake.com/en/sql-reference/functions/parse_json.html#examples
Basically try this DML
INSERT INTO example_json select 1, parse_json ($$ { "name": "example", "animal": {"breed": "cat" } $$);

Related

PyFlink Error/Exception: "Hive Table doesn't support consuming update changes which is produced by node PythonGroupAggregate"

Using Flink 1.13.1 and a pyFlink and a user-defined table aggregate function (UDTAGG) with Hive tables as source and sinks, I've been encountering an error:
pyflink.util.exceptions.TableException: org.apache.flink.table.api.TableException:
Table sink 'myhive.mydb.flink_tmp_model' doesn't support consuming update changes
which is produced by node PythonGroupAggregate
This is the SQL CREATE TABLE for the sink
table_env.execute_sql(
"""
CREATE TABLE IF NOT EXISTS flink_tmp_model (
run_id STRING,
model_blob BINARY,
roc_auc FLOAT
) PARTITIONED BY (dt STRING) STORED AS parquet TBLPROPERTIES (
'sink.partition-commit.delay'='1 s',
'sink.partition-commit.policy.kind'='success-file'
)
"""
)
What's wrong here?

I imagine you are executing a streaming query that is doing some sort of aggregation that requires updating previously emitted results. The parquet/hive sink does not support this -- once results are written, they are final.
One solution would be to execute the query in batch mode. Another would be to use a sink (or a format) that can handle updates. Or modify the query so that it only produces final results -- e.g., a time-windowed aggregation rather than an unbounded one.

bigquery create table from json definition gives STORAGE_FORMAT_UNSPECIFIED error

I want to create a table by cloning the schema of an existing table, editing it by adding some columns, renaming others.
What I did is:
Find the schema of the table to clone:
bq show --format=json $dataset.$from_table | jq -c .schema
Edit it with some scripting, save as a file, e.g. schema.json (here simplified):
schema.json
{"fields":[{"mode":"NULLABLE","name":"project_name","type":"STRING"},
{"mode":"NULLABLE","name":"sample_name","type":"STRING"}]}
Then attempting to create the new table with the command below:
bq mk --table --external_table_definition=schema.json test-
project1:dataset1.table_v1_2_2
But I am getting this error:
BigQuery error in mk operation: Unsupported storage format for
external data: STORAGE_FORMAT_UNSPECIFIED
I just want this to be another table of the same type I have in the
system, which I believe is Location "Google Cloud BigQuery".
Any ideas?

The problem is that you are using the external_table_definition flag, which is only relevant if you are creating an external table over files on GCS or Drive for example. A much easier way to go about creating the new table is to use a CREATE TABLE ... AS SELECT ... statement. As an example, suppose that I have a table T1 with columns and types
foo: INT64
bar: STRING
baz: BOOL
I want to create a new table that renames bar and changes its type, and with the addition of a column named id. I can run a query like this:
CREATE TABLE dataset.T2 AS
SELECT
foo,
CAST(bar AS TIMESTAMP) AS fizz,
baz,
GENERATE_UUID() AS id
FROM dataset.T1
If you just want to clone and update the schema without incurring any cost or copying the data, you can use LIMIT 0, e.g.:
CREATE TABLE dataset.T2 AS
SELECT
foo,
CAST(bar AS TIMESTAMP) AS fizz,
baz,
GENERATE_UUID() AS id
FROM dataset.T1
LIMIT 0
Now you'll have a new, empty table with the desired schema.

Hive External Table exclude records that violate data type

I have an external table in Hive that uses a serde to process json records. Occasionally there will be a value that does not match the table ddl data type, e.g. table field definition is int, json has a string value. During query execution Hive will correctly throw this error for metadata exception due to type mismatch:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing writable
Is there a way to set Hive to just ignore these records that have data type violations?
Note the json is valid syntax, so settings the serde properties like to ignore malformed json is not applicable.
Example DDL:
CREATE EXTERNAL TABLE IF NOT EXISTS test_tbl (
acd INT,
tzo INT
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
;
ALTER TABLE test_tbl SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
Example data - the TZO = alpha record will cause the error:
{"acd":6,"tzo":4}
{"acd":6,"tzo":7}
{"acd":6,"tzo":"alpha"}

You can set up Hive to tolerate a configurable amount of failures.
SET mapred.skip.mode.enabled = true;
SET mapred.map.max.attempts = 100;
SET mapred.reduce.max.attempts = 100;
SET mapred.skip.map.max.skip.records = 30000;
SET mapred.skip.attempts.to.start.skipping = 1
This is not Hive specific and can be applied to ordinary MapReduce as well.

I don't think there is a way to handle this in hive yet. I think you may need to have an intermediate step using MR, Pig etc. to make sure the data is sound and then input from that result.
There may be a configuration parameter here you could use
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-SerDes
I'm thinking you may be able to write your own exception handler to catch that and continue by specifying your custom handler with hive.io.exception.handlers
or if you are ok storing as an ORC file instead of a text file. You can specify the ORC file format with HiveQL statements such as these:
CREATE TABLE ... STORED AS ORC
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
And then when you run your jobs you can use the skip setting:
set hive.exec.orc.skip.corrupt.data=true

JSON Schema validation in PostgreSQL?

I can't find any information about JSON schema validation in PostgreSQL, is there any way to implement JSON Schema validation on PostgreSQL JSON data type?

There is another PostgreSQL extension that implements json validation. The usage is almost the same as "Postgres-JSON-schema"
CREATE TABLE example (id serial PRIMARY KEY, data jsonb);
-- do is_jsonb_valid instead of validate_json_schema
ALTER TABLE example ADD CONSTRAINT data_is_valid CHECK (is_jsonb_valid('{"type": "object"}', data));
INSERT INTO example (data) VALUES ('{}');
-- INSERT 0 1
INSERT INTO example (data) VALUES ('1');
-- ERROR: new row for relation "example" violates check constraint "data_is_valid"
-- DETAIL: Failing row contains (2, 1).
I've done some benchmarking validating tweets and it is 20x faster than "Postgres-JSON-schema", mostly because it is written in C instead of SQL.
Disclaimer, I've written this extension.

There is a PostgreSQL extension that implements JSON Schema validation in PL/PgSQL.
It is used like this (taken from the project README file):
CREATE TABLE example (id serial PRIMARY KEY, data jsonb);
ALTER TABLE example ADD CONSTRAINT data_is_valid CHECK (validate_json_schema('{"type": "object"}', data));
INSERT INTO example (data) VALUES ('{}');
-- INSERT 0 1
INSERT INTO example (data) VALUES ('1');
-- ERROR: new row for relation "example" violates check constraint "data_is_valid"
-- DETAIL: Failing row contains (2, 1).

What you need is something to translate JSON Schema constraints into PostgreSQL ones, e.g.:
{
"properties": {
"age": {"minimum": 21}
},
"required": ["age"]
}
to:
SELECT FROM ...
WHERE (elem->>'age' >= 21)
I'm not aware of any existing tools. I know of something similar for MySQL which might be useful for writing your own, but nothing for using the JSON type in PostgreSQL.

MySQL SET Type in PostgreSQL? [duplicate]

This question already has an answer here:
convert MySQL SET data type to Postgres
(1 answer)
Closed 9 years ago.
I'm trying to use MySQL SET type in PostgreSQL, but I found only Arrays, that has quite similar functionality but doesn't met requirements.
Does PostgreSQL has similar datatype?

You can use following workarounds:
1. BIT strings
You can define your set of maximum N elements as simply BIT(N).
It is little bit awkward to populate and retrieve - you will have to use bit masks as set members. But bit strings really shine for set operations: intersection is simply &, union is |.
This type is stored very efficiently - bit per bit with small overhead for length.
Also, it is nice that length is not really limited (but you have to decide it upfront).
2. HSTORE
HSTORE type is an extension, but very easy to install. Simply executing
CREATE EXTENSION hstore
for most installations (9.1+) will make it available. Rumor has it that PostgreSQL 9.3 will have HSTORE as standard type.
It is not really a set type, but more like Perl hash or Python dictionary: it keeps arbitrary set of key=>value pairs.
With that, it is not very efficient (certainly not BIT string efficient), but it does provide functions essential for sets: || for union, but intersection is little bit awkward: use
slice(a,akeys(b)) || slice(b,akeys(a))
You can read more about HSTORE here.

What about an array with a check constraint:
create table foobar
(
myset text[] not null,
constraint check_set
check ( array_length(myset,1) <= 2
and (myset = array[''] or 'one'= ANY(myset) or 'two' = ANY(myset))
)
);
This would match a the definition of SET('one', 'two') as explained in the MySQL manual.
The only thing that this would not do, is to "normalize" the array. So
insert into foobar values (array['one', 'two']);
and
insert into foobar values (array['two', 'one']);
would be displayed differently than in MySQL (where both would wind up as 'one','two')
The check constraint will however get messy with more than 3 or 4 elements.

Building on a_horse_with_no_name's answer above, I would suggest something just a little more complex:
CREATE FUNCTION set_check(in_value anyarray, in_check anyarray)
RETURNS BOOL LANGUAGE SQL IMMUTABLE AS
$$
WITH basic_check AS (
select bool_and(v = any($2)) as condition, count(*) as ct
FROM unnest($1) v
GROUP BY v
), length_check AS (
SELECT count(*) = 0 as test FROM unnest($1)
)
SELECT bool_and(condition AND ct = 1)
FROM basic_check
UNION
SELECT test from length_check where test;
$$;
Then you should be able to do something like:
CREATE TABLE set_test (
my_set text[] CHECK (set_check(my_set, array['one'::text,'two']))
);
This works:
postgres=# insert into set_test values ('{}');
INSERT 0 1
postgres=# insert into set_test values ('{one}');
INSERT 0 1
postgres=# insert into set_test values ('{one,two}');
INSERT 0 1
postgres=# insert into set_test values ('{one,three}');
ERROR: new row for relation "set_test" violates check constraint "set_test_my_set_check"
postgres=# insert into set_test values ('{one,one}');
ERROR: new row for relation "set_test" violates check constraint "set_test_my_set_check"
Note this assumes that for your set, every value must be unique (we are talking sets here). The function should perform very well and should meet your needs. However this has the advantage of handling any size sets.
Storage-wise it is completely different from MySQL's implementation. It will take up more space on disk but should handle sets with as many members as you like, provided that you aren't running up against storage limits.... So this should have a superset of functionality in comparison to MySQL's implementation. One significant difference though is that this does not collapse the array into distinct values. It just prohibits them. If you need that too, look at a trigger.
This solution also leaves the ordinality of input data intact so '{one,two}' is distinct from '{two,one}' so if you need to ensure that behavior has changed, you may want to look into exclusion constraints on PostgreSQL 9.2.

Are you looking for enumerated data types?
PostgreSQL 9.1 Enumerated Types

From reading the page referenced in the question, it seems like a SET is a way of storing up to 64 named boolean values in one column. PostgreSQL does not provide a way to do this. You could use independent boolean columns, or some size of integer and twiddle the bits directly. Adding two new tables (one for the valid names, and the other to join names to detail rows) could make sense, especially if there is the possibility of needing to associate any other data to individual values.

some time ago I wrote one similar extension
https://github.com/okbob/Enumset
but it is not complete
some more complete and close to mysql is functionality from pltoolkit
http://okbob.blogspot.cz/2010/12/bitmapset-for-plpgsql.html
http://pgfoundry.org/frs/download.php/3203/pltoolbox-1.0.2.tar.gz
http://postgres.cz/wiki/PL_toolbox_%28en%29
function find_in_set can be emulated via arrays
http://okbob.blogspot.cz/2009/08/mysql-functions-for-postgresql.html

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using an unstructured JSON column for Clickhouse v22.3 - json

Can you try parse_json function within Snowflake to insert values into the JSON table. https://docs.snowflake.com/en/sql-reference/functions/parse_json.html#examples Basically try this DML INSERT INTO example_json select 1, parse_json ($$ { "name": "example", "animal": {"breed": "cat" } $$);

Related

PyFlink Error/Exception: "Hive Table doesn't support consuming update changes which is produced by node PythonGroupAggregate"

bigquery create table from json definition gives STORAGE_FORMAT_UNSPECIFIED error

Hive External Table exclude records that violate data type

JSON Schema validation in PostgreSQL?

MySQL SET Type in PostgreSQL? [duplicate]

Categories

Resources