Environment: Impala version 4.1
Scenario: I have a table that has an id column and a JSON column, like so:
id
json_col
111
[{"m_id":"896de41d","name":"DE"},{"m_id":"194a3028","name":"Free"}]
222
[{"m_id":"687c6baa","name":"Texti"},{"m_id":"194a3028","name":"Default"},{"m_id":"896de41d","name":"Parcel"}]
Is there a function that dynamically takes out m_id and name attrtibutes into rows per id?
I tried the GET_JSON_OBJECT function but I'm not sure if it has that ability.
This got no answer and I have a similar question, though I'll expand it.
Suppose I have 3 CSV files in s3://test_path/. I want to create an external table and populate it with the data in these CSVs. However, not only does column order differ across CSVs, but some columns may be missing from some CSVs.
Is Redshift Spectrum capable of doing what I want?
a.csv:
id,name,type
a1,apple,1
a2,banana,2
b.csv:
type,id,name
1,b1,orange
2,b2,lemon
c.csv:
name,id
kiwi,c1
I create the external database/schema and table by running this in Redshift query editor v2 on my Redshift cluster:
CREATE EXTERNAL SCHEMA test_schema
FROM DATA CATALOG
DATABASE 'test_db'
REGION 'region'
IAM_ROLE 'iam_role'
CREATE EXTERNAL DATABASE IF NOT EXISTS
;
CREATE EXTERNAL TABLE test_schema.test_table (
"id" VARCHAR,
"name" VARCHAR,
"type" SMALLINT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://test_path/'
TABLE PROPERTIES ('skip.header.line.count'='1')
;
I expect SELECT * FROM test_schema.test_table to yield:
id
name
type
a1
apple
1
a2
banana
2
b1
orange
1
b2
lemon
2
c1
kiwi
NULL
Instead I get:
id
name
type
a1
apple
1
a2
banana
2
1
b1
NULL
2
b2
NULL
kiwi
c1
NULL
It seems Redshift Spectrum cannot match columns by name across files the way pandas.concat() can with data frames with differing column order.
No, your data needs to be transformed to align the data between files. Your Spectrum DDL specifies that the first row of the CSVs is ignored so the information you need isn't even being read.
If you want to have these files usable as one Spectrum table you will need to transform them to align columns and store new files to S3. You can do this with Redshift if you also have a support piece of code reading the column order from each file. You could write a Lambda to do this easily or if your CSV files are fairly simple then a Glue crawler will work. Almost any ETL tool can do this as well. Lots of choices but these files are not Spectrum ready as is.
I have the table with JSON-field (example)
# table1
id | json_column
---+------------------------
1 | {'table2_ids':[1,2,3], 'sone_other_data':'foo'}
---+------------------------
2 | {'foo_data':'bar', 'table2_ids':[3,5,11]}
And
# table2
id | title
---+------------------------
1 | title1
---+------------------------
2 | title2
---+------------------------
...
---+------------------------
11 | title11
Yes, I know about stored many-to-many relation in the third table. But it's a duplication data (in first case relations in Json_column, in second in third-table)
I know about generated columns in MySQL, but I don't understand how to use it for stored m2m relations. Maybe I have use views to get pairs of table1.id <-> table2.id. But how use index in this case?
I can't understand your explanation for why you can't use a third table to represent the many-to-many pairs. Using a third table is of course the best solution.
I think views have no relevance to this problem.
You could use JSON_EXTRACT() to access individual members of the array. You can use a generated column to pull each member out so you can easily reference it as an individual value.
create table table1 (
id int auto_increment primary key,
json_column json,
first_table2_id int as (json_extract(json_column, '$.table2_ids[0]'))
);
insert into table1 set json_column = '{"table2_ids":[1,2,3], "sone_other_data":"foo"}'
(You must use double-quotes inside a JSON string, and single-quotes to delimit the whole JSON string.)
select * from table1;
+----+-----------------------------------------------------+-----------------+
| id | json_column | first_table2_id |
+----+-----------------------------------------------------+-----------------+
| 1 | {"table2_ids": [1, 2, 3], "sone_other_data": "foo"} | 1 |
+----+-----------------------------------------------------+-----------------+
But this is still a problem: In SQL, the table must have the columns defined by the table metadata, and all rows therefore have the same columns. There no such thing as each row populating additional columns based on the data.
So you need to create another extra column for each potential member of the array of table2_ids. If the array has fewer elements than the number of columns, JSON_EXTRACT() will fill in NULL when the expression returns nothing.
alter table table1 add column second_table2_id int as (json_extract(json_column, '$.table2_ids[1]'));
alter table table1 add column third_table2_id int as (json_extract(json_column, '$.table2_ids[2]'));
alter table table1 add column fourth_table2_id int as (json_extract(json_column, '$.table2_ids[3]'));
I'll query using vertical output, so the columns will be easier to read:
select * from table1\G
*************************** 1. row ***************************
id: 1
json_column: {"table2_ids": [1, 2, 3], "sone_other_data": "foo"}
first_table2_id: 1
second_table2_id: 2
third_table2_id: 3
fourth_table2_id: NULL
This is going to get very awkward. How many columns do you need? That depends on how many table2_ids is the maximum length of the array.
If you need to search for rows in table1 that reference some specific table2 id, which column should you search? Any of the columns may have that value.
select * from table1
where first_table2_id = 2
or second_table2_id = 2
or third_table2_id = 2
or fourth_table2_id = 2;
You could put an index on each of these generated columns, but the optimizer won't use them.
These are some reasons why storing comma-separated lists is a bad idea, even inside a JSON string, if you need to reference individual elements.
The better solution is to use a traditional third table to store the many-to-many data. Each value is stored on its own row, so you don't need many columns or many indexes. You can search one column if you need to look up references to a given value.
select * from table1_table2 where table2_id = 2;
I have a Postgres table that contains a jsonb column, the data in which is arbitrarily deep.
id | jsonb_data
---|----------------------
1 | '{"a":1}'
2 | '{"a":1,"b":2}'
3 | '{"a":1,"b":2,"c":{"d":4}}'
Given a JSON object in my WHERE clause, I want to find the rows that contain objects that contain the same data and no more, but in any order. Including, preferably, nested objects.
SELECT * FROM table
WHERE json_match_ignore_order(jsonb_data, '{"b":2,"a":1}');
id | jsonb_data
---|-----------
2 | '{"a":1,"b":2}'
This would essentially work identically to the following Ruby code, but I'd really like to do it in the database if possible.
table.select { |row| row.jsonb_data_as_a_hash == {b: 2, a: 1} }
How can I do this?
With jsonb type you can use equal sign even for values with nested objects.
Thus the following will also work:
create table jsonb_table(
id serial primary key,
jsonb_data jsonb
);
insert into jsonb_table(jsonb_data)
values
('{"a":1}'),
('{"a":{"c":5},"b":2}'),
('{"a":{"c":5},"b":2,"c":{"d":4}}');
select * from jsonb_table
where jsonb_data = '{"b":2,"a":{"c":5}}'::jsonb;
You will get rows with objects containing same keys with same values recursively (in this case only the second row).
I have the following nested types defined in postgres:
CREATE TYPE address AS (
name text,
street text,
zip text,
city text,
country text
);
CREATE TYPE customer AS (
customer_number text,
created timestamp WITH TIME ZONE,
default_billing_address address,
default_shipping_address address
);
And would now like to populate this types in a stored procedure, which gets json as an input parameter. This works for fields on the top-level, the output shows me the internal format of a postgres composite type:
# select json_populate_record(null::customer, '{"customer_number":"12345678"}'::json)::customer;
json_populate_record
----------------------
(12345678,,,)
(1 row)
However, postgres does not handle a nested json structure:
# select json_populate_record(null::customer, '{"customer_number":"12345678","default_shipping_address":{"name":"","street":"","zip":"12345","city":"Berlin","country":"DE"}}'::json)::customer;
ERROR: malformed record literal: "{"name":"","street":"","zip":"12345","city":"Berlin","country":"DE"}"
DETAIL: Missing left parenthesis.
What works again is, if the nested property is in postgres' internal format like here:
# select json_populate_record(null::customer, '{"customer_number":"12345678","default_shipping_address":"(\"\",\"\",12345,Berlin,DE)"}'::json)::customer;
json_populate_record
--------------------------------------------
(12345678,,,"("""","""",12345,Berlin,DE)")
(1 row)
Is there any way to get postgres to convert from a nested json structure to a corresponding composite type?
Use json_populate_record() only for nested objects:
with a_table(jdata) as (
values
('{
"customer_number":"12345678",
"default_shipping_address":{
"name":"",
"street":"",
"zip":"12345",
"city":"Berlin",
"country":"DE"
}
}'::json)
)
select (
jdata->>'customer_number',
jdata->>'created',
json_populate_record(null::address, jdata->'default_billing_address'),
json_populate_record(null::address, jdata->'default_shipping_address')
)::customer
from a_table;
row
--------------------------------------------
(12345678,,,"("""","""",12345,Berlin,DE)")
(1 row)
Nested composite types are not what Postgres (and any RDBMS) was designed for. They are too complicated and troublesome.
In the database logic nested structures should be maintained as related tables, e.g.
create table addresses (
address_id serial primary key,
name text,
street text,
zip text,
city text,
country text
);
create table customers (
customer_id serial primary key, -- not necessary `serial` may be `integer` or `bigint`
customer_number text, -- maybe redundant
created timestamp with time zone,
default_billing_address int references adresses(address_id),
default_shipping_address int references adresses(address_id)
);
Sometimes it is reasonable to have nested structure in a table but it seems more convenient and natural to use jsonb or hstore in these cases, e.g.:
create table customers (
customer_id serial primary key,
customer_number text,
created timestamp with time zone,
default_billing_address jsonb,
default_shipping_address jsonb
);
plpython to the rescue:
create function to_customer (object json)
returns customer
AS $$
import json
return json.loads(object)
$$ language plpythonu;
Example:
select to_customer('{
"customer_number":"12345678",
"default_shipping_address":
{
"name":"",
"street":"",
"zip":"12345",
"city":"Berlin",
"country":"DE"
},
"default_billing_address":null,
"created": null
}'::json);
to_customer
--------------------------------------------
(12345678,,,"("""","""",12345,Berlin,DE)")
(1 row)
Warning: postgresql when building returned object from python requires to have all null values present as None (ie. it's not allowed to skip null values as not present), thus we have to specify all null values in incoming json. For example, not allowed:
select to_customer('{
"customer_number":"12345678",
"default_shipping_address":
{
"name":"",
"street":"",
"zip":"12345",
"city":"Berlin",
"country":"DE"
}
}'::json);
ERROR: key "created" not found in mapping
HINT: To return null in a column, add the value None to the mapping with the key named after the column.
CONTEXT: while creating return value
PL/Python function "to_customer"
This seems to be solved in Postgres 10. Searching the release notes for json_populate_record shows the following change:
Make json_populate_record() and related functions process JSON arrays and objects recursively (Nikita Glukhov)
With this change, array-type fields in the destination SQL type are properly converted from JSON arrays, and composite-type fields are properly converted from JSON objects. Previously, such cases would fail because the text representation of the JSON value would be fed to array_in() or record_in(), and its syntax would not match what those input functions expect.