Dynamic Partitioning + CREATE AS on HIVE - mysql

I'm trying to create a new table from another table with CREATE AS and dynamic Partitioning on HiveCLI. I'm learning from Hive official wiki where there is this example:
CREATE TABLE T (key int, value string)
PARTITIONED BY (ds string, hr int) AS
SELECT key, value, ds, hr+1 hr1
FROM srcpart
WHERE ds is not null
And hr>10;
But I received this error:
FAILED: SemanticException [Error 10065]:
CREATE TABLE AS SELECT command cannot specify the list of columns for the target table
Source: https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions#DynamicPartitions-Syntax

Since you already know the full schema of the target table, try creating it first and the populating it with a LOAD DATA command:
SET hive.exec.dynamic.partition.mode=nonstrict;
CREATE TABLE T (key int, value string)
PARTITIONED BY (ds string, hr int);
INSERT OVERWRITE TABLE T PARTITION(ds, hr)
SELECT key, value, ds, hr+1 AS hr
FROM srcpart
WHERE ds is not null
And hr>10;
Note: the set command is needed since you are performing a full dynamic partition insert.

SET hive.exec.dynamic.partition.mode=nonstrict;
CREATE TABLE T (key int, value string)
PARTITIONED BY (ds string, hr int);
INSERT OVERWRITE TABLE T PARTITION(ds, hr)
SELECT key, value, ds, hr+1 AS hr
FROM srcpart
WHERE ds is not null
And hr>10;
In the above code, instead of the Create statement use: CREATE TABLE T like srcpart;
In case the partitioning is similar.

Related

How to specify columns SQLAlchemy Insert object and from_select?

I'm using a SQLAlchemy insert object to quickly insert a bunch of data from another table. The schemas are as follow:
create table master (id serial, name varchar);
create table mapping (id serial, new_name varchar, master_id integer);
-- master_id is a foreign key back to the master table, id column
I populate my master table with unique names and IDs. I then want my mapping table to get seeded with data from this master table. The SQL would be
insert into mapping (master_id, new_name) select id, name from master;
I use the following SQLAlchemy statement. The problem I get is that SQLAlchemy can't seem to resolve the names because logically they are different between the two tables.
stmt = sa_mapping_table.insert().from_select(['name', 'id'], stmt)
Is there a way to tell the insert object, "using this select statement select these columns and put the results in these columns of the target table"?
I think you are close but you should specify columns of mapping to insert the select from master into. This should work where master_t and mapping_t are the sqlalchemy Table() objects.
master_t = Table('master', metadata,
Column('id', Integer, primary_key=True),
Column('name', String, nullable=False))
mapping_t = Table('mapping', metadata,
Column('id', Integer, primary_key=True),
Column('new_name', String, nullable=False),
Column('master_id', Integer, ForeignKey('master.id'), nullable=False))
#...
with engine.connect() as conn, conn.begin():
select_q = select(master_t.c.id, master_t.c.name)
stmt = mapping_t.insert().from_select(["master_id", "new_name"], select_q)
conn.execute(stmt)
Creates the following SQL:
INSERT INTO mapping (master_id, new_name) SELECT master.id, master.name
FROM master
See the docs at
insert-from-select

Rebuild a Hive table with different definitions

I want to build an automatic system to help me map Hive tables.
I have an SQL table with meta data: tableID, fieldName, field Type, description, lastUpdated.
I want to update automatically my tables -
where lastUpdate=CURDATE() - INTERVAL '1' DAY
But I don't have an indication on what change was made - it can be a new column in the table, and it can be a column name that was changed, or even a description update.
Is there a way to "define" a table all over again when it already exists? That all the changes I want to make will be executed at once (all change types)?
for instance - I have a table that was defined like this:
create external table IF NOT EXISTS tableA (`a` string, `b` int, `c` int) PARTITIONED BY (dt date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'File/Path';
And the change was that column "b" type is now "string". Is there a (generic) update/alter query that I can write:
*SomeCommand* tableA (`a` string, `b` string, `c` int)
and my column will be updated?
Same if I have a new column - d, type: float.
*SomeCommand* tableA (`a` string, `b` string, `c` int, `d` float)
I need one command that can contain these options, please. Or - if you have another good idea n how to do this, I will really appreciate it...
Thank you!
You can use ALTER TABLE REPLACE COLUMNS. It does exactly what you asked,
It will replace all the columns at once. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

postgres force json datatype

When working with JSON datatype, is there a way to ensure the input JSON must have elements. I don't mean primary, I want the JSON that gets inserted to at least have the id and name element, it can have more but at the minimum the id and name must be there.
thanks
The function checks what you want:
create or replace function json_has_id_and_name(val json)
returns boolean language sql as $$
select coalesce(
(
select array['id', 'name'] <# array_agg(key)
from json_object_keys(val) key
),
false)
$$;
select json_has_id_and_name('{"id":1, "name":"abc"}'), json_has_id_and_name('{"id":1}');
json_has_id_and_name | json_has_id_and_name
----------------------+----------------------
t | f
(1 row)
You can use it in a check constraint, e.g.:
create table my_table (
id int primary key,
jdata json check (json_has_id_and_name(jdata))
);
insert into my_table values (1, '{"id":1}');
ERROR: new row for relation "my_table" violates check constraint "my_table_jdata_check"
DETAIL: Failing row contains (1, {"id":1}).

hive insert into structure data type using a query

I have a use case where I have a table a. I want to select data from it, group by come fields, do some aggregations and insert the result into another hive table b having one of the column as a struct. I am facing some difficulty with it. Can some one please help and tell me whats wrong with my queries.
CREATE EXTERNAL TABLE IF NOT EXISTS a (
date string,
acct string,
media string,
id1 string,
val INT
) PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'folder1/folder2/';
ALTER TABLE a ADD IF NOT EXISTS PARTITION (day='{DATE}') LOCATION 'folder1/folder2/Date={DATE}';
CREATE EXTERNAL TABLE IF NOT EXISTS b (
date string,
acct string,
media string,
st1 STRUCT<id1:STRING, val:INT>
) PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'path/';
FROM a
INSERT OVERWRITE TABLE b PARTITION (day='{DATE}')
SELECT date,acct,media,named_struct('id1',id1,'val',sum(val))
WHERE day='{DATE}' and media is not null and acct is not null and NOT (id1 = "0" )
GROUP BY date,acct,media,id1;
Error I got :
SemanticException [Error 10044]: Line 3:31 Cannot insert into target table because column number/types are different ''2015-07-16'': Cannot convert column 4 from struct<id1:string,val:bigint> to struct<id1:string,val:int>.
Sum return a BIGINT, not an INT. So Declare
st1 STRUCT<id1:STRING, val:BIGINT>
instead of
st1 STRUCT<id1:STRING, val:INT>

Insert DataFrame into SQL table with AUTO_INCREMENT column

I have a MySQL table which includes a column that is AUTO_INCREMENT:
CREATE TABLE features (
id INT NOT NULL AUTO_INCREMENT,
name CHAR(30),
value DOUBLE PRECISION
);
I created a DataFrame and wanted to insert it into this table.
case class Feature(name: String, value: Double)
val rdd: RDD[Feature]
val df = rdd.toDF()
df.write.mode(SaveMode.Append).jdbc("jdbc:mysql://...", "features", new Properties)
I get the error, Column count doesn’t match value count at row 1. If I delete the id column it works. How could I insert this data into the table without changing the schema?
You have to include an id field in the DataFrame, but its value will be ignored and replaced with the auto-incremented ID. That is:
case class Feature(id: Int, name: String, value: Double)
Then just set id to 0, or any number when you create a Feature.