Hive External Table exclude records that violate data type - json

I have an external table in Hive that uses a serde to process json records. Occasionally there will be a value that does not match the table ddl data type, e.g. table field definition is int, json has a string value. During query execution Hive will correctly throw this error for metadata exception due to type mismatch:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing writable
Is there a way to set Hive to just ignore these records that have data type violations?
Note the json is valid syntax, so settings the serde properties like to ignore malformed json is not applicable.
Example DDL:
CREATE EXTERNAL TABLE IF NOT EXISTS test_tbl (
acd INT,
tzo INT
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
;
ALTER TABLE test_tbl SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
Example data - the TZO = alpha record will cause the error:
{"acd":6,"tzo":4}
{"acd":6,"tzo":7}
{"acd":6,"tzo":"alpha"}

You can set up Hive to tolerate a configurable amount of failures.
SET mapred.skip.mode.enabled = true;
SET mapred.map.max.attempts = 100;
SET mapred.reduce.max.attempts = 100;
SET mapred.skip.map.max.skip.records = 30000;
SET mapred.skip.attempts.to.start.skipping = 1
This is not Hive specific and can be applied to ordinary MapReduce as well.

I don't think there is a way to handle this in hive yet. I think you may need to have an intermediate step using MR, Pig etc. to make sure the data is sound and then input from that result.
There may be a configuration parameter here you could use
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-SerDes
I'm thinking you may be able to write your own exception handler to catch that and continue by specifying your custom handler with hive.io.exception.handlers
or if you are ok storing as an ORC file instead of a text file. You can specify the ORC file format with HiveQL statements such as these:
CREATE TABLE ... STORED AS ORC
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
And then when you run your jobs you can use the skip setting:
set hive.exec.orc.skip.corrupt.data=true

Related

PyFlink Error/Exception: "Hive Table doesn't support consuming update changes which is produced by node PythonGroupAggregate"

Using Flink 1.13.1 and a pyFlink and a user-defined table aggregate function (UDTAGG) with Hive tables as source and sinks, I've been encountering an error:
pyflink.util.exceptions.TableException: org.apache.flink.table.api.TableException:
Table sink 'myhive.mydb.flink_tmp_model' doesn't support consuming update changes
which is produced by node PythonGroupAggregate
This is the SQL CREATE TABLE for the sink
table_env.execute_sql(
"""
CREATE TABLE IF NOT EXISTS flink_tmp_model (
run_id STRING,
model_blob BINARY,
roc_auc FLOAT
) PARTITIONED BY (dt STRING) STORED AS parquet TBLPROPERTIES (
'sink.partition-commit.delay'='1 s',
'sink.partition-commit.policy.kind'='success-file'
)
"""
)
What's wrong here?
I imagine you are executing a streaming query that is doing some sort of aggregation that requires updating previously emitted results. The parquet/hive sink does not support this -- once results are written, they are final.
One solution would be to execute the query in batch mode. Another would be to use a sink (or a format) that can handle updates. Or modify the query so that it only produces final results -- e.g., a time-windowed aggregation rather than an unbounded one.

Problem dropping Hive table from pyspark script

I have a table in hive created from many json files using hive-json-serde method, WITH SERDEPROPERTIES ('dots.in.keys' = 'true'), as some keys there have a dot in, like `aaa.bbb`. I create external table and use backticks for these keys. Now I have a problem dropping this table from pyspark script, using sqlContext.sql("DROP TABLE IF EXISTS "+table_name), I'm getting this error message:
An error occurred while calling o63.sql.
: org.apache.spark.SparkException: Cannot recognize hive type string: struct<associations:struct<aaa.bbb:array<string> ...
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting ':'(line 1, pos 33)
== SQL ==
struct<associations:struct<aaa.bbb:array<string>,...
---------------------------------^^^
In HUE i can drop this table without any problem. Am I doing it wrong, or may be there is better way to do it?
It looks like it is not possible to work with Hive tables created with the hive-json-serde method, with dot in keys , using sqlContext.sql("...") from pyspark script, as I want. There is always the same error, if I want to drop such Hive table, or create it (haven't tried other things yet). So my workaround is to use python os.system() and execute required query through hive itself:
q='hive -e "DROP TABLE IF EXISTS '+ table_name+';"'
os.system(q)
It's more complicated with CREATE TABLE query, as we need to escape backticks with '\':
statement = "CREATE TABLE test111 (testA struct<\`aa.bb\`:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3a://bucket/test111';"
q='hive -e "'+ statement+'"'
It outputs some additional hive related info, but works!

S3 to MySQL AWS Data Pipeline Insert table error

It's my first time asking a question on here, please so bear with me
I am trying to create a data pipeline to upload a CSV file in an S3-Bucket to a MySQL database table(Production1) using the template provided by aws, but fails when executing RdsMySqlTableCreateActivity.
The sql statement that I'm using(all column names match the CSV file) in the myRDSTableInsertSql parameter:
INSERT INTO `Production1` (`API`, `Normalized Month`, `DATE`, `Monthly Liquid`, `Cum Oil`, `BOPD`, `Monthly Gas Mcf/Month`, `Cum Gas`, `MCFPD`) VALUES(?,?,?,?,?,?,?,?,?);
The RdsMySqlTableCreateActivity error:
errorId
ActivityFailed:SQLException
errorMessage
No value specified for parameter 1
errorStackTrace
amazonaws.datapipeline.taskrunner.TaskExecutionException:
private.com.amazonaws.services.datapipeline.redshift.QueryStatementException: Exception No value specified for
parameter 1 while executing INSERT INTO `Production1` (`API`, `Normalized Month`, `DATE`, `Monthly Liquid`, `Cum Oil`, `BOPD`, `Monthly Gas Mcf/Month`, `Cum Gas`, `MCFPD`) VALUES(?,?,?,?,?,?,?,?,?);...
I ran the insert command on MySQL workbench, replacing the (?,?,?,?,?,?,?,?,?) with (1,2,3,4,5,6,7,8,9), and it worked. The CSV file that I'm using only has 2 rows the column names and values 1-9 for each column respectively. Really not sure what it means by No value specified for parameter 1, any help/guidance would really be appreciated!!!
For anyone that runs into the same issue using the "Load S3 data into RDS MySQL table" template
My values for each parameter were the following
myRDSTableInsertSql:
INSERT INTO tableName(`col_name1`, `col_name2`, `col_name3`, `col_name4`, `col_name5`, `col_name6`, `col_name7`, `col_name8`, `col_name9`) VALUES(?,?,?,?,?,?,?,?,?);
myRDSTableName: tableName
myRDSCreateTableSql:
CREATE TABLE tableName(`col_name1` type, `col_name2` type, `col_name3` type, `col_name4` type, `col_name5` type, `col_name6` type, `col_name7` type, `col_name8` type, `col_name9` type);
The main issue was with the actual CSV file format, you have to make sure there is no header, and that the types are exactly the same. Also make sure that you're separators are "," and each value is not quoted within your CSV file.
This template is a good starting point but form more detailed/complex CSV files making your own datapipeline is a must!

Read a json file with 12 nested level into hive in AZURE hdinsights

I tried to create a schema for the json file manually and tried to create a Hive table and i am getting
column type name length 10888 exceeds max allowed length 2000.
I am guessing i have to change the metastore details but i am not sure where is the config located In azure Hdinsights .
Other way I tried was
I got the schema from spark dataframe and i tried to create table from the view but still I get the same error.
this are the steps i tried in spark
val tne1 = sc.wholeTextFiles("wasb:path").map(x=>x._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val tne2 = sqlContext.read.json(tne1)
tne2.createOrReplaceTempView("my_temp_table");
sqlContext.sql("create table s ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'hive.serialization.extend.nesting.levels'='true') as select * from my_temp_table")
i am getting the error in this step
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: InvalidObjectException(message:Invalid column type name length 5448 exceeds max allowed length 2000, type struct
when i try to persist or create the rdd i get the schema but in a formatted view . even if i get the full view i might extract the schema .
I Added the following property through Ambari > Hive > Configs > Advanced > Custom hive-site:
hive.metastore.max.typename.length=14000.
and now i am able to create table with column type name upto 14000 length
I was able to fix this problem by running the below command before my create table statement. You can see it to whatever limit fits your schema definition, I made mine extra large.
Note, you have to do this again for each session in hive.
set hive.metastore.max.typename.length=11000;

Tables created in Snappy shell do not show up in JDBC or Pulse

SnappyData v.0-5
The issue I am having is that my JDBC Connection's Table metadata and Pulse Web App do not see the table I created below.
I create a table in SnappyData using the shell and a csv file.
Data is here (roads.csv):
"roadId","name"
"1","Road 1"
"2","Road 2"
"3","Road 3"
"4","Road 4"
"5","Road 5"
"6","Road 6"
"7","Road 7"
"8","Road 8"
"9","Road 9"
"10","Road 10"
==========================================================
snappy> CREATE TABLE STAGING_ROADS
(road_id string, name string)
USING com.databricks.spark.csv
OPTIONS(path '/home/ubuntu/data/example/roads.csv', header 'true');
snappy> select * from STAGING_ROADS
Returns 10 rows.
I have a SnappyData JDBC connection (DBVisualizer & SquirrelSQL show same).
I cannot see that table in the "TABLES" list from metadata.
However, if I do a "select * from STAGING_ROADS".
Returns 10 rows with CLOBs, which btw are completely unusable.
road_id | name
=====================
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
Second, the Pulse Web App does not register that I create the table when I did it from the snappy> shell. However, if I run a CREATE TABLE command from the JDBC client, it shows up there fine.
Am I doing something incorrectly? How can I get metadata about the tables I create in snappy> shell to show up in JDBC and Pulse as well?
The issue I am having is that my JDBC Connection's Table metadata and Pulse Web App do not see the table I created below.
This is a known issue (https://jira.snappydata.io/browse/SNAP-303). The JDBC metadata shows only the items in the store and not the external table. While the metadata issue is being tracked, Pulse webapp will not be able to see such external tables since it is designed to monitor the snappydata store.
A note: the "CREATE TABLE" DDL has been changed to "CREATE EXTERNAL TABLE" (https://github.com/SnappyDataInc/snappydata/pull/311) for sources outside of store to make things clearer.
How can I get metadata about the tables I create in snappy> shell to show up in JDBC and Pulse as well?
It will show up for internal SnappyData sources: column and row tables. For other providers in USING, they will not show up as mentioned.
CSV tables are usually useful for only loading data into column or row tables as in the example provided by #jagsr.
Didn't think creating a table using SQL where Spark.csv is the data source has been tested. Here is a related JIRA - https://jira.snappydata.io/browse/SNAP-416.
We have been suggesting folks to use a Spark Job to load the data in parallel. You can do this using the spark-shell also.
stagingRoadsDataFrame = snappyContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load(someFile)
// Save Dataframe as a Row table
stagingRoadsDatFrame.write.format("row").options(props).saveAsTable("staging_roads")
That said, could you try (perhaps this might work)-
CREATE TABLE STAGING_ROADS (road_id varchar(100), name varchar(500))
Note that there is no 'String' as a data type in SQL. By default, with no knowledge of the max length we convert this to a CLOB. We are working to resolve this issue too.