I wanted to Create hive table with Json array
I am facing issue with top level array. can anyone suggest me a solution.
My json object looks like below
[{"user_id": "a"," previous_user_id": "b"},{"user_id": "c"," previous_user_id": "d"},{"user_id": "e"," previous_user_id": "f"}]
Hive command to create the table:
create external table array_tmp (User array<struct<user_id: String, previous_user_id:String>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
select user.user_id from array_tmp gives exception as
Row is not a valid JSON Object.
I have added the jar ADD JAR json-serde-1.3.8-jar-with-dependencies.jar;
Any suggestion ?
You may need to make few changes. Here is an example
myjson/data.json
{"users":[{"user_id": "a"," previous_user_id": "b"},{"user_id": "c"," previous_user_id": "d"},{"user_id": "e"," previous_user_id": "f"}]}
Now create a Hive table
ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
CREATE EXTERNAL TABLE tbl( users array<struct<user_id:string,previous_user_id:string>>)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
location '/user/cloudera/myjson';
Do a select
select users.user_id from tbl;
+----------------+--+
| user_id |
+----------------+--+
| ["a","c","e"] |
+----------------+--+
Related
I am trying to build a data pipeline from MySql to Ksql.
Use Case: data source is MySql. I have created a table in MySql.
I am using
./bin/connect-standalone ./etc/schema-registry/connect-avro-standalone.properties ./etc/kafka-connect-jdbc/source-quickstart-sqlite.properties
to start a standalone connector. And it is working fine.
I am starting the consumer with topic name i.e.
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test1Category --from-beginning
when I am inserting the data in MySQL table I am getting the result in consumer as well. I have created KSQL Stream as will with the same topic name. I am expecting the same result in my Kstream as well, But i am not getting any result when i am doing
select * from <streamName>
Connector configuration--source-quickstart-mysql.properties
name=jdbc_source_mysql
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://localhost:3306/testDB?user=root&password=cloudera
#comment=Which table(s) to include
table.whitelist=ftest
mode=incrementing
incrementing.column.name=id
topic.prefix=ftopic
Sample Data
MySql
1.) Create Database:
CREATE DATABASE testDB;
2.) Use Database:
USE testDB;
3.) create the table:
CREATE TABLE products (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
description VARCHAR(512),
weight FLOAT
);
4.) Insert data into the table:
INSERT INTO products(id,name,description,weight)
VALUES (103,'car','Small car',20);
KSQL
1.) Create Stream:
CREATE STREAM pro_original (id int, name varchar, description varchar,weight bigint) WITH \
(kafka_topic='proproducts', value_format='DELIMITED');
2.) Select Query:
Select * from pro_original;
Expected Output
Consumer
getting the data which is inserted in the MySQL table.
Here I am getting the data in MySQL.
Ksql
In-Stream data should be populated which is inserted in Mysql table and reflecting in Kafka topic.
I am not getting expected result in ksql
Help me for this data pipeline.
Your data is in AVRO format but in the VALUE_FORMAT instead of AVRO you've defined DELIMITED. It is important to instruct KSQL the format of the values that are stored in the topic. The following should do the trick for you.
CREATE STREAM pro_original_v2 \
WITH (KAFKA_TOPIC='products', VALUE_FORMAT='AVRO');
Data inserted into kafka topic after executing
SELECT * FROM pro_original_v2;
should now be visible in your ksql console window.
You can have a look at some Avro examples in KSQL here.
I want to create a table by cloning the schema of an existing table, editing it by adding some columns, renaming others.
What I did is:
Find the schema of the table to clone:
bq show --format=json $dataset.$from_table | jq -c .schema
Edit it with some scripting, save as a file, e.g. schema.json (here simplified):
schema.json
{"fields":[{"mode":"NULLABLE","name":"project_name","type":"STRING"},
{"mode":"NULLABLE","name":"sample_name","type":"STRING"}]}
Then attempting to create the new table with the command below:
bq mk --table --external_table_definition=schema.json test-
project1:dataset1.table_v1_2_2
But I am getting this error:
BigQuery error in mk operation: Unsupported storage format for
external data: STORAGE_FORMAT_UNSPECIFIED
I just want this to be another table of the same type I have in the
system, which I believe is Location "Google Cloud BigQuery".
Any ideas?
The problem is that you are using the external_table_definition flag, which is only relevant if you are creating an external table over files on GCS or Drive for example. A much easier way to go about creating the new table is to use a CREATE TABLE ... AS SELECT ... statement. As an example, suppose that I have a table T1 with columns and types
foo: INT64
bar: STRING
baz: BOOL
I want to create a new table that renames bar and changes its type, and with the addition of a column named id. I can run a query like this:
CREATE TABLE dataset.T2 AS
SELECT
foo,
CAST(bar AS TIMESTAMP) AS fizz,
baz,
GENERATE_UUID() AS id
FROM dataset.T1
If you just want to clone and update the schema without incurring any cost or copying the data, you can use LIMIT 0, e.g.:
CREATE TABLE dataset.T2 AS
SELECT
foo,
CAST(bar AS TIMESTAMP) AS fizz,
baz,
GENERATE_UUID() AS id
FROM dataset.T1
LIMIT 0
Now you'll have a new, empty table with the desired schema.
i work on cloudera quickstart with docker, I'm trying to create a table into hive interface.
This is my code.
add jar hdfs:///user/cloudera/hive-serdes-1.0-SNAPSHOT.jar
drop table if exists tweets;
CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user1:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user1 STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/';
load data inpath '/user/cloudera/search.json' into table tweets;
when I run "select * from tweets;", I get this error
Fetching results ran into the following error(s):
Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='\xf2e\xcc\xb6v\x8eC"\xae^x\x89*\xd6j\xa7', guid='h\xce\xacgmZIP\x8d\xcc\xc0\xe8C\t\x1a\x0c')), orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/2015_11_18', sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/2015_11_18:25:24', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:366', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:275', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:752', 'sun.reflect.GeneratedMethodAccessor19:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:606',
Don't use your user folder as a Hive table location. A user folder is meant for general file storage, such as that 2015_11_18 directory it's trying to read, not an entire Hive structure.
Do LOCATION '/user/cloudera/tweets';, for example instead.
You could also just make a regular managed table if you don't care if things are deleted when you drop the table.
I tried to create a schema for the json file manually and tried to create a Hive table and i am getting
column type name length 10888 exceeds max allowed length 2000.
I am guessing i have to change the metastore details but i am not sure where is the config located In azure Hdinsights .
Other way I tried was
I got the schema from spark dataframe and i tried to create table from the view but still I get the same error.
this are the steps i tried in spark
val tne1 = sc.wholeTextFiles("wasb:path").map(x=>x._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val tne2 = sqlContext.read.json(tne1)
tne2.createOrReplaceTempView("my_temp_table");
sqlContext.sql("create table s ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'hive.serialization.extend.nesting.levels'='true') as select * from my_temp_table")
i am getting the error in this step
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: InvalidObjectException(message:Invalid column type name length 5448 exceeds max allowed length 2000, type struct
when i try to persist or create the rdd i get the schema but in a formatted view . even if i get the full view i might extract the schema .
I Added the following property through Ambari > Hive > Configs > Advanced > Custom hive-site:
hive.metastore.max.typename.length=14000.
and now i am able to create table with column type name upto 14000 length
I was able to fix this problem by running the below command before my create table statement. You can see it to whatever limit fits your schema definition, I made mine extra large.
Note, you have to do this again for each session in hive.
set hive.metastore.max.typename.length=11000;
I have an external table in Hive that uses a serde to process json records. Occasionally there will be a value that does not match the table ddl data type, e.g. table field definition is int, json has a string value. During query execution Hive will correctly throw this error for metadata exception due to type mismatch:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing writable
Is there a way to set Hive to just ignore these records that have data type violations?
Note the json is valid syntax, so settings the serde properties like to ignore malformed json is not applicable.
Example DDL:
CREATE EXTERNAL TABLE IF NOT EXISTS test_tbl (
acd INT,
tzo INT
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
;
ALTER TABLE test_tbl SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
Example data - the TZO = alpha record will cause the error:
{"acd":6,"tzo":4}
{"acd":6,"tzo":7}
{"acd":6,"tzo":"alpha"}
You can set up Hive to tolerate a configurable amount of failures.
SET mapred.skip.mode.enabled = true;
SET mapred.map.max.attempts = 100;
SET mapred.reduce.max.attempts = 100;
SET mapred.skip.map.max.skip.records = 30000;
SET mapred.skip.attempts.to.start.skipping = 1
This is not Hive specific and can be applied to ordinary MapReduce as well.
I don't think there is a way to handle this in hive yet. I think you may need to have an intermediate step using MR, Pig etc. to make sure the data is sound and then input from that result.
There may be a configuration parameter here you could use
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-SerDes
I'm thinking you may be able to write your own exception handler to catch that and continue by specifying your custom handler with hive.io.exception.handlers
or if you are ok storing as an ORC file instead of a text file. You can specify the ORC file format with HiveQL statements such as these:
CREATE TABLE ... STORED AS ORC
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
And then when you run your jobs you can use the skip setting:
set hive.exec.orc.skip.corrupt.data=true