Problems with importing a JSON tweet into hive - json

i work on cloudera quickstart with docker, I'm trying to create a table into hive interface.
This is my code.
add jar hdfs:///user/cloudera/hive-serdes-1.0-SNAPSHOT.jar
drop table if exists tweets;
CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user1:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user1 STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/';
load data inpath '/user/cloudera/search.json' into table tweets;
when I run "select * from tweets;", I get this error
Fetching results ran into the following error(s):
Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='\xf2e\xcc\xb6v\x8eC"\xae^x\x89*\xd6j\xa7', guid='h\xce\xacgmZIP\x8d\xcc\xc0\xe8C\t\x1a\x0c')), orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/2015_11_18', sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/2015_11_18:25:24', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:366', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:275', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:752', 'sun.reflect.GeneratedMethodAccessor19:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:606',

Don't use your user folder as a Hive table location. A user folder is meant for general file storage, such as that 2015_11_18 directory it's trying to read, not an entire Hive structure.
Do LOCATION '/user/cloudera/tweets';, for example instead.
You could also just make a regular managed table if you don't care if things are deleted when you drop the table.

Related

How to Integrate MySql tables Data To Ksql Stream or Tables?

I am trying to build a data pipeline from MySql to Ksql.
Use Case: data source is MySql. I have created a table in MySql.
I am using
./bin/connect-standalone ./etc/schema-registry/connect-avro-standalone.properties ./etc/kafka-connect-jdbc/source-quickstart-sqlite.properties
to start a standalone connector. And it is working fine.
I am starting the consumer with topic name i.e.
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test1Category --from-beginning
when I am inserting the data in MySQL table I am getting the result in consumer as well. I have created KSQL Stream as will with the same topic name. I am expecting the same result in my Kstream as well, But i am not getting any result when i am doing
select * from <streamName>
Connector configuration--source-quickstart-mysql.properties
name=jdbc_source_mysql
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://localhost:3306/testDB?user=root&password=cloudera
#comment=Which table(s) to include
table.whitelist=ftest
mode=incrementing
incrementing.column.name=id
topic.prefix=ftopic
Sample Data
MySql
1.) Create Database:
CREATE DATABASE testDB;
2.) Use Database:
USE testDB;
3.) create the table:
CREATE TABLE products (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
description VARCHAR(512),
weight FLOAT
);
4.) Insert data into the table:
INSERT INTO products(id,name,description,weight)
VALUES (103,'car','Small car',20);
KSQL
1.) Create Stream:
CREATE STREAM pro_original (id int, name varchar, description varchar,weight bigint) WITH \
(kafka_topic='proproducts', value_format='DELIMITED');
2.) Select Query:
Select * from pro_original;
Expected Output
Consumer
getting the data which is inserted in the MySQL table.
Here I am getting the data in MySQL.
Ksql
In-Stream data should be populated which is inserted in Mysql table and reflecting in Kafka topic.
I am not getting expected result in ksql
Help me for this data pipeline.
Your data is in AVRO format but in the VALUE_FORMAT instead of AVRO you've defined DELIMITED. It is important to instruct KSQL the format of the values that are stored in the topic. The following should do the trick for you.
CREATE STREAM pro_original_v2 \
WITH (KAFKA_TOPIC='products', VALUE_FORMAT='AVRO');
Data inserted into kafka topic after executing
SELECT * FROM pro_original_v2;
should now be visible in your ksql console window.
You can have a look at some Avro examples in KSQL here.

bigquery create table from json definition gives STORAGE_FORMAT_UNSPECIFIED error

I want to create a table by cloning the schema of an existing table, editing it by adding some columns, renaming others.
What I did is:
Find the schema of the table to clone:
bq show --format=json $dataset.$from_table | jq -c .schema
Edit it with some scripting, save as a file, e.g. schema.json (here simplified):
schema.json
{"fields":[{"mode":"NULLABLE","name":"project_name","type":"STRING"},
{"mode":"NULLABLE","name":"sample_name","type":"STRING"}]}
Then attempting to create the new table with the command below:
bq mk --table --external_table_definition=schema.json test-
project1:dataset1.table_v1_2_2
But I am getting this error:
BigQuery error in mk operation: Unsupported storage format for
external data: STORAGE_FORMAT_UNSPECIFIED
I just want this to be another table of the same type I have in the
system, which I believe is Location "Google Cloud BigQuery".
Any ideas?
The problem is that you are using the external_table_definition flag, which is only relevant if you are creating an external table over files on GCS or Drive for example. A much easier way to go about creating the new table is to use a CREATE TABLE ... AS SELECT ... statement. As an example, suppose that I have a table T1 with columns and types
foo: INT64
bar: STRING
baz: BOOL
I want to create a new table that renames bar and changes its type, and with the addition of a column named id. I can run a query like this:
CREATE TABLE dataset.T2 AS
SELECT
foo,
CAST(bar AS TIMESTAMP) AS fizz,
baz,
GENERATE_UUID() AS id
FROM dataset.T1
If you just want to clone and update the schema without incurring any cost or copying the data, you can use LIMIT 0, e.g.:
CREATE TABLE dataset.T2 AS
SELECT
foo,
CAST(bar AS TIMESTAMP) AS fizz,
baz,
GENERATE_UUID() AS id
FROM dataset.T1
LIMIT 0
Now you'll have a new, empty table with the desired schema.

create hive table from json

I wanted to Create hive table with Json array
I am facing issue with top level array. can anyone suggest me a solution.
My json object looks like below
[{"user_id": "a"," previous_user_id": "b"},{"user_id": "c"," previous_user_id": "d"},{"user_id": "e"," previous_user_id": "f"}]
Hive command to create the table:
create external table array_tmp (User array<struct<user_id: String, previous_user_id:String>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
select user.user_id from array_tmp gives exception as
Row is not a valid JSON Object.
I have added the jar ADD JAR json-serde-1.3.8-jar-with-dependencies.jar;
Any suggestion ?
You may need to make few changes. Here is an example
myjson/data.json
{"users":[{"user_id": "a"," previous_user_id": "b"},{"user_id": "c"," previous_user_id": "d"},{"user_id": "e"," previous_user_id": "f"}]}
Now create a Hive table
ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
CREATE EXTERNAL TABLE tbl( users array<struct<user_id:string,previous_user_id:string>>)
ROW FORMAT SERDE "org.apache.hive.hcatalog.data.JsonSerDe"
location '/user/cloudera/myjson';
Do a select
select users.user_id from tbl;
+----------------+--+
| user_id |
+----------------+--+
| ["a","c","e"] |
+----------------+--+

Json Data on hive

I am trying to read json data using Hive External table but I am getting Null pointer exception while using json serde..
Below is the table command and error:
hive> create external table json_tab
> (
> name string, age string, passion string
> )
> row format SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' location '/home/pandi/hive_in';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.NullPointerException
I have added below jar as well:
add jar /usr/local/apache-hive-2.1.1-bin/lib/hive-contrib-2.1.1.jar;
add jar /usr/local/apache-hive-2.1.1-bin/lib/hive-json-serde.jar;
Please help.
It looks like an issue with the SerDe class.
Try to make use of this implementation: 'org.apache.hive.hcatalog.data.JsonSerDe' present in hive-hcatalog-core-0.13.0.jar;
This works for me.

Hive error on CREATE

I'm following these instructions and I've got to running Hive. I ran the following commands:
ADD JAR /home/cloudera/Downloads/hive-serdes-1.0-SNAPSHOT.jar
CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/home/cloudera/flume/tweets';
and then I encountered an error:
CREATE does not exist
Query returned non-zero code: 1, cause: CREATE does not exist.
As I'm new to Hive, I might be missing something obvious.
What might be causing such an error?
I was getting similar error on my Hive console while runing hive commands:
create does not exist
Query returned non-zero code: 1, cause: create does not exist
I resolved this problem by setting the Hive run as user setting.
I changed it from "Run as end user instead of Hive user" from True to False and restarted Hive server/clients.
with this setting my hive commands started running with hive user and started working.
before making this setting the default user id the root user where hive was running from.
This is hive setting issue please restart your hive console and check your hive-jdbc version and hadoop version compatability. Hope this will solve your issue as i can see the query is fine.
The problem is that you didn't put ; in the end of the first statement.
You need to change this:
ADD JAR /home/cloudera/Downloads/hive-serdes-1.0-SNAPSHOT.jar
Into this:
ADD JAR /home/cloudera/Downloads/hive-serdes-1.0-SNAPSHOT.jar;