How to Integrate MySql tables Data To Ksql Stream or Tables? - mysql

I am trying to build a data pipeline from MySql to Ksql.
Use Case: data source is MySql. I have created a table in MySql.
I am using
./bin/connect-standalone ./etc/schema-registry/connect-avro-standalone.properties ./etc/kafka-connect-jdbc/source-quickstart-sqlite.properties
to start a standalone connector. And it is working fine.
I am starting the consumer with topic name i.e.
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test1Category --from-beginning
when I am inserting the data in MySQL table I am getting the result in consumer as well. I have created KSQL Stream as will with the same topic name. I am expecting the same result in my Kstream as well, But i am not getting any result when i am doing
select * from <streamName>
Connector configuration--source-quickstart-mysql.properties
name=jdbc_source_mysql
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://localhost:3306/testDB?user=root&password=cloudera
#comment=Which table(s) to include
table.whitelist=ftest
mode=incrementing
incrementing.column.name=id
topic.prefix=ftopic
Sample Data
MySql
1.) Create Database:
CREATE DATABASE testDB;
2.) Use Database:
USE testDB;
3.) create the table:
CREATE TABLE products (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
description VARCHAR(512),
weight FLOAT
);
4.) Insert data into the table:
INSERT INTO products(id,name,description,weight)
VALUES (103,'car','Small car',20);
KSQL
1.) Create Stream:
CREATE STREAM pro_original (id int, name varchar, description varchar,weight bigint) WITH \
(kafka_topic='proproducts', value_format='DELIMITED');
2.) Select Query:
Select * from pro_original;
Expected Output
Consumer
getting the data which is inserted in the MySQL table.
Here I am getting the data in MySQL.
Ksql
In-Stream data should be populated which is inserted in Mysql table and reflecting in Kafka topic.
I am not getting expected result in ksql
Help me for this data pipeline.

Your data is in AVRO format but in the VALUE_FORMAT instead of AVRO you've defined DELIMITED. It is important to instruct KSQL the format of the values that are stored in the topic. The following should do the trick for you.
CREATE STREAM pro_original_v2 \
WITH (KAFKA_TOPIC='products', VALUE_FORMAT='AVRO');
Data inserted into kafka topic after executing
SELECT * FROM pro_original_v2;
should now be visible in your ksql console window.
You can have a look at some Avro examples in KSQL here.

Related

PyFlink Error/Exception: "Hive Table doesn't support consuming update changes which is produced by node PythonGroupAggregate"

Using Flink 1.13.1 and a pyFlink and a user-defined table aggregate function (UDTAGG) with Hive tables as source and sinks, I've been encountering an error:
pyflink.util.exceptions.TableException: org.apache.flink.table.api.TableException:
Table sink 'myhive.mydb.flink_tmp_model' doesn't support consuming update changes
which is produced by node PythonGroupAggregate
This is the SQL CREATE TABLE for the sink
table_env.execute_sql(
"""
CREATE TABLE IF NOT EXISTS flink_tmp_model (
run_id STRING,
model_blob BINARY,
roc_auc FLOAT
) PARTITIONED BY (dt STRING) STORED AS parquet TBLPROPERTIES (
'sink.partition-commit.delay'='1 s',
'sink.partition-commit.policy.kind'='success-file'
)
"""
)
What's wrong here?
I imagine you are executing a streaming query that is doing some sort of aggregation that requires updating previously emitted results. The parquet/hive sink does not support this -- once results are written, they are final.
One solution would be to execute the query in batch mode. Another would be to use a sink (or a format) that can handle updates. Or modify the query so that it only produces final results -- e.g., a time-windowed aggregation rather than an unbounded one.

bigquery create table from json definition gives STORAGE_FORMAT_UNSPECIFIED error

I want to create a table by cloning the schema of an existing table, editing it by adding some columns, renaming others.
What I did is:
Find the schema of the table to clone:
bq show --format=json $dataset.$from_table | jq -c .schema
Edit it with some scripting, save as a file, e.g. schema.json (here simplified):
schema.json
{"fields":[{"mode":"NULLABLE","name":"project_name","type":"STRING"},
{"mode":"NULLABLE","name":"sample_name","type":"STRING"}]}
Then attempting to create the new table with the command below:
bq mk --table --external_table_definition=schema.json test-
project1:dataset1.table_v1_2_2
But I am getting this error:
BigQuery error in mk operation: Unsupported storage format for
external data: STORAGE_FORMAT_UNSPECIFIED
I just want this to be another table of the same type I have in the
system, which I believe is Location "Google Cloud BigQuery".
Any ideas?
The problem is that you are using the external_table_definition flag, which is only relevant if you are creating an external table over files on GCS or Drive for example. A much easier way to go about creating the new table is to use a CREATE TABLE ... AS SELECT ... statement. As an example, suppose that I have a table T1 with columns and types
foo: INT64
bar: STRING
baz: BOOL
I want to create a new table that renames bar and changes its type, and with the addition of a column named id. I can run a query like this:
CREATE TABLE dataset.T2 AS
SELECT
foo,
CAST(bar AS TIMESTAMP) AS fizz,
baz,
GENERATE_UUID() AS id
FROM dataset.T1
If you just want to clone and update the schema without incurring any cost or copying the data, you can use LIMIT 0, e.g.:
CREATE TABLE dataset.T2 AS
SELECT
foo,
CAST(bar AS TIMESTAMP) AS fizz,
baz,
GENERATE_UUID() AS id
FROM dataset.T1
LIMIT 0
Now you'll have a new, empty table with the desired schema.

Problems with importing a JSON tweet into hive

i work on cloudera quickstart with docker, I'm trying to create a table into hive interface.
This is my code.
add jar hdfs:///user/cloudera/hive-serdes-1.0-SNAPSHOT.jar
drop table if exists tweets;
CREATE EXTERNAL TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweeted_status STRUCT<
text:STRING,
user1:STRUCT<screen_name:STRING,name:STRING>,
retweet_count:INT>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user1 STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/';
load data inpath '/user/cloudera/search.json' into table tweets;
when I run "select * from tweets;", I get this error
Fetching results ran into the following error(s):
Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='\xf2e\xcc\xb6v\x8eC"\xae^x\x89*\xd6j\xa7', guid='h\xce\xacgmZIP\x8d\xcc\xc0\xe8C\t\x1a\x0c')), orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/2015_11_18', sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/cloudera/2015_11_18:25:24', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:366', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:275', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:752', 'sun.reflect.GeneratedMethodAccessor19:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:606',
Don't use your user folder as a Hive table location. A user folder is meant for general file storage, such as that 2015_11_18 directory it's trying to read, not an entire Hive structure.
Do LOCATION '/user/cloudera/tweets';, for example instead.
You could also just make a regular managed table if you don't care if things are deleted when you drop the table.

Tables created in Snappy shell do not show up in JDBC or Pulse

SnappyData v.0-5
The issue I am having is that my JDBC Connection's Table metadata and Pulse Web App do not see the table I created below.
I create a table in SnappyData using the shell and a csv file.
Data is here (roads.csv):
"roadId","name"
"1","Road 1"
"2","Road 2"
"3","Road 3"
"4","Road 4"
"5","Road 5"
"6","Road 6"
"7","Road 7"
"8","Road 8"
"9","Road 9"
"10","Road 10"
==========================================================
snappy> CREATE TABLE STAGING_ROADS
(road_id string, name string)
USING com.databricks.spark.csv
OPTIONS(path '/home/ubuntu/data/example/roads.csv', header 'true');
snappy> select * from STAGING_ROADS
Returns 10 rows.
I have a SnappyData JDBC connection (DBVisualizer & SquirrelSQL show same).
I cannot see that table in the "TABLES" list from metadata.
However, if I do a "select * from STAGING_ROADS".
Returns 10 rows with CLOBs, which btw are completely unusable.
road_id | name
=====================
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
Second, the Pulse Web App does not register that I create the table when I did it from the snappy> shell. However, if I run a CREATE TABLE command from the JDBC client, it shows up there fine.
Am I doing something incorrectly? How can I get metadata about the tables I create in snappy> shell to show up in JDBC and Pulse as well?
The issue I am having is that my JDBC Connection's Table metadata and Pulse Web App do not see the table I created below.
This is a known issue (https://jira.snappydata.io/browse/SNAP-303). The JDBC metadata shows only the items in the store and not the external table. While the metadata issue is being tracked, Pulse webapp will not be able to see such external tables since it is designed to monitor the snappydata store.
A note: the "CREATE TABLE" DDL has been changed to "CREATE EXTERNAL TABLE" (https://github.com/SnappyDataInc/snappydata/pull/311) for sources outside of store to make things clearer.
How can I get metadata about the tables I create in snappy> shell to show up in JDBC and Pulse as well?
It will show up for internal SnappyData sources: column and row tables. For other providers in USING, they will not show up as mentioned.
CSV tables are usually useful for only loading data into column or row tables as in the example provided by #jagsr.
Didn't think creating a table using SQL where Spark.csv is the data source has been tested. Here is a related JIRA - https://jira.snappydata.io/browse/SNAP-416.
We have been suggesting folks to use a Spark Job to load the data in parallel. You can do this using the spark-shell also.
stagingRoadsDataFrame = snappyContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load(someFile)
// Save Dataframe as a Row table
stagingRoadsDatFrame.write.format("row").options(props).saveAsTable("staging_roads")
That said, could you try (perhaps this might work)-
CREATE TABLE STAGING_ROADS (road_id varchar(100), name varchar(500))
Note that there is no 'String' as a data type in SQL. By default, with no knowledge of the max length we convert this to a CLOB. We are working to resolve this issue too.

Connect to MySQL from Hive

I want to connect my MySQL database to Hive so that I can access tables from MySQL server through Hive. I have searched the net and only found solutions for setting MySQL as a metastore database for Hive. But, did not find any methods for my problem. Can anyone please help me set this up? I am expecting something like this except for MySQL instead of MongoDB.
You can achieve this using two ways.
One is by importing the mysql table to hdfs and hive using sqoop. Direct hive import is possible through sqoop. This will create the hive table corresponding to that of mysql in hadoop. Once you import the table to hive, then the new table will work as a hive table alone.
Another way is by using a serde to access mysql tables. I found one hive-mysql serde in github. I haven't tested this serde. If you are good in java, you can write your own serde.
The example that you mentioned above is using a hive-mongodb SerDe.
Hive 2.3.0+ provides ability to define external tables from your MySQL/Postgres/etc using JdbcStorageHandler:
CREATE EXTERNAL TABLE student_jdbc
(
name string,
age int,
gpa double
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "MYSQL",
"hive.sql.jdbc.driver" = "com.mysql.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:mysql://localhost/sample",
"hive.sql.dbcp.username" = "hive",
"hive.sql.dbcp.password" = "hive",
"hive.sql.table" = "STUDENT"
"hive.sql.dbcp.maxActive" = "1"
);
Also you can use hive.sql.query parameter instead of hive.sql.table to define more specific query like:
"hive.sql.query" = "SELECT name, age, gpa FROM STUDENT"
See Cloudera docs also.