Tables created in Snappy shell do not show up in JDBC or Pulse - snappydata

SnappyData v.0-5
The issue I am having is that my JDBC Connection's Table metadata and Pulse Web App do not see the table I created below.
I create a table in SnappyData using the shell and a csv file.
Data is here (roads.csv):
"roadId","name"
"1","Road 1"
"2","Road 2"
"3","Road 3"
"4","Road 4"
"5","Road 5"
"6","Road 6"
"7","Road 7"
"8","Road 8"
"9","Road 9"
"10","Road 10"
==========================================================
snappy> CREATE TABLE STAGING_ROADS
(road_id string, name string)
USING com.databricks.spark.csv
OPTIONS(path '/home/ubuntu/data/example/roads.csv', header 'true');
snappy> select * from STAGING_ROADS
Returns 10 rows.
I have a SnappyData JDBC connection (DBVisualizer & SquirrelSQL show same).
I cannot see that table in the "TABLES" list from metadata.
However, if I do a "select * from STAGING_ROADS".
Returns 10 rows with CLOBs, which btw are completely unusable.
road_id | name
=====================
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
Second, the Pulse Web App does not register that I create the table when I did it from the snappy> shell. However, if I run a CREATE TABLE command from the JDBC client, it shows up there fine.
Am I doing something incorrectly? How can I get metadata about the tables I create in snappy> shell to show up in JDBC and Pulse as well?

The issue I am having is that my JDBC Connection's Table metadata and Pulse Web App do not see the table I created below.
This is a known issue (https://jira.snappydata.io/browse/SNAP-303). The JDBC metadata shows only the items in the store and not the external table. While the metadata issue is being tracked, Pulse webapp will not be able to see such external tables since it is designed to monitor the snappydata store.
A note: the "CREATE TABLE" DDL has been changed to "CREATE EXTERNAL TABLE" (https://github.com/SnappyDataInc/snappydata/pull/311) for sources outside of store to make things clearer.
How can I get metadata about the tables I create in snappy> shell to show up in JDBC and Pulse as well?
It will show up for internal SnappyData sources: column and row tables. For other providers in USING, they will not show up as mentioned.
CSV tables are usually useful for only loading data into column or row tables as in the example provided by #jagsr.

Didn't think creating a table using SQL where Spark.csv is the data source has been tested. Here is a related JIRA - https://jira.snappydata.io/browse/SNAP-416.
We have been suggesting folks to use a Spark Job to load the data in parallel. You can do this using the spark-shell also.
stagingRoadsDataFrame = snappyContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load(someFile)
// Save Dataframe as a Row table
stagingRoadsDatFrame.write.format("row").options(props).saveAsTable("staging_roads")
That said, could you try (perhaps this might work)-
CREATE TABLE STAGING_ROADS (road_id varchar(100), name varchar(500))
Note that there is no 'String' as a data type in SQL. By default, with no knowledge of the max length we convert this to a CLOB. We are working to resolve this issue too.

Related

Problem dropping Hive table from pyspark script

I have a table in hive created from many json files using hive-json-serde method, WITH SERDEPROPERTIES ('dots.in.keys' = 'true'), as some keys there have a dot in, like `aaa.bbb`. I create external table and use backticks for these keys. Now I have a problem dropping this table from pyspark script, using sqlContext.sql("DROP TABLE IF EXISTS "+table_name), I'm getting this error message:
An error occurred while calling o63.sql.
: org.apache.spark.SparkException: Cannot recognize hive type string: struct<associations:struct<aaa.bbb:array<string> ...
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting ':'(line 1, pos 33)
== SQL ==
struct<associations:struct<aaa.bbb:array<string>,...
---------------------------------^^^
In HUE i can drop this table without any problem. Am I doing it wrong, or may be there is better way to do it?
It looks like it is not possible to work with Hive tables created with the hive-json-serde method, with dot in keys , using sqlContext.sql("...") from pyspark script, as I want. There is always the same error, if I want to drop such Hive table, or create it (haven't tried other things yet). So my workaround is to use python os.system() and execute required query through hive itself:
q='hive -e "DROP TABLE IF EXISTS '+ table_name+';"'
os.system(q)
It's more complicated with CREATE TABLE query, as we need to escape backticks with '\':
statement = "CREATE TABLE test111 (testA struct<\`aa.bb\`:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3a://bucket/test111';"
q='hive -e "'+ statement+'"'
It outputs some additional hive related info, but works!

How to Integrate MySql tables Data To Ksql Stream or Tables?

I am trying to build a data pipeline from MySql to Ksql.
Use Case: data source is MySql. I have created a table in MySql.
I am using
./bin/connect-standalone ./etc/schema-registry/connect-avro-standalone.properties ./etc/kafka-connect-jdbc/source-quickstart-sqlite.properties
to start a standalone connector. And it is working fine.
I am starting the consumer with topic name i.e.
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test1Category --from-beginning
when I am inserting the data in MySQL table I am getting the result in consumer as well. I have created KSQL Stream as will with the same topic name. I am expecting the same result in my Kstream as well, But i am not getting any result when i am doing
select * from <streamName>
Connector configuration--source-quickstart-mysql.properties
name=jdbc_source_mysql
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://localhost:3306/testDB?user=root&password=cloudera
#comment=Which table(s) to include
table.whitelist=ftest
mode=incrementing
incrementing.column.name=id
topic.prefix=ftopic
Sample Data
MySql
1.) Create Database:
CREATE DATABASE testDB;
2.) Use Database:
USE testDB;
3.) create the table:
CREATE TABLE products (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
description VARCHAR(512),
weight FLOAT
);
4.) Insert data into the table:
INSERT INTO products(id,name,description,weight)
VALUES (103,'car','Small car',20);
KSQL
1.) Create Stream:
CREATE STREAM pro_original (id int, name varchar, description varchar,weight bigint) WITH \
(kafka_topic='proproducts', value_format='DELIMITED');
2.) Select Query:
Select * from pro_original;
Expected Output
Consumer
getting the data which is inserted in the MySQL table.
Here I am getting the data in MySQL.
Ksql
In-Stream data should be populated which is inserted in Mysql table and reflecting in Kafka topic.
I am not getting expected result in ksql
Help me for this data pipeline.
Your data is in AVRO format but in the VALUE_FORMAT instead of AVRO you've defined DELIMITED. It is important to instruct KSQL the format of the values that are stored in the topic. The following should do the trick for you.
CREATE STREAM pro_original_v2 \
WITH (KAFKA_TOPIC='products', VALUE_FORMAT='AVRO');
Data inserted into kafka topic after executing
SELECT * FROM pro_original_v2;
should now be visible in your ksql console window.
You can have a look at some Avro examples in KSQL here.

how can i input json file on mysql database?

[
{"link":"https://twitter.com/GreenAddress/status/550793651186855937",
"pDate":"2015 01 1",
"title":"GreenAddress",
"description": "btcarchitect coinkite blockchain circlebits coinbase bitgo some maybe some are oracle cosigners which require lesszero trust"},
{"link":"https://twitter.com/Bit_Swift/status/550765718581411840",
"pDate":"2015 01 1",
"title":"Bitswift™",
"description": "swiftstealth offers you privacy in bitswift v2 swiftstealth enables stealth address use on the bitswift blockchain swift"},
{"link":"https://twitter.com/allenday/status/550741133500772352",
"pDate":"2015 01 1",
"title":"Allen Day, PhD",
"description": "all in one article bitcoin blockchain 3dprinting drones and deeplearninghttp simondlr compost101071618938adecentralizedaivia simondlr"}
]
my test.json file like this
and my mysql db table is here
i can input text file with csv type, but i have no idea how input json text file on mysql
i try [create table test ( data json);] and
[insert into test values ( '{json type}'); but when i try input data with csv type LOAD DATA INFILE 'test.txt' made it possible
so I wonder json has the same functionality
thanks for any advice
MySQL does have JSON data field. However, it will not work with your file and current table structure as it request a field to be JSON. To solve your data, will require a little bit of programming work. Depending on your current ability, you will need to write codes that does the following:
Open a database connection
Read the JSON and loop through each value
Store each value using the following INSERT query:
INSERT INTO news(link, date, title, description) VALUES($link, $pDate, $title, $description);
Depending on your language and database connection feature, close the database connection.

Read a json file with 12 nested level into hive in AZURE hdinsights

I tried to create a schema for the json file manually and tried to create a Hive table and i am getting
column type name length 10888 exceeds max allowed length 2000.
I am guessing i have to change the metastore details but i am not sure where is the config located In azure Hdinsights .
Other way I tried was
I got the schema from spark dataframe and i tried to create table from the view but still I get the same error.
this are the steps i tried in spark
val tne1 = sc.wholeTextFiles("wasb:path").map(x=>x._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val tne2 = sqlContext.read.json(tne1)
tne2.createOrReplaceTempView("my_temp_table");
sqlContext.sql("create table s ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'hive.serialization.extend.nesting.levels'='true') as select * from my_temp_table")
i am getting the error in this step
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: InvalidObjectException(message:Invalid column type name length 5448 exceeds max allowed length 2000, type struct
when i try to persist or create the rdd i get the schema but in a formatted view . even if i get the full view i might extract the schema .
I Added the following property through Ambari > Hive > Configs > Advanced > Custom hive-site:
hive.metastore.max.typename.length=14000.
and now i am able to create table with column type name upto 14000 length
I was able to fix this problem by running the below command before my create table statement. You can see it to whatever limit fits your schema definition, I made mine extra large.
Note, you have to do this again for each session in hive.
set hive.metastore.max.typename.length=11000;

Most effective way to push data from a SQL Server database into a Greenplum database?

Greenplum Database version:
PostgreSQL 8.2.15 (Greenplum Database 4.2.3.0 build 1)
SQL Server Database version:
Microsoft SQL Server 2008 R2 (SP1)
Our current approach:
1) Export each table to a flat file from SQL Server
2) Load the data into Greenplum with pgAdmin III using PSQL Console's psql.exe utility
Benifits...
Speed: OK, but is there anything faster? We load millions of rows of data in minutes
Automation: OK, we call this utility from an SSIS package using a Shell script in VB
Pitfalls...
Reliability: ETL is dependent on the file server to hold the flat files
Security: Lots of potentially sensitive data on the file server
Error handling: It's a problem. psql.exe never raises an error that we can catch even if it does error out and loads no data or a partial file
What else we have tried...
.Net Providers\Odbc Data Provider: We have configured a System DSN using DataDirect 6.0 Greenplum Wire Protocol. Good performance for a DELETE. Dog awful slow for an INSERT.
For reference, this is the aforementioned VB script in SSIS...
Public Sub Main()
Dim v_shell
Dim v_psql As String
v_psql = "C:\Program Files\pgAdmin III\1.10\psql.exe -d "MyGPDatabase" -h "MyGPHost" -p "5432" -U "MyServiceAccount" -f \\MyFileLocation\SSIS_load\sql_files\load_MyTable.sql"
v_shell = Shell(v_psql, AppWinStyle.NormalFocus, True)
End Sub
This is the contents of the "load_MyTable.sql" file...
\copy MyTable from '\\MyFileLocation\SSIS_load\txt_files\MyTable.txt' with delimiter as ';' csv header quote as '"'
If you're getting your data load done in minutes, then the current method is probably good enough. However, if you find yourself having to load larger volumes of data (terabyte scale for instance), the usual preferred method for bulk-loading into Greenplum is via gpfdist and corresponding EXTERNAL TABLE definitions. gpload is a decent wrapper that provides abstraction over much of this process and is driven by YAML control files. The general idea is that gpfdist instance(s) are spun up at the location(s) where your data is staged, preferrably as CSV text files, and then the EXTERNAL TABLE definition within Greenplum is made aware of the URIs for the gpfdist instances. From the admin guide, a sample definition of such an external table could look like this:
CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in,
name=20, address=30, age=4,
preserve_blanks='on',null='NULL');
The above example expects to read text files whose fields from left to right are a 20-character (at most) string, a 30-character string, and an integer. To actually load this data into a staging table inside GP:
CREATE TABLE staging_table AS SELECT * FROM students;
For large volumes of data, this should be the most efficient method since all segment hosts are engaged in the parallel load. Do keep in mind that the simplistic approach above will probably result in a randomly distributed table, which may not be desirable. You'd have to customize your table definitions to specify a distribution key.