Problem dropping Hive table from pyspark script - json

I have a table in hive created from many json files using hive-json-serde method, WITH SERDEPROPERTIES ('dots.in.keys' = 'true'), as some keys there have a dot in, like `aaa.bbb`. I create external table and use backticks for these keys. Now I have a problem dropping this table from pyspark script, using sqlContext.sql("DROP TABLE IF EXISTS "+table_name), I'm getting this error message:
An error occurred while calling o63.sql.
: org.apache.spark.SparkException: Cannot recognize hive type string: struct<associations:struct<aaa.bbb:array<string> ...
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting ':'(line 1, pos 33)
== SQL ==
struct<associations:struct<aaa.bbb:array<string>,...
---------------------------------^^^
In HUE i can drop this table without any problem. Am I doing it wrong, or may be there is better way to do it?

It looks like it is not possible to work with Hive tables created with the hive-json-serde method, with dot in keys , using sqlContext.sql("...") from pyspark script, as I want. There is always the same error, if I want to drop such Hive table, or create it (haven't tried other things yet). So my workaround is to use python os.system() and execute required query through hive itself:
q='hive -e "DROP TABLE IF EXISTS '+ table_name+';"'
os.system(q)
It's more complicated with CREATE TABLE query, as we need to escape backticks with '\':
statement = "CREATE TABLE test111 (testA struct<\`aa.bb\`:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3a://bucket/test111';"
q='hive -e "'+ statement+'"'
It outputs some additional hive related info, but works!

Related

Apache Drill Parser error while creating table from simple select of JSON file

I get parser error while creating table from SQL query of JSON file in apache Drill.
USE dfs.tmp;
CREATE Table myt AS
(SELECT KVGEN(repo)[1] reponame FROM dfs.`f:\DemoData\201901-000000000000.json`
WHERE STRPOS(payload,'ARM') >0)
error:
Org.apache.drill.common.exceptions.UserRemoteException: PARSE ERROR: Encountered ";" at line 1, column 12. Was expecting one of: <EOF> "." ... "[" ... SQL Query USE dfs.tmp; ^ CREATE Table myt AS (SELECT KVGEN(repo)[1] reponame FROM dfs.`f:\DemoData\201901-000000000000.json` WHERE STRPOS(payload,'ARM') >0)
What am i doing wrong ?
You are trying to submit to queries, but Drill doesn't support submitting several queries via a single form in Drill Web UI.
Please create Jira ticket to improve it: https://issues.apache.org/jira/browse/DRILL.
You can use Drill SqlLine (Drill shell). It hasn't this limitation.

Tables created in Snappy shell do not show up in JDBC or Pulse

SnappyData v.0-5
The issue I am having is that my JDBC Connection's Table metadata and Pulse Web App do not see the table I created below.
I create a table in SnappyData using the shell and a csv file.
Data is here (roads.csv):
"roadId","name"
"1","Road 1"
"2","Road 2"
"3","Road 3"
"4","Road 4"
"5","Road 5"
"6","Road 6"
"7","Road 7"
"8","Road 8"
"9","Road 9"
"10","Road 10"
==========================================================
snappy> CREATE TABLE STAGING_ROADS
(road_id string, name string)
USING com.databricks.spark.csv
OPTIONS(path '/home/ubuntu/data/example/roads.csv', header 'true');
snappy> select * from STAGING_ROADS
Returns 10 rows.
I have a SnappyData JDBC connection (DBVisualizer & SquirrelSQL show same).
I cannot see that table in the "TABLES" list from metadata.
However, if I do a "select * from STAGING_ROADS".
Returns 10 rows with CLOBs, which btw are completely unusable.
road_id | name
=====================
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
CLOB CLOB
Second, the Pulse Web App does not register that I create the table when I did it from the snappy> shell. However, if I run a CREATE TABLE command from the JDBC client, it shows up there fine.
Am I doing something incorrectly? How can I get metadata about the tables I create in snappy> shell to show up in JDBC and Pulse as well?
The issue I am having is that my JDBC Connection's Table metadata and Pulse Web App do not see the table I created below.
This is a known issue (https://jira.snappydata.io/browse/SNAP-303). The JDBC metadata shows only the items in the store and not the external table. While the metadata issue is being tracked, Pulse webapp will not be able to see such external tables since it is designed to monitor the snappydata store.
A note: the "CREATE TABLE" DDL has been changed to "CREATE EXTERNAL TABLE" (https://github.com/SnappyDataInc/snappydata/pull/311) for sources outside of store to make things clearer.
How can I get metadata about the tables I create in snappy> shell to show up in JDBC and Pulse as well?
It will show up for internal SnappyData sources: column and row tables. For other providers in USING, they will not show up as mentioned.
CSV tables are usually useful for only loading data into column or row tables as in the example provided by #jagsr.
Didn't think creating a table using SQL where Spark.csv is the data source has been tested. Here is a related JIRA - https://jira.snappydata.io/browse/SNAP-416.
We have been suggesting folks to use a Spark Job to load the data in parallel. You can do this using the spark-shell also.
stagingRoadsDataFrame = snappyContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load(someFile)
// Save Dataframe as a Row table
stagingRoadsDatFrame.write.format("row").options(props).saveAsTable("staging_roads")
That said, could you try (perhaps this might work)-
CREATE TABLE STAGING_ROADS (road_id varchar(100), name varchar(500))
Note that there is no 'String' as a data type in SQL. By default, with no knowledge of the max length we convert this to a CLOB. We are working to resolve this issue too.

Insert into Select command causing exception ParseException line 1:12 missing TABLE at 'table_name' near '<EOF>'

I am 2 days old into hadoop and hive. So, my understanding is very basic. I have a question which might be silly. Question :I have a hive external table ABC and have created a sample test table similar to the table as ABC_TEST. My goal is to Copy certain contents of ABC to ABC_TEST depending on select clause. So I created ABC_TEST using the following command:
CREATE TABLE ABC_TEST LIKE ABC;
Problem with this is:
1) this ABC_TEST is not an external table.
2) using Desc command, the LOCATION content for ABC_TEST was something like
hdfs://somepath/somdbname.db/ABC_TEST
--> On command "hadoop fs -ls hdfs://somepath/somdbname.db/ABC_TEST " I found no files .
--> Whereas, "hadoop fs -ls hdfs://somepath/somdbname.db/ABC" returned me 2 files.
3) When trying to insert values to ABC_TEST from ABC, I have the above exception mentioned in the title. Following is the command I used to insert values to ABC_TEST:
INSERT INTO ABC_TEST select * from ABC where column_name='a_valid_value' limit 5;
Is it wrong to use the insert into select option in Hive? what am I missing? Please help
The correct syntax is "INSERT INTO TABLE [TABLE_NAME]"
INSERT INTO TABLE ABC_TEST select * from ABC where column_name='a_valid_value' limit 5;
I faced exactly the same issue and the reason is the Hive version.
In one of our clusters, we are using hive 0.14 and on a new set up we're using hive-2.3.4.
In hive 0.14 "TABLE" keyword is mandatory to be used in the INSERT command.
However in version hive 2.3.4, this is not mandatory.
So while in hive 2.3.4, the query you've mentioned above in your question will work perfectly fine but in older versions you'll face exception "FAILED: ParseException line 1:12 missing TABLE <>".
Hope this helps.

Hive External Table exclude records that violate data type

I have an external table in Hive that uses a serde to process json records. Occasionally there will be a value that does not match the table ddl data type, e.g. table field definition is int, json has a string value. During query execution Hive will correctly throw this error for metadata exception due to type mismatch:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing writable
Is there a way to set Hive to just ignore these records that have data type violations?
Note the json is valid syntax, so settings the serde properties like to ignore malformed json is not applicable.
Example DDL:
CREATE EXTERNAL TABLE IF NOT EXISTS test_tbl (
acd INT,
tzo INT
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
;
ALTER TABLE test_tbl SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
Example data - the TZO = alpha record will cause the error:
{"acd":6,"tzo":4}
{"acd":6,"tzo":7}
{"acd":6,"tzo":"alpha"}
You can set up Hive to tolerate a configurable amount of failures.
SET mapred.skip.mode.enabled = true;
SET mapred.map.max.attempts = 100;
SET mapred.reduce.max.attempts = 100;
SET mapred.skip.map.max.skip.records = 30000;
SET mapred.skip.attempts.to.start.skipping = 1
This is not Hive specific and can be applied to ordinary MapReduce as well.
I don't think there is a way to handle this in hive yet. I think you may need to have an intermediate step using MR, Pig etc. to make sure the data is sound and then input from that result.
There may be a configuration parameter here you could use
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-SerDes
I'm thinking you may be able to write your own exception handler to catch that and continue by specifying your custom handler with hive.io.exception.handlers
or if you are ok storing as an ORC file instead of a text file. You can specify the ORC file format with HiveQL statements such as these:
CREATE TABLE ... STORED AS ORC
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
And then when you run your jobs you can use the skip setting:
set hive.exec.orc.skip.corrupt.data=true

MySql to PostgreSql migration

My PostgreSQL is installed on Windows. How can I migrate data from MySQL database to PostgreSQL?
I've read tons of aricles. Nothing helps :(
Thanks.
My actions:
mysql dump:
mysqldump -h 192.168.0.222 --port 3307 -u root -p --compatible=postgresql synchronizer > c:\dump.sql
create db synchronizer at pgsql
import dump:
psql -h 192.168.0.100 -d synchronizer -U postgres -f C:\dump.sql
output:
psql:C:/dump.sql:17: NOTICE: table "Db_audit" does not exist, skipping
DROP TABLE
psql:C:/dump.sql:30: ERROR: syntax error at or near "("
СТРОКА 2: "id" int(11) NOT NULL,
^
psql:C:/dump.sql:37: ERROR: syntax error at or near ""Db_audit""
СТРОКА 1:LOCK TABLES "Db_audit" WRITE;
^
psql:C:/dump.sql:39: ERROR: relation "Db_audit" does not exist
СТРОКА 1:INSERT INTO "Db_audit" VALUES (4068,4036,4,1,32,'2010-02-04 ...
^
psql:C:/dump.sql:40: ERROR: relation "Db_audit" does not exist
СТРОКА 1:INSERT INTO "Db_audit" VALUES (19730,2673,2,2,44,'2010-11-23...
^
psql:C:/dump.sql:42: ERROR: syntax error at or near "UNLOCK"
СТРОКА 1:UNLOCK TABLES;
^
psql:C:/dump.sql:48: NOTICE: table "ZHNVLS" does not exist, skipping
DROP TABLE
psql:C:/dump.sql:68: ERROR: syntax error at or near "("
СТРОКА 2: "id" int(10) unsigned NOT NULL,
^
psql:C:/dump.sql:75: ERROR: syntax error at or near ""ZHNVLS""
СТРОКА 1:LOCK TABLES "ZHNVLS" WRITE;
^
psql:C:/dump.sql:77: WARNING: nonstandard use of escape in a string literal
СТРОКА 1:...???????? ??? ???????','10','4607064820115','0','','??????-??...
^
ПОДСКАЗКА: Use the escape string syntax for escapes, e.g., E'\r\n'.
Cancel request sent
psql:C:/dump.sql:77: WARNING: nonstandard use of escape in a string literal
СТРОКА 1:...??????????? ????????','10','4602784001189','0','','???????? ...
My experience with MySQL -> Postgresql migration wasn't really pleasant, so I'd have to second Daniel's suggestion about CSV files.
In my case, I recreated the schema by hands and then imported all tables, one-by-one, using mysqldump and pg_restore.
So, while this dump/restore may work for the data, you are most likely out of luck with schema. I haven't tried any commercial solutions, so see what other people say and... good luck!
UPDATE: I looked at the code the process left behind and here is how I actually did it.
I had a little different schema in my PostgreSQL db, so some tables were joined, some were split. This is why straightforward import was not an option and my case is probably more complex than what you describe and this solution may be an overkill.
For each table in PG database I wrote a query that selects the relevant data from MySQL database. In case the table is basically the same in both databases, and there are no joins it can be as simple as this
select * from mysql_table_name
Then I exported results of this query to XML, to do this you need to run it like this:
echo "select * from mysql_table_name" | mysql [CONNECTION PARAMETERS] -X --default-character-set=utf8 > mysql_table_name.xml
This will create a simple XML file with the following structure:
<resultset statement="select * from mysql_table_name">
<row>
<field name="some_field">field_value</field>
...
</row>
...
</resultset>
Then, I wrote a script, that produces INSERT statement for each row element in this XML file. The name of the table, where to insert the data was given as a command line parameter to this script. Python script, in case you need it.
These sql statements were written to a file, and then fed to psql like this:
psql [CONNECTION PARAMETERS] -f FILENAME -1
The only trick there was in XML -> SQL transformation is to recognize numbers, and unquote them.
To sum it up: mysql can produce query results as XML and you can use it.
It's a bit more complicated than that. There is plenty of documentation here:
http://wiki.postgresql.org/wiki/Converting_from_other_Databases_to_PostgreSQL#MySQL
There, you'll also find conversion scripts.
In my rather simple case (30 tables, 10000 records), I used a perl script:
http://pgfoundry.org/frs/?group_id=1000198
It chugged through the mysql dump file and produced a pg dump file, with the following issues.
I was importing to Heroku so I used their pgbackups plugin which worked almost flawlessly.
Issues to watch for
Boolean data types. MySQL stores these as 0 and 1. PostGreSQL stores them as t and f. Watch that the booleans dont get migrated as integers.
Auto incrementing IDs. You may find your ids start counting again from 1. You'll get errors like this: "duplicate key value violates unique constraint ...". It's easy to fix, but watch out for it.
I've used py-mysql2pgsql for converting a big MySQL database into Postgres. It handles most cases very well. I had to patch it for couple of cases specific to my needs though.
https://pypi.python.org/pypi/py-mysql2pgsql
By default, it reads data from MySQL and writes to Postgres. But you can ask it to write the schema and/or data to a file for inspecting before loading into Postgres.
You can use https://github.com/mihailShumilov/mysql2postgresql
This is wroted on PHP convertor
There's also a very nice (fork of a) python converter that is maintained by the gitlab creators:
https://github.com/gitlabhq/mysql-postgresql-converter
The original fork is for this project is stale. For me, everything worked perfectly using this script.
Here there is a project which migrates in couple commands your current MySQL database to Postgresql including indexes, and foreign keys. Also it allows to define name, indexes and column type parsings so you can overwrite default behavior.
https://github.com/ggarri/mysql2psql
I hope it could be useful for anyone of you who is interested in migrating his current project to PG, in our case we obtained around 20% performance increase.
It is much better to use some program, that automates the process of migration.
Even if you familiar with all gotchas, doing every step by hand may take a lot of time, especially when your db is "big".
Try FromMySqlToPostgreSql.
This tool is feature-reach and easy to use.
It maps data-types, migrates constraints, indexes, PKs and FKs exactly as they were in your MySQL db.
Under the hood it uses PostgreSQL COPY, so data transfer is very fast.