Apache Drill Parser error while creating table from simple select of JSON file - apache-drill

I get parser error while creating table from SQL query of JSON file in apache Drill.
USE dfs.tmp;
CREATE Table myt AS
(SELECT KVGEN(repo)[1] reponame FROM dfs.`f:\DemoData\201901-000000000000.json`
WHERE STRPOS(payload,'ARM') >0)
error:
Org.apache.drill.common.exceptions.UserRemoteException: PARSE ERROR: Encountered ";" at line 1, column 12. Was expecting one of: <EOF> "." ... "[" ... SQL Query USE dfs.tmp; ^ CREATE Table myt AS (SELECT KVGEN(repo)[1] reponame FROM dfs.`f:\DemoData\201901-000000000000.json` WHERE STRPOS(payload,'ARM') >0)
What am i doing wrong ?

You are trying to submit to queries, but Drill doesn't support submitting several queries via a single form in Drill Web UI.
Please create Jira ticket to improve it: https://issues.apache.org/jira/browse/DRILL.
You can use Drill SqlLine (Drill shell). It hasn't this limitation.

Related

DLT: commas treated as part of column name

I am trying to create a STREAMING LIVE TABLE object in my DataBricks environment, using an S3 bucket with a bunch of CSV files as a source.
The syntax I am using is:
CREATE OR REFRESH STREAMING LIVE TABLE t1
COMMENT "test table"
TBLPROPERTIES
(
"myCompanyPipeline.quality" = "bronze"
, 'delta.columnMapping.mode' = 'name'
, 'delta.minReaderVersion' = '2'
, 'delta.minWriterVersion' = '5'
)
AS
SELECT * FROM cloud_files
(
"/input/t1/"
,"csv"
,map
(
"cloudFiles.inferColumnTypes", "true"
, "delimiter", ","
, "header", "true"
)
)
A sample source file content:
ROW_TS,ROW_KEY,CLASS_ID,EVENT_ID,CREATED_BY,CREATED_ON,UPDATED_BY,UPDATED_ON
31/07/2018 02:29,4c1a985c-0f98-46a6-9703-dd5873febbbb,HFK,XP017,test-user,02/01/2017 23:03,,
17/01/2021 21:40,3be8187e-90de-4d6b-ac32-1001c184d363,HTE,XP083,test-user,02/09/2017 12:01,,
08/11/2019 17:21,05fa881e-6c8d-4242-9db4-9ba486c96fa0,JG8,XP083,test-user,18/05/2018 22:40,,
When I run the associated pipeline, I am getting the following error:
org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore.
For some reason, the loader is not recognizing commas as column separators and is trying to load the whole thing into a single column.
I spent a good few hours already trying to find a solution. Replacing commas with semicolons (both in the source file and in the "delimiter" option) does not help.
Trying to manually upload the same file to a regular (i.e. non-streaming) Databricks table works just fine. The issue is solely with a streaming table.
Ideas?
Not exactly the type of a solution I would have expected here but it seems to work so...
Rather than using SQL to create a DLT, using Python scripting helps:
import dlt
#dlt.table
def t1():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/input/t1/")
)
Note that the above script needs to be executed via a DLT pipeline (running it directly from a notebook will throw a ModuleNotFoundError exception)

bigquery error: "Could not parse '41.66666667' as INT64"

I am attempting to create a table using a .tsv file in BigQuery, but keep getting the following error:
"Failed to create table: Error while reading data, error message: Could not parse '41.66666667' as INT64 for field Team_Percentage (position 8) starting at location 14419658 with message 'Unable to parse'"
I am not sure what to do as I am completely new to this.
Here is a file with the first 100 lines of the full data:
https://wetransfer.com/downloads/25c18d56eb863bafcfdb5956a46449c920220502031838/f5ed2f
Here are the steps I am currently taking to to create the table:
https://i.gyazo.com/07815cec446b5c0869d7c9323a7fdee4.mp4
Appreciate any help I can get!
As confirmed with OP (#dan), the error encountered is caused by selecting Auto detect when creating a table using a .tsv file as the source.
The fix for this is to manually create a schema and define the data type for each column properly. For more reference on using schema in BQ see this document.

Error while reading data, error message: CSV table references column position 15, but line starting at position:0 contains only 1 columns

I am new in bigquery, Here I am trying to load the Data in GCP BigQuery table which I have created manually, I have one bash file which contains bq load command -
bq load --source_format=CSV --field_delimiter=$(printf '\u0001') dataset_name.table_name gs://bucket-name/sample_file.csv
My CSV file contains multiple ROWS with 16 column - sample Row is
100563^3b9888^Buckname^https://www.settttt.ff/setlllll/buckkkkk-73d58581.html^Buckcherry^null^null^2019-12-14^23d74444^Reverb^Reading^Pennsylvania^United States^US^40.3356483^-75.9268747
Table schema -
When I am executing bash script file from cloud shell, I am getting following Error -
Waiting on bqjob_r10e3855fc60c6e88_0000016f42380943_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'project-name-
staging:bqjob_r10e3855fc60c6e88_0000ug00004521': Error while reading data, error message: CSV
table
encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection
for more details.
Failure details:
- gs://bucket-name/sample_file.csv: Error while
reading data, error message: CSV table references column position
15, but line starting at position:0 contains only 1 columns.
What would be the solution, Thanks in advance
You are trying to insert wrong values to your table per the schema you provided
Based on table schema and your data example I run this command:
./bq load --source_format=CSV --field_delimiter=$(printf '^') mydataset.testLoad /Users/tamirklein/data2.csv
1st error
Failure details:
- Error while reading data, error message: Could not parse '39b888'
as int for field Field2 (position 1) starting at location 0
At this point, I manually removed the b from 39b888 and now I get this
2nd error
Failure details:
- Error while reading data, error message: Could not parse
'14/12/2019' as date for field Field8 (position 7) starting at
location 0
At this point, I changed 14/12/2019 to 2019-12-14 which is BQ date format and now everything is ok
Upload complete.
Waiting on bqjob_r9cb3e4ef5ad596e_0000016f42abd4f6_1 ... (0s) Current status: DONE
You will need to clean your data before upload or use a data sample with more lines with --max_bad_records flag (Some of the lines will be ok and some not based on your data quality)
Note: unfortunately there is no way to control date format during the upload see this answer as a reference
We had the same problem while importing data from local to BigQuery. After researching the data we saw that there data which starting \r or \s enter image description here
After implementing ua['ColumnName'].str.strip() and ua['District'].str.rstrip(). we could add data to Bg.
Thanks

Problem dropping Hive table from pyspark script

I have a table in hive created from many json files using hive-json-serde method, WITH SERDEPROPERTIES ('dots.in.keys' = 'true'), as some keys there have a dot in, like `aaa.bbb`. I create external table and use backticks for these keys. Now I have a problem dropping this table from pyspark script, using sqlContext.sql("DROP TABLE IF EXISTS "+table_name), I'm getting this error message:
An error occurred while calling o63.sql.
: org.apache.spark.SparkException: Cannot recognize hive type string: struct<associations:struct<aaa.bbb:array<string> ...
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting ':'(line 1, pos 33)
== SQL ==
struct<associations:struct<aaa.bbb:array<string>,...
---------------------------------^^^
In HUE i can drop this table without any problem. Am I doing it wrong, or may be there is better way to do it?
It looks like it is not possible to work with Hive tables created with the hive-json-serde method, with dot in keys , using sqlContext.sql("...") from pyspark script, as I want. There is always the same error, if I want to drop such Hive table, or create it (haven't tried other things yet). So my workaround is to use python os.system() and execute required query through hive itself:
q='hive -e "DROP TABLE IF EXISTS '+ table_name+';"'
os.system(q)
It's more complicated with CREATE TABLE query, as we need to escape backticks with '\':
statement = "CREATE TABLE test111 (testA struct<\`aa.bb\`:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3a://bucket/test111';"
q='hive -e "'+ statement+'"'
It outputs some additional hive related info, but works!

Insert into Select command causing exception ParseException line 1:12 missing TABLE at 'table_name' near '<EOF>'

I am 2 days old into hadoop and hive. So, my understanding is very basic. I have a question which might be silly. Question :I have a hive external table ABC and have created a sample test table similar to the table as ABC_TEST. My goal is to Copy certain contents of ABC to ABC_TEST depending on select clause. So I created ABC_TEST using the following command:
CREATE TABLE ABC_TEST LIKE ABC;
Problem with this is:
1) this ABC_TEST is not an external table.
2) using Desc command, the LOCATION content for ABC_TEST was something like
hdfs://somepath/somdbname.db/ABC_TEST
--> On command "hadoop fs -ls hdfs://somepath/somdbname.db/ABC_TEST " I found no files .
--> Whereas, "hadoop fs -ls hdfs://somepath/somdbname.db/ABC" returned me 2 files.
3) When trying to insert values to ABC_TEST from ABC, I have the above exception mentioned in the title. Following is the command I used to insert values to ABC_TEST:
INSERT INTO ABC_TEST select * from ABC where column_name='a_valid_value' limit 5;
Is it wrong to use the insert into select option in Hive? what am I missing? Please help
The correct syntax is "INSERT INTO TABLE [TABLE_NAME]"
INSERT INTO TABLE ABC_TEST select * from ABC where column_name='a_valid_value' limit 5;
I faced exactly the same issue and the reason is the Hive version.
In one of our clusters, we are using hive 0.14 and on a new set up we're using hive-2.3.4.
In hive 0.14 "TABLE" keyword is mandatory to be used in the INSERT command.
However in version hive 2.3.4, this is not mandatory.
So while in hive 2.3.4, the query you've mentioned above in your question will work perfectly fine but in older versions you'll face exception "FAILED: ParseException line 1:12 missing TABLE <>".
Hope this helps.