DLT: commas treated as part of column name - csv

I am trying to create a STREAMING LIVE TABLE object in my DataBricks environment, using an S3 bucket with a bunch of CSV files as a source.
The syntax I am using is:
CREATE OR REFRESH STREAMING LIVE TABLE t1
COMMENT "test table"
TBLPROPERTIES
(
"myCompanyPipeline.quality" = "bronze"
, 'delta.columnMapping.mode' = 'name'
, 'delta.minReaderVersion' = '2'
, 'delta.minWriterVersion' = '5'
)
AS
SELECT * FROM cloud_files
(
"/input/t1/"
,"csv"
,map
(
"cloudFiles.inferColumnTypes", "true"
, "delimiter", ","
, "header", "true"
)
)
A sample source file content:
ROW_TS,ROW_KEY,CLASS_ID,EVENT_ID,CREATED_BY,CREATED_ON,UPDATED_BY,UPDATED_ON
31/07/2018 02:29,4c1a985c-0f98-46a6-9703-dd5873febbbb,HFK,XP017,test-user,02/01/2017 23:03,,
17/01/2021 21:40,3be8187e-90de-4d6b-ac32-1001c184d363,HTE,XP083,test-user,02/09/2017 12:01,,
08/11/2019 17:21,05fa881e-6c8d-4242-9db4-9ba486c96fa0,JG8,XP083,test-user,18/05/2018 22:40,,
When I run the associated pipeline, I am getting the following error:
org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore.
For some reason, the loader is not recognizing commas as column separators and is trying to load the whole thing into a single column.
I spent a good few hours already trying to find a solution. Replacing commas with semicolons (both in the source file and in the "delimiter" option) does not help.
Trying to manually upload the same file to a regular (i.e. non-streaming) Databricks table works just fine. The issue is solely with a streaming table.
Ideas?

Not exactly the type of a solution I would have expected here but it seems to work so...
Rather than using SQL to create a DLT, using Python scripting helps:
import dlt
#dlt.table
def t1():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/input/t1/")
)
Note that the above script needs to be executed via a DLT pipeline (running it directly from a notebook will throw a ModuleNotFoundError exception)

Related

During bulk insert date is detected as null

In one of my applications I am importing csv data into mt Access db using the following bulk insert query.
INSERT INTO Log_134_temp ([DATE],[TIME],CH0,CH1,CH2,CH3) SELECT [DATE],[TIME],CH0,CH1,CH2,CH3 FROM [Text;FMT=CSVDelimited;HDR=Yes;DATABASE=C:\tmp].[SAMPLE_1.csv]
The query gets executed and all the parameters in the query are correct. The issue is with just one of the csv files which gives the following error after the query is executed.
The field 'Log_134_temp.date' cannot contain a Null value because the
Required property for this field is set to True. Enter a value in
this field.
Where as the other csv files get imported without any issue.
The file that is imported successfully and the file with the issue however look identical with their formats. And this has puzzled me over a day now.
The file that gets imported
https://www.dropbox.com/s/amddhzhi6nr24ex/SAMPLE_1_111.csv?dl=0
The file that doesn't get imported
https://www.dropbox.com/s/2rrgdf7oor5ptbf/SAMPLE_1_112.csv?dl=0
Bad line in row 135169:
2019-02-14,16:57:54,310,837,300,650
It contains a lot of 00 symbols.
I found this with help of simple Python cycle:
In [43]: f = read_file(r'...SAMPLE_1_112.csv')
In [44]: li = f.split('\n')
...
In [60]: prev_len = 1
In [61]: for l in li:
...: if not len(l): continue
...: if prev_len != len(l): print(l)
...: prev_len = len(l)
...:
DATE,TIME,CH0,CH1,CH2,CH3
2018-10-03,11:45:44,246,621,250,600
2019-02-14,16:57:54,310,837,300,650
2019-02-14,16:59:01,309,859,300,650

Problem dropping Hive table from pyspark script

I have a table in hive created from many json files using hive-json-serde method, WITH SERDEPROPERTIES ('dots.in.keys' = 'true'), as some keys there have a dot in, like `aaa.bbb`. I create external table and use backticks for these keys. Now I have a problem dropping this table from pyspark script, using sqlContext.sql("DROP TABLE IF EXISTS "+table_name), I'm getting this error message:
An error occurred while calling o63.sql.
: org.apache.spark.SparkException: Cannot recognize hive type string: struct<associations:struct<aaa.bbb:array<string> ...
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting ':'(line 1, pos 33)
== SQL ==
struct<associations:struct<aaa.bbb:array<string>,...
---------------------------------^^^
In HUE i can drop this table without any problem. Am I doing it wrong, or may be there is better way to do it?
It looks like it is not possible to work with Hive tables created with the hive-json-serde method, with dot in keys , using sqlContext.sql("...") from pyspark script, as I want. There is always the same error, if I want to drop such Hive table, or create it (haven't tried other things yet). So my workaround is to use python os.system() and execute required query through hive itself:
q='hive -e "DROP TABLE IF EXISTS '+ table_name+';"'
os.system(q)
It's more complicated with CREATE TABLE query, as we need to escape backticks with '\':
statement = "CREATE TABLE test111 (testA struct<\`aa.bb\`:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3a://bucket/test111';"
q='hive -e "'+ statement+'"'
It outputs some additional hive related info, but works!

S3 to MySQL AWS Data Pipeline Insert table error

It's my first time asking a question on here, please so bear with me
I am trying to create a data pipeline to upload a CSV file in an S3-Bucket to a MySQL database table(Production1) using the template provided by aws, but fails when executing RdsMySqlTableCreateActivity.
The sql statement that I'm using(all column names match the CSV file) in the myRDSTableInsertSql parameter:
INSERT INTO `Production1` (`API`, `Normalized Month`, `DATE`, `Monthly Liquid`, `Cum Oil`, `BOPD`, `Monthly Gas Mcf/Month`, `Cum Gas`, `MCFPD`) VALUES(?,?,?,?,?,?,?,?,?);
The RdsMySqlTableCreateActivity error:
errorId
ActivityFailed:SQLException
errorMessage
No value specified for parameter 1
errorStackTrace
amazonaws.datapipeline.taskrunner.TaskExecutionException:
private.com.amazonaws.services.datapipeline.redshift.QueryStatementException: Exception No value specified for
parameter 1 while executing INSERT INTO `Production1` (`API`, `Normalized Month`, `DATE`, `Monthly Liquid`, `Cum Oil`, `BOPD`, `Monthly Gas Mcf/Month`, `Cum Gas`, `MCFPD`) VALUES(?,?,?,?,?,?,?,?,?);...
I ran the insert command on MySQL workbench, replacing the (?,?,?,?,?,?,?,?,?) with (1,2,3,4,5,6,7,8,9), and it worked. The CSV file that I'm using only has 2 rows the column names and values 1-9 for each column respectively. Really not sure what it means by No value specified for parameter 1, any help/guidance would really be appreciated!!!
For anyone that runs into the same issue using the "Load S3 data into RDS MySQL table" template
My values for each parameter were the following
myRDSTableInsertSql:
INSERT INTO tableName(`col_name1`, `col_name2`, `col_name3`, `col_name4`, `col_name5`, `col_name6`, `col_name7`, `col_name8`, `col_name9`) VALUES(?,?,?,?,?,?,?,?,?);
myRDSTableName: tableName
myRDSCreateTableSql:
CREATE TABLE tableName(`col_name1` type, `col_name2` type, `col_name3` type, `col_name4` type, `col_name5` type, `col_name6` type, `col_name7` type, `col_name8` type, `col_name9` type);
The main issue was with the actual CSV file format, you have to make sure there is no header, and that the types are exactly the same. Also make sure that you're separators are "," and each value is not quoted within your CSV file.
This template is a good starting point but form more detailed/complex CSV files making your own datapipeline is a must!

How to create table without schema in BigQuery by API?

Simply speaking I would create table with given name providing only data.
I have some JUnit's with sample data (jsons)
I have to provide schema for above files to create tables for them
I suppose that don't need provide above schemas.
Why? Because in BigQuery console I can create table from query (even such simple like: select 1, 'test') or I can upload json to create table with schema autodetection => probably could also do it programatically
I saw https://chartio.com/resources/tutorials/how-to-create-a-table-from-a-query-in-google-bigquery/#using-the-api and know that could parse jsons with data to queries and use Jobs.insert API to run them but it's over engineered and has some other disadvanteges e.g. boilerplate code.
After some research I found possibly simpler way of creating table on fly, but it doesn't work for me, code below:
Insert insert = bigquery.jobs().insert(projectId,
new Job().setConfiguration(
new JobConfiguration().setLoad(
new JobConfigurationLoad()
.setSourceFormat("NEWLINE_DELIMITED_JSON")
.setDestinationTable(
new TableReference()
.setProjectId(projectId)
.setDatasetId(dataSetId)
.setTableId(tableId)
)
.setCreateDisposition("CREATE_IF_NEEDED")
.setWriteDisposition(writeDisposition)
.setSourceUris(Collections.singletonList(sourceUri))
.setAutodetect(true)
)
));
Job myInsertJob = insert.execute();
JSON file which is used as a source data is pointed by sourceUri, looks like:
[
{
"stringField1": "value1",
"numberField2": "123456789"
}
]
Even if I used setCreateDisposition("CREATE_IF_NEEDED") I still receive error: "Not found: Table ..."
Is there any other method in API or better approach than above to exclude schema?
The code in your question is perfectly fine, and it does create table if it doesn't exist. However, it fails when you use partition id in place of table id, i.e. when destination table id is "table$20170323" which is what you used in your job. In order to write to partition, you will have to create table first.

Text file as data source in SSRS

I need to use text files as data source in SSRS. I tried accessing this with ‘OLEDB provider for Microsoft directory services’ connection. But I could not. The query is given below.
Also let me know how to query the data
I know this thread is old, but as it came up in my search results this may help other people.
There are two 'sort of' workarounds for this. See the following:
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=130650
So basically you should use OLEDB as the data source, then in the connection string type:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=xxxx;Extended Properties="text;HDR=No;FMT=Delimited"
Then make sure your file is saved in .txt format, with comma delimiters. Where I've put xxxx you need to put the FOLDER directory - so C:\Temp - don't go down to the individual file level, just the folder it's in.
In the query you write for the dataset, you specify the file name as though it were a table - essentially your folder is your database, and the files in it are tables.
Thanks
I have had great success creating linked servers in SQL to link to disparate text files for creating SSRS reports. Below is sample SQL to link to your txt files:
EXEC master.dbo.sp_addlinkedserver #server = N'', #srvproduct=N'', #provider=N'Microsoft.Jet.OLEDB.4.0', #datasrc=N'', #provstr=N'text'
EXEC master.dbo.sp_addlinkedsrvlogin #rmtsrvname=N'YourLinkedServerName',#useself=N'False',#locallogin=NULL,#rmtuser=NULL,#rmtpassword=NULL
I simply used BULK INSERT command to load the flat file into a temporary table in SSRS, like this:
CREATE TABLE #FlatFile
(
Field1 int,
Field2 varchar(10),
Field3 varchar(15),
Field4 varchar(20),
Field5 varchar(50)
)
BEGIN TRY
BULK INSERT #FlatFile
FROM 'C:\My_Path\My_File.txt'
WITH
(
FIELDTERMINATOR ='\t', -- TAB delimited
ROWTERMINATOR ='\n', -- or '0x0a' (whatever works)
FIRSTROW = 2, -- has 1 header row
ERRORFILE = 'C:\My_Path\My_Error_File.txt',
TABLOCK
);
END TRY
BEGIN CATCH
-- do nothing (prevent the query from aborting on errors...)
END CATCH
SELECT * FROM #FlatFile
I don't think you can
Data Sources Supported by Reporting Services. In the table, your only chance would be "Generic ODBC data source", however a text file is not ODBC compliant AFAIK. No types, no structure etc.
Why not just display the text files? It seems a bit strange to query text files to bloat them into formatted HTML...
I'm not of the mind that you can, but a workaround for this, if your text files are CSVs or the like, is to create an SSIS package which brings that data into a table in SQL Server, which you can then query like there's no tomorrow. SSIS does Flat File Sources with ease.
You can even automate this by right clicking the database in SSMS, doing Tasks->Import Data. Walk through the wizard, and you can then save off the package at the end.