What format applies to the Hive LazySimpleSerDe - csv

What exactly is the format for Hive LazySimpleSerDe?
A format like ParquetHiveSerDe tells me that Hive will read the HDFS files in parquet format.
But what is LazySimpleSerDe? Why not call it something explicit like CommaSepHiveSerDe or TabSepHiveSerDe, given LazySimpleSerDe is for delimited files?

LasySimpleSerde - fast and simple SerDe, it does not recognize quoted values, though it can work with different delimiters, not only commas, default is TAB (\t). You can specify STORED AS TEXTFILE in table DDL and LasySimpleSerDe will be used. For quoted values use OpenCSVSerDe, it is not as fast as LasySimpleSerDe but works correctly with quoted values.
LasySimpleSerDe is simple for the sake of performance, also it creates Objects in a lazy way, to provide better performance, this is why it is preferable when possible (for text files).
See this example with pipe-delimited (|) file format: https://stackoverflow.com/a/68095278/2700344
show create table command for such table prints serde class as org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, STORED AS TEXTFILE is a shortcut.

Related

Export non-varchar data to CSV table using Trino (formerly PrestoDB)

I am working on some benchmarks and need to compare ORC, Parquet and CSV formats. I have exported TPC/H (SF1000) to ORC based tables. When I want to export it to Parquet I can run:
CREATE TABLE hive.tpch_sf1_parquet.region
WITH (format = 'parquet')
AS SELECT * FROM hive.tpch_sf1_orc.region
When I try the similar approach with CSV, then I get the error Hive CSV storage format only supports VARCHAR (unbounded). I would assumed that it would convert the other datatypes (i.e. bigint) to text and store the column format in the Hive metadata.
I can export the data to CSV using trino --server trino:8080 --catalog hive --schema tpch_sf1_orc --output-format=CSV --execute 'SELECT * FROM nation, but then it gets emitted to a file. Although this works for SF1 it quickly becomes unusable for SF1000 scale-factor. Another disadvantage is that my Hive metastores wouldn't have the appropriate meta-data (although I could patch it manually if nothing else works).
Anyone an idea how to convert my ORC/Parquet data to CSV using Hive?
In Trino Hive connector, the CSV table can contain varchar columns only.
You need to cast the exported columns to varchar when creating the table
CREATE TABLE region_csv
WITH (format='CSV')
AS SELECT CAST(regionkey AS varchar), CAST(name AS varchar), CAST(comment AS varchar)
FROM region_orc
Note that you will need to update your benchmark queries accordingly, e.g. by applying reverse casts.
DISCLAIMER: Read the full post, before using anything discussed here. It's not real CSV and you migth screw up!
It is possible to create typed CSV-ish tables when using the TEXTFILE format and use ',' as the field separator:
CREATE TABLE hive.test.region (
regionkey bigint,
name varchar(25),
comment varchar(152)
)
WITH (
format = 'TEXTFILE',
textfile_field_separator = ','
);
This will create a typed version of the table in the Hive catalog using the TEXTFILE format. It normally uses the ^A character (ASCII 10), but when set to ',' it resembles the same structure as CSV formats.
IMPORTANT: Although it looks like CSV, it is not real CSV. It doesn't follow RFC 4180, because it doesn't properly quote and escape. The following INSERT will not be inserted co:
INSERT INTO hive.test.region VALUES (
1,
'A "quote", with comma',
'The comment contains a newline
in it');
The text will be copied unmodified to the file without escaping quotes or commas. This should have been written like this to be proper CSV:
1,"A ""quote"", with comma","The comment contains a newline
in it"
Unfortunately, it is written as:
1,A "quote", with comma,The comment contains a newline
in it
This results in invalid data that will be represented by NULL columns. For this reason, this method can only be used when you have full control over the text-based data and are sure that it doesn't contain newlines, quotes, commas, ...

TPT12109: Export Operator does not support JSON column.Is there a way to export results involving json column other than bteq export?

I have a table having 2 million records. I am trying to dump the contents of the table in json format. This issue is that the TPT export does not allow JSON columns and BTEQ export would take a lot of time to do this export. Is there any way to handle this export in a more optimized way.
Your help is really appreciated.
If the JSON values are not too large, you could potentially CAST them in your SELECT as VARCHAR(64000) CHARACTER SET LATIN, or VARCHAR(32000) CHARACTER SET UNICODE if you have non-LATIN characters, and export them in-line.
Otherwise each JSON object has to be transferred DEFERRED BY NAME where each object is stored in a separate file and the corresponding filename stored in the output row. In that case you would need to use BTEQ, or TPT SQL Selector operator - or write your own application.
You can do one thing. Load the json formatted rows in another teradata table.
Keep that table column as varchar and then do a tptexport of that column/table.
It should work.
INSERT INTO test (col1,col2...,jsn_obj)
SELECT col1,col2,..
JSON_Compose(<. columns you want to inlcude in your json file)
FROM <schemaname>.<tablename>
;

csv data with comma values throws error while processing the file through the BizTalk flatfile Disassembler

I'm going to a pick a csv file in BizTalk and after some process I wanted to update it with two or more different systems.
In order to getting the csv file, I'm using the default Flatfile Disassembler for breaking it and constructing it as XML with the help of genereted schema. I can do that successfully with some consistent data however if I use a data with comma in it (other than delimiters), BizTalk fails!
Any other way to do this without using a custom pipeline component?
Expecting a simple configuration within the flatfile disassembler component!
So, here's the deal. BizTalk is not failing. Well, it is, but that is the expected and correct behavior.
What you have in an invalid CSV file. The CSV specification disallows the comma in field data unless a wrap character is used. Either way, both are reserved characters.
To accept the comma in field data, you must choose a wrap character and set that in the Wrap Character property in the Flat File Schema.
This is valid:
1/1/01,"Smith, John", $5000
This is not:
1/1/01,Smith, John, $5000
Since your schema definition has ',' as delimiter, flat file disassembler will consider the data with comma as two fields and will fail due to mismatch in columns.
You have few options:
Either add a new field to schema if you know , in data will only be present in a particular field.
Or change the delimiter in flat file from , to |(pipe) or some other character so that data does not conflict with delimiter.
Or as you mentioned manipulate the flat file in a custom pipeline component, which should be last resort if above two are not feasible.

Unable to import 3.4GB csv into redshift because values contains free-text with commas

And so we found a 3.6GB csv that we have uploaded onto S3 and now want to import into Redshift, then do the querying and analysis from iPython.
Problem 1:
This comma delimited file contains values free text that also contains commas and this is interfering with the delimiting so can’t upload to Redshift.
When we tried opening the sample dataset in Excel, Excel surprisingly puts them into columns correctly.
Problem 2:
A column that is supposed to contain integers have some records containing alphabets to indicate some other scenario.
So, the only way to get the import through is to declare this column as varchar. But then we can do calculations later on.
Problem 3:
The datetime data type requires the date time value to be in the format YYYY-MM-DD HH:MM:SS, but the csv doesn’t contain the SS and the database is rejecting the import.
We can’t manipulate the data on a local machine because it is too big, and we can’t upload onto the cloud for computing because it is not in the correct format.
The last resort would be to scale the instance running iPython all the way up so that we can read the big csv directly from S3, but this approach doesn’t make sense as a long-term solution.
Your suggestions?
Train: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train.csv (3.4GB)
Train Sample: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train-sample.csv (133MB)
Try having different delimiter or use escape characters.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_preparing_data.html
For second issue, if you want to extract only numbers from the column after loading into char use regexp_replace or other functions.
For third issue, you can as well load it into VARCHAR field and then use substring cast(left(column_name, 10)||' '||right(column_name, 6)||':00' as timestamp)
to load it into final table from staging table
For the first issue, you need to find out a way to differentiate between the two types of commas - the delimiter and the text commas. Once you have done that, replace the delimiters with a different delimiter and use the same as delimiter in the copy command for Redshift.
For the second issue, you need to first figure out if this column needs to be present for numerical aggregations once loaded. If yes, you need to get this data cleaned up before loading. If no, you can directly load this as char/ varchar field. All your queries will still work but you will not be able to do any aggregations (sum/ avg and the likes) on this field.
For problem 3, you can use Text(date, "yyyy-mm-dd hh:mm:ss") function in excel to do a mass replace for this field.
Let me know if this works out.

How can I quickly reformat a CSV file into SQL format in Vim?

I have a CSV file that I need to format (i.e., turn into) a SQL file for ingestion into MySQL. I am looking for a way to add the text delimiters (single quote) to the text, but not to the numbers, booleans, etc. I am finding it difficult because some of the text that I need to enclose in single quotes have commas themselves, making it difficult to key in to the commas for search and replace. Here is an example line I am working with:
1239,1998-08-26,'Severe Storm(s)','Texas,Val Verde,"DEL RIO, PARKS",'No',25,"412,007.74"
This is FEMA data file, with 131246 lines, I got off of data.gov that I am trying to get into a MySQL database. As you can see, I need to insert a single quote after Texas and before Val Verde, so I tried:
s/,/','/3
But that only replaced the first occurrence of the comma on the first three lines of the file. Once I get past that, I will need to find a way to deal with "DEL RIO, PARKS", as that has a comma that I do not want to place a single quote around.
So, is there a "nice" way to manipulate this data to get it from plain CSV to a proper SQL format?
Thanks
CSV files are notoriously dicey to parse. Different programs export CSV in different ways, possibly including strangeness like embedding new lines within a quoted field or different ways of representing quotes within a quoted field. You're better off using a tool specifically suited to parsing CSV -- perl, python, ruby and java all have CSV parsing libraries, or there are command line programs such as csvtool or ffe.
If you use a scripting language's CSV library, you may also be able to leverage the language's SQL import as well. That's overkill for a one-off, but if you're importing a lot of data this way, or if you're transforming data, it may be worthwhile.
I think that I would also want to do some troubleshooting to find out why the CSV import into MYSql failed.
I would take an approach like this:
:%s/,\("[^"]*"\|[^,"]*\)/,'\1'/g
:%s/^\("[^"]*"\|[^,"]*\)/'\1'/g
In words, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Next, for the first column in a row, look for a double quoted set of characters or , \|, a non-double quoted set of characters beginning with a comma and replace the set of characters in a single quotation.
Try the csv plugin. It allows to convert the data into other formats. The help includes an example, how to convert the data for importing it into a database
Just to bring this to a close, I ended up using #Eric Andres idea, which was the MySQL load data option:
LOAD DATA LOCAL INFILE '/path/to/file.csv'
INTO TABLE MYTABLE FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n';
The initial .csv file still took a little massaging, but not as much as I were to do it by hand.
When I commented that the LOAD DATA had truncated my file, I was incorrect. I was treating the file as a typical .sql file and assumed the "ID" column I had added would auto-increment. This turned out to not be the case. I had to create a quick script that prepended an ID to the front of each line. After that, the LOAD DATA command worked for all lines in my file. In other words, all data has to be in place within the file to load before the load, or the load will not work.
Thanks again to all who replied, and #Eric Andres for his idea, which I ultimately used.