Export table enclosing values with quotes to local csv in hive - csv

I am trying to export a table to a local csv file in hive.
INSERT OVERWRITE LOCAL DIRECTORY '/home/sofia/temp.csv'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
select * from mytable;
The problem is that some of the values contain the newline "\n" character and the resulting file becomes really messy.
Is there any way of enclosing the values in quotes when exporting in Hive, so that the csv file can contain special characters (and especially the newline)?

One possible solution could be to use Hive CSV SerDe (Serializer/Deserializer). It provides a way to specify custom delimiters, quote, and escape characters.
Limitation:
It does not handle embedded newlines
Availability:
The CSV Serde is available in Hive 0.14 and greater.
Background:
The CSV SerDe is based from https://github.com/ogrodnek/csv-serde, and was added to the Hive distribution in HIVE-7777.
Usage:
This SerDe works for most CSV data, but does not handle embedded newlines. To use the SerDe, specify the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde.
original documentation is available at https://github.com/ogrodnek/csv-serde.
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Default separator, quote, and escape characters if unspecified
DEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER "
DEFAULT_SEPARATOR ,
Reference: Hive csv-serde

Related

Export non-varchar data to CSV table using Trino (formerly PrestoDB)

I am working on some benchmarks and need to compare ORC, Parquet and CSV formats. I have exported TPC/H (SF1000) to ORC based tables. When I want to export it to Parquet I can run:
CREATE TABLE hive.tpch_sf1_parquet.region
WITH (format = 'parquet')
AS SELECT * FROM hive.tpch_sf1_orc.region
When I try the similar approach with CSV, then I get the error Hive CSV storage format only supports VARCHAR (unbounded). I would assumed that it would convert the other datatypes (i.e. bigint) to text and store the column format in the Hive metadata.
I can export the data to CSV using trino --server trino:8080 --catalog hive --schema tpch_sf1_orc --output-format=CSV --execute 'SELECT * FROM nation, but then it gets emitted to a file. Although this works for SF1 it quickly becomes unusable for SF1000 scale-factor. Another disadvantage is that my Hive metastores wouldn't have the appropriate meta-data (although I could patch it manually if nothing else works).
Anyone an idea how to convert my ORC/Parquet data to CSV using Hive?
In Trino Hive connector, the CSV table can contain varchar columns only.
You need to cast the exported columns to varchar when creating the table
CREATE TABLE region_csv
WITH (format='CSV')
AS SELECT CAST(regionkey AS varchar), CAST(name AS varchar), CAST(comment AS varchar)
FROM region_orc
Note that you will need to update your benchmark queries accordingly, e.g. by applying reverse casts.
DISCLAIMER: Read the full post, before using anything discussed here. It's not real CSV and you migth screw up!
It is possible to create typed CSV-ish tables when using the TEXTFILE format and use ',' as the field separator:
CREATE TABLE hive.test.region (
regionkey bigint,
name varchar(25),
comment varchar(152)
)
WITH (
format = 'TEXTFILE',
textfile_field_separator = ','
);
This will create a typed version of the table in the Hive catalog using the TEXTFILE format. It normally uses the ^A character (ASCII 10), but when set to ',' it resembles the same structure as CSV formats.
IMPORTANT: Although it looks like CSV, it is not real CSV. It doesn't follow RFC 4180, because it doesn't properly quote and escape. The following INSERT will not be inserted co:
INSERT INTO hive.test.region VALUES (
1,
'A "quote", with comma',
'The comment contains a newline
in it');
The text will be copied unmodified to the file without escaping quotes or commas. This should have been written like this to be proper CSV:
1,"A ""quote"", with comma","The comment contains a newline
in it"
Unfortunately, it is written as:
1,A "quote", with comma,The comment contains a newline
in it
This results in invalid data that will be represented by NULL columns. For this reason, this method can only be used when you have full control over the text-based data and are sure that it doesn't contain newlines, quotes, commas, ...

How to prevent Hive Create Table from splitting the column that has comma "," inside the data into two columns

I imported MySQL table using Sqoop. Some of the column values has comma "," in them. For example, "value, ST". I want to store that value in the same column like how its in MySQL but when i create Hive table, "value" and "ST" are stored in separate column. "ST" goes into the right column.
I've tried this
CREATE EXTERNAL TABLE IF NOT EXISTS personal_to_delete
(id_personal string,
no_ktp string,
nama string,
nama_tanpa_gelar string,
alamat1 string,
kodepos string,
id_kabupaten_alamat string,
id_propinsi string,
npwp string,
tgl_update string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\",
"quoteChar" = ","")
FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hadoop/personal_to_delete';
I get Null values after i run this query. How do i solve this?
The problem seems to be that the seperator character used for separating the fields also appears in the values itself. This makes it difficult for the create table command to correctly work. You need either
to escape this character within the fields or
use a quote Character to enclose the fields or
use a different field seperator which does not appear in the fields itself
to make the file "easy understandable" for the import process.
I myself would probably take one of this 2 options:
As suggested by OneCricketeer: Let sqoop directly import to a hive table. Here seems to be a nice example: enter link description here.
When creating a file with sqoop try the option --fields-terminated-by which sets the field separator character. If you set it to something different then comma "," (e.g. semicolon ";" or something else not appearing in your data) and provide this information in your hive create statement (e.g. FIELDS TERMINATED BY ';') and leave out the serdeproperties it should work.

How to convert string "3.82384E+11" to BIGINT with MySQL?

I'm trying to save some ID values from CSV that are automatically converted to exponent numbers by Excel.
Like 382383816413 becomes 3.82384E+11. So I'm doing a full import into my MySQL database with:
LOAD DATA LOCAL INFILE
'file.csv'
INTO TABLE my_table
FIELDS TERMINATED BY ';'
ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(#`item_id`,
#`colum2`,
#`colum3`)
SET
item_id = #`item_id`;
I've tried using cast like:
CAST('3.82384E+11' as UNSIGNED) and it gives me just 3.
CAST('3.82384E+11' as BIGINT) and it doesn't work.
CAST('3.82384E+11' as UNSIGNED BIGINT) and gives me 3 again.
So, what's the better way to convert string exponent numbers to real big integers in MySQL?
Set column format as text instead of number in excel. Refer below link.
PHPExcel - set cell type before writing a value in it
My option was to convert the column with 3.82384E+11 to number in the excel file, so it get back to the original value. Then I export to CSV and use SQL query to import it fine.

Redshift loading CSV with commas in a text field

I've been trying to load a csv file with the following row in it:
91451960_NE,-1,171717198,50075943,"MARTIN LUTHER KING, JR WAY",1,NE
Note the comma in the name. I've tried all permutations of REMOVEQUOTES, DELIMITER ',', etc... and none of them work.
I have other rows with quotes in the middle of the name, so the ESCAPE option has to be there as well.
According to other posts,
DELIMITER ',' ESCAPE REMOVEQUOTES IGNOREHEADER 1;
should work but does not. Redshift gives a "Delimiter not found" error.
Is the ESCAPE causing issues and do I have to escape the comma?
I have tried loading your data using CSV as data format parameter and this worked for me. Please keep in mind that CSV cannot be used with FIXEDWIDTH, REMOVEQUOTES, or ESCAPE.
create TEMP table awscptest (a varchar(40),b int,c bigint,d bigint,e varchar(40),f int,g varchar(10));
copy awscptest from 's3://sds-dev-db-replica/test.txt'
iam_role 'arn:aws:iam::<accounID>:<IAM_role>'
delimiter as ',' EMPTYASNULL CSV NULL AS '\0';
References: http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html
http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#load-from-csv
This is a commonly recurring question. If you are actually using the CSV format for you files (not just some ad hoc text file that uses commas) then you need to enclose the field in double quotes. If you have commas and quotes then you need to enclose the field in double quotes and escape the double quotes in the field data.
There is a definition for the CSV files format - rfc 4180. All text characters can be represented correctly in CSV if you follow the format.
https://www.ietf.org/rfc/rfc4180.txt
Use the CSV option to the Redshift COPY command, not just TEXT with a Delimiter of ','. Redshift will also follow the official file format if you tell it that the files is CSV
In this case, you have comma (,) in name field. Clean the data by removing that comma before loading to redshift.
df = (df.withColumn('name', F.regexp_replace(F.col('name'), ',', ' ')))
Store the new dataframe in s3 and then use the below copy command to load to redshift
COPY 'table_name'
FROM 's3 path'
IAM_ROLE 'iam role'
DELIMITER ','
ESCAPE
IGNOREHEADER 1
MAXERROR AS 5
COMPUPDATE FALSE
ACCEPTINVCHARS
ACCEPTANYDATE
FILLRECORD
EMPTYASNULL
BLANKSASNULL
NULL AS 'null';
END;

how to load csv file into hive

this is my csv file
id,name,address
"1xz","hari","streetno=1-23-2,street name=Lakehill,town=Washington"
"2xz","giri","streetno=5-6-3456,street name=second street,town=canada"
i was loaded this data using row format delimeter "," but it was not loading properley,i am facing the problem with address filed.in address field i have data like this format "streetno=1-23-2,street name=Lakehill,town=Washington" in this address filed values are terminated by again ",".i was found one solution in pig,help me to solve it using hive.
i am getting this output
"1xz" "hari" "streetno=1-23-2
"2xz" "giri" "streetno=5-6-3456
this is my schema
create table emps (id string,name string,addresss string ) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
Use split() function, it returns array of strings: [0]='streetno', [1]='1-23-2':
split(address,'=')[1] as address --returns '1-23-2'
You already found a working solution in Pig so why not transfer that relation to a Hive table directly using HCatalog.
STORE pig_relation INTO 'hive_table_name' USING org.apache.hive.hcatalog.pig.HCatStorer();
Make sure you start up Pig using:
>pig -useHCatalog
Table must already exist in Hive.
Hope this helps.
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL