I'm trying to load the json data into hive table. This json data contains newline characters. When I try to load this json data into hive table, it is not inserting properly.
My hive table creation:
CREATE EXTERNAL TABLE serde_tab(
gender STRING, name STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/input/text' ;
My json data:
{"gender":"femal\ne","name":"xyz"}
My hive table data:
select * from serde_tab;
OK
serde_tab.gender serde_tab.name
femal
e xyz
Can anyone please help me out regarding the same ...
You can use regexp_replace function to replace \n with ''.
hive> select regexp_replace(string("femal\ne"),'\n','');
+---------+--+
| _c0 |
+---------+--+
| female |
+---------+--+
(or)
Write a shell script to replace all \n newline characters with emptyvalues ('').
Related
I have my data in CSV format in the below form:
Id -> tinyint
Name -> String
Id Name
1 Alex
2 Sam
When I export the CSV file to S3 and create an Athena table, the data transform into the following format.
Id Name
1 "Alex"
2 "Sam"
How do I get rid of the double quotes while creating the table?
Any help is appreciated.
By default if SerDe is not specified, Athena is using LasySimpleSerDe, it does not support quoted values and reads quotes as a part of value. If your CSV file contains quoted values, use OpenCSVSerde (specify correct separatorChar if it is not comma):
CREATE EXTERNAL TABLE mytable(
id tinyint,
Name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://my-bucket/mytable/'
;
Read the manuals: https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html
See also this answer about data types in OpenCSVSerDe
I have an external table using Glue catalog and reading a CSV file. The fields are enclosed in double quotes if they have comma or a LF (LineFeed). I am able to read a field properly as a single value if there is delimiter within that field but the fields having LineFeed in it are getting split and the rest of the columns afterwards are shown as NULL.
Have used serde row format to specify the quote character. and used normal row format delimiter with line delimited by Line Feed ascii as well. But as of now, none of it seems to be working.
CREATE EXTERNAL TABLE schema.ext_table
(
id varchar (18),
name varchar (80)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' )
STORED AS textfile
LOCATION 's3://path/'
TABLE PROPERTIES ('skip.header.line.count'='1')
;
For a file like this:
id,name,addCRLF
1,abc,defCRLF
2,"a,b",mnoCRLF
3,"aLF
b",xyzCRLF
Please note that the CRLF and LF, in the above file, can be seen by notepad++ like tools.
I want the output to be like:
1 abc def
2 a,b mno
3 a xyz
b-------this b need to be in the same cell as that of the a above
BUT the Output is coming like :
1 abc def
2 a,b mno
3 a null
null null null
Got the official response from AWS support. Redshift Spectrum doesn't support embedded line breaks in a csv file.
Below is my table creation and a sample from my csv;
DROP TABLE IF EXISTS xxx.fbp;
CREATE TABLE IF NOT EXISTS xxx.fbp (id bigint, p_name string, h_name string, ufi int, city string, country string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
74905,xxx,xyz,-5420642,City One,France
74993,xxx,zyx,-874432,City,Germany
75729,xxx,yzx,-1284248,City Two Long Name,France
I then load the data into a hive table with the following query:
LOAD DATA
INPATH '/user/xxx/hdfs_import/fbp.csv'
INTO TABLE xxx.fbp;
It seems that there is data leaking from the 5th csv "column" into the 6th column of the table. So, I'm seeing city data in my country column.
SELECT country, count(country) from xxx.fbp group by country
+---------+------+
| country | _c1 |
| Germany | 1143 |
| City | 1 |
+---------+------+
I'm not sure why city data is occasionally being imported to the country column. The csv is downloaded from Google Sheets and I've removed the header.
The reason could be your line termination is not '\n', Windows based tool add additional characters which creates issue. Also may be you have feilds using column separator creating this.
Solution:
1. Try print line which have issue by 'where country = City' clause, this will give you some idea how Hive created the record.
2. Try binary storage format to be 100% sure about data processed by Hive.
Hope it helps.
The issue was within the CSV itself. Some columns, such as p.name contained , in several fields. This would cause a line ending to end sooner than expected. I had to clean the data and remove all ,. After that, it imported correctly. Quickly done with python.
with open("fbp.csv") as infile, open("outfile.csv", "w") as outfile:
for line in infile:
outfile.write(line.replace(",", ""))
I have a CSV file with embedded commas that I want to drop in a Hive directory so my Hive table will immediately see the data. I don't wish to pre-process the data, and the data has some consecutive double quotes. e.g.:
"hi,there",999,""BROWN,FOX"","goodbye"
I know I need to create my table using the CSV SerDe, and I have:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
)
STORED AS TEXTFILE
but when I select the data for this sample data, I get this:
hive> select * from my_table;
hi,there 999 "BROWN FOX" goodbye
instead of what I want:
hive> select * from my_table;
hi,there 999 "BROWN,FOX" goodbye
or even:
hive> select * from my_table;
hi,there 999 BROWN,FOX goodbye
How do I get Hive to consider double double-quotes as a single double-quote, or otherwise read this data the way I want? Can I do this without pre-processing the data? Thank you in advance.
I have a csv file which has contents like this.
"DepartmentID","Name","GroupName","ModifiedDate"
"1","Engineering","Research and Development","2008-04-30 00:00:00"
I have
create external table if not exists AdventureWorks2014.Department
(
DepartmentID smallint ,
Name string ,
GroupName string,
rate_code string,
ModifiedDate timestamp
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' lines terminated by '\n'
STORED AS TEXTFILE LOCATION 'wasb:///ds/Department' TBLPROPERTIES('skip.header.line.count'='1');`
And after loading the data
LOAD DATA INPATH 'wasb:///ds/Department.csv' INTO TABLE AdventureWorks2014.Department;
The data is not loaded.
select * from AdventureWorks2014.Department;
The above select returns nothing.
I think the double quotes around each fileds is the issue. Is there a way to load the data from such a file to hive tables, Without having to strip out the double quotes?
Try this (cellphone...)
create external table if not exists AdventureWorks2014.Department ( DepartmentID smallint , Name string , GroupName string, rate_code string, ModifiedDate timestamp )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 'wasb:///ds/Department'
** Limitation **
This SerDe treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type. The type information is retrieved from the SerDe. To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type.
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
FIELDS TERMINATED BY '","' is incorrect. Your fields are terminated by a , not ",". Change your DDL to FIELDS TERMINATED BY ','.
LOAD DATA LOCAL INPATH '/home/hadoop/hive/log_2013805_16210.log'into table_name