move data from local to HDFS - data shift - csv

I have comma separated file in .csv format
name,address,zip
Ram,"123,ave st",1234
While moving the data to hdfs and creating hive table in comma separated, facing column shift.
What properties in Hive will fix this issue?
name - Ram
address - "123
zip - ave st"

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"SEPARATORCHAR" = ",",
"QUOTECHAR" = "\"",
"ESCAPECHAR" = "\""
)
STORED AS TEXTFILE
LOCATION
'hdfs://path'
It works..

Related

writing both string and json string to a csv file

I use requests to pull json files of companies. How I add a ticker column and json string in a csv file (separated by comma) so I can import the csv file into postgresql?
My python code:
ticker_list = ['AAPL','MSFT','IBM', 'APD']
for ticker in ticker_list:
url_profile = fmp_url + profile + ticker + '?apikey=' + apikey
#get data in json array format
json_array = requests.get(url_profile).json()
# for each record within the json array, use json.dumps to turn it into an json string.
json_str = [json.dumps(element) for element in json_array]
#add a ticker colum and write both ticker and json string to a csv file:
with open ("C:\\DATA\\fmp_profile_no_brackets.csv","a") as dest:
for element in json_str:
dest.writelines (ticker_str + ',' + element + '\n' )
In postgres I have table t_profile_json with 2 columns:
ticker varchar(20) and profile jsonb
when I copy the file fmp_profile into postgres by using:
copy fmp.t_profile_json(ticker,profile) from 'C:\DATA\fmp_profile.csv' delimiter ',';
I have this error:
ERROR: extra data after last expected column
CONTEXT: COPY t_profile_json, line 1: "AAPL,{"symbol": "AAPL", "price": 144.49, "beta": 1.219468, "volAvg": 88657734, "mktCap": 22985613828..."
SQL state: 22P04
The copy command seems to add both "AAPL, json string.." as one string.
I did something wrong at the "dest.writelines (ticker_str + ',' + element + '\n' )", but I don't know how to correct it.
Thank you so much in advance for helping!

Load pipe delimited CSV data having " (double quote) in one of the column in hive

I have data as below:-
Rollno|Name|height|department
101|Aman|5"2|C.S.E
Taking all the columns as string.
When I am loading above data in hive I am getting extra quote at start and end as below:-
Rollno:-"101
Name:-Aman
Height:-5"2
Department:-C.S.E"
Can anyone help me with the solution.
Specify your separator such as:
val df = spark.read.option("header","true").option("inferSchema","true").option("sep", "|").csv("test.csv")
df.show(false)
+------+----+------+----------+
|Rollno|Name|height|department|
+------+----+------+----------+
|101 |Aman|5"2 |C.S.E |
+------+----+------+----------+

hive sql, serde how to not quote my fields?

Since by default serde quotes fields by ", How can I not quote my fields using serde?
I tried:
row format serde "org.apache.hadoop.hive.serde2.OpenCSVSerde"
with serdeproperties(
"separatorChar" = ",",
"quoteChar" = "")
But i'm getting
FAILED: SemanticException java.lang.StringIndexOutOfBoundsException: String index out of range: 0
You could achieve this by specifying \u0000 as the quote character. Since quoteChar expects a string, you should use this unicode version of NULL.
ROW FORMAT SERDE
"org.apache.hadoop.hive.serde2.OpenCSVSerde"
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\u0000")
This unicode NULL \u0000 is what used by the CSV writer class as value for NO_QUOTE_CHARACTER: http://www.java2s.com/Code/Java/Development-Class/AverysimpleCSVwriterreleasedunderacommercialfriendlylicense.htm
For some reason "quoteChar" = "\u0000" didn't work for me as suggested in Nirmal's answer above.
When saving to file without quotes around the fields, I use:
-- saving to file
INSERT OVERWRITE LOCAL DIRECTORY 'file:/home/sidazhou/temp'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT *
FROM temp_table
;
PS. I know this isn't what's being asked, which concerns ROW FORMAT SERDE instead of ROW FORMAT DELIMITED FIELDS.

HIVE escaped by not working '\\'

I have a data-set in S3
123, "some random, text", "", "", 236
I build a external table on this dataset :
CREATE EXTERNAL TABLE db1.myData(
field1 bigint,
field2 string,
field3 string,
field4 string,
field5 bigint,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LOCATION 's3n://thisMyData/';
Problem/ Issue :
when I do
select * from db1.myData
field2 is shown as
some random
I need the field to be
some random, text
Gotcha's :
1. I cannot change the delimiter as there are over ~300 .csv files at this location
2. ESCAPED BY is not escaping the '\\'
3. I'm using HIVE 0.13 so there I cannot use CSV SerDe and neither i'm allowed to import new jars to cluster (its a complicated process to add a new jar as I have to go through Director level approvals)
Question:
Is there a workaround for making 'ESCAPED BY' come alive ?!
Any other workarounds for this ??
All suggestions are welcome !!
N.B : THis is not a repeat question. If you think its a repeat, please guide me to right page and I will take this off of this portal :)
I had to use: ESCAPED BY '\134' which translates to: ESCAPED BY '\'.
Additionally, because I was calling the Athena create table statement by passing in the statement from a JSON file I had to add an extra \ to mask the original \ in JSON. So my final statement within the JSON file looked like this: ESCAPED BY '\\134'.
If you are using Hive 0.14, you can use CSV Serde like this:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Refer below link for details:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Error while trying to import csv into hive

I tried to import my csv data into hive
My query:
CREATE EXTERNAL TABLE student(Stud_name String,dept String,year String)
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> with serdeproperties (
> "separatorChar" = ","
> )
> STORED AS TEXTFILE
> LOCATION '/home/codewarrior/Desktop/csv';
But it gives this error
, and quits from hive..i hope anybody help me..
You can try this code instead :
CREATE EXTERNAL TABLE student(Stud_name String,dept String,year String)
ROW FORMAT AS DELIMITED FIELDS TERMINATED by ',' location '/home/codewarrior/Desktop/csv';