Hive - Complex regexp_replace - csv

I'm not a specialist with regular expression and I'm facing issues using regexp_replace in Hive.
I would like to load a CSV file into Hive, which contains rows like that:
AAA,1234,BBB,,,"""CC,CCC""","""DDD""","""EE"EEE""",,
"""AAA""",1234,BBB,,,CCCC,"""DD,DD""",,"""FFFF""",
As you can see, the format isn't perfect
There are non-escaped commas into string fields
Some string fields are enclosed by """ (3 double-quotes)
There are non-escaped double-quotes into string fields
There are empty fields
When I try to import it into a Hive table, the columns are not well parsed because of the non-escaped commas.
So I imported the raw data as rows into a Hive table like this:
CREATE EXTERNAL TABLE MyRawTable
(
RAW_DATA STRING
)
STORED AS TEXTFILE
LOCATION '/path/to/hdfs/file'
And i'm trying to use the regexp_replace function to transform the rows:
Escape the commas, the double and simple quotes in the string fields
Not enclose string fields by double quotes
So data will look like that:
AAA,1234,BBB,,,CC\,CCC,DDD,EE\"EEE,,
AAA,1234,BBB,,,CCCC,DD\,DD,,FFFF,
I don't find the solution for this regex, any ideas? Thanks a lot!

Forget about the regexp, you don't need it. The commas aren't escaped, but they are surrounded by double-quotes. You can simply use the OpenCSVSerde :
CREATE EXTERNAL TABLE yourtable(foo int, bar string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\""
)
LOCATION '/your/folder/containing/csv/files/';

Related

Error parsing JSON: more than one document in the input (Redshift to Snowflake SQL)

I'm trying to convert a query from Redshift to Snowflake SQL.
The Redshift query looks like this:
SELECT
cr.creatives as creatives
, JSON_ARRAY_LENGTH(cr.creatives) as creatives_length
, JSON_EXTRACT_PATH_TEXT(JSON_EXTRACT_ARRAY_ELEMENT_TEXT (cr.creatives,0),'previewUrl') as preview_url
FROM campaign_revisions cr
The Snowflake query looks like this:
SELECT
cr.creatives as creatives
, ARRAY_SIZE(TO_ARRAY(ARRAY_CONSTRUCT(cr.creatives))) as creatives_length
, PARSE_JSON(PARSE_JSON(cr.creatives)[0]):previewUrl as preview_url
FROM campaign_revisions cr
It seems like JSON_EXTRACT_PATH_TEXT isn't converted correctly, as the Snowflake query results in error:
Error parsing JSON: more than one document in the input
cr.creatives is formatted like this:
"[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]"
It seems to me that you are not working with valid JSON data inside Snowflake.
Please review your file format used for the copy into command.
If you open the "JSON" text provided in a text editor , note that the information is not parsed or formatted as JSON because of the quoting you have. Once your issue with double quotes / escaped quotes is handled, you should be able to make good progress
Proper JSON on Left || Original Data on Right
If you are not inclined to reload your data, see if you can create a Javascript User Defined Function to remove the quotes from your string, then you can use Snowflake to process the variant column.
The following code is working POJO that can be used to remove the doublequotes for you.
var textOriginal = '[{""previewUrl"":""https://someurl.com/preview1.png"",""device"":""desktop"",""splitId"":null,""splitType"":null},{""previewUrl"":""https://someurl.com/preview2.png"",""device"":""mobile"",""splitId"":null,""splitType"":null}]';
function parseText(input){
var a = input.replaceAll('""','\"');
a = JSON.parse(a);
return a;
}
x = parseText(textOriginal);
console.log(x);
For anyone else seeing this double double quote issue in JSON fields coming from CSV files in a Snowflake external stage (slightly different issue than the original question posted):
The issue is likely that you need to use the FIELD_OPTIONALLY_ENCLOSED_BY setting. Specifically, FIELD_OPTIONALLY_ENCLOSED_BY = '"' when setting up your fileformat.
(docs)
Example of creating such a file format:
create or replace file format mydb.myschema.my_tsv_file_format
type = CSV
field_delimiter = '\t'
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
And example of querying from a stage using this file format:
select
$1 field_one
$2 field_two
-- ...and so on
from '#my_s3_stage/path/to/file/my_tab_separated_file.csv' (file_format => 'my_tsv_file_format')

Read empty string as Null Athena

I want to create a table in Amazon Athena over csv file on s3. Csv file looks like
id,name,invalid
1,abc,
2,cba,y
Code for creating table looks like
CREATE EXTERNAL TABLE IF NOT EXISTS {schema}.{table_name} (
id int,
name string,
invalid string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
)
LOCATION '{s3}'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip')
So, my problem is Athena reads empty string as actually empty string but I'd like to see it like null. I haven't found any property for that in docs.
LazySimpleSerDe will interpret \N as NULL by default, but you can configure it to use other strings with the serialization.null.format serde property.
For this guide on CSV and Athena for more details.

mySql JSON string field returns encoded

First week having to deal with a MYSQL database and JSON field types and I cannot seem to figure out why values are encoded automatically and then returned in encoded format.
Given the following SQL
-- create a multiline string with a tab example
SET #str ="Line One
Line 2 Tabbed out
Line 3";
-- encode it
SET #j = JSON_OBJECT("str", #str);
-- extract the value by name
SET #strOut = JSON_EXTRACT(#J, "$.str");
-- show the object and attribute value.
SELECT #j, #strOut;
You end up with what appears to be a full formed JSON object with a single attribute encoded.
#j = {"str": "Line One\n\tLine 2\tTabbed out\n\tLine 3"}
but using JSON_EXTRACT to get the attribute value I get the encoded version including outer quotes.
#strOut = "Line One\n\tLine 2\tTabbed out\n\tLine 3"
I would expect to get my original string with the \n \t all unescaped to the original values and no outer quotes. as such
Line One
Line 2 Tabbed out
Line 3
I can't seem to find any JSON_DECODE or JSON_UNESCAPE or similar functions.
I did find a JSON_ESCAPE() function but that appears to be used to manually build a JSON object structure in a string.
What am I missing to extract the values to the original format?
I like to use handy operator ->> for this.
It was introduced in MySQL 5.7.13, and basically combines JSON_EXTRACT() and JSON_UNQUOTE():
SET #strOut = #J ->> '$.str';
You are looking for the JSON_UNQUOTE function
SET #strOut = JSON_UNQUOTE( JSON_EXTRACT(#J, "$.str") );
The result of JSON_EXTRACT() is intentionally a JSON document, not a string.
A JSON document may be:
An object enclosed in { }
An array enclosed in [ ]
A scalar string value enclosed in " "
A scalar number or boolean value
A null — but this is not an SQL NULL, it's a JSON null. This leads to confusing cases because you can extract a JSON field whose JSON value is null, and yet in an SQL expression, this fails IS NULL tests, and it also fails to be equal to an SQL string 'null'. Because it's a JSON type, not a scalar type.

How to CONCAT two column value into a JSON format?

I'd like to CONCAT two column value into one string in string in JSON format. However I have problem with the quote and double quote in the query. How do I fix my query so it success produce the expected result?
$concat = "CONCAT('{"CODE":"pm_r.CODE","NAME":"pm_r.NAME"}') AS `JSON`"
$query = $this->db->query(
'SELECT pm_r.ID_REQUIREMENT, '.$concat.'FROM `pm_requirement` `pm_r`'
);
The expected out should be:
ID_REQUIREMENT JSON
ID001 {"CODE":"001","NAME":"Shane"}
To avoid quoting clashes, a solution is to use a php HEREDOC string.
I also fixed you CONCAT, where the names of the column to concatenate should be separated from the fixed parts of the string.
$query = $this->db->query(<<<EOT
SELECT
pm_r.ID_REQUIREMENT,
CONCAT('{"CODE":"', pm_r.CODE, '","NAME":"', pm_r.NAME, '"}') AS `JSON`
FROM `pm_requirement` `pm_r`
EOT
);

I am trying to set the empty values in a csv file to zero in hive. But this code doesn't seem to work. What changes should I make?

This is the input .csv file
"1","","Animation"
"2","Jumanji",""
"","Grumpier Old Men","Comedy"
Hive Code
CREATE TABLE IF NOT EXISTS movies(movie_id int, movie_name string,genre string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"serialization.null.format" = '0'
);
Output
1 Animation
2 Jumanji
Grumpier Old Men Comedy
Empty strings in csv are interpreted as empty strings, not NULLs. To represent NULL inside a delimited text file you should use "\N". Also Hive provides you a table property “serialization.null.format” which can be used to treat a character of your choice as null in Hive SQL. In your case it should be empty string "". To convert NULLs to zeroes use NVL(col, 0) or COALESCE(col, 0) function depending on your hive version (COALESCE should work for all).