Athena create table from CSV file (S3 bucket) with semicolon - csv

I'm trying to create a table with a S3 bucket which has an CSV file. Because of the regional settings the CSV has semicolon as a seperator and one row even contains commas.
Input CSV file:
Name;Phone;CRM;Desk;Rol
First Name;f.name;Name, First;IT;Inbel
First2 Name2;f2.name2;Name2, First2;IT;Inbel
First3 Name3;f3.name3;Name3, First3;IT;Inbel
First4 Name4;f4.name4;Name4, First4;IT;Inbel
Athena query:
CREATE EXTERNAL TABLE IF NOT EXISTS `a`.`test` (
`Name` string,
`Phone` string,
`CRM` string,
`Desk` string,
`Rol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://***/test/'
TBLPROPERTIES ('has_encrypted_data'='false');
The output comes out as:
Name;Phone;CRM;Desk;Rol
First Name;f.name;Name First;IT;Inbel
First2 Name2;f2.name2;Name2 First2;IT;Inbel
First3 Name3;f3.name3;Name3 First3;IT;Inbel
First4 Name4;f4.name4;Name4 First4;IT;Inbel
I tried scanning the web for solutions (especially for the seperator), but nothing seems to work. I don't want to change regional settings and would love to keep the input file as is. Also if someone knows the solution for the CRM column it would be a bonus!

Related

How to handle spaces in filed names in a json file while creating Athena table [duplicate]

This question already has answers here:
Column name with a Space - Athena
(3 answers)
Closed 1 year ago.
My Json file has field names with spaces in it like "Customer ID".The json file sits in the S3 bucket .So,when I try creating an Athena table on this json file ,it throws me error as the field names has spaces.it loads fine when the fields with spaces are removed while loading.How do I handle this situation so the entire data gets properly loaded.
if you have a chance convert your json file to csv
you can try something like :
CREATE EXTERNAL TABLE IF NOT EXISTS db.table_name (
.......
`Customer_ID` int ,
   .......
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',    
'escapeChar' = '\\' )
STORED AS TEXTFILE LOCATION 's3://location'
TBLPROPERTIES ('skip.header.line.count'='1')
main thought is  - TBLPROPERTIES ('skip.header.line.count'='1') here you skip header in you .csv file and set your custom column name

Loading Json Array File in Hive

I have a file which contains the data as follows
[{"col1":"col1","col2":1}
,{"col1":"col11","col2":11}
,{"col1":"col111","col2":2}
]
I am trying to load the table in Hive.
I am using following Hive serde
CREATE EXTERNAL TABLE my_table (
my_array ARRAY<struct<col1:string,col2:int>>
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";
I am getting error when I try to run select * after running the create command -
['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:25:24', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:499', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:307', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:878', 'sun.reflect.GeneratedMethodAccessor29:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:498', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy35:fetchResults::-1', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:559', 'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:751', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748', '*java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:29:4', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428', 'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:147', 'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2207', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:494', '*org.apache.hadoop.hive.serde2.SerDeException:java.io.IOException: Start token not found where expected:30:1', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:184', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:502', '*java.io.IOException:Start token not found where expected:30:0', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:170'], statusCode=3), results=None, hasMoreRows=None)
I tried several things, none of which worked as expected. I can't change the input data format as it is someone else who is providing the data.
This is a malformed JSON issue. A JSON file will always have "curly braces" at the beginning and the end. So change your JSON file to look something like below.
{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}
Create your table in the exact same way as you are doing it already.
CREATE EXTERNAL TABLE my_table
(
my_array ARRAY<struct<col1:string,col2:int>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";
Now fire a select * on your newly created table to see following results.
[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]
Use select my_array.col1 from my_table; to see the values for col1 from your array.
["col1","col11","col111"]
PS - Not the most efficient way to store the data. Consider transforming the data and storing it as ORC/Parquet.
Hope that helps!
Looks like the issue is with your json data. Can you try with below example?
Create employee json with below content and place it in hdfs.
[root#quickstart spark]# hadoop fs -cat /user/cloudera/spark/employeejson/*
{"Name":"Vinayak","age":35}
{"Name":"Nilesh","age":37}
{"Name":"Raju","age":30}
{"Name":"Karthik","age":28}
{"Name":"Shreshta","age":1}
{"Name":"Siddhish","age":2}
Add below jar(execute only if you get any error. )
hive> ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;
hive>
CREATE TABLE employeefromjson(name string, age int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/user/cloudera/hive/employeefromjson'
;
hive> LOAD DATA INPATH '/user/cloudera/spark/employeejson' OVERWRITE INTO TABLE employeefromjson;
hive> select * from employeefromjson;
OK
Vinayak 35
Nilesh 37
Raju 30
Karthik 28
Shreshta 1
Siddhish 2
Time taken: 0.174 seconds, Fetched: 6 row(s)
A JSON should always start with '{' and not with '['. That is the problem. As you know, JSON has a structure of {'key':'value'}. What you have given in your file is a value which does not have any key. So, change your JSON to the below formmat
{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}
Your Create table statement should work fine.
If you want to get the data for each column for all the rows, use the below query.
select my_array.col1, my_array.col2 from my_table;
The above command will give you the below result.
OK
["col1","col11","col111"] [1,11,2]
If you want to get the result column wise for each row seperately, use the below query.
select a.* from my_table m lateral view outer inline (m.my_array) a;
The above command will give you the below result.
OK
col1 1
col11 11
col111 2
Hope you this helps!

Hive json serde - Keys with white spaces not populating

I have an external table that is built off of a json file.
All of the json keys are columns and are populated as expected except for one key that has a space.
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( event ARRAY <STRUCT
value:STRING
,info:STRUCT
<id:STRING
,event_source:STRING>>
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES("mapping.event_source"="event source")
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'foo/bar'
All of the values show up as expected except for event_source, which shows up as NULL. The original form of event_source in the json file is 'event source' without the single quotes.
Is there something I need to do different with the WITH SERDEPROPERTIES setting in order to get the key to work properly?
Thanks
you mean that the json has data like
{ id: "myid", event source: "eventsource" }
If so, there's not much that can be done since it's simply broken JSON.
If not, can you post a sample of the JSON you're trying to read ?
I have encountered a similar problem as above but with a slight variation that the input data is correct json.
I have an external table that is built off of a json file. All of the json keys are populated except one msrp_currency
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( id string,
variants array<struct<pid:string, msrp_currency:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true" ,
'mapping.variants.msrp_currency' = 'variants.msrpcurrency')
LOCATION 'foo/bar'
All of the values show up as expected except for msrp_currency, which shows up as NULL. The reason I need to introduce underscore is because later I need to extract the same field value as msrpCurrecny using brickhouse to_json UDF.
sample values:
{ "pid": "mypid", "msrpCurrency": "USD" }

issue with Hive Serde dealing nested structs

I am trying to load a huge volume json data with nested structure to hive using a Json serde. some of the field names start with $ in nested structure. I am mapping hive filed names Using SerDeproperties, but how ever when i query the table, getting null in the field starting with $, tried with different syntax,but no luck.
Sample JSON:
{
"_id" : "319FFE15FF90",
"SomeThing" :
{
"$SomeField" : 22,
"AnotherField" : 2112,
"YetAnotherField": 1
}
. . . etc . . . .
Using a schema as follows:
create table testSample
(
`_id` string,
something struct
<
$somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
This schema builds OK, however, somefield(starting with $) in the above table is always returning null (all the other values exist and are correct).
We've been trying a lot of syntax combinations, but to no avail.
Does anyone know the trick to hap a nested field with a leading $ in its name?
You almost got it right. Try creating the table like this.
The mistake you're making is that when mapping in the serde properties (mapping.somefield ="$somefield") you're saying "when looking for the hive column named 'somefield', look for the json field '$somefield', but in hive you defined the column with the dollar sign, which if not outright illegal it's for sure not the best practice in hive.
create table testSample
(
`_id` string,
something struct
<
somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
I tested it with some test data:
{ "_id" : "123", "something": { "$somefield": 12, "anotherfield":13,"yetanotherfield":100}}
hive> select something.somefield from testSample;
OK
12
I am suddenly starting to see this problem as well but for normal column names as well (no special characters such as $)
I am populating an external table (Temp) from another internal table (Table2) and want the output of Temp table in JSON format. I want column names in camel case in the output JSON file and so am also using the Serdepoperties in the Temp table to specify correct names. However, I am seeing that when I do Select * from the Temp table, it gives NULL values for the columns whose names have been used in the mapping.
I am running Hive 0.13. Here are the commands:
Create table command:
CREATE EXTERNAL TABLE Temp (
data STRUCT<
customerId:BIGINT, region:STRING, marketplaceId:INT, asin:ARRAY<STRING>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.customerid' = 'customerId',
'mapping.marketplaceid' = 'marketplaceId'
)
LOCATION '/output';
INSERT INTO TABLE Temp
SELECT
named_struct ('customerId',customerId, 'region', region, 'marketplaceId', marketplaceId, 'asin', asin)
FROM Table2;
Select * from Temp:
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1PZC"]}
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1C9G"]}
See how "customerid" and "marketplaceid" are null. Generated JSON file is:
{"data":{"region":"EU","asin":["B000FC1PZC"]}}
{"data":{"region":"EU","asin":["B000FC1C9G"]}}
Now, if I remove the with serdeproperties, the table starts getting all values:
{"customerid":1,"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"]}
{"customerid":2,"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"]}
And then the JSON file so generated is:
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"],"customerid":1}}
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"],"customerid":2}}

Hive: parsing JSON

I am trying to get some values out of nested JSON for millions of rows (5 TB+ table). What is the most efficient way to do this?
Here is an example:
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
I need these values out of the above JSON:
Country Page impressions_s impressions_o
--------- ----- ------------- --------------
US 2 10 10
This is Hive's json_tuple function, I am not sure if this is the best function.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject
You can use get_json_object:
select get_json_object(fieldname, '$.country'),
get_json_object(fieldname, '$.data.ad.s') from ...
You will get better performance with json_tuple but I found a "how to" to get the values in json inside json;
To formating your table you can use something like this:
from table t lateral view
explode( split(regexp_replace(get_json_object(ln, ''$.data.ad.s'), '\\[|\\]', ''), ',' ) ) tb1 as s
this code above will transform you "Array" in a column.
form more: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
I hope this help ...
Here is what you can quickly try , I would suggest to use Json-Ser-De.
nano /tmp/hive-parsing-json.json
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
Create base table :
hive > CREATE TABLE hive_parsing_json_table ( json string );
Load json file to Table :
hive > LOAD DATA LOCAL INPATH '/tmp/hive-parsing-json.json' INTO TABLE hive_parsing_json_table;
Query the table :
hive > select v1.Country, v1.Page, v4.impressions_s, v4.impressions_o
from hive_parsing_json_table hpjp
LATERAL VIEW json_tuple(hpjp.json, 'country', 'page', 'data') v1
as Country, Page, data
LATERAL VIEW json_tuple(v1.data, 'ad') v2
as Ad
LATERAL VIEW json_tuple(v2.Ad, 'impressions') v3
as Impressions
LATERAL VIEW json_tuple(v3.Impressions, 's' , 'o') v4
as impressions_s,impressions_o;
Output :
v1.country v1.page v4.impressions_s v4.impressions_o
US 227 10 10
Using hive native json-serde('org.apache.hive.hcatalog.data.JsonSerDe') you can do this.. here are the steps
ADD JAR /path/to/hive-hcatalog-core.jar;
create a table as below
CREATE TABLE json_serde_nestedjson (
country string,
page int,
data struct < ad: struct < impressions: struct < s:int, o:int > > >
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
then load data(stored in file)
LOAD DATA LOCAL INPATH '/tmp/nested.json' INTO TABLE json_serde_nestedjson;
then get required data using
SELECT country, page, data.ad.impressions.s, data.ad.impressions.o
FROM json_serde_nestedjson;
Implementing a SerDe to parse your data in JSON is a better way for your case.
A tutorial on how to implement SerDe for parsing JSON can be found here
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
You can use the following sample SerDe implementation as well
https://github.com/rcongiu/Hive-JSON-Serde