How to handle spaces in filed names in a json file while creating Athena table [duplicate] - json

This question already has answers here:
Column name with a Space - Athena
(3 answers)
Closed 1 year ago.
My Json file has field names with spaces in it like "Customer ID".The json file sits in the S3 bucket .So,when I try creating an Athena table on this json file ,it throws me error as the field names has spaces.it loads fine when the fields with spaces are removed while loading.How do I handle this situation so the entire data gets properly loaded.

if you have a chance convert your json file to csv
you can try something like :
CREATE EXTERNAL TABLE IF NOT EXISTS db.table_name (
.......
`Customer_ID` int ,
   .......
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',    
'escapeChar' = '\\' )
STORED AS TEXTFILE LOCATION 's3://location'
TBLPROPERTIES ('skip.header.line.count'='1')
main thought is  - TBLPROPERTIES ('skip.header.line.count'='1') here you skip header in you .csv file and set your custom column name

Related

Athena create table from CSV file (S3 bucket) with semicolon

I'm trying to create a table with a S3 bucket which has an CSV file. Because of the regional settings the CSV has semicolon as a seperator and one row even contains commas.
Input CSV file:
Name;Phone;CRM;Desk;Rol
First Name;f.name;Name, First;IT;Inbel
First2 Name2;f2.name2;Name2, First2;IT;Inbel
First3 Name3;f3.name3;Name3, First3;IT;Inbel
First4 Name4;f4.name4;Name4, First4;IT;Inbel
Athena query:
CREATE EXTERNAL TABLE IF NOT EXISTS `a`.`test` (
`Name` string,
`Phone` string,
`CRM` string,
`Desk` string,
`Rol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://***/test/'
TBLPROPERTIES ('has_encrypted_data'='false');
The output comes out as:
Name;Phone;CRM;Desk;Rol
First Name;f.name;Name First;IT;Inbel
First2 Name2;f2.name2;Name2 First2;IT;Inbel
First3 Name3;f3.name3;Name3 First3;IT;Inbel
First4 Name4;f4.name4;Name4 First4;IT;Inbel
I tried scanning the web for solutions (especially for the seperator), but nothing seems to work. I don't want to change regional settings and would love to keep the input file as is. Also if someone knows the solution for the CRM column it would be a bonus!

Loading Json Array File in Hive

I have a file which contains the data as follows
[{"col1":"col1","col2":1}
,{"col1":"col11","col2":11}
,{"col1":"col111","col2":2}
]
I am trying to load the table in Hive.
I am using following Hive serde
CREATE EXTERNAL TABLE my_table (
my_array ARRAY<struct<col1:string,col2:int>>
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";
I am getting error when I try to run select * after running the create command -
['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:25:24', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:499', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:307', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:878', 'sun.reflect.GeneratedMethodAccessor29:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:498', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy35:fetchResults::-1', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:559', 'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:751', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748', '*java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:29:4', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428', 'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:147', 'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2207', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:494', '*org.apache.hadoop.hive.serde2.SerDeException:java.io.IOException: Start token not found where expected:30:1', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:184', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:502', '*java.io.IOException:Start token not found where expected:30:0', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:170'], statusCode=3), results=None, hasMoreRows=None)
I tried several things, none of which worked as expected. I can't change the input data format as it is someone else who is providing the data.
This is a malformed JSON issue. A JSON file will always have "curly braces" at the beginning and the end. So change your JSON file to look something like below.
{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}
Create your table in the exact same way as you are doing it already.
CREATE EXTERNAL TABLE my_table
(
my_array ARRAY<struct<col1:string,col2:int>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";
Now fire a select * on your newly created table to see following results.
[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]
Use select my_array.col1 from my_table; to see the values for col1 from your array.
["col1","col11","col111"]
PS - Not the most efficient way to store the data. Consider transforming the data and storing it as ORC/Parquet.
Hope that helps!
Looks like the issue is with your json data. Can you try with below example?
Create employee json with below content and place it in hdfs.
[root#quickstart spark]# hadoop fs -cat /user/cloudera/spark/employeejson/*
{"Name":"Vinayak","age":35}
{"Name":"Nilesh","age":37}
{"Name":"Raju","age":30}
{"Name":"Karthik","age":28}
{"Name":"Shreshta","age":1}
{"Name":"Siddhish","age":2}
Add below jar(execute only if you get any error. )
hive> ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;
hive>
CREATE TABLE employeefromjson(name string, age int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/user/cloudera/hive/employeefromjson'
;
hive> LOAD DATA INPATH '/user/cloudera/spark/employeejson' OVERWRITE INTO TABLE employeefromjson;
hive> select * from employeefromjson;
OK
Vinayak 35
Nilesh 37
Raju 30
Karthik 28
Shreshta 1
Siddhish 2
Time taken: 0.174 seconds, Fetched: 6 row(s)
A JSON should always start with '{' and not with '['. That is the problem. As you know, JSON has a structure of {'key':'value'}. What you have given in your file is a value which does not have any key. So, change your JSON to the below formmat
{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}
Your Create table statement should work fine.
If you want to get the data for each column for all the rows, use the below query.
select my_array.col1, my_array.col2 from my_table;
The above command will give you the below result.
OK
["col1","col11","col111"] [1,11,2]
If you want to get the result column wise for each row seperately, use the below query.
select a.* from my_table m lateral view outer inline (m.my_array) a;
The above command will give you the below result.
OK
col1 1
col11 11
col111 2
Hope you this helps!

AWS Glue Crawler for JSONB column in PostgreSQL RDS

I've created a crawler that looks at a PostgreSQL 9.6 RDS table with a JSONB column but the crawler identifies the column type as "string". When I then try to create a job that loads data from a JSON file on S3 into the RDS table I get an error.
How can I map a JSON file source to a JSONB target column?
It's not quite a direct copy, but an approach that has worked for me is to define the column on the target table as TEXT. After the Glue job populates the field, I then convert it to JSONB. For example:
alter table postgres_table
alter column column_with_json set data type jsonb using column_with_json::jsonb;
Note the use of the cast for the existing text data. Without that, the alter column would fail.
Crawler will identify JSONB column type as "string" but you can try to use Unbox Class in Glue to convert this column to json
let's check the following table in PostgreSQL
create table persons (id integer, person_data jsonb, creation_date timestamp )
There is an example of one record from person table
ID = 1
PERSON_DATA = {
"firstName": "Sergii",
"age": 99,
"email":"Test#test.com"
}
CREATION_DATE = 2021-04-15 00:18:06
The following code need to be added in Glue
# 1. create dynamic frame from catalog
df_persons = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "persons", transformation_ctx = "df_persons ")
# 2.in path you need to add your jsonb column name that need to be converted to json
df_persons_json = Unbox.apply(frame = df_persons , path = "person_data", format="json")
# 3. converting from dynamic frame to data frame
datf_persons_json = df_persons_json.toDF()
# 4. after that you can process this column as a json datatype or create dataframe with all necessary columns , each json data element can be added as a separate column in dataframe :
final_df_person = datf_persons_json.select("id","person_data.age","person_data.firstName","creation_date")
You can also check the following link:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html

Hive json serde - Keys with white spaces not populating

I have an external table that is built off of a json file.
All of the json keys are columns and are populated as expected except for one key that has a space.
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( event ARRAY <STRUCT
value:STRING
,info:STRUCT
<id:STRING
,event_source:STRING>>
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES("mapping.event_source"="event source")
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'foo/bar'
All of the values show up as expected except for event_source, which shows up as NULL. The original form of event_source in the json file is 'event source' without the single quotes.
Is there something I need to do different with the WITH SERDEPROPERTIES setting in order to get the key to work properly?
Thanks
you mean that the json has data like
{ id: "myid", event source: "eventsource" }
If so, there's not much that can be done since it's simply broken JSON.
If not, can you post a sample of the JSON you're trying to read ?
I have encountered a similar problem as above but with a slight variation that the input data is correct json.
I have an external table that is built off of a json file. All of the json keys are populated except one msrp_currency
Here is the DDL:
CREATE EXTERNAL TABLE foo.bar
( id string,
variants array<struct<pid:string, msrp_currency:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true" ,
'mapping.variants.msrp_currency' = 'variants.msrpcurrency')
LOCATION 'foo/bar'
All of the values show up as expected except for msrp_currency, which shows up as NULL. The reason I need to introduce underscore is because later I need to extract the same field value as msrpCurrecny using brickhouse to_json UDF.
sample values:
{ "pid": "mypid", "msrpCurrency": "USD" }

issue with Hive Serde dealing nested structs

I am trying to load a huge volume json data with nested structure to hive using a Json serde. some of the field names start with $ in nested structure. I am mapping hive filed names Using SerDeproperties, but how ever when i query the table, getting null in the field starting with $, tried with different syntax,but no luck.
Sample JSON:
{
"_id" : "319FFE15FF90",
"SomeThing" :
{
"$SomeField" : 22,
"AnotherField" : 2112,
"YetAnotherField": 1
}
. . . etc . . . .
Using a schema as follows:
create table testSample
(
`_id` string,
something struct
<
$somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
This schema builds OK, however, somefield(starting with $) in the above table is always returning null (all the other values exist and are correct).
We've been trying a lot of syntax combinations, but to no avail.
Does anyone know the trick to hap a nested field with a leading $ in its name?
You almost got it right. Try creating the table like this.
The mistake you're making is that when mapping in the serde properties (mapping.somefield ="$somefield") you're saying "when looking for the hive column named 'somefield', look for the json field '$somefield', but in hive you defined the column with the dollar sign, which if not outright illegal it's for sure not the best practice in hive.
create table testSample
(
`_id` string,
something struct
<
somefield:int,
anotherfield:bigint,
yetanotherfield:int
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.somefield" = "$somefield"
);
I tested it with some test data:
{ "_id" : "123", "something": { "$somefield": 12, "anotherfield":13,"yetanotherfield":100}}
hive> select something.somefield from testSample;
OK
12
I am suddenly starting to see this problem as well but for normal column names as well (no special characters such as $)
I am populating an external table (Temp) from another internal table (Table2) and want the output of Temp table in JSON format. I want column names in camel case in the output JSON file and so am also using the Serdepoperties in the Temp table to specify correct names. However, I am seeing that when I do Select * from the Temp table, it gives NULL values for the columns whose names have been used in the mapping.
I am running Hive 0.13. Here are the commands:
Create table command:
CREATE EXTERNAL TABLE Temp (
data STRUCT<
customerId:BIGINT, region:STRING, marketplaceId:INT, asin:ARRAY<STRING>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.customerid' = 'customerId',
'mapping.marketplaceid' = 'marketplaceId'
)
LOCATION '/output';
INSERT INTO TABLE Temp
SELECT
named_struct ('customerId',customerId, 'region', region, 'marketplaceId', marketplaceId, 'asin', asin)
FROM Table2;
Select * from Temp:
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1PZC"]}
{"customerid":null,"region":"EU","marketplaceid":null,"asin":["B000FC1C9G"]}
See how "customerid" and "marketplaceid" are null. Generated JSON file is:
{"data":{"region":"EU","asin":["B000FC1PZC"]}}
{"data":{"region":"EU","asin":["B000FC1C9G"]}}
Now, if I remove the with serdeproperties, the table starts getting all values:
{"customerid":1,"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"]}
{"customerid":2,"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"]}
And then the JSON file so generated is:
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1PZC"],"customerid":1}}
{"data":{"region":"EU","marketplaceid":4,"asin":["B000FC1C9G"],"customerid":2}}