Add data query with Hive-JSON-SerDe - json

I'm working with hive and i need to add data in json-format. I use https://github.com/rcongiu/Hive-JSON-Serde library. It loads data in hive from file.
~$ cat test.json
{"text":"foo","number":123}
{"text":"bar","number":345}
$ hadoop fs -put -f test.json /user/data/test.json
$ hive
hive> CREATE DATABASE test;
hive> CREATE EXTERNAL TABLE test ( text string )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/data';
hive> SELECT * FROM test;
OK
foo 123
bar 345
But i need load data from query, like:
insert into table test values {"text": "abc", number: 666}
Who knows how do this?

The question seems old, however, in case someone looking for an answer:
I tried another approach as following:
Create table test (text string);
LOAD data inpath 'path/test.json' INTO TABLE test;
insert into table test values ("{'text':'abc','number':666}");
The only different is when you need to load values it will be something like:
select get_json_object(str,'$.text') as text1, get_json_object(str,'$.number') as number1 from test;

A SerDe is really intended for use with external tables which read the data from files containing the data. So it will not help you directly insert json data and the insert query you give as an example will not work as such. I suggest that you should either write the data to a file on your hdfs and create an external table on the folder containing the file, or parse incoming data such that you can insert it as columns.

Related

Upload CSVs of JSON data from S3 To Redshift

I have thousands of unusually formatted CSVs sitting in S3 that I need uploaded to Redshift.
The CSVs are formatted like so:
Column A Column B ..... Column Z
{"id": 2034823" "created": "2017-1-1" "result": true}
In other words, each row of the CSV is valid JSON.
I've tried a simple copy command, but to no avail. I tried to add the format as json 'auto'; flag, but still receiving errors:
Invalid Value: err_code 1216, line number 1, position 0
Is there a recommended way to handle CSVs in this format? I want to save them into an existing Redshift table that already has types defined
I have the same exact types of files. The steps I have followed to load them into a Redshift table like this
Create an external table in Redshift Spectrum table with struct
Insert into your Redshift table from the table above.
in your case
1.
CREATE EXTERNAL TABLE <spectrum schema>.<your external table>
(
data struct<
id:integer,
created:timestamp,
...
result:varchar(5)>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties (
'dots.in.keys' = 'true',
'mapping.requesttime' = 'requesttimestamp')
as location 's3:<your S3 bucket>';
2.
INSERT INTO <your Redshift table>
SELECT data.id, data.created, ..., data.result
FROM <your external table>
See how to setup Redshift Spectrum
https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
Let me know if you have further questions.

Export Dynamodb to S3 using Hive

I referred to this link: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMR_Hive_Commands.html.
My hive script is like below:
DROP TABLE IF EXISTS hiveTableName;
CREATE EXTERNAL TABLE hiveTableName (item map<string,string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test_table", "dynamodb.region"="us-west-2");
DROP TABLE IF EXISTS s3TableName;
CREATE EXTERNAL TABLE s3TableName (item map<string, string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://bucket/test-hive2';
SET dynamodb.throughput.read.percent=0.8;
INSERT OVERWRITE TABLE s3TableName SELECT *
FROM hiveTableName;
Dynamodb table can be successfully exported to S3, but the file format is not JSON, it is like:
uuid{"s":"db154955-8555-4b49-bf40-ee36605ac510"}num{"n":"1294"}info{"s":"qwefjdkslafjdafl"}
uuid{"s":"d9898564-2b56-42ba-9cfb-fd092e7d0b8d"}num{"n":"100"}info{"s":"qwefjdkslafjdafl"}
Does someone know how to export in JSON format? I know I can use Data Pipeline, and it can export dynamodb table to S3 in JSON format, but for some reason I need to use EMR. I tried another tool: https://github.com/awslabs/emr-dynamodb-connector, and use the command:
java -cp target/emr-dynamodb-tools-4.2.0-SNAPSHOT.jar org.apache.hadoop.dynamodb.tools.DynamoDBExport /where/output/should/go my-dynamo-table-name
but the error was
Error: Could not find or load main class org.apache.hadoop.dynamodb.tools.DynamoDBExport
Can someone tell me how to solve these problems? Thanks.
== update ==
If I use to_json, as Chris suggested, my code is as below:
DROP TABLE IF EXISTS hiveTableName2;
CREATE EXTERNAL TABLE hiveTableName2 (item map<string, string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test_table", "dynamodb.region"="us-west-2");
DROP TABLE IF EXISTS s3TableName2;
CREATE EXTERNAL TABLE s3TableName2 (item string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://backup-restore-dynamodb/hive-test';
INSERT OVERWRITE TABLE s3TableName2 SELECT to_json(item)
FROM hiveTableName2;
When I look at the generated file, it's like
{"uuid":"{\"s\":\"db154955-8555-4b49-bf40-ee36605ac510\"}","num":"{\"n\":\"1294\"}","info":"{\"s\":\"qwefjdkslafjdafl\"}"}
What I want is a nested map, like
map<string, map<string, string>>
not
map<string, string>
Can someone give me some suggestions? Thanks.
Your SELECT * query is emitting a serialized form of the Hive map, which isn't guaranteed to be JSON. You may want to consider using the Brickhouse Hive UDF's. In particular, calling the to_json function would be a good fit for guaranteeing a JSON format in your output.
to_json -- Convert an arbitrary Hive structure ( list,map, named_struct ) into JSON
INSERT OVERWRITE TABLE s3TableName SELECT to_json(item)
FROM hiveTableName;
On November 9, 2020, DynamoDB released a new feature to export your data to an S3 bucket - you can read more about it here:
https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
It's a native, server-less solution, and currently (as of 11/20) supports DynamoDB JSON.

SQLITE: import data from a CSV

I need to import in a SQLITE database a CSV file that use both numbers than strings: here you a sample ..
col_1|col_2|col_3
10|text2|http://www.google.com
For the import I use Spatialite GUI because I've to manage also spatial data: all works fine in the import but when I try to select the data
select * from test;
How I've to structure my CSV file to store my "text2" string?
I've solved in a different manner ....
Enter in Sqlite and give these commands:
CREATE TABLE ps_details(
col_1 TEXT,
col_2 TEXT,
col_3 TEXT
);
.mode csv
.separator |
.import test.csv test
.quit
You can save this in a file (es. test.txt) and then, in a second file named test.sh write
sqlite3 dbtest.sqlite < test.txt
save and change its permission (chmod 777), and then launch it form comand line
./test.sh
This will create a table test in your dbtest.sqlite getting data form test.csv file
It looks like you defined the type of col_2 as REAL, where it should be TEXT.
The structure of your CSV look ok.
Disclaimer: I have never used Spatialite, this is just from looking at the information you provided.

Athena AWS bad field name and multiple folders with Hive DDL

I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):
How to query a JSON field named with parenthesis? For example I have a field named "Capacity(GB)", and when I'm trying to include in the CREATE EXTERNAL statement I receive an error:
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
`device`: string,
`Capacity(GB)`: string)
Your query has the following error(s):
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.IllegalArgumentException: Error: : expected at the position
of 'Capacity(GB):string>' but '(' is found.
My files are located in sub folders in S3 in a following structure:
'location_name/YYYY/MM/DD/appstring/'
and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path?
Something like this:
LOCATION 's3://location_name/%/%/%/appstring/'
Do I have to load the raw data as-is using CREATE EXTERNAL TABLE, and only then query it, or I can add some WHERE statements build-in? Specifically is someting like this is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
field1:string,
field2:string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'
WHERE field2='value'
What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.
Thanks!
1. JSON field named with parenthesis
There is no need to create a field called Capacity(GB). Instead, create the field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON then you can use the Serde's mapping property (which I saw on issue with Hive Serde dealing nested structs):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works nicely with an input of:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/). The closest option is to use partitioned data but that would require a different naming format for your directories.
3. Creating tables
You cannot specify WHERE statements as part of the CREATE TABLE statement.
If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.
For examples, see: Analyzing Data in S3 using Amazon Athena

Creating hive table over complex parquet file

I am trying to put a hive table on top of a parquet table that I created based of the following json contents:
{"user_id":"4513","providers":[{"id":"4220","name":"dbmvl","behaviors":{"b1":"gxybq","b2":"ntfmx"}},{"id":"4173","name":"dvjke","behaviors":{"b1":"sizow","b2":"knuuc"}}]}
{"user_id":"3960","providers":[{"id":"1859","name":"ponsv","behaviors":{"b1":"ahfgc","b2":"txpea"}},{"id":"103","name":"uhqqo","behaviors":{"b1":"lktyo","b2":"ituxy"}}]}
{"user_id":"567","providers":[{"id":"9622","name":"crjju","behaviors":{"b1":"rhaqc","b2":"npnot"}},{"id":"6965","name":"fnheh","behaviors":{"b1":"eipse","b2":"nvxqk"}}]}
I basically used spark sql to read the json and write out a parquet file.
I am running into issues with putting hive on top of the produced parquet file. Here is the hive hql I have:
create table test (mycol STRUCT<user_id:String, providers:ARRAY<STRUCT<id:String, name:String, behaviors:MAP<String, String>>>>) stored as parquet;
Alter table test set location 'hdfs:///tmp/test.parquet';
The above statements execute fine, but I get errors when I try to do a select * on the table:
Failed with exception java.io.IOException:java.lang.IllegalStateException: Column mycol at index 0 does not exist in {providers=providers, user_id=user_id}
Try changing your query to:
create table test (user_id:String, providers:ARRAY<STRUCT<id:String, name:String, behaviors:MAP<String, String>>>) stored as parquet;
The root JSON object gets flattened out when Parquet file is stored.