I have a simple JSON file that I would like to insert into an HBase table.
My JSON file has the following format:
{
"word1":{
"doc_01":4,
"doc_02":7
},
"word2":{
"doc_06":1,
"doc_02":3,
"doc_12":8
}
}
The HBase table is called inverted_index, it has one column family matches.
I would like to use the keys word1,word2, etc as row keys and their values to be inserted in the column family matches.
I know that Hive supports JSON parsing and I've already tried it, but only when I know the keys in the JSON beforehand to access the records.
My problem is that I don't know what or How many words my JSON file contains, or How many matches each word will have (it can't be empty though).
My question: Is this even doable using hive only? If so, kindly provide some pointers on what hive queries/functions to use to achieve that.
If not, is there any alternative to tackle this? Thanks in advance.
Related
I have a table having 2 million records. I am trying to dump the contents of the table in json format. This issue is that the TPT export does not allow JSON columns and BTEQ export would take a lot of time to do this export. Is there any way to handle this export in a more optimized way.
Your help is really appreciated.
If the JSON values are not too large, you could potentially CAST them in your SELECT as VARCHAR(64000) CHARACTER SET LATIN, or VARCHAR(32000) CHARACTER SET UNICODE if you have non-LATIN characters, and export them in-line.
Otherwise each JSON object has to be transferred DEFERRED BY NAME where each object is stored in a separate file and the corresponding filename stored in the output row. In that case you would need to use BTEQ, or TPT SQL Selector operator - or write your own application.
You can do one thing. Load the json formatted rows in another teradata table.
Keep that table column as varchar and then do a tptexport of that column/table.
It should work.
INSERT INTO test (col1,col2...,jsn_obj)
SELECT col1,col2,..
JSON_Compose(<. columns you want to inlcude in your json file)
FROM <schemaname>.<tablename>
;
I have a mysql table WEBSITE_IMAGES in which one of the field name called Value has data in JSON format.
Value field looks like below:
I am wondering how I can extract product_name and image_name only. (eg: 14669 golden.png, 14754 tealglass.png)
{"1235":"custom_images","options":{"1235":{"product_image":"image","color":"","image":"{\"14669\":\"\/s\/i\/golden.png\",\"14754\":\"\/s\/m\/tealglass.png\"
Best solution is the set your directory addresses in the programming settings side.
In case of file trouble or migration, you will have problem with your files and data. Keep your data simple. Let your program do the file locations later.
With that, you will have less trouble with slashes and conversion.
I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.
So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?
I would recommend doing 2 things here:
Preprocess your file and store the final layout of the file sans the first row i.e. the header row
BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required
Allowing BQ to autodetect schema is not advised.
Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.
If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:
(Optional) An integer indicating the number of header rows in the source data.
The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)
Yes you can modify the existing schema (aka DDL) using bq show..
bq show --schema --format=prettyjson project_id:dataset.table > myschema.json
Note that this will result in you creating a new BQ table all together.
I have way to schema for loading csv into bigquery. You just enough edit value column, for example :
weight|total|summary
2|4|just string
2.3|89.5|just string
if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this
weight|total|summary
'2'|'4'|just string
2.3|89.5|just string
You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.
cheers
If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this. So I took another approach.
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.
ALTER TABLE <table name> DROP COLUMN <Test>
Change and according to your Table.
Can I get help in creating a table on AWS Athena.
For a sample example of data :
[{"lts": 150}]
AWS Glue generate the schema as :
array (array<struct<lts:int>>)
When I try to use the created table by AWS Glue to preview the table, I had this error:
HIVE_BAD_DATA: Error parsing field value for field 0: org.openx.data.jsonserde.json.JSONObject cannot be cast to org.openx.data.jsonserde.json.JSONArray
The message error is clear, but I can't find the source of the problem!
Hive running under AWS Athena is using Hive-JSON-Serde to serialize/deserialize JSON. For some reason, they don't support just any standard JSON. They ask for one record per line, without an array. In their words:
The following example will work.
{ "key" : 10 }
{ "key" : 20 }
But this won't:
{
"key" : 20,
}
Nor this:
[{"key" : 20}]
You should create a JSON classifier to convert array into list of object instead of a single array object. Use JSON path $[*] in your classifier and then set up crawler to use it:
Edit crawler
Expand 'Description and classifiers'
Click 'Add' on the left pane to associate you classifier with crawler
After that remove previously created table and re-run the crawler. It will create a table with proper scheme but I think Athena will still be complaining when you will try to query it. However, now you can read from that table using Glue ETL job and process single record object instead of array-objects
This json - [{"lts": 150}] would work like a charm with below query:-
select n.lts from table_name
cross join UNNEST(table_name.array) as t (n)
The output would be as below:-
But I have faced a challenge with json like - [{"lts": 150},{"lts": 250},{"lts": 350}].
Even if there are 3 elements in the JSON, the query is returning only the first element. This may be because of the limitation listed by #artikas.
Definitely, we can change the json like below to make it work:-
{"lts": 150}
{"lts": 250}
{"lts": 350}
Please post if anyone is having a better solution to it.
I have a plane text file in HDFS as
44,UK,{"name":{"name1":"John","name2":"marry","name3":"michel"},"fruits":{"fruit1":"apple","fruit2":"orange"}},31-07-2016
91,INDIA,{"name":{"name1":"Ram","name2":"Sam"},"fruits":{}},31-07-2016
and want to store this in a hive table with schema as
create table data (SerNo int, country string , detail string,date string )
Then what should be the table definition so that {"name": ..... } will come as one column as "detail" and rest with other ?
what should be the column separator ? so that i can query detail column with get_json_object udf along with other columns.
Thank you.
Hive works well with Json format data until Json file is not nested at many levels. In that case it is better to make your Json file more Flat.
refer https://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/
Here you can find explained answer for you question.