I made a backup of TableA, and downloaded it, as several .json files that has in each line something like:
{"Item": {"key_1": {"S":"value 1"}, "key_2": {"N": "2"}, ... }
I modified some values, and I want to add those modified datas in a table TableB.
Is there an easy way to do that ?
If table B already exists and you just want to add the JSON files to it, then you can use a Lambda function to read the files and insert the items to DynamoDB using PutItem/BatchPutItem.
Lambda works well for small datasets, however, if you plan to ingest GB's of data, then AWS Glue is best.
If table B is a new table, then you can use the Import From S3 functionality released earlier this year.
Related
This question is a bit vague and I wouldn't be surprised if it gets closed for that reason, but here goes.
I have successfully been able to use this connector https://github.com/confluentinc/kafka-connect-bigquery to stream kafka topics into a BigQuery instance. There are some columns in the data which ends up looking like this:
[{ "key": "product", "value": "cell phone" }, { "key": "weight", "value": "heavy" }]
let's call this column product_details, and this automatically goes into a BQ table. I don't create the table or anything, it's created automatically and this column is created as a RECORD type in BigQuery, whose subfields are KEY and VALUE, always, no matter what.
Now, when I use the Aiven GCS connector https://github.com/aiven/gcs-connector-for-apache-kafka, I end up loading the data in as a JSON file in JSONL format into a GCS bucket. I then create an external table to query these files in the GCS bucket. The problem is that the column product_details above will look like this instead:
{ "product": "cell phone", "weight": "heavy" }
as you can see, the subfields in this case are NOT KEY and VALUE. Instead, the subfields are PRODUCT and WEIGHT, and this is fine as long as they are not null. If there are null values, then BigQuery throws an error saying Unsupported empty struct type for field product_details. This is because BigQuery does not currently support empty RECORD types.
Now, when I try and create an external table with a schema definition identical to the table created when using the Wepay BigQuery sink connector (i.e., using KEY and VALUE as in the BigQuery sink connector), I get an error saying JSON parsing error in row starting at position 0: No such field <ANOTHER, COMPLETELY UNRELATED COLUMN HERE>. So the error relates to a totally different column, and that part I can't figure out.
What I would like to know is where in the Wepay BigQuery connector does it insert data as KEY and VALUE as opposed to the names of the subfields? I would like to somehow modify the Aiven GCS connector to do the same, but after searching endlessly through the repositories, I cannot seem to figure it out.
I'm relatively new to Kafka Connect, so contributing to or modifying existing connectors is a very tall order. I would like to try and do that, but I really can't seem to figure out where in the Wepay code it serializes the data as KEY and VALUE subfields instead of the actual names of the subfields. I would like to then apply that to the Aiven GCS connector.
On another note, I wanted to originally (maybe?) mitigate this whole issue by using CSV format in the Aiven connector, but it seems CSV files must have BASE64 format or "none". BigQuery cannot seem to decode either of these formats (BASE64 just inserts a single column of data in a large BASE64 string). "None" inserts characters into the CSV file which are not supported by BigQuery when using an external table.
Any help on where I can identify the code which uses KEY and VALUE instead of the actual subfield names would be really great.
I am exporting GA360 table from Big query to snowflake as json format using bq cli command. I am losing some fields when I load it as table in snowflake. I use the copy command to load my json data from GCS external stage in snowflake to snowflake tables. But, I am missing some fields that are part of nested array. I even tried compressing the file when I export to gcs but I still loose data. Can someone suggest me how I can do this. I don't want to flatten the table in bigquery and transfer that. My daily table size is minimum of 1.5GB to maximum of 4GB.
bq extract \
--project_id=myproject \
--destination_format=NEWLINE_DELIMITED_JSON \
--compression GZIP \
datasetid.ga_sessions_20191001 \
gs://test_bucket/ga_sessions_20191001-*.json
I have set up my integration, file format, and stage in snowflake. I copying data from this bucket to a table that has one variant field. The row count matches with Big query but the fields are missing.
I am guessing this is due to the limit snowflake has where each variant column should be of 16MB. Is there some way I can compress each variant field to be under 16MB?
I had no problem exporting GA360, and getting the full objects into Snowflake.
First I exported the demo table bigquery-public-data.google_analytics_sample.ga_sessions_20170801 into GCS, JSON formatted.
Then I loaded it into Snowflake:
create or replace table ga_demo2(src variant);
COPY INTO ga_demo2
FROM 'gcs://[...]/ga_sessions000000000000'
FILE_FORMAT=(TYPE='JSON');
And then to find the transactionIds:
SELECT src:visitId, hit.value:transaction.transactionId
FROM ga_demo1, lateral flatten(input => src:hits) hit
WHERE src:visitId='1501621191'
LIMIT 10
Cool things to notice:
I read the GCS files easily from Snowflake deployed in AWS.
JSON manipulation in Snowflake is really cool.
See https://hoffa.medium.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1 for more.
Can I get help in creating a table on AWS Athena.
For a sample example of data :
[{"lts": 150}]
AWS Glue generate the schema as :
array (array<struct<lts:int>>)
When I try to use the created table by AWS Glue to preview the table, I had this error:
HIVE_BAD_DATA: Error parsing field value for field 0: org.openx.data.jsonserde.json.JSONObject cannot be cast to org.openx.data.jsonserde.json.JSONArray
The message error is clear, but I can't find the source of the problem!
Hive running under AWS Athena is using Hive-JSON-Serde to serialize/deserialize JSON. For some reason, they don't support just any standard JSON. They ask for one record per line, without an array. In their words:
The following example will work.
{ "key" : 10 }
{ "key" : 20 }
But this won't:
{
"key" : 20,
}
Nor this:
[{"key" : 20}]
You should create a JSON classifier to convert array into list of object instead of a single array object. Use JSON path $[*] in your classifier and then set up crawler to use it:
Edit crawler
Expand 'Description and classifiers'
Click 'Add' on the left pane to associate you classifier with crawler
After that remove previously created table and re-run the crawler. It will create a table with proper scheme but I think Athena will still be complaining when you will try to query it. However, now you can read from that table using Glue ETL job and process single record object instead of array-objects
This json - [{"lts": 150}] would work like a charm with below query:-
select n.lts from table_name
cross join UNNEST(table_name.array) as t (n)
The output would be as below:-
But I have faced a challenge with json like - [{"lts": 150},{"lts": 250},{"lts": 350}].
Even if there are 3 elements in the JSON, the query is returning only the first element. This may be because of the limitation listed by #artikas.
Definitely, we can change the json like below to make it work:-
{"lts": 150}
{"lts": 250}
{"lts": 350}
Please post if anyone is having a better solution to it.
Is it anyhow possible to write the results of an AWS Athena query to a results.json within an s3 bucket?
My first idea was to use INSERT INTO SELECT ID, COUNT(*) ... or INSERT OVERWRITE but this seems not be supported according Amazon Athena DDL Statements and tdhoppers Blogpost
Is it anyhow possible to CREATE TABLE with new data with AWS Athena?
Is there any work around with AWS Glue?
Anyhow possible to trigger an lambda function with the results of Athena?
(I'm aware of S3 Hooks)
It would not matter to me to overwrite the whole json file / table and always create a new json, since it is very limited statistics I aggregate.
I do know AWS Athena automatically writes the results to an S3 bucket as CSV. However I like to do simple aggregations and write the outputs directly to a public s3 so that an spa angular application in the browser is able to read it. Thus JSON Format and a specific path is important to me.
The work around for me with glue. Use Athena jdbc driver for running the query and load result in a dataframe. Then save the dataframe as the required format on specified S3 location.
df=spark.read.format('jdbc').options(url='jdbc:awsathena://AwsRegion=region;UID=your-access-key;PWD=your-secret-access-key;Schema=database name;S3OutputLocation=s3 location where jdbc drivers stores athena query results',
driver='com.simba.athena.jdbc42.Driver',
dbtable='(your athena query)').load()
df.repartition(1).write.format("json").save("s3 location")
Specify query in format dbtable='(select * from foo)'
Download jar from here and store it in S3.
While configuring etl job on glue specify s3 location for jar in Jar lib path.
you can get Athena to create data in s3 by using a "create table as select" (CTAS) query. In that query you can specify where and in what format you want the created table to store its data.
https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
For json, the example you are looking for is:
CREATE TABLE ctas_json_unpartitioned
WITH (
format = 'JSON',
external_location = 's3://my_athena_results/ctas_json_unpartitioned/')
AS SELECT key1, name1, address1, comment1
FROM table1;
this would result in single lines json format
Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.
Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?
Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.
First, you would infer the schema of your JSON:
kite-dataset json-schema sample-file.json -o schema.avsc
Then you can use that file to create a Parquet Hive table:
kite-dataset create mytable --schema schema.avsc --format parquet
And finally, you can load your JSON into the dataset.
kite-dataset json-import sample-file.json mytable
You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.
You can actually use Drill itself to create a parquet file from the output of any query.
create table student_parquet as select * from `student.json`;
The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.
To complete the answer of #rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.
create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t
I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.
But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:
If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to
select members.id as members_id, members.name as members_name
or
select members.id as `members.id`, members.name as `members.name`
to get it to work.
I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.
The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.