I'm having an issue with the s3 sink connector. I set my flush-size to 3 (for tests) and my s3 is receiving properly the json file. But when I open the json, I don't have a list of jsons, I only have one after other. Is there any way to get "properly" the jsons in a list when they are sent to my bucket? I want to try a "good way" to solve that, else I'll fix this in a lambda function (but I wouldn't like to do it...)
What I have:
{"before":null,"after":{"id":10230,"nome":"John","idade":30,"cidade":"São Paulo","estado":"SP","sexo":"M"}
{"before":null,"after":{"id":10231,"nome":"Alan","idade":30,"cidade":"São Paulo","estado":"SP","sexo":"M"}
{"before":null,"after":{"id":10232,"nome":"Rodrigo","idade":30,"cidade":"São Paulo","estado":"SP","sexo":"M"}
What I want
[{"before":null,"after":{"id":10230,"nome":"John","idade":30,"cidade":"São Paulo","estado":"SP","sexo":"M"},
{"before":null,"after":{"id":10231,"nome":"Alan","idade":30,"cidade":"São Paulo","estado":"SP","sexo":"M"},
{"before":null,"after":{"id":10232,"nome":"Rodrigo","idade":30,"cidade":"São Paulo","estado":"SP","sexo":"M"}]
The S3 sink connector sends each message to S3 as its own message.
You're wanting to do something different, which is to batch messages together into discrete array objects.
To do this you'll need some kind of stream processing. For example, you could write a Kafka Streams processor that would process the topic and merge each batch of x messages into one message holding an array as you want.
Not clear how you expect to read these files other than manually, but most analytical tools that read S3 buckets (Hive, Athena, Spark, Presto, etc), all expect JSONLines
Related
I have a CloudFormation template that consists of a Lambda function that reads messages from the SQS Queue.
Lambda function will read the message from the queue and transform it using a JSON template(Which I want it to be injected externally)
I will deploy different stacks for different products and for each product I will provide different JSON templates to be used for transformation.
I have different options but couldn't decide which one is better;
I can write all JSON files under the project and pack them together and pass related JSON name as a parameter to lambda.
I can store JSON files on S3 and pass S3 URL to lambda so I can read on runtime.
I can store JSON files on Dynamo DB and read from there using the same approach with 2
The first one seems like a better approach as I don't need to read from an external file on every lambda execution. But I will need to pack all templates together.
The last two are a more clear approach but require an external call to read JSON for every call.
Another approach could be (I'm not sure if it is possible) to inject a JSON file to Lambda on deploy from S3 bucket or sth. And Lambda function will read it like an environment variable.
As you can see from the cloudformation documentation Lambda environment variables can be only a Map of Strings, so the actual value you can pass to the function as an environment variable must be a String. You could pass your JSON as a string but the problem is that the max size for all environment variables is 4 KB.
If your templates are bigger and you don't want to call S3 or DynamoDB at runtime you could do a workaround like writing a simple shell script that copies the correct template file to the lambda folder before building and deploying the stack. This way the lambda gets deployed in a package with the code and only the desired json template.
I decided to go with S3 setup and also improved efficiency by storing Json on a global variable (after reading the first time). So I read once and use it for the lifetime of the Lambda container.
I'm not sure this is the best solution but works well enough for my scenario.
I have a Kinesis stream in AWS and can send data to it (JSON) using kinesis command and can get it back from a stream with:
SHARD_ITERATOR=$(aws kinesis get-shard-iterator --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON --stream-name mystream --query 'ShardIterator' --profile myprofile)
aws kinesis get-records --shard-iterator $SHARD_ITERATOR --profile myprofile
Output of this looks like something like:
HsKCQkidmlkZW9Tb3VyY2UiOiBbCgkJCXsKCQkJCSJicmFuZGluZyI6IHt9LAoJCQkJInByb21vUG9vbCI6IFtdLAoJCQkJImlkIjogbnVsbAoJCQl9CgkJXSwKCQkiaW1hZ2VTb3VyY2UiOiB7fSwKCQkibWV0YWRhdGFBcHByb3ZlZCI6IHRydWUsCgkJImR1ZURhdGUiOiAxNTgzMzEyNTA0ODAzLAoJCSJwcm9maWxlIjogewoJCQkiY29tcG9uZW50Q291bnQiOiAwLAoJCQkibmFtZSI6ICJTUUVfQVRfUFJPRklMRSIsCgkJCSJpZCI6ICJTUUVfQVRfUFJPRklMRV9JRCIsCgkJCSJwYWNrYWdlQ291bnQiOiAwLAoJCQkicGFja2FnZXMiOiBbCgkJCQl7CgkJCQkJIm5hbWUiOiAiUEVBQ09DSy1MVEEiLAoJCQkJCSJpZCI6ICJmZDk5NTRmZC03NDYwLTRjZjItOTU5Ni05YzBhMjcxNTViODgiCgkJCQl9CgkJCV0KCQl9LAoJCSJ3b3JrT3JkZXJJZCI6ICJTUUVfQVRfSk9CX1NVQk1JU1
How do I get actual JSON message in raw format (to look as JSON) - same way as it was in original when I sent it?
Thanks
As per the docs, you need to use a Base64 decoding tool or use KCL library to get the data in the format it was sent:
The first thing you'll likely notice about your record in this part of the tutorial is that the data appears to be garbage –; it's not the clear text testdata we sent. This is due to the way put-record uses Base64 encoding to allow you to send binary data. However, the Kinesis Data Streams support in the AWS CLI does not provide Base64 decoding because Base64 decoding to raw binary content printed to stdout can lead to undesired behavior and potential security issues on certain platforms and terminals. If you use a Base64 decoder (for example, https://www.base64decode.org/) to manually decode dGVzdGRhdGE= you will see that it is, in fact, testdata. This is sufficient for the sake of this tutorial because, in practice, the AWS CLI is rarely used to consume data, but more often to monitor the state of the stream and obtain information, as shown previously (describe-stream and list-streams). Future tutorials will show you how to build production-quality consumer applications using the Kinesis Client Library (KCL), where Base64 is taken care of for you. For more information about the KCL, see Developing KCL 1.x Consumers.
on unix, you can use the base64 --decode command to decode the base64 encoded kinesis record data.
for example, to decode the data of the first record:
# define the name of the stream you want to read
KINESIS_STREAM_NAME='__your_stream_name_goes_here__';
# define the shard iterator to use
SHARD_ITERATOR=$(aws kinesis get-shard-iterator --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON --stream-name $KINESIS_STREAM_NAME --query 'ShardIterator');
# read the records, use `jq` to grab the data of the first record, and base64 decode it
aws kinesis get-records --shard-iterator $SHARD_ITERATOR | jq -r '.Records[0].Data' | base64 --decode
I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.
I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:
{
"id": 12345,
"#address": "XXXXXX"
...
}
Of course the "#address" key throws everything off and the cloud function errors out ...
Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.
I am contemplating the option of:
Loading the file in memory into a string var
Replace #address with address
Convert the json new line delimited to a list of dictionaries
Use bigquery stream insert to insert the rows in BQ
But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.
What other options do I have?
And no, I cannot modify the legacy system to rename the "#address" field :(
Thanks!
I'm going to assume the error that you are getting is something like this:
Errors: query: Invalid field name "#address". Fields must contain
only letters, numbers, and underscores, start with a letter or
underscore, and be at most 128 characters long.
This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.
Here's one way of doing it, which is completely serverless:
Create a Cloud Function to trigger on new files arriving in the bucket. You've already done this part by the sounds of things.
Create a templated Cloud Dataflow pipeline that is trigged by the Cloud Function when a new file arrives. It simply passes the name of the file to process to the pipeline.
In said Cloud Dataflow pipeline, read the JSON file into a ParDo, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "#" before creating your output TableRow object.
Write results to BigQuery. Under the hood, this will actually invoke a BigQuery load job.
To sum up, you'll need the following in the conga line:
File > GCS > Cloud Function > Dataflow (template) > BigQuery
The advantages of this:
Event driven
Scalable
Serverless/no-ops
You get monitoring alerting out of the box with Stackdriver
Minimal code
See:
Reading nested JSON in Google Dataflow / Apache Beam
https://cloud.google.com/dataflow/docs/templates/overview
https://shinesolutions.com/2017/03/23/triggering-dataflow-pipelines-with-cloud-functions/
disclosure: the last link is to a blog which was written by one of the engineers I work with.
I am trying to write a JSON exporter in GoLang using client_golang
I could not find any useful example for this. I have a service ABC that produces JSON output over HTTP. I want to use the client-golang to export this metric to prometheus.
Take a look at the godoc for the Go client, it is very detailed and contains plenty of examples. The one for the Collector interface is probably the most relevant here:
https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Collector
Essentially, you would implement the Collector interface, which contains two methods: describe and collect.
describe simply sends descriptions for the possible Metrics of your Collector over the given channel. This includes their name, possible label values and help string.
collect creates actual metrics that match the descriptions from describe and populates them with data. So in your case, it would GET the JSON from your service, unmarshal it, and write values to the relevant metrics.
In your main function, you then have to register your collector, and start the HTTP server, like this:
prometheus.MustRegister(NewCustomCollector())
http.Handle("/metrics", promhttp.Handler())
log.Fatal(http.ListenAndServe(":8080", nil))
You mean you want to write an exporter for your own service using golang? The exporters listed on prometheus exporter page are all good examples, many of which are written in golang, you could pick a simple one like redis exporter to see how it's implemented.
Basically what you need to do is:
Define your own Exporter type
Implement interface prometheus.Collector, you can poll the json data from your service and build metrics based on it
Register your own Exporter to prometheus by prometheus.MustRegister
Start a HTTP server and expose metrics endpoint for prometheus to poll metrics
I'm implementing a web service that needs to query a JSON file( size: ~100MB; format: [{},{},...,{}] ) about 70-80 times per second, and the JSON file will be updated every hour. "query a JSON file" means checking if there's a JSON object in the file that has an attribute with a certain value.
Currently I think I will implement the service in Node.js, and import ( mongoimport ) the JSON file into a collection in MongoDB. When a request comes in, it will query the MongoDB collection instead of reading and looking up in the file directly. In the Node.js server, there should be another timer service, which in every hour checks whether the JSON file has been updated, and if it has, it needs to "repopulate" the collection with the data in the new file.
The JSON file is retrieved by sending a request to an external API. The API has two methods: methodA lets me download the entire JSON file; methodB is actually just an HTTP HEAD call, which simply tells whether the file has been updated. I cannot get the incrementally updated data from the API.
My problem is with the hourly update. With the service running, requests are coming in constantly. When the timer detects there is an update for the JSON file, it will download it and when download finishes it will try to re-import the file to the collection, which I think will take at least a few minutes. Is there a way to do this without interrupting the queries to the collection?
Above is my first idea to approach this. Is there anything wrong about the process? Looking up in the file directly just seems too expensive, especially with the requests coming in about 100 times per seconds.