I'm currently struggling to understand how to create a data catalog of our data lake (=Source).
Background:
We have an event-driven architecture and started to store all events produced by our application to a data lake (S3 Bucket). Before the events are stored we sanitize them (remove sensitive information) and add an envelope around each event with some general data:
event origin (which application generated the event)
event type (what kind of event was generated)
timestamp (when was the event generated)
...
With Kinesis Streams and Firehose, we batch those events together and store them as a JSON file in an S3 bucket. The bucket is structured like this:
/////
In there, we store the batched events with the envelope as JSON files. That means one JSON file contains multiple events:
{
"origin": "hummingbird",
"type": "AuthenticationFailed",
"timestamp": "2019-06-30T18:24:13.868Z",
"correlation_id": "2ff0c077-542d-4307-a58b-d6afe34ba748",
"data": {
...
}
}
{
"origin": "hummingbird",
"type": "PostingCreated",
"timestamp": "2019-06-30T18:24:13.868Z",
"correlation_id": "xxxx",
"data": {
...
}
}
The data object contains specific data of the events.
Now I thought I can use AWS Glue to hook into the raw data and use ETL Jobs to aggregate the event data. As I understood I need to have a data catalog for my source data and here is where I'm struggling with since the JSON always contains different events which are batched together. The standard "Crawler" cannot handle this..well it does but it creates non-sense schemas based on every JSON file.
What I wanted to achieve:
Parse through the data lake to filter out events that I'm interested in
Use the events that I'm interested in and do some transformation/aggregation/calculation with it
Store results into our current Analytics RDS or wherever (enough for our purposes right now)
Parse through newly events on a daily basis and insert/append/update that to our analytics rds
My Questions I have:
What's the best way to use glue with our data lake?
Are there possible ways to use crawlers with custom classifiers and some sort of filter together with our datalake?
Do I need to transform the data even before, to actually be able to use AWS glue?
let me give it a try.
Parse through the data lake to filter out events that I'm interested
in
Use the events that I'm interested in and do some
transformation/aggregation/calculation with it
--> You can flatten the json for each event, then export it into different S3 bucket. Refer to some python code here https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/
--> use Glue to crawl your new bucket & generate a new table schema, then in Athena you should be able to see it & do your filter/query/aggregation on top of the table. Once you're happy with the transformed data, you can further import it into Redshift or RDS.
Store results into our current Analytics RDS or wherever (enough for
our purposes right now)
--> From Glue Catalog above, add Redshift/RDS connection, then use Python Spark (need some basic knowledge on working with dataframe) to load data into Redshift or RDS.
https://www.mssqltips.com/sqlservertip/5952/read-enrich-and-transform-data-with-aws-glue-service/
Parse through newly events on a daily basis and insert/append/update
that to our analytics rds
--> You can schedule your Glue crawler to discover new data from the new bucket.
Alternatively, Lambda is also a good option for this. Can use S3 object creation (thenew bucket with flattened json) to trigger a Lambda to , pre-process, ETL & then insert into Redshift/RDS (Using JDBC driver)
Related
The values for data.CustomerNumber do not get written to the file, the other elements do.
The CustomerNumber is Pascal case because it is a custom value in the system I am pulling the data out of.
I have tried the map complex values option, advanced editor, deleted, started over. I originally was trying to save this to a Azure Sql table but gave up , and just tried to write to a file. I have been able to do this type of mapping in other pipelines so I am really at a loss for why it doesn't work in this case.
The actual json content for the mapping
I have a requirement to validate the incoming csv data against the metadata which is coming from a bigquery table. I am using dataflow in local to do this, but I am unable to find out a way how can I use apache beam transform to implement this logic.
Note: At the moment, I saw some code where ParDo was getting used in which sideinput is given. Is this a right approach?
Validation required: Before loading the data to a bigquery table I have to check it against the metadata, checking if it passes the validation. I have to insert the data to BQ table.
Sample code:
"Process Data into TableRows" >> beam.ParDo(ProcessDF(),
beam.pvalue.AsList(metadata_for_validation))
I am trying to get weather forecasts data from OpenWeatherMap and integrate them in Orion by performing a registeration request.
I was able to register and get the API key from OpenWeatherMap, however, the latter returns a JSON file with all the data inside, which is not supported by ORION.
I have followed the step by step tutorial https://fiware-tutorials.readthedocs.io/en/latest/context-providers/index.html#context-provider-ngsi-proxy where they have acquired the data from OpenWeatherMap using NGSI proxy, an API key is required to be indicated in the docker-compose file as an environment variable, however, the data acquired is the "current data" and not forecast and also specific to Berlin.
I have tried to access the files inside the container "fiware/tutorials.context-provider" and try to modify and match the parameters to my needs but I feel like I am taking a long blocked path.
I don't think that's even considered as good practice but I have run out of ideas :(
Can anyone suggest how I could bring the forecast data to Orion and register it as a context provider?
Thank you in advance.
I imagine you aim to implement a context provider, able to speak NGSI with Orion.
OpenWeatherMap surely doesn't implement NGSI ...
If you have the data from OpenWeatherMap, as a JSON string, perhaps you should parse the JSON and create your entities using some select key-values from the parsed OpenWeatherMap? Save the entity (entities) locally and then register those keys in Orion.
Alternatively (easier but I wouldn't recommend it), create local entities with the entire OpenWeatherMap data as the value of an attribute of the entity:
{
"id": "id-from-OpenWeatherMap",
"type": "OpenWeatherMap",
"weatherData": {
"value":
...
}
...
}
Then you register id/weatherData in Orion.
I have a Google Cloud Storage bucket where a legacy system drops NEW_LINE_DELIMITED_JSON files that need to be loaded into BigQuery.
I wrote a Google Cloud Function that takes the JSON file and loads it up to BigQuery. The function works fine with sample JSON files - the problem is the legacy system is generating a JSON with a non-standard key:
{
"id": 12345,
"#address": "XXXXXX"
...
}
Of course the "#address" key throws everything off and the cloud function errors out ...
Is there any option to "ignore" the JSON fields that have non-standard keys? Or to provide a mapping and ignore any JSON field that is not in the map? I looked around to see if I could deactivate the autodetect and provide my own mapping, but the online documentation does not cover this situation.
I am contemplating the option of:
Loading the file in memory into a string var
Replace #address with address
Convert the json new line delimited to a list of dictionaries
Use bigquery stream insert to insert the rows in BQ
But I'm afraid this will take a lot longer, the file size may exceed the max 2Gb for functions, deal with unicode when loading file in a variable, etc. etc. etc.
What other options do I have?
And no, I cannot modify the legacy system to rename the "#address" field :(
Thanks!
I'm going to assume the error that you are getting is something like this:
Errors: query: Invalid field name "#address". Fields must contain
only letters, numbers, and underscores, start with a letter or
underscore, and be at most 128 characters long.
This is an error message on the BigQuery side, because cols/fields in BigQuery have naming restrictions. So, you're going to have to clean your file(s) before loading them into BigQuery.
Here's one way of doing it, which is completely serverless:
Create a Cloud Function to trigger on new files arriving in the bucket. You've already done this part by the sounds of things.
Create a templated Cloud Dataflow pipeline that is trigged by the Cloud Function when a new file arrives. It simply passes the name of the file to process to the pipeline.
In said Cloud Dataflow pipeline, read the JSON file into a ParDo, and using a JSON parsing library (e.g. Jackson if you are using Java), read the object and get rid of the "#" before creating your output TableRow object.
Write results to BigQuery. Under the hood, this will actually invoke a BigQuery load job.
To sum up, you'll need the following in the conga line:
File > GCS > Cloud Function > Dataflow (template) > BigQuery
The advantages of this:
Event driven
Scalable
Serverless/no-ops
You get monitoring alerting out of the box with Stackdriver
Minimal code
See:
Reading nested JSON in Google Dataflow / Apache Beam
https://cloud.google.com/dataflow/docs/templates/overview
https://shinesolutions.com/2017/03/23/triggering-dataflow-pipelines-with-cloud-functions/
disclosure: the last link is to a blog which was written by one of the engineers I work with.
I am newbie on AWS side working on a AWS IOT project where all devices updates there state and send a json to the AWS IOT.A rule is there to save data to dynamodb.I have created a table in dynamodb.
I am sending below data to the AWS,
{
"state": {
"reported": {
"color": "Blue",
"mac":"123:123"
}
}
}
But on Dynamodb it is saving three items ,first for state another for current and one for metadata.
I want to save only data which is coming for state.Is there any condition I have to write for this.
Instead of creating a rule to write directly to DynamoDB, which IMHO, is not a good practice, have the rule trigger a Lambda function, which then processes the payload ( maybe even does some error checking ) and writes to DynamoDB.
I don't believe there is any way to configure how you want the data mapped to DynamoDB so you need something like lambda to map it.
Longer term, if you need to change your schema ( or even change the database ), you can change the lambda to do something else.