I am currently working on a ServiceBus trigger (using C#) which copies and moves related blobs to another blob storage and Azure Data Lake. After copying, the function has to emit a notification to trigger further processing tasks. Therefore, I need to know, when the copy/move task has been finished.
My first approach was to use a Azure Function which copies all these files. However, Azure Functions have a processing time limit of 10 minutes (when manually set) and therefore it seems to be not the right solution. I was considering calling azCopy or StartCopyAsync() to perform an asynchronous copy, but as far as I understand, the processing time of the function will be as long as azCopy takes. To solve the time limit problem, I could use WebJobs instead, but there are also other technologies like Logic Apps, Durable Azure functions, Batch jobs, etc. which makes me confused about choosing the right technology for this problem. The function won't be called every second but might copy large data. Does anybody have an idea?
I just found out that Azure Functions only have a time limit when using consumption plan. If there is no better solution for copy blob tasks, I'll go for Azure Functions.
Related
I am new to GCP and need help to setup a system for this scenario .
There is a file in GCS and it gets written by an application program (for example log ) .
I need to capture every new record that is written in this file , then process the record by writing some logic for some transformation in the data and finally write it into a bigquery table .
I am thinking about this approach :
event trigger on Google storage for the file
write into pub/sub
apply google cloud function
subscribe into bigquery
I am not sure if this approach is optimal and right for this use case .
Please suggest .
It depends a bit on your requirements. Here are some options:
1
Is it appropriate to simply mount this file as an external table like this?
One example from those docs:
CREATE OR REPLACE EXTERNAL TABLE mydataset.sales (
Region STRING,
Quarter STRING,
Total_Sales INT64
) OPTIONS (
format = 'CSV',
uris = ['gs://mybucket/sales.csv'],
skip_leading_rows = 1);
If your desired transformation can be expressed in SQL, this may be sufficient: you could define a SQL View that enacts the transformation but which will always query the most up-to-date version of the data. However, queries may turn out to be a little slow with this setup.
2
How up-to-date does your BigQuery table have to be? Real-time accuracy is often not needed, in which case a batch load job on a schedule may be most appropriate. There's a nice built in system for this approach, the BigQuery Data Transfer service, which you could use to sync the BigQuery table as often as every fifteen minutes.
Unlike with an external table, you could create a materialized view for your transformation, ensuring good performance with a guarantee that the data won't be more than 15 minutes out of data in the most regularly scheduled case.
3
Okay, you need real time availability and good performance/your transformation is too complex to express with SQL? For this, your proposal looks okay, but it has quite a few moving parts, and there will certainly be some latency in the system. In this scenario you're likely better off following GCP's preferred route of using the Dataflow service. The link there is to the template they provide for streaming files from GCS into BigQuery, with a transformation of your choosing applied via a function.
4
There is one other case I didn't deal with, which is where you don't need real-time data but the transformation is complex and can't be expressed with SQL. In this case I would probably suggest a batch job run on a simple schedule (using a GCS client library and a BigQuery client library in the language of your choice).
There are many, many ways to do this sort of thing, and unless you are working on a completely greenfield project you almost certainly have one you could use. But I will remark that GCP has recently created the ability to use Cloud Scheduler to execute Cloud Run Jobs, which may be easiest if you don't already have a way to do this.
None of this is to say your approach won't work - you can definitely trigger a cloud function directly based on a change in a GCP bucket, and so you could write a function to perform the ELT process every time. It's not a bad all-round approach, but I have aimed to give you some examples that are either simpler or more performant, covering a variety of possible requirements.
I have a pubsub topic with roughly 1 message per second published. The message size is around 1kb. I need to get these data realtime both into cloudsql and bigquery.
The data are coming at a steady rate and it's crucial that none of them get lost or delayed. Writing them multiple times into destination is not a problem. The size of all the data in database is around 1GB.
What are dis/advantages of using google cloud functions triggered by the topic versus google dataflow to solve this problem?
Dataflow is focused on transformation of the data before loading it into a sink. Streaming pattern of Dataflow (Beam) is very powerful when you want to perform computation of windowed data (aggregate, sum, count,...). If your use case required a steady rate, Dataflow can be a challenge when you deploy a new version of your pipeline (hopefully easily solved if doubled values aren't a problem!)
Cloud Function is the glue of the cloud. In your description, it seems perfectly fit. On the topic, create 2 subscriptions and 2 functions (one on each subscription). One write in BigQuery, the other in CLoud SQL. This parallelisation ensures you the lowest latency in the processing.
Currently I am using MySQL as a main database which is not real time database. To synchronize client side data with server and keep data when offline I am considering to add real time database as a slave database in my architecture. Synchronizing data is not easy, so I want to use cloud firestore.
What I searched until now, It seems there is no pratical way to synchronize between not real time RDMS(in my case is MySQL) and cloud firestore. I can not migrate current data to cloud firestore because other services depend on that.
If there is no pratical solution for this, please suggest me the best way. Thanks.
Not sure, but it seems there is no solution to synchronize that. To synchronize such this case, on the client side has to be implemented manually.
I have been thinking about this as well. What I have thought up so far is this.
Use Firestore to create a document.
Write cloud function to listen to create events on the collection that stores that document.
send the created document over http to a rest endpoint that stores that data on a relational db (mysql, postgres), if server is down or status code is anything other than 200, assign true to a boolean flag on the document called syncFailed.
write a worker that periodically fetches documents where syncFailed is true and sends this document to the endpoint again until sync is a success.
THINGS TO CONSIDER
Be careful in adopting this approach with update events. You may accidently create an infinite loop that will cost you all your wealth. If you do add a update listener on a document, make sure you devise a way to avoid creating an infinite loop. You may update a document from an update-listening cloud function that in turn re-triggers this cloud function which in turn updates the document which in turn re-re-triggers this cloud function and so on. Make sure you use a boolean flag on the document or some field as a sentinel value to get out of such a looping situation.
I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.
I want to "replicate" a database to an external service. For doing so I could just copy the entire database (SELECT * FROM TABLE).
If some changes are made (INSERT, UPDATE, DELETE), do I need to upload the entire database again or there is a log file describing these operations?
Thanks!
It sounds like your "external service" is not just another database, so traditional replication might not work for you. More details on that service would be great so we can customize answers. Depending on how long you have to get data to your external service and performance demands of your application, some main options would be:
Triggers: add INSERT/ UPDATE/ DELETE triggers
that update your external service's
data when your data changes (this
could be rough on your app's
performance but provide near
real-time data for your external
service)
Log Processing: you can parse changes from the logs and use some level of ETL to make sure they'll run properly on your external service's data storage. I wouldn't recommend getting into this if you're not familiar with their structure for your particular DBMS.
Incremental Diffs: you could run diffs on some interval (maybe 3x a day, for example) and have a cron job or scheduled task run a script that moves all the data in a big chunk. This prioritizes your app's performance over the external service.
If you choose triggers, you may be able to tweak an existing trigger-based replication solution to update your external service. I haven't used these so I have no idea how crazy that would be, just an idea. Some examples are Bucardo and Slony.
There are many ways to replicate a PostgreSQL database. In the current version 9.0 the PostgreSQL Global Development Group introduced two new rocks features called Hot Standby and Streaming Replication puting to PostgreSQL to a new level and introducing a built-in solution.
On the wiki, there is a completed review of the new PostgreSQL-9.0´s features:
http://wiki.postgresql.org/wiki/PostgreSQL_9.0
There are other applications like Bucardo, Slony-I, Londiste (Skytools), etc,which you can use too.
Now, What are you want to do for log processing? What do you want exactly ? regards