I have a pubsub topic with roughly 1 message per second published. The message size is around 1kb. I need to get these data realtime both into cloudsql and bigquery.
The data are coming at a steady rate and it's crucial that none of them get lost or delayed. Writing them multiple times into destination is not a problem. The size of all the data in database is around 1GB.
What are dis/advantages of using google cloud functions triggered by the topic versus google dataflow to solve this problem?
Dataflow is focused on transformation of the data before loading it into a sink. Streaming pattern of Dataflow (Beam) is very powerful when you want to perform computation of windowed data (aggregate, sum, count,...). If your use case required a steady rate, Dataflow can be a challenge when you deploy a new version of your pipeline (hopefully easily solved if doubled values aren't a problem!)
Cloud Function is the glue of the cloud. In your description, it seems perfectly fit. On the topic, create 2 subscriptions and 2 functions (one on each subscription). One write in BigQuery, the other in CLoud SQL. This parallelisation ensures you the lowest latency in the processing.
Related
I am new to GCP and need help to setup a system for this scenario .
There is a file in GCS and it gets written by an application program (for example log ) .
I need to capture every new record that is written in this file , then process the record by writing some logic for some transformation in the data and finally write it into a bigquery table .
I am thinking about this approach :
event trigger on Google storage for the file
write into pub/sub
apply google cloud function
subscribe into bigquery
I am not sure if this approach is optimal and right for this use case .
Please suggest .
It depends a bit on your requirements. Here are some options:
1
Is it appropriate to simply mount this file as an external table like this?
One example from those docs:
CREATE OR REPLACE EXTERNAL TABLE mydataset.sales (
Region STRING,
Quarter STRING,
Total_Sales INT64
) OPTIONS (
format = 'CSV',
uris = ['gs://mybucket/sales.csv'],
skip_leading_rows = 1);
If your desired transformation can be expressed in SQL, this may be sufficient: you could define a SQL View that enacts the transformation but which will always query the most up-to-date version of the data. However, queries may turn out to be a little slow with this setup.
2
How up-to-date does your BigQuery table have to be? Real-time accuracy is often not needed, in which case a batch load job on a schedule may be most appropriate. There's a nice built in system for this approach, the BigQuery Data Transfer service, which you could use to sync the BigQuery table as often as every fifteen minutes.
Unlike with an external table, you could create a materialized view for your transformation, ensuring good performance with a guarantee that the data won't be more than 15 minutes out of data in the most regularly scheduled case.
3
Okay, you need real time availability and good performance/your transformation is too complex to express with SQL? For this, your proposal looks okay, but it has quite a few moving parts, and there will certainly be some latency in the system. In this scenario you're likely better off following GCP's preferred route of using the Dataflow service. The link there is to the template they provide for streaming files from GCS into BigQuery, with a transformation of your choosing applied via a function.
4
There is one other case I didn't deal with, which is where you don't need real-time data but the transformation is complex and can't be expressed with SQL. In this case I would probably suggest a batch job run on a simple schedule (using a GCS client library and a BigQuery client library in the language of your choice).
There are many, many ways to do this sort of thing, and unless you are working on a completely greenfield project you almost certainly have one you could use. But I will remark that GCP has recently created the ability to use Cloud Scheduler to execute Cloud Run Jobs, which may be easiest if you don't already have a way to do this.
None of this is to say your approach won't work - you can definitely trigger a cloud function directly based on a change in a GCP bucket, and so you could write a function to perform the ELT process every time. It's not a bad all-round approach, but I have aimed to give you some examples that are either simpler or more performant, covering a variety of possible requirements.
I want to stream real time data from Twitter API to Cloud Storage and BigQuery. I have to ingest and transform the data using Cloud Functions but the problem is I have no idea how to pull data from Twitter API and ingest it into the Cloud.
I know I also have to create a scheduler and a Pub/Sub topic to trigger Cloud Functions. I have created a Twitter developer account. The main problem is actually streaming the data into Cloud Storage.
I'm really new to GCP and streaming data so it'll be nice to see a clear explanation on this. Thank you very much :)
You have to design first your solution. What do you want to achieve? Streaming or Microbatches?
If streaming, you have to use the streaming API of Twitter. In short, you initiate a connection and you stay up and running (and connected) receiving the data.
If batches, you have to query an API and to download a set of message. In a Query-response mode.
That being said, how to implement it with Google Cloud. Streaming is problematic because you have to be always connected. And with serverless product you have timeout concern (9 minutes for Cloud Functions V1, 60 minutes for Cloud Run and Cloud Functions V2).
However you can imagine to invoke regularly your serverless product, stay connected for a while (let say 1h) and schedule trigger every hour.
Or use a VM to do that (or a pod on a K8S container)
You can also consider microbatches where you invoke every minute your Cloud Functions and to get all the messages for the past minutes.
At then end, all depends on your use case. What's the real time that you expect? which product do you want to use?
I am currently working on a ServiceBus trigger (using C#) which copies and moves related blobs to another blob storage and Azure Data Lake. After copying, the function has to emit a notification to trigger further processing tasks. Therefore, I need to know, when the copy/move task has been finished.
My first approach was to use a Azure Function which copies all these files. However, Azure Functions have a processing time limit of 10 minutes (when manually set) and therefore it seems to be not the right solution. I was considering calling azCopy or StartCopyAsync() to perform an asynchronous copy, but as far as I understand, the processing time of the function will be as long as azCopy takes. To solve the time limit problem, I could use WebJobs instead, but there are also other technologies like Logic Apps, Durable Azure functions, Batch jobs, etc. which makes me confused about choosing the right technology for this problem. The function won't be called every second but might copy large data. Does anybody have an idea?
I just found out that Azure Functions only have a time limit when using consumption plan. If there is no better solution for copy blob tasks, I'll go for Azure Functions.
I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.
I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.
Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!
Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/