Capture the new data in in file and write to Bigquery - function

I am new to GCP and need help to setup a system for this scenario .
There is a file in GCS and it gets written by an application program (for example log ) .
I need to capture every new record that is written in this file , then process the record by writing some logic for some transformation in the data and finally write it into a bigquery table .
I am thinking about this approach :
event trigger on Google storage for the file
write into pub/sub
apply google cloud function
subscribe into bigquery
I am not sure if this approach is optimal and right for this use case .
Please suggest .

It depends a bit on your requirements. Here are some options:
1
Is it appropriate to simply mount this file as an external table like this?
One example from those docs:
CREATE OR REPLACE EXTERNAL TABLE mydataset.sales (
Region STRING,
Quarter STRING,
Total_Sales INT64
) OPTIONS (
format = 'CSV',
uris = ['gs://mybucket/sales.csv'],
skip_leading_rows = 1);
If your desired transformation can be expressed in SQL, this may be sufficient: you could define a SQL View that enacts the transformation but which will always query the most up-to-date version of the data. However, queries may turn out to be a little slow with this setup.
2
How up-to-date does your BigQuery table have to be? Real-time accuracy is often not needed, in which case a batch load job on a schedule may be most appropriate. There's a nice built in system for this approach, the BigQuery Data Transfer service, which you could use to sync the BigQuery table as often as every fifteen minutes.
Unlike with an external table, you could create a materialized view for your transformation, ensuring good performance with a guarantee that the data won't be more than 15 minutes out of data in the most regularly scheduled case.
3
Okay, you need real time availability and good performance/your transformation is too complex to express with SQL? For this, your proposal looks okay, but it has quite a few moving parts, and there will certainly be some latency in the system. In this scenario you're likely better off following GCP's preferred route of using the Dataflow service. The link there is to the template they provide for streaming files from GCS into BigQuery, with a transformation of your choosing applied via a function.
4
There is one other case I didn't deal with, which is where you don't need real-time data but the transformation is complex and can't be expressed with SQL. In this case I would probably suggest a batch job run on a simple schedule (using a GCS client library and a BigQuery client library in the language of your choice).
There are many, many ways to do this sort of thing, and unless you are working on a completely greenfield project you almost certainly have one you could use. But I will remark that GCP has recently created the ability to use Cloud Scheduler to execute Cloud Run Jobs, which may be easiest if you don't already have a way to do this.
None of this is to say your approach won't work - you can definitely trigger a cloud function directly based on a change in a GCP bucket, and so you could write a function to perform the ELT process every time. It's not a bad all-round approach, but I have aimed to give you some examples that are either simpler or more performant, covering a variety of possible requirements.

Related

Azure - Trigger which copies multible blobs

I am currently working on a ServiceBus trigger (using C#) which copies and moves related blobs to another blob storage and Azure Data Lake. After copying, the function has to emit a notification to trigger further processing tasks. Therefore, I need to know, when the copy/move task has been finished.
My first approach was to use a Azure Function which copies all these files. However, Azure Functions have a processing time limit of 10 minutes (when manually set) and therefore it seems to be not the right solution. I was considering calling azCopy or StartCopyAsync() to perform an asynchronous copy, but as far as I understand, the processing time of the function will be as long as azCopy takes. To solve the time limit problem, I could use WebJobs instead, but there are also other technologies like Logic Apps, Durable Azure functions, Batch jobs, etc. which makes me confused about choosing the right technology for this problem. The function won't be called every second but might copy large data. Does anybody have an idea?
I just found out that Azure Functions only have a time limit when using consumption plan. If there is no better solution for copy blob tasks, I'll go for Azure Functions.

A way to execute pipeline periodically from bounded source in Apache Beam

I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.

Google Cloud SQL Timeseries Statistics

I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.
Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!
Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/

how to process couchbase buckets in batches

I'm trying to migrate a lot of buckets from one production server to another. I'm currently using a script that queries a view and copies the results to the other server. Hovewever, I don't know how can this process be broken down in smaller steps. Specifically, I'm looking to copy all available buckets to the other server(this takes several hours), run some tests, and when the tests are successful, if there are new buckets, use the same script to only migrate the new ones.
Does couchbase support, for its views, any feature that might help ? Like LIMIT and OFFSET for the query or maybe a last modified date on each bucket so I can filter by that?
You should really consider using Backup
and restore Restore
To answer your question, yes. If you are using an SDK than you need to look into their API but for instance using the console you can check all the filter options available to you. As an example if you use HTTP you have &limit=10&skip=0 as arguments. Check more info here
To filter by modified date you need to create a view specifically for that, which would have the modified date as key in order to be searchable.
Here is a link that shows you how to search by date which implies as I mentioned, creating a map / reduce function with the date as the key and then querying that key: Date and Time Selection

Replicating database changes

I want to "replicate" a database to an external service. For doing so I could just copy the entire database (SELECT * FROM TABLE).
If some changes are made (INSERT, UPDATE, DELETE), do I need to upload the entire database again or there is a log file describing these operations?
Thanks!
It sounds like your "external service" is not just another database, so traditional replication might not work for you. More details on that service would be great so we can customize answers. Depending on how long you have to get data to your external service and performance demands of your application, some main options would be:
Triggers: add INSERT/ UPDATE/ DELETE triggers
that update your external service's
data when your data changes (this
could be rough on your app's
performance but provide near
real-time data for your external
service)
Log Processing: you can parse changes from the logs and use some level of ETL to make sure they'll run properly on your external service's data storage. I wouldn't recommend getting into this if you're not familiar with their structure for your particular DBMS.
Incremental Diffs: you could run diffs on some interval (maybe 3x a day, for example) and have a cron job or scheduled task run a script that moves all the data in a big chunk. This prioritizes your app's performance over the external service.
If you choose triggers, you may be able to tweak an existing trigger-based replication solution to update your external service. I haven't used these so I have no idea how crazy that would be, just an idea. Some examples are Bucardo and Slony.
There are many ways to replicate a PostgreSQL database. In the current version 9.0 the PostgreSQL Global Development Group introduced two new rocks features called Hot Standby and Streaming Replication puting to PostgreSQL to a new level and introducing a built-in solution.
On the wiki, there is a completed review of the new PostgreSQL-9.0´s features:
http://wiki.postgresql.org/wiki/PostgreSQL_9.0
There are other applications like Bucardo, Slony-I, Londiste (Skytools), etc,which you can use too.
Now, What are you want to do for log processing? What do you want exactly ? regards