Retrieve streaming data from API using Cloud Functions - google-cloud-functions

I want to stream real time data from Twitter API to Cloud Storage and BigQuery. I have to ingest and transform the data using Cloud Functions but the problem is I have no idea how to pull data from Twitter API and ingest it into the Cloud.
I know I also have to create a scheduler and a Pub/Sub topic to trigger Cloud Functions. I have created a Twitter developer account. The main problem is actually streaming the data into Cloud Storage.
I'm really new to GCP and streaming data so it'll be nice to see a clear explanation on this. Thank you very much :)

You have to design first your solution. What do you want to achieve? Streaming or Microbatches?
If streaming, you have to use the streaming API of Twitter. In short, you initiate a connection and you stay up and running (and connected) receiving the data.
If batches, you have to query an API and to download a set of message. In a Query-response mode.
That being said, how to implement it with Google Cloud. Streaming is problematic because you have to be always connected. And with serverless product you have timeout concern (9 minutes for Cloud Functions V1, 60 minutes for Cloud Run and Cloud Functions V2).
However you can imagine to invoke regularly your serverless product, stay connected for a while (let say 1h) and schedule trigger every hour.
Or use a VM to do that (or a pod on a K8S container)
You can also consider microbatches where you invoke every minute your Cloud Functions and to get all the messages for the past minutes.
At then end, all depends on your use case. What's the real time that you expect? which product do you want to use?

Related

Subscribing to google pub/sub messages via cloud functions versus using dataflow

I have a pubsub topic with roughly 1 message per second published. The message size is around 1kb. I need to get these data realtime both into cloudsql and bigquery.
The data are coming at a steady rate and it's crucial that none of them get lost or delayed. Writing them multiple times into destination is not a problem. The size of all the data in database is around 1GB.
What are dis/advantages of using google cloud functions triggered by the topic versus google dataflow to solve this problem?
Dataflow is focused on transformation of the data before loading it into a sink. Streaming pattern of Dataflow (Beam) is very powerful when you want to perform computation of windowed data (aggregate, sum, count,...). If your use case required a steady rate, Dataflow can be a challenge when you deploy a new version of your pipeline (hopefully easily solved if doubled values aren't a problem!)
Cloud Function is the glue of the cloud. In your description, it seems perfectly fit. On the topic, create 2 subscriptions and 2 functions (one on each subscription). One write in BigQuery, the other in CLoud SQL. This parallelisation ensures you the lowest latency in the processing.

Should I run mysql on google cloud run? (or any database)

I've been researching the new options to run Docker containers in Google Cloud Run, however, there seems to be no advice on whether or not one should run MySQL on Cloud run, apparently, I know it isn't a web service, and I understand in the Official Google Documentation for GCP, Google would probably just tell people to kindly use Cloud SQL (their SQL Offering), I haven't found any advice online about "running mysql on cloud run", so I thought I'd ask here.
Will startup times from cold starts decrease performance of the solution? (assuming one uses a Bucket for storing the stuff)
Running a SQL database is not a good fit for Cloud Run.
First of all, the contract between the deployed container and Cloud Run is that the container needs to run an HTTP server on port 8080. That's not really the way MySQL works.
Second of all, the container is going to be limited to the filesystem that was included in the container image. This same image is going to be instantiated many times over as the service handles load. There will be no way to persist the data written to MySQL. You could have read-only data stored in that image that only changes when a new image is published, but that's not really what you would expect to use a relational database for.
Cloud Run is really good at operating HTTP/web services in a serverless and scalable way. These web services typically make use of other APIs and service deployed to Google Cloud, or third party services. It's not really meant to offer persistent, scalable, ACID-compliant database services - this is a whole different sort of problem space.

What is running on my Google compute engine

There is a lot of activity on my Google Compute engine API. It's less than 1 request per second which probably keeps me in the free zone but how do I figure out what is running and if I should stop it?
I have some pub/sub topics and a cloud function to copy data into a dataStore database. But even if I am not publishing any data (for days), I still get activity on the compute engine? Can I disable it or will it stop my cloud functions?

Stream data from MySQL Binary Log to Kinesis

We have a write-intensive table (on AWS RDS MySQL) from a legacy system and we'd like to stream every write event (insert or updated) from that table to kinesis. The idea is to create a pipe to warmup caches and update search engines.
Currently we do that using a rudimentar polling architecture, basically using SQL, but the ideal would be to have a push architecture reading the events directly from the transaction log.
Has anyone tried it? Any suggested architecture?
I've worked with some customers doing that already, in Oracle. Seems also that LinkedIn uses a lot that technique of streaming data from databases to somewhere else. They created a platform called Databus to accomplish that in an agnostic way - https://github.com/linkedin/databus/wiki/Databus-for-MySQL.
There is a public project in Github, following LinkedIn principles that is already streaming binlog from Mysql to Kinesis Streams - https://github.com/cmerrick/plainview
If you want to get into the nitty gritty details of LinkedIn approach, there is a really nice (and extensive) blog post available - https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying.
Last but not least, Yelp is doing that as well, but with Kafka - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html
Not getting into the basics of Kinesis Streams, for the sake of brevity, if we bring Kinesis Streams to the game, I don't see why it shouldn't work. As a matter of fact, it was built for that - your database transaction log is a stream of events. Borrowing an excerpt from Amazon Web Services public documentation: Amazon Kinesis Streams allows for real-time data processing. With Amazon Kinesis Streams, you can continuously collect data as it is generated and promptly react to critical information about your business and operations.
Hope this helps.
aws DMS service offers data migration from SQL db to kinesis.
Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams
https://aws.amazon.com/blogs/database/use-the-aws-database-migration-service-to-stream-change-data-to-amazon-kinesis-data-streams/

Good implementation of sending data to a REST api?

Each day hundreds of thousands of items are inserted, updated and deleted on our service (backend using .Net and a MySql database).
Now we are integrating our service with another service using their RESTful API. Each time an item is inserted, updated or deleted on our service we also need to connect to their web service and use POST, PUT, DELETE.
What is a good implementation of this case?
It seems like not a very good idea to connect to their API each time a user inserts an item on our service as it would be a quite slow experience for the user.
Another idea was to update our database like usual. Then set up another server constant connecting to our database and fetching data that needs to be posted to the RESTful API. Is this the way to go?
How would you solve it? Any guides of implementing stuff like this would be great! Thanks!
It depends if you delay in updating the other service is acceptable or not. If not, than create a event and put this in queue of event processor who can send this to second service.
If delay is acceptable than there can be background batch job that can run periodically and send the data.