how to process couchbase buckets in batches

how to process couchbase buckets in batches - couchbase

I'm trying to migrate a lot of buckets from one production server to another. I'm currently using a script that queries a view and copies the results to the other server. Hovewever, I don't know how can this process be broken down in smaller steps. Specifically, I'm looking to copy all available buckets to the other server(this takes several hours), run some tests, and when the tests are successful, if there are new buckets, use the same script to only migrate the new ones.
Does couchbase support, for its views, any feature that might help ? Like LIMIT and OFFSET for the query or maybe a last modified date on each bucket so I can filter by that?

You should really consider using Backup
and restore Restore
To answer your question, yes. If you are using an SDK than you need to look into their API but for instance using the console you can check all the filter options available to you. As an example if you use HTTP you have &limit=10&skip=0 as arguments. Check more info here
To filter by modified date you need to create a view specifically for that, which would have the modified date as key in order to be searchable.
Here is a link that shows you how to search by date which implies as I mentioned, creating a map / reduce function with the date as the key and then querying that key: Date and Time Selection

Related

Capture the new data in in file and write to Bigquery

I am new to GCP and need help to setup a system for this scenario .
There is a file in GCS and it gets written by an application program (for example log ) .
I need to capture every new record that is written in this file , then process the record by writing some logic for some transformation in the data and finally write it into a bigquery table .
I am thinking about this approach :
event trigger on Google storage for the file
write into pub/sub
apply google cloud function
subscribe into bigquery
I am not sure if this approach is optimal and right for this use case .
Please suggest .

It depends a bit on your requirements. Here are some options:
1
Is it appropriate to simply mount this file as an external table like this?
One example from those docs:
CREATE OR REPLACE EXTERNAL TABLE mydataset.sales (
Region STRING,
Quarter STRING,
Total_Sales INT64
) OPTIONS (
format = 'CSV',
uris = ['gs://mybucket/sales.csv'],
skip_leading_rows = 1);
If your desired transformation can be expressed in SQL, this may be sufficient: you could define a SQL View that enacts the transformation but which will always query the most up-to-date version of the data. However, queries may turn out to be a little slow with this setup.
2
How up-to-date does your BigQuery table have to be? Real-time accuracy is often not needed, in which case a batch load job on a schedule may be most appropriate. There's a nice built in system for this approach, the BigQuery Data Transfer service, which you could use to sync the BigQuery table as often as every fifteen minutes.
Unlike with an external table, you could create a materialized view for your transformation, ensuring good performance with a guarantee that the data won't be more than 15 minutes out of data in the most regularly scheduled case.
3
Okay, you need real time availability and good performance/your transformation is too complex to express with SQL? For this, your proposal looks okay, but it has quite a few moving parts, and there will certainly be some latency in the system. In this scenario you're likely better off following GCP's preferred route of using the Dataflow service. The link there is to the template they provide for streaming files from GCS into BigQuery, with a transformation of your choosing applied via a function.
4
There is one other case I didn't deal with, which is where you don't need real-time data but the transformation is complex and can't be expressed with SQL. In this case I would probably suggest a batch job run on a simple schedule (using a GCS client library and a BigQuery client library in the language of your choice).
There are many, many ways to do this sort of thing, and unless you are working on a completely greenfield project you almost certainly have one you could use. But I will remark that GCP has recently created the ability to use Cloud Scheduler to execute Cloud Run Jobs, which may be easiest if you don't already have a way to do this.
None of this is to say your approach won't work - you can definitely trigger a cloud function directly based on a change in a GCP bucket, and so you could write a function to perform the ELT process every time. It's not a bad all-round approach, but I have aimed to give you some examples that are either simpler or more performant, covering a variety of possible requirements.

How to synchronize MySQL database with Amazon OpenSearch service

I am new to Amazon OpenSearch service, and i wish to know if there's anyway i can sync MySQL db with Opensearch on real time. I thought of Logstash but it seems like it doesn't support delete , update operations which might not update my OpenSearch cluster

I'm going to comment for Elasticsearch as that is the tag used for this question.
You can:
Read from the database (SELECT * from TABLE)
Convert each record to a JSON Document
Send the json document to elasticsearch, preferably using the _bulk API.
Logstash can help for that. But I'd recommend modifying the application layer if possible and send data to elasticsearch in the same "transaction" as you are sending your data to the database.
I shared most of my thoughts there: http://david.pilato.fr/blog/2015/05/09/advanced-search-for-your-legacy-application/
Have also a look at this "live coding" recording.
Side note: If you want to run Elasticsearch, have look at Cloud by Elastic, also available if needed from AWS Marketplace, Azure Marketplace and Google Cloud Marketplace.
Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, Maps UI, Alerting and built-in solutions named Observability, Security, Enterprise Search and what is coming next :) ...
Disclaimer: I'm currently working at Elastic.

Keep a column that indicates when the row was last modified, then you will be able to do updates to OpenSearch. Similarly for deleting, just have a column indicating whether it is deleted or not (soft delete), and the date it was deleted.
With this db design, you can send the "delete" or "update" actions to OpenSearch/ElasticSearch to update/delete the indexes based on the last modified / deleted date. You can later have a scheduled maintenance job to delete these rows permanently from the database table.
Lastly, this article might be of help to you How to keep Elasticsearch synchronized with a relational database using Logstash and JDBC

A way to execute pipeline periodically from bounded source in Apache Beam

I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!

This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.

Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.

Google Cloud SQL Timeseries Statistics

I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.

Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!

Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/

Couchbase - smart cross data center replication (XCDR)

I have 2 Couchbase clusters. 1 for real time work and 1 for back-end data query.
I wish to replicate only 10% of the data from the real time bucket to the back-end because it's used for statistically annalists.
Note one: I know it's not possible by the UI, I'm looking for a way to write some-kind of extension for it that could "sit" in the middle of the XCDR and filter it.
Note two: As I understand Elastic-Search are using the replication feature to get noticed for changes on the cluster and build there own indexes. If I could "listen" for those notification myself I could take it from there, reading and sending the relevant data myself.
Any ideas on how I can make it work?
==NOTES==
I found the following link: http://blog.couchbase.com/xdcr-aspnet-and-nancy, this give a basic example of Sinatra project which XDCR can connect to. But there is no link to a documentation on the Rest API for one which doesn't want to work with Sinatra.
As for #Cihan question, replication 10% of the data is the basic use I wish for and for that I can use only the key. But in general I probably like to munipulate the data and also be able to merge it to an existing data - that would be a case if I have 2 real time clusters replicating to 1 back-end cluster.

Don't have anything built in today to do this. You could set up XDCR and delete the data that you don't need on the destination cluster but it may reappear as updates happen so your cleanup will have to continuously run. would a method like that work?
By the way we do plan to have the facility in future. one comment that would be helpful for me is what type of a filtering would suffice in your case? can we filter with a prefix only to achieve your case or would you need a more sophisticated filtering expression?
thanks
Cihan Biyikoglu

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008