Elastic Search using river-jdbc to sync data with remote mysql server - mysql

Me and my team wants to use elastic-search on our project, however we have a requirement we dont want to use local instance of mysql for each node. We want to use a remote server of mysql data to store data that elastic search services are querying.
So the idea is each time a new item is adding on a ES server local is not add to a local instance but on a remote server of mysql (we think on amazon RDS). And for search query for any index we want the ES node query the remote database (on RDS instance).
We try to use river-jbdc with two flavour (river for pulling data) and feeder (for putting data on RDS instance).But we not able to make this working with river-jdbc .
Anyone try something similar? Or can anyone linking to one blog where this was made?
I appreciate any help
Thanks in advance

We use a simular approach. We use an Oracle database as primary datastore.
We use PLSQL to flatten/convert data. For initial load we add data(records) to a "oneshot" table. Updates of the data will be flatten/converted and result in records in an "update" table. The oneshot and update table will be mapped to a single index in Elasticsearch.
Initial load of ES:
[Oracle DB]--->flatten data (pl sql)-->[records to animal_oneshot_river table, records to user_oneshot_river table]
The data will be pulled by the river to for example http://localhost/9200/zoo/animal and http://localhost/9200/zoo/user)
Updates
[Software]---->Change data--->[Oracle DB]--->flatten data (pl sql)-->[records to animal_update_river table, records to user_update_river table]
The update tables also contains a type of change (insert, update or delete).
The river wil poll the update_river tables for updates and mutates the data in Elasticsearch (we use a pull). The records will be deleted after processing by the river.
Data changes to Elasticsearch won't be send to Oracle. All changes on the primary datastore will be done by our own bussiness logic software.
We also write data to _spare tables (animal_oneshot_river_spare) because that makes it possible to reload the Elasticsearch without downtime and without synchronisation issues (we switch aliasses after reloading Elasticsearch).

Related

how to perevent polling duplicated data from mysql database

I have a big amount of data in a mysql database. I want to poll data from database and push them in a activemq in camel. the connection between database and queue will be lost every 15 minutes. some of the messages are lost during connection interruption. I need to know which messages are lost to poll them again from database. the messages should not be send more that one time. and this should be done without any changes in database schema.(i can not add any Boolean status field to my database).
any suggestion is welcomed.
Essentially, you need to have some unique identifier in the data you pull from the source database. Maybe it is whatever has already been defined as the primary key. Or, maybe the table has some timestamp field. Or, maybe some combination of fields will be unique.
Once you identify that, when you are putting the data into the target, reject any key that is already in the target. You could use Camel's "idempotency" features, but if you are able to check for the key in the target database, you probably won't need anything else.
If you have to make the decision about what to send, but do not have access to your remote database from App #1, you'll need to keep a record on the other side of the firewall.
You would need to do this, even if the connection did not break every 15 minutes...because you could have failures for other reasons.
If you can have an Idempotency database for App#1, another approach could be to transfer data from the local database to some other local table, and read from this. Then you poll this other table, and delete whenever the send is successful.
Example:
It looks like you're using MySql. If both databases are on MySql, you could look into MySql data-replication, rather than using your own app, with Camel.

Querying data from 2 MySQL Databases to a new MySQL database

I want to query data from two different MySQL databases to a new MySQL database.
I have two databases with a lot of irrelevant data and I want to create what can be seen as a data warehouse where only relevent data should be present coming from the two databases.
As of now all data gets sent to the two old databases, however I would like to have scheduled updating so the new database is up to speed. There is a key between the two databases so in best case I would like all data to be present in one table however this is not crucial.
I have done similar work with Logstash and ES, however I do not know how to do it when it comes to MySQL.
Best way to do that is create a ETL process with Pentaho Data Integrator or any ETL tool. Where your source will be two different databases, in the transformation part you can remove or add any business logic then load those data into new database.
If you create this ETL you can schedule it once a day so that your database will be up to date.
If you want to do this without an ETL than your database must be in same host. Than you can just add database name just before table name in query. like SELECT * FROM database.table_name

MYSQL Change Data Capture(CDC) - Azure Services (Azure data factory)

I want to perform ETL operation on the data tables of MYSQL Database and store the data in the azure data warehouse. I do not have updated date column to identify a modified record over the period. How do I come to know which record is modified. Does MYSQL database support CDC?
It is possible to read the MYSQL binlogs or binary logs using azure services (Azure data factory)?
If you can put together a single statement query that will return what you want using whatever functions and joins are available to you then you can put that into the sqlReaderQuery part of the ADF.
Otherwise you might be able to use a stored procedure activity (sorry not so familiar with mySQL as I am ADF)
Do you have any column which is increasing integer? If so, you can still use lookup activity + copy activity + stored procedure activity to get incremental load. More details are as following: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-powershell
ADF do not have built-in support for CDC yet. You can do that through custom activity in ADF with your code.
In MySQL you have the option to add a timestamp column which updates on an update on rowlevel by default. A CDC is not available, but when you can to see the de difference you can compare the MAX(updatedate) on MySQL versus (>=) your own MAX(ETLDate) to get all the modified records.

Accesing data from one mysql database to another in MYSQL Workbench

I have two different databases. I have to access data from one database and insert them into another ( with some data processing included, it is not only to copy data ) Also, the schema is really complex and each table has many rows, so copying data into schema in the second database is not an option. I have to do that using MySQL Workbench, so I have to do it using SQL queries. Is there a way to create a connection from one database to another and access its data?
While MySQL Workbench can be used to transfer data between servers (e.g. as part of a migration process) it is not useful when you have to process the data first. Instead you have 2 other options:
Use a dedicated tool you write yourself to do that (as eddwinpaz mentioned).
Use the capabilities of your server. That is, copy the data to the target server, into a temporary table (using dump and restore). Then use queries to modify the data as you need it. Finally copy it to the target table.

MySQL backup multi-client DB for single client

I am facing a problem for a task I have to do at work.
I have a MySQL database which holds the information of several clients of my company and I have to create a backup/restore procedure to backup and restore such information for any single client. To clarify, if my client A is losing his data, I have to be able to recover such data being sure I am not modifying the data of client B, C, ...
I am not a DB administrator, so I don't know if I can do this using standard mysql tools (such as mysqldump) or any other backup tools (such as Percona Xtrabackup).
To backup, my research (and my intuition) led my to this possibile solution:
create the restore insert statement using the insert-select syntax (http://dev.mysql.com/doc/refman/5.1/en/insert-select.html);
save this inserts into a sql file, either in proper order or allowing this script to temporarily disable the foreign key checks to meet foreign keys' constraint;
of course, I do this for all my clients on a daily base, using a file for each client (and day).
Then, in the case I have to restore the data for a specific client:
I delete all his data left;
I restore the correct data using his sql file I created during the backup.
This way I believe I may recover the right data of client A without touching the data of client B. Is my solution eventually working? Is there any better way to achieve the same result? Or do you need more information about my problem?
Please, forgive me if this question is not well-formed, but I am new here and this is my first question so I may be unprecise...thanks anyway for the help.
Note: we will also backup the entire database with mysqldump.
You can use the --where parameter, you could provide a condition like *client_id=N* . Of course I am making an assumption since you don't provide any information on your schema.
If you have a Star schema , then you could probably write a small script that backups all lookup tables (considering they are adequately small) by using this parameter --tables and use the --where condition for your client data table. For additional performance, perhaps you could partition the table by the client_id.