Solr: continuous migration from MySQL - mysql

This may sound like an opinion question, but it's actually a technical one: Is there a standard process for maintaining a simple data set?
What I mean is this: let's say all I have is a list of something (we'll say books). The primary storage engine is MySQL. I see that Solr has a data import handler. I understand that I can use this to pull in book records on a first run - is it possible to use this for continuous migration? If so, would it work as well for updating books that have already been pulled into Solr as it would for pulling in new book records?
Otherwise, if the data import handler isn't the standard way to do it, what other ways are there? Thoughts?
Thank you very much for the help!

If you want to update documents from within Solr, I believe you'll need to use the UpdateRequestHandler as opposed to the DataImportHandler. I've never had need to do this where I work, so I don't know all that much about it. You may find this link of interest: Uploading Data With Index Handlers.
If you want to update Solr with records that have newly been added to your MySQL database, you would use the DataImportHandler for a delta-import. Basically, how it works is you have some kind of field in MySQL that shows the new record is, well, new. If the record is new, Solr will import it. For example, where I work, we have an "updated" field that Solr uses to determine whether or not it should import that record. Here's a good link to visit: DataImportHandler

The question looks similar to the one which we are doing, but not with SQL. Its with HBase(hadoop stack DB). However there we have Hbase indexer, which after mapping DB with Solr, listens to the events in hbase(DB) for new rows, and then executes code to fetch those values from DB and add in Solr. Not sure if there is such for SQL. However the concept looks similar. IN SQL I know about triggers which can listen to inserts and updates. At that even, you can trigger something to execute the steps of adding them in continuosly manner.

Related

MySQL search with typo

In my MySQL database I have a user table. I need to perform search as you type with typo over the user name field. There are few very old question on this topic. I tested the builtin full text search of mysql but it didn't work as expected (it does not handle typo) [I knew but I tried anyway].
What's my best option? I thought there should be an easy solution nowadays. I'm thinking about replicating the user table on elasticsearch and do the instant search from there, but I'd really like to avoid the syncronization nightmare that this will cause.
Thanks!!
You could use SOUNDEX for mysql. We have tried that but I can say that it does not work that well and it also makes the search a bit slow.
We Had a similar issue and switched to ES.
What we did is as follows:
Created a trigger for the table that will be synced to ES. The
trigger will write to a new table. The columns of such a table would
be:
IdToUpdate Operation DateTime IsSynced
The Operation would be create, update, delete. IsSynced will tell
whether the update is pushed to ES.
Then add a corn job that would query this table for all rows that will have issynced set to say '0', Add those ID's and operation to a Queue like RabbitMQ. And set the ISSynced to 1 for those ID's
The reason to use RabbitMQ is that it will make sure that the update is forwarded to ES. In case of failure we can always re-queue the the object.
Write a consumer to get the objects from the queue and update ES.
Apart from this you will also have to create a utility that will create an ES index from the database for first time use.
And you can also look at Fuzzy Search of ES that will handle typo's
Also Completion suggester which also supports fuzzy search.

Sybase to MySQL automatic exportation

I have two databases: Sybase and MySQL. I need to export records to MySql when these are inserted in Sybase or export in some scheduled event.
I've tried with output statement but this can not be used in triggers or procedures.
Any suggestion to solve this problem?
(disclaimer, I've done similar things previously, but by no means would I consider the answer below the state of the art - just one possible approach
google around something like 'cross-database replication' or 'cross rdbms replication' to see who's done this before.
).
I would first of all see if you can't score an ETL tool do the job without too much work. There are free open source ones and even things like Microsoft SSIS might work on non-MS databases.
If not, I would split this into different steps.
Find an appropriate Sybase output command that exports a subset of rows from one or more tables. By subset I mean you need to be able to add a WHERE clause, not just do a full table dump.
Use an appropriate MySQL import script/command to load the data gotten out of step #1. You may need to cycle back and forth between the 2 till you have something that works manually.
Write a Sybase trigger to insert lookup keys into a to-export table. You want to store at least the tablename & source Sybase table's keys for each inserted row. Use column names like key1_char, key2_char, not the actual column names, that makes it easier to extend to other source tables as needed. keep trigger processing as light as possible. What about updates btw?
Write a scheduled batch on Sybase side to run step #1 for the rows flagged in #3.
Write a scheduled batch on Mysql to import ,via #2, the results of #4. Or kick it off from #4.
Another approach is to do the #3 flagging bit as needed, but use to drive one scheduled batch that SELECTs data from Sybase and INSERTs it into mysql directly.
You'll have to pick up the data from Sybase's SELECT and bind it manually to the INSERT of mysql. But you probably get finer control over whats going on and you don't have to juggle 2 batches. That's what I think a clever ETL would already be doing on your behalf. Any half clever scripting language like php, python or ruby ought to handle it easily. Especially important if you have things like surrogate/auto-generated keys.
Keep in mind that in both cases you'll have to either delete the to-export rows that you've successfully inserted or flag them as done.

Entity Framework 4.1 Custom Database Initializer strategy

I would like to implement a custom database initialization strategy so that I can:
generate the database if not exists
if model change create only new tables
if model change create only new fields without dropping the table and losing the data.
Thanks in advance
You need to implement IDatabaseInitializer interface.
Eg
public class MyInitializer : IDatabaseInitializer<MyDbContext>
{
public void InitializeDatabase(MyDbContext context)
{
//your logic here
}
}
And then set your initializer at your application startup
Database.SetInitializer<ProductCatalog>(new MyInitializer());
Here's an example
You will have to manually execute commands to alter the database.
context.ObjectContext.ExecuteStoreCommand("ALTER TABLE dbo.MyTable ADD NewColumn VARCHAR(20) NULL");
You can use a tool like SQL Compare to script changes.
There is a reason why this doesn't exist yet. It is very complex and moreover IDatabaseInitializer interface is not very prepared for such that (there is no way to make such initialization database agnostic). Your question is "too broad" to be answered to your satisfaction. With your reaction to #Eranga's correct answer you simply expect that somebody will tell you step by step how to do that but we will not - that would mean we will write the initializer for you.
What you need to do what you want?
You must have very good knowledge of SQL Server. You must know how does SQL server store information about database, tables, columns and relations = you must understand sys views and you must know how to query them to get data about current database structure.
You must have very good knowledge of EF. You must know how does EF store mapping information. You must be able to explore metadata get information about expected tables, columns and relations.
Once you have old database description and new database description you must be able to write a code which will correctly explore changes and create SQL DDL commands for changing your database. Even this look like the simplest part of the whole process this is actually the hardest one because there are many other internal rules in SQL server which cannot be violated by your commands. Sometimes you really need to drop table to make your changes and if you don't want to lose data you must first push them to temporary table and after recreating table you must push them back. Sometimes you are doing changes in constraints which can require temporarily turning constrains off, etc. There is good reason why tools which do this on SQL level (comparing two databases) are probably all commercial.
Even ADO.NET team doesn't implemented this and they will not implement it in the future. Instead they are working on something called migrations.
Edit:
That is true that ObjectContext can return you script for database creation - that is exactly what default initializers are using. But how it could help you? Are you going to parse that script to see what changed? Are you going to execute that script in another connection to use the same code as for current database to see its structure?
Yes you can create a new database, move data from the old database to a new one, delete the old one and rename a new one but that is the most stupid solution you can ever imagine and no database administrator will ever allow that. Even this solution still requires analysis of changes to create correct data transfer scripts.
Automatic upgrade is a wrong way. You should always prepare upgrade script manually with help of some tools, test it and after that execute it manually or as part of some installation script / package. You must also backup your database before you are going to do any changes.
The best way to achieve this is probably with migrations:
http://nuget.org/List/Packages/EntityFramework.SqlMigrations
Good blog posts here and here.

Perl: How to copy/mirror remote MYSQL table(s) to another database? Possibly different structure too?

I am very new to this and a good friend is in a bind. I am at my wits end. I have used gui's like navicat and sqlyog to do this but, only manually.
His band info data (schedules and whatnot) is in a MYSQL database on a server (admin server).
I am putting together a basic site for him written in Perl that grabs data from a database that resides on my server (public server) and displays schedule info, previous gig newsletters and some fan interaction.
He uses an administrative interface, which he likes and desires to keep, to manage the data on the admin server.
The admin server db has a bunch of tables and even table data the public db does not need.
So, I created tables on the public side that only contain relevant data.
I basically used a gui to export the data, then insert to the public side whenever he made updates to the admin db (copy and paste).
(FYI I am using DBI module to access the data in/via my public db perl script.)
I could access the admin server directly to grab only the data I need but, the whole purpose of this is to "mirror" the data not access the admin server on every query. Also, some tables are THOUSANDS of rows and parsing every row in a loop seemed too "bulky" to me. There is however a "time" column which could be utilized to compare to.
I cannot "sync" due to the fact that the structures are different, I only need the relevant table data from only three tables.
SO...... I desire to automate!
I read "copy" was a fast way but, my findings in how to implement were too advanced for my level.
I do not have the luxury of placing a script on the admin server to notify when there was an update.
1- I would like to set up a script to check a table to see if a row was updated or added on the admin servers db.
I would then desire to update or insert the new or changed data to the public servers db.
This "check" could be set up in a cron job I guess or triggered when a specific page loads on the public side. (the same sub routine called by the cron I would assume).
This data does not need to be "real time" but, if he updates something it would be nice to have it appear as quickly as possible.
I have done much reading, module research and experimenting but, here I am again at stackoverflow where I always get great advice and examples.
Much of the terminology is still quite over my head so verbose examples with explanations really help me learn quicker.
Thanks in advance.
The two terms you are looking for are either "replication" or "ETL".
First, replication approach.
Let's assume your admin server has tables T1, T2, T3 and your public server has tables TP1, TP2.
So, what you want to do (since you have different table structres as you said) is:
Take the tables from public server, and create exact copies of those tables on the admin server (TP1 and TP2).
Create a trigger on the admin server's original tables to populate the data from T1/T2/T3 into admin server's copy of TP1/TP2.
You will also need to do initial data population from T1/T2/T3 into admin server's copy of TP1/TP2. Duh.
Set up the "replication" from admin server's TP1/TP2 to public server's TP1/TP2
A different approach is to write a program (such programs are called ETL - Extract-Transform-Load) which will extract the data from T1/T2/T3 on admin server (the "E" part of "ETL"), massage the data into format suitable for loading into TP1/TP2 tables (the "T" part of "ETL"), transfer (via ftp/scp/whatnot) those files to public server, and the second half of the program (the "L") part will load the files into the tables TP1/TP2 on public server. Both halfs of the program would be launched by cron or your scheduler of choice.
There's an article with a very good example of how to start building Perl/MySQL ETL: http://oreilly.com/pub/a/databases/2007/04/12/building-a-data-warehouse-with-mysql-and-perl.html?page=2
If you prefer not to build your own, here's a list of open source ETL systems, never used any of them so no opinions on their usability/quality: http://www.manageability.org/blog/stuff/open-source-etl
I think you've misunderstood ETL as a problem domain, which is complicated, versus ETL as a one-off solution, which is often not much harder than writing a report. Unless I've totally misunderstood your problem, you don't need a general ETL solution, you need a one-off solution that works on a handful of tables and a few thousand rows. ETL and Schema mapping sound scarier than they are for a single job. (The generalization, scaling, change-management, and OLTP-to-OLAP support of ETL are where it gets especially difficult.) If you can use Perl to write a report out of a SQL database, you probably know enough to handle the ETL involved here.
1- I would like to set up a script to check a table to see if a row was updated or added on the admin servers db. I would then desire to update or insert the new or changed data to the public servers db.
If every table you need to pull from has an update timestamp column, then your cron job includes some SELECT statements with WHERE clauses based on the last time the cron job ran to get only the updates. Tables without an update timestamp will probably need a full dump.
I'd use a one-to-one table mapping unless normalization was required... just simpler to my opinion. Why complicate it with "big" schema changes if you don't have to?
some tables are THOUSANDS of rows and parsing every row in a loop seemed too "bulky" to me.
Limit your queries to only the columns you need (and if there are no BLOBs or exceptionally big columns in what you need) a few thousand rows should not be a problem via DBI with a FETCHALL method. Loop all you want locally, just make as few trips to the remote database as possible.
If a row is has a newer date, update it. I will also have to check for new rows for insertion.
Each table needs one SELECT ... WHERE updated_timestamp_columnname > last_cron_run_timestamp. That result set will contain all rows with newer timestamps, which contains newly inserted rows (if the timestamp column behaves like I'd expect). For updating your local database, check out MySQL's ON DUPLICATE KEY UPDATE syntax... this will let you do it in one step.
... how to implement were too advanced for my level ...
Yes, I have actually done this already but, I have to manually update...
Some questions to help us understand your level... Are you hitting the database from the mysql client command-line or from a GUI? Have you gotten to the point where you've wrapped your SQL queries in Perl and DBI, yet?
If the two databases have different, you'll need an ETL solution to map from one schema to another.
If the schemas are the same, all you have to do is replicate the data from one to the other.
Why not just create identical structure on the 'slave' server to the master server. Then create a small table that keeps track of the last timestamp or id for the updated tables.
Then select from the master all rows changed since the last timestamp or greater than the id. Insert them into the matching table on the slave server.
You will need to be careful of updated rows. If a row on the master is updated but the timestamp doesn't change then how will you tell which rows to fetch? If that's not an issue the process is quite simple.
If it is an issue then you need to be more sophisticated, but without knowing the data structure and update mechanism its a goose chase to give pointers on it.
The script could be called by cron every so often to update the changes.
if the database structures must be different on the two servers then a simple translation step may need to be added, but most of the time that can be done within the sql select statement and maybe a join or two.

How to version control data stored in mysql

I'm trying to use a simple mysql database but tweak it so that every field is backed up up to an indefinite number of versions. The best way I can illustrate this is by replacing each and every field of every table with a stack of all the values this field has ever had (each of these values should be timestamped). I guess it's kind of like having customized version control for all my data..
Any ideas on how to do this?
The usual method for "tracking any changes" to a table is to add insert/update/delete trigger procedures on the table and have those records saved in a history table.
For example, if your main data table is "ItemInfo" then you would also have an ItemInfo_History table that got a copy of the new record every time anything changed (via the triggers).
This keeps the performance of your primary table consistent, yet gives you access to the history of any changes if you need it.
Here are some examples, they are for SQL Server but they demonstrate the logic:
My Repository table
My Repository History table
My Repository Insert trigger procedure
My Repository Update trigger procedure
Hmm, what you're talking about sounds similar to Slowly Changing Dimension.
Be aware that version control on arbitrary database structures is officially a rather Hard Problem. :-)
A simple solution would be to add a version/revision field to the tables, and whenever a record is updated, instead of updating it in place, insert a copy with the changes applied and the version number incremented. Then when selecting, always choose the record with the latest version. That's roughly how most such schemes are implemented (e.g. Wikimedia does it pretty much this exact way).
Maybe a tool can help you to do that for you. Have a look at nextep designer :
https://github.com/christophefondacci/nextep-designer
With this IDE you will be able to take snapshots of your database structure and data and put it under version control. After this you can compute the differences between any 2 versions and generate the appropriate SQL that can insert / update / delete your data.
Maybe this is an alternative way to achieve what you wanted.