database strategy to provide data analytics - mysql

I provide a solution that handles operations for brick and mortar shops. My next step is to provide analytics for my customers.
As I am in the starting phase I am hoping to find a free way to do it myself instead of using third party solutions. I am not expecting a massive scale at this point but I would like to get it done right instead of running queries off the production database.
And I am thinking for performance concerns I should run the analytics queries from separate tables in the same database. A cron job will run every night to replicate the data from the production tables to the analytics tables.
Is that the proper way to do this?
The other option I have in mind is to run the analytics from a different database (as opposed to just tables). I am using Amazon RDS with MySQL if that makes it more convenient?

It depends on how much analytics you want to provide.
I am a DWH manager and would start off with a small (free) BI (Business Intelligence) solution.
Your production DB and analytics DB should always be separate.
Take a look at Pentaho Data Integration (Community Edition) It's a free ETL tool that will help you get your data from your production to your analytics database and also can perform transformation.
check out some free reporting software like Jaspersoft to help you provide a Reporting Platform for customers (if that's what you want, otherwise just use Excel).
BI never wants to throw away data. If you think that your data in the analytics DB is gonna grow large (2TB +) don't use MySQL but rather PostgreSQL. MySQL does not handle big data well.
If you are really serious about this, read "The Datawarehouse Toolkit" by Ralph Kimball. That will set you up with some basic Data Warehouse knowledge.

Amazon RDS provides something call a Read-Replica. Which automatically performs replication and is optimised for reading.
I like this solution for its high convenience. Downside: its price-tag.

Related

Oracle Golden Gate for Integration with Microservices

I'm working on migration/integration of large on-premise Oracle monolithic app to cloud based Microservices. For a long time, microservices will need to be fed from and synchronized with the Oracle DB.
One of the alternatives is using Oracle Golden Gate for DB-to-DB(s) near-real-time replication. The advantage is that it seems to be reliable and resilient. The disadvantage is that it works on low-level CDC/DB changes (as opposed to app-level events).
An alternative is creating higher level business events from source DB by enriching data and then pushing it to something like Kafka. The disadvantage is that it puts more load on source DB, and requires durability on the source.
Anybody dealt with similar problems? Any advice is appreciated.
The biggest problem for us has been that legacy data is on the LAN, and our microservices are in the public cloud (in an attempt to avoid a "new legacy" hybrid cloud future).
Oracle Goldengate for Big Data can push change records as JSON to Kafka/Confluent. There's also the option to write your own handlers. You can find a lot of our PoC code in github.
As time has gone by, it became apparent the number of feeds was going to end up in the 300+ range, and we're now considering a data virtualisation + caching approach rather than pushing the legacy data to the cloud apps

Synchronising data between different databases

I'm looking for a possible solution for the following problem.
First the situation I'm at:
I've 2 databases, 1 Oracle DB and 1 MySQL DB. Although they have a lot of similarities they are not identical. A lot of tables are available on both the Oracle DB and the MySQL DB but the Oracle tables are often more extensive and contain more columns.
The situation with the databases can't be changed, so I've to deal with that.
Now I'm looking for the following:
I want to synchronise data from Oracle to MySQL and vice versa. This has to be done real time or as close to real time as possible. So when changes are made at one DB they have to be synced to the other DB as quickly as possible.
Also not every table has to be in sync, so the solution must offer a way of selecting which tables have to be synced and which not.
Because the databases are not identical replication isn't an option I think. But what is?
I hope you guys can help me with finding a way of doing this or a tool which does exactly what I need. Maybe you know some good papers/articles I can use?
Thanks!
Thanks for the comments.
I did some further research on ETL and EAI.
I found out that I am searching for an ETL tool.
I read your question and your answer. I have worked on both Oracle, SQL, ETL and data warehouses and here are my suggestions:
It is good to have a readymade ETL tool. But, if your application is big enough to make you need a tailor made ETL tool, I suggest you for a home-made ETL process.
If your transactional database is on Oracle, you can have triggers set up on the key tables that would further trigger an external procedure written in C, C++ or Java.
The reason behind using an external procedure is to be able to communicate with both databases at a time - Oracle and MySQL.
You can read more about Oracle External Procedures here.
If not through ExtProc, you can develop a separate application in Java or .Net that would extract data from the first database, transform it according to your business rules and load it into your warehouse.
In either approaches that you choose, you will have greater control on the ETL process if you implement your own tool, rather than going for a readymade tool.

What strategy/technology should I use for this kind of replication?

I am currently facing one problem which not yet figure out good solution, so hope to get some advice from you all.
My Problem as in the picture
Core Database is where all the clients connect to for managing live data which is really really big and busy all the time.
Feature Database is not used so often but it need some part of live data (maybe 5%) from the Core Database, But the request task to this server will take longer time and consume much resource.
What is my current solution:
I used database replication between Core Database & Feature Database, it works fine. But
the problem is that I waste a lot of disk space to store unwanted data.
(Filtering while replicate data is not work with my databases schema)
Using queueing system will not make data live on time as there are many request to Core Database.
Please suggest some idea if you have met this?
Thanks,
Pang
What you define is a classic data integration task. You can use any data integration tool to extract data from your core database and load into featured database. You can schedule your data integration jobs from real-time to any time-frame.
I used Talend in my mid-size (10GB) semi-scientific PostgreSQL database integration project. It worked beautifully.
You can also try SQL Server Integration Services (SSIS). This tool is very powerful as well. It works with all top-notch RDBMSs.
If all you're worrying about is disk space, I would stick with the solution you have right now. 100GB of disk space is less than a dollar, these days - for that money, you can't really afford to bring a new solution into the system.
Logically, there's also a case to be made for keeping the filtering in the same application - keeping the responsibility for knowing which records are relevant inside the app itself, rather than in some mysterious integration layer will reduce overall solution complexity. Only accept the additional complexity of a special integration layer if you really need to.

What tools are people using to measure SQL Server database performance?

I've experimented with a number of techniques for monitoring the health of our SQL Servers, ranging from using the Management Data Warehouse functionality built into SQL Server 2008, through other commercial products such as Confio Ignite 8 and also of course rolling my own solution using perfmon, performance counters and collecting of various information from the dynamic management views and functions.
What I am finding is that whilst each of these approaches has its own associated strengths, they all have associated weaknesses too. I feel that to actually get people within the organisation to take the monitoring of SQL Server performance seriously whatever solution we roll out has to be very simple and quick to use, must provide some form of a dashboard, and the act of monitoring must have minimal impact on the production databases (and perhaps even more importantly, it must be possible to prove that this is the case).
So I'm interested to hear what others are using for this task? Any recommendations?
RedGate's SQL Response is definitely a great tool for the job.
EDIT #1
There is also SQL IO, which tests the I/O of SQL Server. You will have some further information following the link provided.
There are other performance testing tools such as DBMonster which are open source (you need to scroll down).
I can't remember the name of the absolute testing tool I already referenced here on SO. I shall write it here when I found out.
Check out idera and Quest for additional tools/ideas.
[www.idera.com/Content/Home.aspx][1]
[www.quest.com/][2]
Don't forget the 2008 custom reports & Report Server for a roll-your-own.

ETL mechanisms for MySQL to SQL Server over WAN

I’m looking for some feedback on mechanisms to batch data from MySQL Community Server 5.1.32 with an external host down to an internal SQL Server 05 Enterprise machine over VPN. The external box accumulates data throughout business hours (about 100Mb per day), which then needs to be transferred internationally across a WAN connection (quality not yet determined but it's not going to be super fast) to an internal corporate environment before some BI work is performed. This should just be change-sets making their way down each night.
I’m interested in thoughts on the ETL mechanisms people have successfully used in similar scenarios before. SSIS seems like a potential candidate; can anyone comment on the suitability for this scenario? Alternatively, other thoughts on how to do this in a cost-conscious way would be most appreciated. Thanks!
It depends on the use you have of the data received from the external machine.
If you must have the data for the calculations of the morning after or do not have confidence in your network, you would prefer to loose-couple the two systems and enable some message-queuing between them so that if something fails during the night like the DBs, the networks links, anything that would be a pain for you to recover, you can start every morning with some data.
If the data retrieval is not subject to a high degree of criticality, any solution is good :)
Regarding SSIS, it's just a great ETL framework (yes, there's a subtlety :)). But I don't see it as a part of the data transfer, rather in the ETL part when your data has been received or is still waiting in the message-queing system.
First, if you are going to do this, have a good way to easily see what has changed since the last time. Every field should have a last updatedate or a timestamp that changes when the record is updated (not sure if mysql has this). This is far better than comparing every single field.
If you had SQL Server in both locations I would recommend replication, is it possible to use SQL server instead of mySQL? If not then SSIS is your best bet.
In terms of actually getting your data from MySQL into SQL Server, you can use SSIS to import the data using a number of methods. One would be to connect directly to your MySQL source (via an OLEDB Connection or similar) or you could do a daily export from MySQL to a flat file and pick this up using a FTP Task. Once you have the data, SSIS can perform the required transforms before loading the processed data to SQL Server.