WSO2 ESB Analytics - Production Database Configuration - wso2-das

What is the best configuration for ESB Analytics database. The data grows with time and what is the best way to handle it in a production environment. Any idea on this would be really helpful, thanks.

In production you can run purging jobs [1] for old data in tables which consist of SECOND, MINUTE or even HOURLY tables as same information is aggregated into DAILY, MONTHLY, YEARLY tables.
[1] https://docs.wso2.com/display/EI600/Purging+Analytics+Data

Related

HIVE, HBASE which one I have to use for My Data Analytics

I have 150 GB of MySQL data, Plan to replace MySQL to Casandra as backend.
Analytics, plan to go with Hadoop, HIVE or HBASE.
Currently I have 4 physical machines for POC. Please some one help me to come up with best efficient architecture.
Per day I will get 5 GB of Data.
Daily Status report I have to send to each customer.
Have to give Analysis report based on request : for example : 1 week report or last month first 2 week report. Is it possible to produce report instantly using HIVe or HBASE ?
I want to give best performance using Cassandra, Hadoop .
Hadoop can process your data using map reduce paradigm or other, using emerging technologies such as Spark. The advantage is a reliable distributed filesystem and the usage of data locality to send the computation to the nodes that have the data.
Hive is a good SQL-like way of processing files and generate your reports once a day. It's batch processing and 5 more GB a day shouldn't produce a big impact. It has a high overhead latency though, but shouldn't be a problem if you do it once a day.
HBase and Cassandra are NoSQL databases whose purpose is to serve data with low latency. If that's a requirement, you should go with any of those. HBase uses the DFS to store the data and Cassandra has good connectors to Hadoop, so it's simple to run jobs consuming from these two sources.
For reports based on request, specifying a date range, you should store the data in an efficient way so you don't have to ingest data that's not needed for your report. Hive supports partitioning and that can be done using date (i.e. /<year>/<month>/<day>/). Using partitioning can significantly optimize your job execution times.
If you go to the NoSQL approach, be sure the rowkeys have some date format as prefix (e.g. 20140521...) so that you can select those that start by the dates you want.
Some questions you should also consider are:
how many data do you want to store in your cluster – e.g. last 180
days, etc. This will affect the number of nodes / disks. Beware data is usually replicated 3 times.
how many files do you have in HDFS – when the number of files is high,
Namenode will be hit hard on retrieval of file metadata. Some
solutions exist such as replicated namenode or using MapR Hadoop
distribution which doesn't rely on a Namenode per se.

database strategy to provide data analytics

I provide a solution that handles operations for brick and mortar shops. My next step is to provide analytics for my customers.
As I am in the starting phase I am hoping to find a free way to do it myself instead of using third party solutions. I am not expecting a massive scale at this point but I would like to get it done right instead of running queries off the production database.
And I am thinking for performance concerns I should run the analytics queries from separate tables in the same database. A cron job will run every night to replicate the data from the production tables to the analytics tables.
Is that the proper way to do this?
The other option I have in mind is to run the analytics from a different database (as opposed to just tables). I am using Amazon RDS with MySQL if that makes it more convenient?
It depends on how much analytics you want to provide.
I am a DWH manager and would start off with a small (free) BI (Business Intelligence) solution.
Your production DB and analytics DB should always be separate.
Take a look at Pentaho Data Integration (Community Edition) It's a free ETL tool that will help you get your data from your production to your analytics database and also can perform transformation.
check out some free reporting software like Jaspersoft to help you provide a Reporting Platform for customers (if that's what you want, otherwise just use Excel).
BI never wants to throw away data. If you think that your data in the analytics DB is gonna grow large (2TB +) don't use MySQL but rather PostgreSQL. MySQL does not handle big data well.
If you are really serious about this, read "The Datawarehouse Toolkit" by Ralph Kimball. That will set you up with some basic Data Warehouse knowledge.
Amazon RDS provides something call a Read-Replica. Which automatically performs replication and is optimised for reading.
I like this solution for its high convenience. Downside: its price-tag.

OLTP to OLAP with CQRS or SSIS

Background Information
Our application reads/writes from 3 components:
ASP.NET MVC 3 customer front end website (write actions)
Winform verification tool at stores (write actions)
Silverlight Dashboard for tenant (95% aggregate reads 5% write actions)
(3) is the only piece that can use some performance improvements.
Our storage is Sql Server Standard OLTP database that has stored procedures that aggregate data consumed by the silverlight app.
When using database tuning advisor or execution plan we don't see any critical indexes missing and we rebuild indexes with sql agent job.
Most of the widgets are sparklines
x = time selected by interval (day, week, month, year)
y = aggregate (sum,avg,ect)
currently we return about 14 - 20 points per widget. Our dashboard opens with 10 widgets initially.
Our dimensions would be: tenant, store, (day,week,month,year)
Our facts: completed, incomplete, redeemed, score ...
I know a denormalized table will remove needing sql server from recalculating for
store managers, franchise owners, corporate viewing the data ~50 (simultaneous users)
each time
I'll be honest if we go with OLAP it will be my first hands on experience with it.
Questions
What is the long term solution for a rich reporting dashboard?
I would assume OLAP. If so, how would you keep it up to date to be near realtime dashboard that we have today?
Putting a maintenance page while OLAP rebuilds itself is not an option.
Ideally, we would want to do this incrementally and see Nservicebus (which we use today already) as a great bridge to update these
denormalized views. Do we put these denormalized views in oltp as just another table or is there a way to incrementally update OLAP datasource?
References
http://www.udidahan.com/2009/12/09/clarified-cqrs/
http://www.udidahan.com/2011/10/02/why-you-should-be-using-cqrs-almost-everywhere%E2%80%A6/
“Putting a maintenance page while OLAP rebuilds itself is not an option.“
Why would you say that? The OLAP cube is available while it’s rebuilding.
There are several ways you can configure how the refresh works, ROLAP, HOLAP and MOLAP. You can have automatically refreshes at X hours or even make the data available in real-time. Try reading about proactive caching on SSAS, it may give you some ideas.

VPS MySQL Problem but Where?

We have a VPS with around 40 databases and 1 of them is getting very high traffic (selects)... is there any tool that i can install to keep a record or visualize what database is getting the most traffic/hits???
Thanks
MySQL Enterprise Monitor
Advanced Query Analyzer - Provides multiple options for quickly finding worst offenders. Collected queries can be sorted and filtered by server, database, query type, query content. Analysis can be done by current time intervals or by historical date/time range.
MySQL and OS Metric Graphs - Allows DBA to monitor and correlate these metrics: Database Activity (Selects, Inserts, Updates, Deletes)
I believe it has a free trial.

SQL Azure for extremely large databases

In a scenario with a database containing hundreds of millions of rows and reaching sizes of 500GB with maybe ~20 users. Mostly it's data storage for aggregated data to be reported on later.
Would SQL Azure be able to handle this scenario? If so, does it make sense to go that route? Compared to purchasing and housing 2+ high end servers ($15k-$20k each) in a co-location facility + all maintenance and backups.
Did you consider using the Azure Table storage? Azure Tables do not have referential integrity, but if you are simply storing many rows, is that an option for you? You could use SQL Azure for your transactional needs, and use Azure Tables for those tables that do not fit in SQL Azure. Also, Azure Tables will be cheaper.
SQL Azure databases are limited to 50Gb (at the moment)
As described in the General Guidelines and Limitations
I don't know whether SQL Azure is able to handle your scenario - 500GB seems a lot and does not figure in the pricing list (50GB max). I'm just trying to give perspective about the pricing.
Official pricing of SQL Azure is around 10$ a GB/month ( http://www.microsoft.com/windowsazure/pricing/)
Therefore, 500 GB would be around 5k $ each month roughly. 2 high-end servers (without license fees, maintenance and backups) of 20k take about 8 months to pay off.
Or, from an other point of view: Assuming you change your servers every 4 years, does the budget of 240k $ (5k $ * 48 months) cover the hardware, installation/configuration, licence fees and maintenance costs? (Not counting bandwidth and backup since you'll pay that extra too when using SQL Azure).
One option would be to use SQL Azure sharding. This is a way to spread the data over multiple SQL Azure databases and has the advantage that each database would use a different CPU & hard drive (since each db is actually stored on a different machine in the data center) which should give you really good performance. Of course, this is under the assumption that your database has the possibility of being sharded. There is some more info on this sharding here.