SQL Azure for extremely large databases - sql-server-2008

In a scenario with a database containing hundreds of millions of rows and reaching sizes of 500GB with maybe ~20 users. Mostly it's data storage for aggregated data to be reported on later.
Would SQL Azure be able to handle this scenario? If so, does it make sense to go that route? Compared to purchasing and housing 2+ high end servers ($15k-$20k each) in a co-location facility + all maintenance and backups.

Did you consider using the Azure Table storage? Azure Tables do not have referential integrity, but if you are simply storing many rows, is that an option for you? You could use SQL Azure for your transactional needs, and use Azure Tables for those tables that do not fit in SQL Azure. Also, Azure Tables will be cheaper.

SQL Azure databases are limited to 50Gb (at the moment)
As described in the General Guidelines and Limitations

I don't know whether SQL Azure is able to handle your scenario - 500GB seems a lot and does not figure in the pricing list (50GB max). I'm just trying to give perspective about the pricing.
Official pricing of SQL Azure is around 10$ a GB/month ( http://www.microsoft.com/windowsazure/pricing/)
Therefore, 500 GB would be around 5k $ each month roughly. 2 high-end servers (without license fees, maintenance and backups) of 20k take about 8 months to pay off.
Or, from an other point of view: Assuming you change your servers every 4 years, does the budget of 240k $ (5k $ * 48 months) cover the hardware, installation/configuration, licence fees and maintenance costs? (Not counting bandwidth and backup since you'll pay that extra too when using SQL Azure).

One option would be to use SQL Azure sharding. This is a way to spread the data over multiple SQL Azure databases and has the advantage that each database would use a different CPU & hard drive (since each db is actually stored on a different machine in the data center) which should give you really good performance. Of course, this is under the assumption that your database has the possibility of being sharded. There is some more info on this sharding here.

Related

Data Base for handle large data

We have started a new project using MySQL, spring boot, and Angular js. Initially, we did not realize our DB is going to handle large data.
The number of tables will not be large (<130), only 10 to 20 tables will be contained in more data, which is almost inserted/ read/ update.
The estimated amount of data in that 10 table is going to grow at 12,00,000 records in a month, and we should not delete those data be able to do various reports.
There needs to be (read-only) replicated database as a backup/failover, and maybe for offloading reports in peak time.
I don't have first-hand experience with that large databases, so I'm asking the ones that have which DB is the best choice in this situation. as we have completed 100% coding and development but now we realize this. I have doubts may be MYSQL going to handle large data. I know that Oracle is the safe bet, interested if Mysql with a similar setup. But it is bound only in MySQL I am ok with any DB based on you all feedback I can take a call.
Open source DB more preferable but it's not mandatory we can go for paid DB also.
Handling Large Data
MySQL is more than capable of handling such loads. In fact, it is capable of handling much much more load than what you are talking about. You just have to create the right kind of tables. You can do that by choosing
the correct storage engine for your use-case
the correct character set
the optimal data type for your column
the right indexing strategy - creating indexes thoughtfully
the right partitioning strategy (if the data in the table exceeds tens of millions of records)
EDIT: You've also got to choose the right kind of data modelling and normalization strategy for your use-case. Most of OLTP applications require some level of normalization. But if you want to do analytics and aggregates on heavy tables, you should either have a Data Warehouse of have highly denormalized tables to avoid joins and/or have a column-oriented database to support such queries.
MySQL is open-source and has a very strong community support so you will find a lot of literature around any issue that you face. You can also find all the filed bugs (resolved and unresolved) here.
As far as the number of tables are concerned, there's really no cap on that. See here, MySQL permits 4 billion tables if you're using InnoDB as the engine.
A lot of very big companies with scale use MySQL in some capacity. Facebook is one of them.
Native JSON Support
With the growing popularity of JSON as the de facto data exchange format across the internet, MySQL has also provided native JSON support in 5.7, so now you can store and query JSON from your APIs, if required.
HA and Replication
MySQL Replication works! Earlier, MySQL used to support coordinate replication only but now it supports GTID replication which makes it easier to maintain and fix replication issues. There are third-party replicators also available in the market. For instance, Continuent's Tungsten is a replicator written in Java and is a replacement for native replication. It comes with a lot of configuration options which are not available with native MySQL replication.
I agree with MontyPython, MySql can do it and the design is critical. Fortunately MySql allows you to be flexible over time as needed.
I've had history tables needed used in daily reporting that grew to over a billion records in plain MySql and had no problems.
I've also used MySql Merge tables to divide up tables with big-ish rows (100KB+) to speed things up. Basically keeping the individual merge table file sizes under 30GB each. However that solution increases the open file count (in the system) per client - might be a bigger deal on a clustered system. That one was not.
That said, I like to give Honorable Mention to:
MariaDB - MySql but with contributions from Facebook, Alibaba, Google, and more.
I've moved most of my MySql community edition projects over to MariaDB and have been very happy. It's an almost transparent upgrade.
They offer an interesting enterprise Big Data Analytics (MariaDB AX) package, but with your current requirements its probably overkill and the standard community edition will fulfill your needs.
For example, here's an informative tutorial on how to set up a scalable Cluster (Galera) and adding MaxScale for High Availability:
https://mariadb.com/resources/blog/getting-started-mariadb-galera-and-mariadb-maxscale-centos
Another interesting option is Vitesse - developed at Youtube, which allows for sharded mysql through a (mostly) driver based solution. It solves the problem of needing to have available access to huge amounts of data and always yield good performance. As such, it goes beyond high availability and focuses on a solution wherein no single query (ie. a report against millions of rows of historical data) can negatively impact the other queries needing to be performed.

1 big Google Cloud SQL instance, 2 small Google Cloud SQL instances or 1 medium + 1 replica?

I've started to use Google Cloud SQL and I need to improve my IOPS and network speed. I've seen that this it's only possible improving the type of machine and/or improving the size of disk. And this is my question. In my case, I need to migrate 2 MySQL databases (from 2 different projects) and I don't know what is better: 1 big instance with 2 databases? 2 small instances with the database in each instance? or 1 regular instance + 1 read replica instance?
Thank you in advance!
The answer is usual "It depends".
If you're not concerned with data isolation issues, a single instance would be more efficient and easier to manage.
If you split data between instances, you're also capping performance per database. This can be a non-issue if your datasets are similar and process the same amount of requests.
Read replicas could be a solution to scale IOPS if your application workload is heavily skewed towards reads.
Also, independent of which option you will choose, consider HA-setup.

Huge sql server database with varbinary entries

We have to design an SQL Server 2008 R2 database storing many varbinary blobs.
Each blob will have around 40K and there will be around 700.000 additional entries a day.
The maximum size of the database estimated is 25 TB (30 months).
The blobs will never change. They will only be stored and retrieved.
The blobs will be either deleted the same day they are added, or only during cleanup after 30 months. In between there will be no change.
Of course we will need table partitioning, but the general question is, what do we need to consider during implementation for a functioning backup (to tape) and restore strategy?
Thanks for any recommendations!
Take a look at the "piecemeal backup and restore" - you will find it very useful for your scenario, which would benefit from different backup schedules for different filegroups/partitions. Here are a couple of articles to get you started:
http://msdn.microsoft.com/en-us/library/ms177425(v=sql.120).aspx
http://msdn.microsoft.com/en-us/library/dn387567(v=sql.120).aspx
I have had the pleasure in the past of working with several very large databases, the largest environment I have worked with being in the 5+ TB range. Going even larger than that, I am sure that you will encounter some unique challenges that I may not have faced.
What I can say for sure is that any backup strategy that you are going to implement is going to take awhile, so you should plan to have at least one day a week devoted to backups and maintenance where the database while available should not be expected to perform at the same levels.
Second, I have found the following MVP article to be extremely useful in planning backups which are taken through the native MSSQL backup operations. There are some large database specific options for the backup command which could assist in reducing your backup duration. While these increase throughput, you can expect performance impact. Specifically the options that had the greatest impact in my testing is buffercount, blocksize, and maxtransfersize.
http://henkvandervalk.com/how-to-increase-sql-database-full-backup-speed-using-compression-and-solid-state-disks
Additionally, assuming your data is stored on a SAN, you may wish as an alternative to investigate the use of SAN level tools in your backup strategy. Some SAN vendors provide software which integrates with SQL Server to perform SAN style snapshot backups while still integrating with the engine to handle things like marking backup dates and forwarding LSN values.
Based on your statement that the majority of the data will not change over time, inclusion of differential backups seems like a very useful option for you allowing you to reduce the number of transaction logs which would be have to be restored in a recovery scenario.
Please feel free to get in touch with me directly if you would like to discuss further.

Can a magento store with 1000 websites and daily automated product updates be made to work by using multiple mysql per website?

The mysql performance of running the magento for this situation under one mysql installation is giving a headache. I wonder if it is feasible to setup an individual mysql for each website so that updates to the catalog can occur concurrently across all websites.
It sure can be made working within a cluster and if you queue your updates and plan ahead for such. But it won't be cheap and i'll guess you 'll need a mysql instance for every 30 to 50 website. It's worth to observe mysql sharding for heavily used tables and ways to run all this inside RAM to dramatically pull down the resource usage needed.
and for such task you have to be living and breathing INNODB person

want to create a data warehouse... new database or just pile the tables into the existing database?

So I've got a MySQL database for an web community that is a potential stats goldmine. Currently I'm serving stats built via all sorts of nasty queries on my well-normalized database. I've run into the "patience limit" for such queries on my shared hosting, and would like to move to data warehousing and a daily cron job, thereby sacrificing instant updates for a 100-fold increase in statistical depth.
I've just started reading about data warehouses, and particularly the star schema, and it all seems pretty straight-forward.
My question essentially is - should I toss all that crap into a new database, or just pile the tables into my existing MySQL database? The current database has 47 tables, the largest of which has 30k records. I realize this is paltry compared to your average enterprise application, but your average enterprise application does not (I hope!) run on shared-hosting!
So, keeping my hardware limits in mind, which method would be better?
I really don't know much about this at all, but I assume reading Table A, calculating, then updating Table B is a lot easier in the same database than across databases, correct?
Should I even care how many tables my DB has?
If you just need to improve performance, you should just create a set of pre-cocked reporting tables. Low effort and big performance gains. With the data volume you described, this won't even have an noticable impact on the users of your web community.
The different database approach has several benefits (see below) but I don't think you will gain any of them as you are on a shared database host.
You can support different SLA for DW and web site
DW and Web database can have different configurations
DW database is basically read-only for large portion of the day
DW and Web database can have different release cycles (this is big)
Typical DW queries (large amount of data) don't kill the cache for web DB.
The number of tables in a particular database does not usually become a problem until you have thousands (or tens of thousands) of tables, and these problems usually come into play due to filesystem limits related to the maximum number of files in a directory.
You don't say what storage engine you are using. In general, you want the indexes in your database to fit into memory for good insert/update/delete performance, so the size of your key buffer or buffer pool must be large enough to hold the "hot" part of the index.