I'm looking at building a Rails application which will have some pretty
large tables with upwards of 500 million rows. To keep things snappy
I'm currently looking into how a large table can be split to more
manageable chunks. I see that as of MySQL 5.1 there is a partitioning
option and that's a possible option but I don't like the way the column
that determines the partitioning has to be part of the primary key on
the table.
What I'd really like to do is split the table that a AR model writes to
based upon the values written but as far as I am aware there is no way
to do this - does anyone have any suggestions as to how I might
implement this or any alternative strategies?
Thanks
Arfon
Partition columns in MySQL are not limited to primary keys. In fact, a partition column does not have to be a key at all (though one will be created for it transparently). You can partition by RANGE, HASH, KEY and LIST (which is similar to RANGE only that it is a set of discrete values). Read the MySQL manual for an overview of partioning types.
There are alternative solutions such as HScale - a middleware plug-in that transparently partitions tables based on certain criteria. HiveDB is an open-source framework for horizontal partioning for MySQL.
In addition to sharding and partioning you should employ some sort of clustering. The simplest setup is a replication based setup that helps you spread the load over several physical servers. You should also consider more advanced clustering solutions such as MySQL cluster (probably not an option due to the size of your database) and clustering middleware such as Sequioa.
I actually asked a relevant question regarding scaling with MySQL here on stack-overflow some time ago, which I ended up answering myself several days later after collecting a lot of information on the subject. Might be relevant for you as well.
If you want to split your datas by time, the following solution may fit to your need. You can probably use MERGE tables;
Let's assume your table is called MyTable and that you need one table per week
Your app always logs in the same table
A weekly job atomically renames your table and recreates an empty one: MyTable is renamed to MyTable-Year-WeekNumber, and a fresh empty MyTable is created
Merge tables are dropped and recreated.
If you want to get all the datas of the past three months, you create a merge table which will include only the tables from the last 3 months. Create as many merge tables as you need different periods. If you can not include the table in which datas are currently inserted (MyTable in our example), you'll be even more happy, as you won't have any read / write concurrency
You can handle this entirely in Active Record using DataFabric.
It's not that complicated to implement similar behavior yourself if that's not suitable. Google sharding for a lot of discussion on the architectural pattern of handling table partitioning within the app tier. It has the advantages of avoiding middleware or depending on db vender specific features. On the other hand it is more code in your app that you're responsible for.
Related
I have to handle 25M rows of data that I have collected and transformed from about 50 different sources. Every source leads to about 500.000 to 600.000 rows. Each record has the same structure, regardless the source (let say: id, title, author, release_date)
For flexibility, I would prefer to create a dedicated table for each source, (then I can clear/drop data from a source and reload/upload data very quickly (using LOAD INFILE)). This way, it seems very easy to truncate the table with no risk to delete rows from other sources.
But then I don't know how to select records having the same author across the different tables, and cherry on the cake, with pagination (LIMIT keyword).
Is the only solution to store everything into a single huge table and deal with the pain of indexing/backuping a 25M+ database or is there a kind of abstract layer to virtually merge 50 tables into a virtual one.
It is probably a usual question for a dba, but I could not find any answer yet...
Any help/idea much appreicated. Thx
This might be a good spot for MySQL partitoning.
This lets you handle a big volume of data, while giving you the opportunity to run DML operations on a specific partition when needed (such as truncate, or event drop) very efficiently, and without impacting the rest of your data. Partitioning selection is also supported in LOAD DATA statements.
You can run queries across partitions as you would with a normal table, or target a specific partition when you need to (which can be done very efficiently).
In your specific use case, list partitioning seems like a relevant choice: you have a pre-defined list of sources, so you would typically have one partition per source.
I have a complex data model consisting of 100s of tables. I have CDC enabled for all tables and have that data in corresod nponding CDC tables. I need a generic mechanism whereby given a arbitrary SELECT query I am able to return results corresponding to a point of time in the past.
I did not find any online recipes or blogs about this. I have managed to realize on my own so far that in order to convert a normal SELECT query into its CDC-aware equivalent, it is important to consider cardinality of JOINs and have some logic around choosing transactions that matter. But it seems too complex and error-prone to write an equivalent query by hand on a per-query basis. Is there a tool out there which does that? or is this a market gap?
Change Data Capture in general would allow you to replicate the changes on the original database tables to a new target table.
One approach is that you could use LiveAudit or similar auditing functionality to have a full audit trail of the changes on that Table. (Insert all changes to an Audit table for ongoing history). You would still have to have some complex queries to identify the latest version of a Row based on Primary Key up to the time in question.
(https://www.ibm.com/support/knowledgecenter/SSTRGZ_11.4.0/com.ibm.cdcdoc.mcadminguide.doc/concepts/mappingliveaudit.html ) **IBM product, not MS SQL-CDC.
Temporal Tables may be a better approach, but it has its own complexities with design.
Both approaches could be limited by how much history you retain.
(https://learn.microsoft.com/en-us/azure/sql-database/sql-database-temporal-tables-retention-policy )
I want to create a table about "users" for each of the 50 states. Each state has about 2GB worth of data. Which option sounds better?
Create one table called "users" that will be 100GB large OR
Create 50 separate tables called "users_{state}", each which will be 2GB large
I'm looking at two things: performance, and style (best practices)
I'm also running RDS on AWS, and I have enough storage space. Any thoughts?
EDIT: From the looks of it, I will not need info from multiples states at the same time (i.e. won't need to frequently join tables if I go with Option 2). Here is a common use case: The front-end passes a state id to the back-end, and based on that id, I need to query data from the db regarding the specified state, and return data back to front-end.
Are the 50 states truly independent in your business logic? Meaning your queries would only need to run over one given state most of the time? If so, splitting by state is probably a good choice. In this case you would only need joining in relatively rarer queries like reporting queries and such.
EDIT: Based on your recent edit, this first option is the route I would recommend. You will get better performance from the table partitioning when no joining is required, and there are multiple other benefits to having the smaller partitioned tables like this.
If your queries would commonly require joining across a majority of the states, then you should definitely not partition like this. You'd be better off with one large table and just build the appropriate indices needed for performance. Most modern enterprise DB solutions are capable of handling the marginal performance impact going from 2GB to 100GB just fine (with proper indexing).
But if your queries on average would need to join results from only a handful of states (say no more than 5-10 or so), the optimal solution is a more complex gray area. You will likely be able to extract better performance from the partitioned tables with joining, but it may make the code and/or queries (and all coming maintenance) noticeably more complex.
Note that my answer assumes the more common access frequency breakdowns: high reads, moderate updates, low creates/deletes. Also, if performance on big data is your primary concern, you may want to check out NoSQL (for example, Amazon AWS DynamoDB), but this would be an invasive and fundamental departure from the relational system. But the NoSQL performance benefits can be absolutely dramatic.
Without knowing more of your model, it will be difficult for anyone to make judgement calls about performance, etc. However, from a data modelling point of view, when thinking about a normalized model I would expect to see a User table with a column (or columns, in the case of a compound key) which hold the foreign key to a State table. If a User could be associated with more than one state, I would expect another table (UserState) to be created instead, and this would hold the foreign keys to both User and State, with any other information about that relationship (for instance, start and end dates for time slicing, showing the timespan during which the User and the State were associated).
Rather than splitting the data into separate tables, if you find that you have performance issues you could use partitioning to split the User data by state while leaving it within a single table. I don't use MySQL, but a quick Google turned up plenty of reference information on how to implement partitioning within MySQL.
Until you try building and running this, I don't think you know whether you have a performance problem or not. If you do, following the above design you can apply partitioning after the fact and not need to change your front-end queries. Also, this solution won't be problematic if it turns out you do need information for multiple states at the same time, and won't cause you anywhere near as much grief if you need to look at User by some aspect other than State.
I'm struggling to find the best way to build out a structure that will work for my project. The answer may be simple but I'm struggling due to the massive number of columns or tables, depending on how it's set up.
We have several tools, each that can be run for many customers. Each tool has a series of questions that populate a database of answers. After the tool is run, we populate another series of data that is the output of the tool. We have roughly 10 tools, all populating a spreadsheet of 1500 data points. Here's where I struggle... each tool can be run multiple times, and many tools share the same data point. My next project is to build an application that can begin data entry for a tool, but allow import of data that shares the same datapoint for a tool that has already been run.
A simple example:
Tool 1 - company, numberofusers, numberoflocations, cost
Tool 2 - company, numberofusers, totalstorage, employeepayrate
So if the same company completed tool 1, I need to be able to populate "numberofusers" (or offer to populate) when they complete tool 2 since it already exists.
I think what it boils down to is, would it be better to create a structure that has 1500 tables, 1 for each data element with additional data around each data element, or to create a single massive table - something like...
customerID(FK), EventID(fk), ToolID(fk), numberofusers, numberoflocations, cost, total storage, employee pay,.....(1500)
If I go this route and have one large table I'm not sure how that will impact performance. Likewise - how difficult it will be to maintain 1500 tables.
Another dimension is that it would be nice to have a description of each field:
numberofusers,title,description,active(bool). I assume this is only possible if each element is in its own table?
Thoughts? Suggestions? Sorry for the lengthy question, new here.
Build a main table with all the common data: company, # users, .. other stuff. Give each row a unique id.
Build a table for each unique tool with the company id from above and any data unique to that implementation. Give each table a primary (unique key) for 'tool use' and 'company'.
This covers the common data in one place, identifies each 'customer' and provides for multiple uses of a given tool for each customer. Every use and customer is trackable and distinct.
More about normalization here.
I agree with etherbubunny on normalization but with larger datasets there are performance considerations that quickly become important. Joins which are often required in normalized databases to display human readable information can be performance killers on even medium sized tables which is why a lot of data warehouse models use de-normalized datasets for reporting. This is essentially pre-building the joined reporting data into new tables with heavy use of indexing, archiving and partitioning.
In many cases smart use of partitioning on its own can also effectively help reduce the size of the datasets being queried. This usually takes quite a bit of maintenance unless certain parameters remain fixed though.
Ultimately in your case (and most others) I highly recommend building it the way you are able to maintain and understand what is going on and then performing regular performance checks via slow query logs, explain, and performance monitoring tools like percona's tool set. This will give you insight into what is really happening and give you some data to come back here or the MySQL forums with. We can always speculate here but ultimately the real data and your setup will be the driving force behind what is right for you.
I'm working on a project which is similar in nature to website visitor analysis.
It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large.
Should I use a single table with websiteid or a separate table for each website?
Making changes to a live service with 100s of websites with separate tables for each seems like a big problem. On the other hand performance and scalability are probably going to be a problem with such large data. Any suggestions, comments or advice is most welcome.
How about one table partitioned by website FK?
I would say use the design that most makes sense given your data - in this case one large table.
The records will all be the same type, with same columns, so from a database normalization standpoint they make sense to have them in the same table. An index makes selecting particular rows easy, especially when whole queries can be satisfied by data in a single index (which can often be the case).
Note that visitor analysis will necessarily involve a lot of operations where there is no easy way to optimise other than to operate on a large number of rows at once - for instance: counts, sums, and averages. It is typical for resource intensive statistics like this to be pre-calculated and stored, rather than fetched live. It's something you would want to think about.
If the data is uniform, go with one table. If you ever need to SELECT across all websites
having multiple tables is a pain. However if you write enough scripting you can do it with multiple tables.
You could use MySQL's MERGE storage engine to do SELECTs across the tables (but don't expect good performance, and watch out for the Windows hard limit on the number of open files - in Linux you may haveto use ulimit to raise the limit. There's no way to do it in Windows).
I have broken a huge table into many (hundreds) of tables and used MERGE to SELECT. I did this so the I could perform off-line creation and optimization of each of the small tables. (Eg OPTIMIZE or ALTER TABLE...ORDER BY). However the performance of SELECT with MERGE caused me to write my own custom storage engine. (Described http://blog.coldlogic.com/categories/coldstore/'>here)
Use the single data structure. Once you start encountering performance problems there are many solutions like you can partition your tables by website id also known as horizontal partitioning or you can also use replication. This all depends upon the the ratio of reads vs writes.
But for start keep things simple and use one table with proper indexing. You can also determine if you need transactions or not. You can also take advantage of various different mysql storage engines like MyIsam or NDB (in memory clustering) to boost up the performance. Also caching plays a very good role in offloading the load from the database. The data that is mostly read only and can be computed easily is usually put in the cache and the cache serves the request instead of going to the database and only the necessary queries go to the database.
Use one table unless you have performance problems with MySQL.
Nobody here cannot answer performance questions, you should just do performance tests yourself to understand, whether having one big table is sufficient.