MySQL Performance: Single table or multiple tables - mysql

I have a 8 sets of data of about 30,000 rows for each set, the data is the same structure just for different languages.
The front end of the site will get relatively high traffic.
So my question is regarding MySQL performance, if i should have a single table with one column to distinguish which set the data belongs to (i.e. coloumn "language") or create individual tables for each language set?
(an explanation on why if possible would be really helpful)
Thanks in advance
Shadi

I would go with single table design. Seek time, with proper index, should be exactly the same, no matter how "wide" table is.
Apart from performance issues, this will simplify design and relations with other tables (foreign keys etc).

Another drawback to the "one table per language" design is that you have to change your schema every time you add one.
A language column means you just have to add data, which is not intrusive. The latter is the way to go.

I'd go with one-table design too. Since the cardinality of the language_key is very low, I'd partition the table over language_key instead of defining an index. (if your database supports it.)

I agree with the other responses - I'd use of a single table. With regards to performance optimization a number of things have the potential to have a bigger impact on performance:
appropriate indexing
writing/testing for query efficiency
chosing appropriate storage engine(s)
the hardware
type and configuration of the filesystem(s)
optimizing mysql configuration settings
... etc. I'm a fan of High Performance MySQL

Related

Best methods to increase database performance?

Assuming that I have 20L records,
Approach 1: Hold all 20L records in a single table.
Approach 2: Make 20 tables and enter 1L into each.
Which is the best method to increase performance and why, or are there any other approaches?
Splitting a large table into smaller ones can give better performance -- it is called sharding when the tables are then distributed across multiple database servers -- but when you do it manually it is most definitely an antipattern.
What happens if you have 100 tables and you are looking for a row but you don't know which table has it? If you put index on the tables you'll need to do it 100 times. If somebody wants to join the data set he might need to include 100 tables in his join in some use cases. You'd need to invent your own naming conventions, document and enforce them yourself with no help from the database catalog. Backup and recovery and all the other maintenance tasks will be a nightmare....just don't do it.
Instead just break up the table by partitioning it. You get 100% of the performance improvement that you would have gotten from multiple tables but now the database is handling the details for you.
When looking for read time performance, indexes are a great way to improve the performance. However, having indexes can slow down the write time queries.
So if you are looking for a read time performance, prefer indexes.
Few things to keep in mind when creating the index
Try to avoid null values in the index
Cardinality of the columns matter. It's been observed that having a column with lower cardinality first gives better performance when compared to a column with higher cardinality
Sequence of the columns in index should match your where clause. For ex. you create a index on Col A and Col B but query on Col C, your index would not be used. So formulate your indexes according to your where clauses.
When in doubt if an index was used or not, use EXPLAIN to see which index was used.
DB indexes can be a tricky subject for the beginners but imagining it as a tree traversal helps visualize the path traced when reading the data.
The best/easiest is to have a unique table with proper indexes. On 100K lines I had 30s / query, but with an index I got 0.03s / query.
When it doesn't fit anymore you split tables (for me it's when I got to millions of lines).
And preferably on different servers.
You can then create a microservice accessing all servers and returning data to consumers like if there was only one database.
But once you do this you better not have joins, because it'll get messy replicating data on every databases.
I would stick to the first method.

Is it better to have many columns, or many tables?

Imagine a hypothetical database, which is storing products. Each product have have 100 attributes, although any given product will only have values set for ~50 of these. I can see three ways to store this data:
A single table with 100 columns,
A single table with very few (say the 10 columns that have a value for every product), and another table with columns (product_id, attribute, value). I.e, An EAV data store.
A separate table for every columns. So the core products table might have 2 columns, and there would be 98 other tables, each with the two columns (product_id, value).
Setting aside the shades of grey between these extremes, from a pure efficiency standpoint, which is best to use? I assume it depends on the types of queries being run, i.e. if most queries are for several attributes of a product, or the value of a single attribute for several products. How does this affect the efficiency?
Assume this is a MySQL database using InnoDB, and all tables have appropriate foreign keys, and an index on the product_id. Imagine that the attribute names and values are strings, and are not indexed.
In a general sense, I am asking whether accessing a really big table takes more or less time than a query with many joins.
I found a similar question here: Best to have hundreds of columns or split into multiple tables?
The difference is, that question is asking about a specific case, and doesn't really tell me about efficiency in the general case. Other similar questions are all talking about the best way to organize the data, I just want to know how the different organizational systems impact the speed of queries.
In a general sense, I am asking whether accessing a really big table takes more or less time than a query with many joins.
JOIN will be slower.
However, if you usually query only a specific subset of columns, and this subset is "vertically partitioned" into its own separate table, querying such "lean" table is typically quicker than querying the "fat" table with all the columns.
But this is very specific and fragile (easy to break-apart as the system evolves) situation and you should test very carefully before going down that path. Your default starting position should be one table.
In general, the more tables you have, the more normalised, more correct, and hence better (ie: reduced redundancy of data) your design.
If you later find you have problems with reporting on this data, then that may be the moment to consider creating denormalised values to improve any specific performance issues. Adding denormalised values later is much less painful than normalising an existing badly designed database.
In most cases, EAV is a querying and maintenance nightmare.
An outline design would be to have a table for Products, a table for Attributes, and a ProductAttributes table that contained the ProductID and the AttributeID of the relevant entries.
As you mentioned - it strictly depends on queries, which will be executed on these data. As you know, joins are aggravating for database. I can't imagine to make 50-60 joins for simple data reading. In my humble opinion it would be madness. :) The best thing, you can do is to introduce test data and check out your specific queries in tool as Estimated Execution Plan in Management Studio. There should exist similar tool for MySQL.
I would tend to advice you to avoid creating so much tables. I think, it have to cause problems in future. Maybe it is possible to categorise rarely used data for separate tables or to use complex types? For string data you can try to use nonclustered indexes.

How to structure an extremely large table

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.
By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available.
I have two conceptual ideas to speed it up.
1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.
2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.
Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.
The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)
There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.
From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.
There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.
10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.
Your second idea looks like partitioning.
I don't know how well it works, but there is support for partition in MySQL -- see, in its manual : Chapter 17. Partitioning
There is good scalability approach for this tables. Union is right way, but there is better way.
If your database engine supports "semantical partitioning", then you can split one table into partitions. Each partition will cover some subrange (say 1 partition per year). It will not affect anything in SQL syntax, except DDL. And engine will transparently run hidden union logic and partitioned index scans with all parallel hardware it has (CPU, I/O, storage).
For example Sybase allows up to 255 partitions, as it is limit of union. But you will never need keyword "union" in queries.
Often the best plan is to have one table and then use database partioning.
Or you can archive data and create a view for the archived and combined data and keep only the active data in the table most functions are referencing. You will have to have a good archiving stategy though (which is automated) or you can lose data or not get things done efficiently in moving them. This is typically more difficult to maintain.
What you're talking about is horizontal partitioning or sharding.

Is indexing the answer MySQL online database thousands of records

I'm soon to have an application that has uses online database (MySQL). I could have many thousands of records in a particular table in my database. Am I going to see a performance hit? What about if I turn on indexing. Is indexing is all handled transparent (it's all handled for me)? Or should good practice be to try and catergorize the data into other tables (such as alphabetically). Isn't this what indexing basically does?
Any other advice... I'm quite new to online connectivity. I mostly develop for standalone computer.
Thanks
Thomas
you should always index if appropriate but there's not enough information provided to know what should be indexed. Many thousands for any db is not a lot,
Indexes are useful to speed up queries that identfy a tiny subset of all records. Indexes are sorted (well not all types of indexes), which means they may also speed up certain queries where you need the rows in the same order as the columns in the index.
Indexes will not help you if you are selecting larger subsets of the data in your table, other than the special case where the index includes all columns referenced by your query.

What is the optimal amount of data for a table?

How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.
I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.
Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.
Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.
For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.
Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.
This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.
I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.
The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.