MySQL Database design with multiple column or single column - mysql

Hi just a simple question
I need to store data to database, there are 2 option to show now
Data : a,b,c,d
1. store a,b,c,d in 1 column, when needed only query and perform splitting in application
2. store a,b,c,d to 4 different column, can query directly from database
Which option will be better? My concern is split it into 4 different column will make the tables contain many column, does it slow down the performance? And also I am curious is it possible the query is fast but the transfer of data to my application is slow?

MySQL performance is a complicated subject. To the issue you raised:
My concern is split it into 4 different column will make the tables
contain many column, does it slow down the performance?
there is nothing inherently worse, from a performance perspective, to have 4 columns, or 10, or 20, or 50.
Now, that being said, there are things that could impact performance, and probably will if you don't know about them. For example, if you SELECT * FROM {my_table} when really you only need to SELECT a FROM {my_table}... yeah, that'll impact your performance (although there are arguments to be made in favor of SELECT * FROM {my_table} depending on your caching strategy).
Likewise, you'll want to consider LIMIT statements. To your question
And also I am curious is it possible the query is fast but the
transfer of data to my application is slow?
Yes, of course. If you only need 50 rows and your table has 50000, you're gonna want to add limit clauses to your SQL statements, or you'll be sending a lot more data over the wire than you need to be. Memory is faster than disk and disk is faster than network. If you're sending a lot of data over the wire that you don't need, you better believe it's gonna cause performance problems. But again, keep in mind, that has nothing to do with how many columns you have. There is absolutely nothing inherent in the number of columns a table has that affects performance (at least not at the scale that you're talking about and in the way that you are thinking about it)
All of which to say, performance is a complex topic. You should take a look into it, if you're interested. And it sounds like a,b,c, and d are logically different columns, so you should probably go ahead and store them in different columns in MySQL. Hope this helps.

Related

Distributed database use cases

At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.

MySQL Database Structure

I will have a table with a few million entries and I have been wondering if it was smarter to create more than just this one table, even though they would all have the same structure? Would it save resources and would it be more efficient in the end?
This is my particular concern, because I plan creating a small search engine which indexes about 3.000.000 sites and each sites will have approximately 30 words that are being indexed. This is my structure right now
site
--id
--url
word
--id
--word
appearances
--site_id
--word_id
--score
Should I keep this structure? Or should I create tables for A words, B words, C words etc? Same with the appearances table
Select queries are faster on smaller tables. You want to fit the indexes you have to sort on into your systems memory for better performance.
More importantly, tables should not be defined in order to hold a certain type of data, but instead a collection of associated data. So if the data you are storing has logical differences they maybe should be broken into separate tables.
(Incomplete)
Pros:
Faster data access
Easier to copy or back up
Cons:
Cannot easily compare data from different tables.
Union and join queries are needed to compare across tables
If you aren't concerned with some latency on your database it should be able to handle this on the other of a few million records without too much trouble.
Here's some questions to ask yourself:
Are the records all inter-related? Is there any way of cleanly dividing them into different, non-overlapping groups? Are these groups well defined, or subject to change?
Is maintaining optimal write speed more of a concern than simplicity of access to data?
Is there any way of partitioning the records into different categories?
Is replication a concern? Redundancy?
Are you concerned about transaction safety?
Is it possible to re-structure the data later if you get the initial schema wrong?
There are a lot of ways of tackling this problem, but until you know the parameters you're working with, it's very hard to say.
Usually step one is to collect either a large corpus of genuine data, or at least simulate enough data that's reasonably similar to the genuine data to be structurally the same. Then you use your test data to try out different methods of storing and retrieving it.
Without any test data you're just stabbing in the dark

Splitting up a large mySql table into smaller ones - is it worth it?

I have about 28 million records to import into a mySql database. The record contains personal information about members in the US and will be searchable by states.
My question is, is it more efficient to break up the table into smaller tables as opposed to keeping everything in one big table? What I had in mind was to split them up into 50 seperate tables representing the 50 states something like this: members_CA, members_AZ, members_TX, etc;
This way I could do a query like this:
'SELECT * FROM members_' . $_POST['state'] . ' WHERE members_name LIKE "John Doe" ';
This way I only have to deal with data for a given state at once. Intuitively it makes a lot of sense but I would be curious to hear other opinions.
Thanks in advance.
I posted as a comment initially but I'll post as an answer now.
Never, ever think of creating X tables based on a difference in attribute. That's not how things are done.
If your table will have 28 million rows, think of partitioning to split it into smaller logical sets.
You can read about partitioning at MySQL documentation.
The other thing is choosing the right db design and choosing your indexes properly.
The third thing would be that you avoid terrible idea of using $_POST directly in your query, as you probably wouldn't want someone to inject SQL and drop your database, tables or what not.
The final thing is choosing appropriate hardware for the task, you don't want such an app running on VPS with 500 mb of ram or 1 gig of ram.
Do Not do that. Keep the similar data in 1 table itself. You will have heavy problems in implementing logical decisions and query making when the decision spans many states. Moreover if you need to change the database definition like adding columns, then you will have to perform the same operation over all the numerous(seemingly infinite) tables.
Use indexing to increase performance but stick to single table!!!
You can increase the memory cache also, for performance hit. Follow this article to do so.
If you create an index on the state column a select on all members of one state will be as efficient as the use of separate tables. Splittimg the table has a lot of disadvantages. If you add columns you have to add them in 50 tables. If you want data from different states you have to use union statements that will be very ugly and inefficient. I strongly recommend sticking at one table.
My first response is that you need to keep all like data together and keep it as one table. You should look into putting index's on your table to increase performance, but not breaking it up into smaller tables.

MySQL - 1 large table with 100 columns OR split into 5 tables and JOIN

I had a 'large' MySQL table that originally contained ~100 columns and I ended up splitting it up into 5 individual tables and then joining them back up with CodeIgniter Active Record...
From a performance point of view is it better to keep the original table with 100 columns or keep it split up.
Each table has around 200 rows.
200 rows? That's nothing.
I would split the table if the new ones combined columns in a way that was meaningful for your problem. I would do it with an eye towards normalization.
You sound like you're splitting them to meet some unstated criteria for "goodness" or because your current performance is unacceptable. Do you have some data that suggests a performance problem that is caused by your schema? If not, I'd recommend rethinking this approach.
No one can say what the impact on performance will be. More JOINs may be slower when you query, but you don't say what your use cases are.
So you've already made the change and now you're asking if we know which version of your schema goes faster?
(if the answer is the split tables, then you're doing something wrong).
Not only should the consolidated table be faster, it should also require less code and therefore less likely to have bugs.
You've not provided any information about the structure of your data.
And with 200 rows in your database, performance is the last thing you need to worry about.
The concept you're referring to is called vertical partitioning and it can have surprising effects on performance. On a Mysql.com Performance Post they discuss this in particular. An excerpt from the article:
Although you have to do vertical
partitioning manually, you can benefit
from the practice in certain
circumstances. For example, let's say
you didn't normally need to reference
or use the VARCHAR column defined in
our previously shown partitioned
table.
Important thing is - you can (and its good style!) move columns containing temporary data into separate table. You can move optional columns into separate table (this depends on logic).
When you are making a database the most important thing is: each table should incapsulate some essence. You should better create more tables, but separate different essences into different tables. The only exclusion is when you have to optimize your software, because 'straight' logical solution works slowly.
If you deal with some very complicated model, you should divide it into few simple blocks with simple relations - this works with database design as well.
As for perfomance - of course one table should give better perfomance since you would not need any kind of joins and keys to access all data. Less relations - less lags.

What techniques are most effective for dealing with millions of records?

I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716