Use one giant MySQL table? - mysql

Say I have a table with 25,000 or so rows:
item_id, item_name, item_value, etc...
My application will allow users to generate dynamic lists of anywhere from 2-300 items each.
Should I store all of these relationships in a giant table with columns dynamic_list_id, item_id? Each dynamic list would end up having 2-300 rows in this table, and the size of the table would likely balloon to the millions, or even billions.
This table would also be queried quite frequently, retrieving several of these dynamic lists each second. Is a giant table the best way to go? Would it make sense to split it up into dynamic tables, perhaps named by user?
I'm really at a loss when it comes to preparing databases for giant amounts of data like this, so any insight would be much appreciated.

It's a relational database, it's designed for that kind of thing - just go for it. A mere few million rows doesn't even count as "giant". Think very carefully about indexing though - you have to balance insert/update performance, storage space and query performance.

Yes, I recommend going with yor proposed design: "a giant table with columns dynamic_list_id, item_id."
Performance can easily be addressed as required, through index selection, and by increasing the number of spindles and read/write arms, and SSD caching.
And inthe grand scheme of things, this database does not look to be particularly big. These days it takes dozens or hundreds of TB to be a BIG database.

With such a large table make sure to set your engine to InnoDB for row level locks.
Make sure you're using indexes wisely. If your query starts to drag, increase the size of the Innodb_buffer_pool to compensate.

Related

Store large amounts of sensor data in SQL, optimize for query performance

I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.

is having millions of tables and millions of rows within them a common practice in MySQL database design?

I am doing database design for an upcoming web app, and I was wondering from anybody profusely using mysql in their current web apps if this sort of design is efficient for a web app for lets say 80,000 users.
1 DB
in DB, millions of tables for features for each user, and within each table, potentially millions of rows.
While this design is very dynamic and scales nicely, I was wondering two things.
Is this a common design in web applications today?
How would this perform, time wise, if querying millions of rows.
How does a DB perform if it contains MILLIONS of tables? (again, time wise, and is this even possible?)
if it performs well under above conditions, how could it perform under strenuous load, if all 80,000 users accessed the DB 20-30 times each for 10 -15 minute sessions every day?
how much server space would this require, very generally speaking (reiterating, millions of tables each containing potentially millions of rows with 10-15 columns filled with text)
Any help is appreciated.
1 - Definitely not. Almost anyone you ask will tell you millions of tables is a terrible idea.
2 - Millions of ROWS is common, so just fine.
3 - Probably terribly, especially if the queries are written by someone who thinks it's OK to have millions of tables. That tells me this is someone who doesn't understand databases very well.
4 - See #3
5 - Impossible to tell. You will have a lot of extra overhead from the extra tables as they all need extra metadata. Space needed will depend on indexes and how wide the tables are, along with a lot of other factors.
In short, this is a very very very seriously bad idea and you should not do it.
Millions of rows is perfectly normal usage, and can respond quickly if properly optimized and indexed.
Millions of tables is an indication that you've made a major goof in how you've architected your application. Millions of rows times millions of tables times 80,000 users means what, 80 quadrillion records? I strongly doubt you have that much data.
Having millions of rows in a table is perfectly normal and MySQL can handle this easily, as long as you use appropriate indexes.
Having millions of tables on the other hand seems like a bad design.
In addition to what others have said, don't forget that finding the right table based on the given table name also takes time. How much time? Well, this is internal to DBMS and likely not documented, but probably more than you think.
So, a query searching for a row can either take:
Time to find the table + time to find the row in a (relatively) small table.
Or, just the time to find a row in one large table.
The (2) is likely to be faster.
Also, frequently using different table names in your queries makes query preparation less effective.
If you are thinking of having millions of tables, I can't imagine that you actually designing millions of logically distinct tables. Rather, I would strongly suspect that you are creating tables dynamically based on data. That is, rather than create a FIELD for, say, the user id, and storing one or more records for each user, you are contemplating creating a new TABLE for each user id. And then you'll have thousands and thousands of tables that all have exactly the same fields in them. If that's what you're up to: Don't. Stop.
A table should represent a logical TYPE of thing that you want to store data for. You might make a city table, and then have one record for each city. One of the fields in the city table might indicate what country that city is in. DO NOT create a separate table for each country holding all the cities for each country. France and Germany are both examples of "country" and should go in the same table. They are not different kinds of thing, a France-thing and a Germany-thing.
Here's the key question to ask: What data do I want to keep in each record? If you have 1,000 tables that all have exactly the same columns, then almost surely this should be one table with a field that has 1,000 possible values. If you really seriously keep totally different information about France than you keep about Germany, like for France you want a list of provinces with capital city and the population but for Germany you want a list of companies with industry and chairman of the board of directors, then okay, those should be two different tables. But at that point the difference is likely NOT France versus Germany but something else.
1] Look up dimensions and facts tables in database design. You can start with http://en.wikipedia.org/wiki/Database_model#Dimensional_model.
2] Be careful about indexing too much: for high write/update you don't want to index too much because that gets very expensive (think average case or worst case of balancing a b-tree). For high read tables, index only the fields you search by. for example in
select * from mutable where A ='' and B='';
you may want to index A and B
3] It may not be necessary to start thinking about replication. but since you are talking about 10^6 entries and tables, maybe you should.
So, instead of me telling you a flat no for the millions of tables question (and yes my answer is NO), I think a little research will serve you better. As far as millions of records, it hints that you need to start thinking about "scaling out" -- as opposed to "scaling up."
SQL Server has many ways you can support large tables. You may find some help by splitting your indexes across multiple partitions (filegroups), placing large tables on their own filegroup, and indexes for the large table on another set of filegroups.
A filegroup is basically a separate drive. Each drive has its own dedicated read and write heads. The more drives the more heads are searching the indexes at a time and thus faster results finding your records.
Here is a page that talks in details about filegroups.
http://cm-bloggers.blogspot.com/2009/04/table-and-index-partitioning-in-sql.html

Splitting up a large mySql table into smaller ones - is it worth it?

I have about 28 million records to import into a mySql database. The record contains personal information about members in the US and will be searchable by states.
My question is, is it more efficient to break up the table into smaller tables as opposed to keeping everything in one big table? What I had in mind was to split them up into 50 seperate tables representing the 50 states something like this: members_CA, members_AZ, members_TX, etc;
This way I could do a query like this:
'SELECT * FROM members_' . $_POST['state'] . ' WHERE members_name LIKE "John Doe" ';
This way I only have to deal with data for a given state at once. Intuitively it makes a lot of sense but I would be curious to hear other opinions.
Thanks in advance.
I posted as a comment initially but I'll post as an answer now.
Never, ever think of creating X tables based on a difference in attribute. That's not how things are done.
If your table will have 28 million rows, think of partitioning to split it into smaller logical sets.
You can read about partitioning at MySQL documentation.
The other thing is choosing the right db design and choosing your indexes properly.
The third thing would be that you avoid terrible idea of using $_POST directly in your query, as you probably wouldn't want someone to inject SQL and drop your database, tables or what not.
The final thing is choosing appropriate hardware for the task, you don't want such an app running on VPS with 500 mb of ram or 1 gig of ram.
Do Not do that. Keep the similar data in 1 table itself. You will have heavy problems in implementing logical decisions and query making when the decision spans many states. Moreover if you need to change the database definition like adding columns, then you will have to perform the same operation over all the numerous(seemingly infinite) tables.
Use indexing to increase performance but stick to single table!!!
You can increase the memory cache also, for performance hit. Follow this article to do so.
If you create an index on the state column a select on all members of one state will be as efficient as the use of separate tables. Splittimg the table has a lot of disadvantages. If you add columns you have to add them in 50 tables. If you want data from different states you have to use union statements that will be very ugly and inefficient. I strongly recommend sticking at one table.
My first response is that you need to keep all like data together and keep it as one table. You should look into putting index's on your table to increase performance, but not breaking it up into smaller tables.

Max number of rows in MySQL

I'm planning to generate a huge amount of data, which I'd like to store in a MySQL database. My current estimations point to four thousand million billion rows in the main table (only two columns, one of them indexed).
Two questions here:
1) is this possible?
and more specifically:
2) Will such table be efficiently usable?
thanks!,
Jaime
Sure, it's possible. Whether or not it's usable will depend on how you use it and how much hardware/memory you have. With a table that large, it would probably make sense to use partitioning as well if that makes sense for the kind of data you are storing.
ETA:
Based on the fact that you only have two columns with one of them being indexed, I'm going to take a wild guess here that this is some kind of key-value store. If that is the case, you might want to look into a specialized key-value store database as well.
It may be possible, MySQL has several table storage engines with differing capabilities. I think the MyISAM storage engine, for instance, has a theoretical data size limit of 256TB, but it's further constrained by the maximum size of a file on your operating system. I doubt it would be usable. I'm almost certain it wouldn't be optimal.
I would definitely look at partitioning this data across multiple tables (probably even multiple DBs on multiple machines) in a way that makes sense for your keys, then federating any search results/totals/etc. you need to. Amongst other things, this allows you to do searches where each partition is searched in parallel (in the mutiple servers approach).
I'd also look for a solution that's already done the heavy lifting of partitioning and federating queries. I wonder if Google's AppEngine data store (BigTable) or the Amazon SimpleDB would be useful. They'd both limit what you could do with the data (they are not RDBMS's), but then, the sheer size is going to do that anyway.
You should consider partitioning your data...for example if one of the two columns is a name, separate the rows into 26 tables based on the first letter.
I created a mysql database with one table that contained well over 2 million rows (imported U.S. census county line data for overlay on a Google map). Another table had slightly under 1 million rows (USGS Tiger location data). This was about 5 years ago.
I didn't really have an issue (once I remembered to create indexes! :) )
4 gigarows is not that big, actually, it is pretty average to handle by any database engine today. Even partitioning could be an overkill. It should simply work.
Your performance will depend on your HW though.

Table Size Efficiency

We're doing some normalization of our data because it's too modular in some respects. The thing is that the table is getting very wide, with 400 or so columns so far. I've seen that the maximum amount is 1024 but I'm interested in knowing about paging with large table structures. If we had say, 1000 columns, but some were quite large (varchar(max) for example), then would there be a reduction in speed during queries? It's probably going to be accessed thousands of times a day so making sure it's not doing something like paging is quite important.
Basically, what's the maximum we can have before we notice a performance hit?
Technically it would depend on the query for that data, while I don't know the ins and outs of MS SQL Server too well, if all queries to such a table were only ever querying the primary key index it would be fast.
The bigger problem is how many repeats of the same varchar data is there? And does it even repeat in multiple fields as well as records, generally you'll want to seperate this of to an index<->data table then use integer indexes in main table. After all integers are faster to query than string matching, and it's less data storage.
It depends on hardware, OS, number of rows, DB configuration, indexes, and the queries themselves - a lot of things.
When it comes to paging, the main question is how much data is frequently accessed (this is essentially a lot of index data, as well as row data that is frequently retrieved) versus how much physical memory the machine has.
But in terms of general performance, you really need to take into account what sort of queries are being issued. If you're getting queries all over the place, each looking at different columns, then you'll hit performance problems much more easily. If a small set of columns are used in the bulk of queries, then you can probably use this to improve your performance by a reasonable amount.
Note that if your table has lots of repeated data, you really should review your database design.