MySQL Database Structure - mysql

I will have a table with a few million entries and I have been wondering if it was smarter to create more than just this one table, even though they would all have the same structure? Would it save resources and would it be more efficient in the end?
This is my particular concern, because I plan creating a small search engine which indexes about 3.000.000 sites and each sites will have approximately 30 words that are being indexed. This is my structure right now
site
--id
--url
word
--id
--word
appearances
--site_id
--word_id
--score
Should I keep this structure? Or should I create tables for A words, B words, C words etc? Same with the appearances table

Select queries are faster on smaller tables. You want to fit the indexes you have to sort on into your systems memory for better performance.
More importantly, tables should not be defined in order to hold a certain type of data, but instead a collection of associated data. So if the data you are storing has logical differences they maybe should be broken into separate tables.
(Incomplete)
Pros:
Faster data access
Easier to copy or back up
Cons:
Cannot easily compare data from different tables.
Union and join queries are needed to compare across tables
If you aren't concerned with some latency on your database it should be able to handle this on the other of a few million records without too much trouble.

Here's some questions to ask yourself:
Are the records all inter-related? Is there any way of cleanly dividing them into different, non-overlapping groups? Are these groups well defined, or subject to change?
Is maintaining optimal write speed more of a concern than simplicity of access to data?
Is there any way of partitioning the records into different categories?
Is replication a concern? Redundancy?
Are you concerned about transaction safety?
Is it possible to re-structure the data later if you get the initial schema wrong?
There are a lot of ways of tackling this problem, but until you know the parameters you're working with, it's very hard to say.
Usually step one is to collect either a large corpus of genuine data, or at least simulate enough data that's reasonably similar to the genuine data to be structurally the same. Then you use your test data to try out different methods of storing and retrieving it.
Without any test data you're just stabbing in the dark

Related

Store large amounts of sensor data in SQL, optimize for query performance

I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.

Best practices for creating a huge SQL table

I want to create a table about "users" for each of the 50 states. Each state has about 2GB worth of data. Which option sounds better?
Create one table called "users" that will be 100GB large OR
Create 50 separate tables called "users_{state}", each which will be 2GB large
I'm looking at two things: performance, and style (best practices)
I'm also running RDS on AWS, and I have enough storage space. Any thoughts?
EDIT: From the looks of it, I will not need info from multiples states at the same time (i.e. won't need to frequently join tables if I go with Option 2). Here is a common use case: The front-end passes a state id to the back-end, and based on that id, I need to query data from the db regarding the specified state, and return data back to front-end.
Are the 50 states truly independent in your business logic? Meaning your queries would only need to run over one given state most of the time? If so, splitting by state is probably a good choice. In this case you would only need joining in relatively rarer queries like reporting queries and such.
EDIT: Based on your recent edit, this first option is the route I would recommend. You will get better performance from the table partitioning when no joining is required, and there are multiple other benefits to having the smaller partitioned tables like this.
If your queries would commonly require joining across a majority of the states, then you should definitely not partition like this. You'd be better off with one large table and just build the appropriate indices needed for performance. Most modern enterprise DB solutions are capable of handling the marginal performance impact going from 2GB to 100GB just fine (with proper indexing).
But if your queries on average would need to join results from only a handful of states (say no more than 5-10 or so), the optimal solution is a more complex gray area. You will likely be able to extract better performance from the partitioned tables with joining, but it may make the code and/or queries (and all coming maintenance) noticeably more complex.
Note that my answer assumes the more common access frequency breakdowns: high reads, moderate updates, low creates/deletes. Also, if performance on big data is your primary concern, you may want to check out NoSQL (for example, Amazon AWS DynamoDB), but this would be an invasive and fundamental departure from the relational system. But the NoSQL performance benefits can be absolutely dramatic.
Without knowing more of your model, it will be difficult for anyone to make judgement calls about performance, etc. However, from a data modelling point of view, when thinking about a normalized model I would expect to see a User table with a column (or columns, in the case of a compound key) which hold the foreign key to a State table. If a User could be associated with more than one state, I would expect another table (UserState) to be created instead, and this would hold the foreign keys to both User and State, with any other information about that relationship (for instance, start and end dates for time slicing, showing the timespan during which the User and the State were associated).
Rather than splitting the data into separate tables, if you find that you have performance issues you could use partitioning to split the User data by state while leaving it within a single table. I don't use MySQL, but a quick Google turned up plenty of reference information on how to implement partitioning within MySQL.
Until you try building and running this, I don't think you know whether you have a performance problem or not. If you do, following the above design you can apply partitioning after the fact and not need to change your front-end queries. Also, this solution won't be problematic if it turns out you do need information for multiple states at the same time, and won't cause you anywhere near as much grief if you need to look at User by some aspect other than State.

Having data stored across tables representing individual data types - Why is it wrong?

Say I have lots of time to waste and decide to make a database where information is not stored as entities but in separate inter-related tables representing INT,VARCHAR,DATE,TEXT, etc types.
It would be such a revolution to never have to design a database structure ever again except that the fact no-one else has done it probably indicates it's not a good idea :p
So why is this a bad design ? What principles is this going against ? What issues could it cause from a practical point of view with a relational database ?
P.S: This is for the learning exercise.
Why shouldn't you separate out the fields from your tables based on their data types? Well, there are two reasons, one philosophical, and one practical.
Philosophically, you're breaking normalization
A properly normalized database will have different tables for different THINGS, with each table having all fields necessary and unique for that specific "thing." If the only way to find the make, model, color, mileage, manufacture date, and purchase date of a given car in my CarCollectionDatabase is to join meaningless keys on three tables demarked by data-type, then my database has almost zero discoverablity and no real cohesion.
If you designed a database like that, you'd find writing queries and debugging statements would be obnoxiously tiresome. Which is kind of the reason you'd use a relational database in the first place.
(And, really, that will make writing queries WAY harder.)
Practically, databases don't work that way.
Every database engine or data-storage mechanism i've ever seen is simply not meant to be used with that level of abstraction. Whatever engine you had, I don't know how you'd get around essentially doubling your data design with fields. And with a five-fold increase in row count, you'd have a massive increase in index size, to the point that once you get a few million rows your indexes wouldn't actually help.
If you tried to design a database like that, you'd find that even if you didn't mind the headache, you'd wind up with slower performance. Instead of 1,000,000 rows with 20 fields, you'd have that one table with just as many fields, and some 5-6 extra tables with 1,000,000+ entries each. And even if you optimized that away, your indexes would be larger, and larger indexes run slower.
Of course, those two ONLY apply if you're actually talking about databases. There's no reason, for example, that an application can't serialize to a text file of some sort (JSON, XML, etc.) and never write to a database.
And just because your application needs to store SQL data doesn't mean that you need to store everything, or can't use homogenous and generic tables. An Access-like application that lets user define their own "tables" might very well keep each field on a distinct row... although in that case your database's THINGS would be those tables and their fields. (And it wouldn't run as fast as a natively written database.)

Using multiple Tables instead of one BIG table?

Is there an advantage or disadvantage when I split big tables into multiple smaller tables when using InnboDB & MySQL?
I'm not talking about splitting the actual innoDB file of course, I'm just wondering what happens when I use multiple tables.
Circumstances:
I have a REAL big table with millions of rows (items), they are categorized (column "category").
Now, I'm thinking about using a separate table for each category instead.
I will not need the data across multiple tables under any circumstances guaranteed.
Generally speaking if your tables have no relevance to each other they should be in separate tables rather than a catch all table.
However, if they are related they should really reside in one table. You can manage performance of large tables in a number of ways. I suggest you have a look at partitioning the tables if it grows too large that it starts to cause problems.
However millions of rows isn't a "REAL big table" as you say, we have many tables with tens of millions of rows, and even a few with hundreds of millions - and they perform just fine thanks to a mixture of cleaver indexing, partitioning and read replicas.
Edit 1 - In response to comments:
Creating dynamic tables on each of your keys in a key value pair is as you rightly say unusual, ugly and just very wrong – you’re defeating the relational part of a RDBMS.
It is impossible for me to be specific with the following as your schema and detailed information of what you want to achieve is still lacking from this question – however I feel I grasp enough to edit my original answer.
There is a huge different talking about partitioning a table within the same database and creating a new table in another database. You ask about performance, generally speaking they should perform the same (that is a new table in a 1000gb database and one in a 0gb database) providing you have enough resources such as memory for indexing and IO on the underlying data storage and there is no bottleneck.
I can’t really work out why you would be wanting to create dynamic tables
("table_{category}"), or storing the value / category in a text file. This really sounds like you need a 1-N relationship and a JOIN.

mysql tables structure - one very large table or separate tables?

I'm working on a project which is similar in nature to website visitor analysis.
It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large.
Should I use a single table with websiteid or a separate table for each website?
Making changes to a live service with 100s of websites with separate tables for each seems like a big problem. On the other hand performance and scalability are probably going to be a problem with such large data. Any suggestions, comments or advice is most welcome.
How about one table partitioned by website FK?
I would say use the design that most makes sense given your data - in this case one large table.
The records will all be the same type, with same columns, so from a database normalization standpoint they make sense to have them in the same table. An index makes selecting particular rows easy, especially when whole queries can be satisfied by data in a single index (which can often be the case).
Note that visitor analysis will necessarily involve a lot of operations where there is no easy way to optimise other than to operate on a large number of rows at once - for instance: counts, sums, and averages. It is typical for resource intensive statistics like this to be pre-calculated and stored, rather than fetched live. It's something you would want to think about.
If the data is uniform, go with one table. If you ever need to SELECT across all websites
having multiple tables is a pain. However if you write enough scripting you can do it with multiple tables.
You could use MySQL's MERGE storage engine to do SELECTs across the tables (but don't expect good performance, and watch out for the Windows hard limit on the number of open files - in Linux you may haveto use ulimit to raise the limit. There's no way to do it in Windows).
I have broken a huge table into many (hundreds) of tables and used MERGE to SELECT. I did this so the I could perform off-line creation and optimization of each of the small tables. (Eg OPTIMIZE or ALTER TABLE...ORDER BY). However the performance of SELECT with MERGE caused me to write my own custom storage engine. (Described http://blog.coldlogic.com/categories/coldstore/'>here)
Use the single data structure. Once you start encountering performance problems there are many solutions like you can partition your tables by website id also known as horizontal partitioning or you can also use replication. This all depends upon the the ratio of reads vs writes.
But for start keep things simple and use one table with proper indexing. You can also determine if you need transactions or not. You can also take advantage of various different mysql storage engines like MyIsam or NDB (in memory clustering) to boost up the performance. Also caching plays a very good role in offloading the load from the database. The data that is mostly read only and can be computed easily is usually put in the cache and the cache serves the request instead of going to the database and only the necessary queries go to the database.
Use one table unless you have performance problems with MySQL.
Nobody here cannot answer performance questions, you should just do performance tests yourself to understand, whether having one big table is sufficient.