Basic question: Querying data and performance tradeoffs - mysql

Let's say I have 100 rows in my table, with 3 columns of numbers. I don't need all the rows, only about half of them every time I fetch data. I only want the rows that have updated as getting the rest would be redundant.
Is it better to add a field and give it a datetime field to represent that it has updated since the last time I've fetched it (and use that as a criteria when SELECTing)? Or would it be better to simply download all the data each and every time (currently the data is being sent back as a JSON file).
What are the tradeoffs in terms of speed, bandwidth usage, and server cpu usage between these two options? Is the former just plain better than the latter?

Both Jens Struwe and roycl are right - but as you're asking a hypothetical question, you're going to get answers that are right and contradictory.
If only half the data is relevant, how is the client going to determine which data to show? If the decision can be made by software at all, it's more efficient to do it on the database - but it's also more logical.
With tables of 100 rows, performance is neither here nor there; maintainability and long-term upgradability is a far bigger deal. Most developers would expect a logical database design, and sorting/filtering to be done on the DB rather than the client.

Always (or at least if possible) select only data that you need to accomplish your task. Vice versa: Never select data that you have to filter out. In result: Add a timestamp field for the updates and select only these rows whose timestamp is > than the given one.

With a 100 rows in your table and 3 columns of numbers it really doesn't matter which approach you use if you don't mind if the server returns the data in less than a few 10s of milliseconds. The rows, if queried frequently, will all be in memory anyway. It also makes your json code simpler and your client code dumber (which is probably good, and more maintainable).
If you had a several-million row table with only a small percentage of data that was required, you would naturally want to limit the return set, and the easiest way of doing that is with an SQL WHERE clause, such as WHERE dt_modified > my_timestamp. On a properly optimised database even this query could come in at well under 100ms.
The issue may be more to do with time the data spends "on the wire", how much time the client spends either regenerating the page, or updating it based on the returned data. Client processing tim is often the slowest part of the process. Only testing on different browsers and over different network speeds will find the best balance between server-side tweeks, network fixes (such as gzipping to compress data) and optimising your javascript calls.

Related

Is filtered selection faster than fetching all the rows and then filtering

So I want to create a table in the frontend where I will list every single user. The thing is that the tables are relational and I have to get data from multiple tables in order to fulfill my goal.
Now here comes my question (keep in mind I have a MySQL database) :
Which method is better on the long run :
Generate joined queries that fetch all the data from each table where a user has any information (it outputs ~80 column per row and only 15 of them are needed)
Fetch the data that I need with multiple queries and then just "stick" the values together and output them (15 columns and all of them are needed, but I have to do extra work)
I would suggest you to go for third option.
Generate joined queries that fetch only necessary 15 columns for your front end. It would be the most efficient way.
If you are facing challenges with joining the tables then you can share table structures with sample data and desired output here with your query. We can try to help you achieve your goal.
This is a bit long for a comment.
I don't understand your first option. Why would you be selecting columns that you don't need? If there are 15 columns that you specifically want, then select those columns and nothing else.
In general, it is faster to have the database do most of the work. It can take advantage of its optimizer to produce the best execution plan that it can.
From Experience with embedded hardware mysql server.
If the hardware can do it and has enough resources you let the databse server run it course, as it can run its optimizer.
But if the server hardware lags on some fronts, you transpport all data to the client and let it run Javascript on all returned data.
The same goes for bandwith of the internet connection, it is slow, you want lesser number of rows, to transport because that the user will notice it, even old smartphones have to much power in cpu, amd can so handle everything with easy what you through at them.
In Basic there is no sime answer, you have to check server hardware and the usual bandwith offered and then program a solution that works best
A simple Rule of Thumb:
Fewer round-trips to the database server is usually the faster alternative.

Distributed database use cases

At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.

mysql getting rid of redundant values

I am creating a database to store data from a monitoring system that I have created. The system takes a bunch of data points(~4000) a couple times every minute and stores them in my database. I need to be able to down sample based on the time stamp. Right now I am planning on using one table with three columns:
results:
1. point_id
2. timestamp
3. value
so the query I'd be like to do would be:
SELECT point_id,
MAX(value) AS value
FROM results
WHERE timestamp BETWEEN date1 AND date2
GROUP BY point_id;
The problem I am running into is this seems super inefficient with respect to memory. Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me. The only solutions I thought of that reduce the memory footprint of my database requires me to either use separate tables (which to my understanding is super bad practice) or storing the data in CSV files which would require me to write my own code to search through the data (which to my understanding requires me not to be a bum... and probably search substantially slower). Is there a database structure that I could implement that doesn't require me to store so much duplicate data?
A database on with your data structure is going to be less efficient than custom code. Guess what. That is not unusual.
First, though, I think you should wait until this is actually a performance problem. A timestamp with no fractional seconds requires 4 bytes (see here). So, a record would have, say 4+4+8=16 bytes (assuming a double floating point representation for value). By removing the timestamp you would get 12 bytes -- savings of 25%. I'm not saying that is unimportant. I am saying that other considerations -- such as getting the code to work -- might be more important.
Based on your data, the difference is between 184 Mbytes/day and 138 Mbytes/day, or 67 Gbytes/year and 50 Gbytes. You know, you are going to have to deal with biggish data issues regardless of how you store the timestamp.
Keeping the timestamp in the data will allow you other optimizations, notably the use of partitions to store each day in a separate file. This should be a big benefit for your queries, assuming the where conditions are partition-compatible. (Learn about partitioning here.) You may also need indexes, although partitions should be sufficient for your particular query example.
The point of SQL is not that it is the most optimal way to solve any given problem. Instead, it offers a reasonable solution to a very wide range of problems, and it offers many different capabilities that would be difficult to implement individually. So, the time to a reasonable solution is much, much less than developing bespoke code.
Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me.
Not really. Date values are not that big and storing the same value for each row is perfectly reasonable.
...use separate tables (which to my understanding is super bad practice)
Who told you that!!! Normalising data (splitting into separate, linked data structures) is actually a good practise - so long as you don't overdo it - and SQL is designed to perform well with relational tables. It would perfectly fine to create a "time" table and link to the data in the other table. It would use a little more memory, but that really shouldn't concern you unless you are working in a very limited memory environment.

Should i recalculate big amounts of data from tables, or should I save it in my database?

My question is more general than specific, yet I am using an example to transfer the idea.
I have a forum, and in each replay I present the number of messages the users have.
Assuming that in some pages there are 15 different users, each has over 20,000 messages, should I recalculate the number of messages by counting how many entries in the messages table the user has, or would it be better to create a column in the users table that contains this data, and update the column every time a reply is made?
I know it defies the database normalizations rules, but it seems like a big waste to calculate it every time.
I'm using mySQL, if it matters.
Generally no, but in some specific cases, yes.
You should avoid having redundant data in a database. However, sometimes you have to make that tradeoff to get a decent performance.
I have actually done exactly the thing in your example. It works great for the performance, but it's really hard to keep the message count correct. You will get some inconsistent values sooner or later, so you need a plan for how to go through the values periodically and recalculate them.
You are talking about denormalization. Quoting wikipedia:
denormalization is the process of attempting to optimise the read
performance of a database by adding redundant data or by grouping
data.
Keep denormalized data in 'plain' code is not an easy issue. Remember than:
You can keep redundant data with triggers.
If your architecture includes ORM it is more easy to keep redundant data.
You could also go half way in your denormalisation: have a table with monthly data per user, filled by a monthly job, and calculate the number of messages on the fly, by counting the msg since 1st of month + sum of monthly data. Or if you don't need the monthly data, you can still calc on the fly over the month + a monthly process that updates the EOM figues. That will avoid triggers...
I'm surprised nobody has mentioned materialized views. These objects are very helpful when it comes to maintaining aggregates of data for performance reasons without violating the normalisation of our actual data. Find out more.
Have you tried to benchmark the results of counting the number of rows?
I'd recommend you just do you're calculation in a view. With the denormalization you're proposing, you're just exposing yourself to the risk of data corruption. The post count column will then end up with some arbitrary value that's go nothing to do with the reality of the number of posts.

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.