MYSQL Group By Date Performance Query - mysql

I'm currently writing a query that will run multiple times on a MYISAM table in a MySQL DB.
The query takes a lot (could be anything upto 100,000+) of rows and gets monthly totals. The SQL I'm currently using is
SELECT DATE_FORMAT(ct_cdatetime, "%m-%Y") AS Month, SUM(ct_total), SUM(ct_charge)
FROM client_transaction
WHERE (...omitted for this example...)
GROUP BY DATE_FORMAT(ct_cdatetime, "%m-%Y") ORDER BY ct_cdatetime ASC
I'm aware of the performance issue of forcing MySQL to cast the date to a string. Would it be quicker and / or better practice to
1) Leave it as is
2) Select all the rows and group them in PHP in an
array.
3) Have a month-year int field in the database and update this
when I add the row (e.g. 714 for July 2014)?

The answer to the question of which is fastest is simple: try both and measure.
The performance on the SQL side is not really affected by the date conversion. It is determined by the group by, and in particular, the sorting for the ordering.
I am skeptical that transferring the data and doing the work on the application side would be faster. In particular, you have to transfer a large (ish) amount of data from the database to the application. Then, you have to replicate the work in the application that would be done in the database.
However, the in-memory algorithms on the application side can be faster than the more generic algorithms on the database-side. It is possible that doing the work on the application side could be faster.

The following query is faster because it just extracts the components of a date instead of creating a formatted text:
SELECT YEAR(dtfield), MONTH(dtfield), COUNT(*)
FROM mytable
GROUP BY YEAR(dtfield), MONTH(dtfield);
Please keep in mind that MySQL cannot use any index optimization for the GROUP BY criteria when using functions. If you have a big table and need this computations often create separate indexed columns for them (as noticed by hellcode).

Related

How to (efficiently) get the start, end, and count of timeseries data from all SQL tables?

I have a massive amount of SQL tables (50,000+) each with 100,000+ time series data points. I'm just looking for the most efficient way to get the start, end, and count of each table.
I've tried the following in a loop, but its very slow, I time out when I try to query just 500 tables. Is there any way to improve this?
SELECT
min(timestamp) as start,
max(timestamp) as end,
count(value) as count,
FROM
table_NAME
Edit: To provide some context. Data is coming from a large number of sensors for engineering equipment. Each sensor has its own stream of data, including collection interval.
The type of SQL database is dependent on the building, there will be a few different types.
As for what the data will be used for, I need to know which trends are current and how old they are. If they are not current, I need to fix them. If there are very few data points, I need to check configuration of data collection.
(Note: The following applies to MySQL.)
Auto-generate query
Use informtation_schema.TABLES to list all the table and generate the SELECT statements. Then copy/paste to run them.
Or write a Stored Procedure to do the above, including the execution. It might be better to have the SP build a giant UNION ALL to find all the results as one "table".
min/max
As already mentioned, if you don't have an index on timestamp, it will have to read all 5 billion rows -- which is a lot slower than fetching just the first and last of values from 50K indexes.
COUNT
Use COUNT(*) instead of COUNT(value) -- The latter goes to the extra effort of checking value for NOT NULL.
The COUNT(*) will need to read an entire index. That is, if you do have INDEX(timestamp), the COUNT will the slow part. Consider the following: Don't do the COUNT; instead, do SHOW TABLE STATUS;; it will find estimates of the number of rows for every table in the current database. That will be much faster.

How to speed up MySQL queries in real-time app

I've created a shiny app that renders plots, tables and maps by querying in real-time a MySQL database. Since we talk about big data (i.e. 12 million rows), queries take a long time to execute, both in SQL and Spark (using scala and python).
Do you have any suggestions in order to speed up these queries? I've been thinking about switching to Cassandra, but data migration from a relational to a NoSQL DB is challenging ...
Database background: Data about vehicle detection in a given timestamp and a given Bluetooth station. There are two tables: one for the position and the name of stations and one with timestamp, station and number of vehicles.
An example of a query I have is the following, where I group by month to acquire the total number of vehicles detected in each month.
SELECT MONTH(timestamp) as month,SUM(count) as c
FROM bluetoothstations.measurement
GROUP BY month(timestamp);
Thank you in advance!
The data does not change once it is inserted, correct? In that case, augment a "summary table" nightly (or as data is INSERTed). Then the summary table will allow for a much faster generation of the COUNT (or other aggregates).
More discussion: http://mysql.rjweb.org/doc.php/summarytables
I think that MONTH(timestamp) is causing a full table scan. My guess is that if you saved MONTH(timestamp) as a separate column in bluetoothstations.measurement called, say, month and then added an index on month, then you could run
SELECT month,SUM(count) as c
FROM bluetoothstations.measurement
GROUP BY month;
which I would expect to run faster.
Use DESCRIBE (aka EXPLAIN) to get the execution plan of your query; that should give you a better idea of what's slowing down your query, and where you need indexes.

Should I use the sql COUNT(*) or use SELECT to fetch rows and count the rows later

I am writing a NodeJs application which should be very light in weight to the mysql db(Engine- InnoDB).
I am trying to count the number of records of a table in the mysql db
So I was wondering whether I should use the COUNT(*) function or get all the rows with a SELECT query and then count the rows using JavaScript.
Which way is better with respect to,
DB Operation cost
Overall performance
Definitely use the count() function - unless you need the data within the records as well for other purpose.
If you query all rows, then on MySQL side the server has to prepare a resultset (memory consumption, time to fetch data into resultset), then push it down through the connection to your application (more data takes more time), your application has to receive the data (again, memory consumption and time to create the resultset), and finally your application has to count the number of records in the resultset.
If you use count(), MySQL counts records and returns just a single number.
count() is obviously better than fetch and count separately.
As count() fetch the total count from index key (if there is any primary key).
Also the fetching data takes too much of time( disk I/O and network operations).
Thanks
When getting information from a database, the usual best approach is to get what you need and nothing more. This includes things like selecting specific columns rather than select *, and aggregating at the DBMS rather than in your client code. In this case, since all you apparently need is a count, use count().
It's a good bet that will outperform any other attempted solution since:
you'll be sending only what's absolutely necessary over the network (this may be less important for local databases but, once you have your data elsewhere, it can have a real impact); and
the DBMS will almost certainly be optimised for that use case.
Do a count(FIELD_NAME) as it will be much faster when you fetch all rows .It will only get count which is always index in table.

mysql database design?

I have a table in which I store ratings from users for each product..
The table consists of the following fields productid, userid, rating(out of 5)...This table might contain a million rows in the future....
So in order to select the top 5 products I am using the following query::.
SELECT productid, avg( rating ) as avg_rating
from product_ratingstblx245v
GROUP BY productid
ORDER BY avg( rating ) DESC
LIMIT 5
My Question is since I will be showing this result on few pages of my site, would it be better to maintain a seperate table for the average ratings with fields productid,avgrating,totalvotes???
You don't need the answer to that question yet. You can start with a VIEW, that is the result of executing the above query. If, after performing load tests (e.g. with JMeter), you see that your site runs slow indeed, you can replace the VIEW with a TEMPORARY TABLE (stored in memory). Since the view and the temporary table will look the same from the outside, you will not have to change your business logic.
Tbh, if MySql wasn't able to handle queries on a simple table schema such as yours above for over a million records in (sub)millisecond speeds, I really would wonder why companies use it for LOB applications.
As it is, I'm a MS Sql developer, so I don;t really know that much about MySql's ability. However assuming that its Database Engine is as good as Sql Servers (I've heard good things about My Sql), you don;t need to worry about your performance issue. If you do want to tweak it, then why not cache the results for 10 minutes (or longer) at your application layer? Triggers are generally pure (albeit sometimes neccessary) evil. Sql Servers are designed specfically for the type of query you with to execute, trust in the Sql.
Personally, I do not like the idea of running totals like this but if it does become necessary then I would not store the average, I would store the TOTAL VOTES and TOTAL RATING. That way it's a very simple UPDATE query (add 1 to TOTAL VOTES and add rating to TOTAL RATING). You can then calculate the average on the fly in minimal time.
As for how you might handle this, I would use a trigger as someone already suggested. But only after trying the VIEW thing that someone else suggested.

What are some optimization techniques for MySQL table with 300+ million records?

I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().