Simple Sphinx & mySQL Query - mysql

Forgive me for asking what should be a simple question but I am totally new to Sphinx.
I am using Sphinx with a mySQL datastore. The table looks like below with the Title and Content fields indexed by Sphinx.
CREATE TABLE `documents` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`group_id` int(11) NOT NULL,
`group_id2` int(11) NOT NULL,
`date_added` datetime NOT NULL,
`title` varchar(255) NOT NULL,
`content` text NOT NULL,
`url` varchar(255) NOT NULL,
`links` int(11) NOT NULL,
`hosts` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `url` (`url`)
) ENGINE=InnoDB AUTO_INCREMENT=439043 DEFAULT CHARSET=latin1
Now, if I connect through Sphinx with
mysql -h0 -P9306
I can run a simple query like...
SELECT * FROM test1 WHERE MATCH('test document');
And I will get back a result set like...
+--------+----------+------------+
| id | group_id | date_added |
+--------+----------+------------+
| 360625 | 1 | 1499727792 |
| 362257 | 1 | 1499727807 |
| 362777 | 1 | 1499727811 |
| 159717 | 1 | 1499717614 |
| 160557 | 1 | 1499717621 |
----------------------------------
When what I actually want is it to return a result set with column values from the documents table (like the URL, Title, Links, Hosts, etc. columns) and, if all possible, sort these by the relevancy of the Sphinx match.
Can that be accomplished in a single query? What might it look like?
Thanks in advance!

Two (main) options
Take the ids from the SphinxQL result, and run a MySQL Query to get the full details, see http://sphinxsearch.com/info/faq/#row-storage
eg SELECT * FROM documents WHERE id IN (3,5,7) ORDER BY FIELD(id,3,5,7)
This MySQL query, should be VERY quick, because its a PK lookup, and only retrieving a few rows (ie one page of results) - the heavy lifting of searching the whole table has already been done in first Sphinx Query.
Duplicate all the columns you want to retrieve in the resultset as Attributes. You've already made group_id and date_added as attributes, would need to make more attributes.
sql_field_string is a very convenient shortcut to make BOTH a Field and an String Attribute from one column. Not available for other column types, but less useful, as numeric columns, are not typically needed as fields anyway.
option 1 is good in it avoids duplicating the data, and saves memory (Sphinx wants to typically hold attributes in memory) - and may be most practical on big datasets.
whereas option 2 is good in that it avoids a second query for each result. But because have a copy of data, it may mean additional complication syncing.
Doesn't look relevant in your case, but if say had a 'clicks' column, which you want in increment often (when users click!), and need it in resultset but you don't really need it in sphinx for query purposes, the first option, would allow you only have to increment it in database, and the mysql query would always get the live value. But the second option means having to keep sphinx index in 'sync' at all times)

Related

Fulltext index not matching column list on mysql

I use contao 4.9 and have problems with a view that only arise when using mysql 5.7.35 instead of mariadb. The query that creates the view is the following:
CREATE
OR REPLACE VIEW `tl_news4ward_articleWithTags` AS
SELECT
tl_news4ward_article.*,
GROUP_CONCAT(tag) AS tags
FROM
tl_news4ward_article
LEFT OUTER JOIN tl_news4ward_tag ON (
tl_news4ward_tag.pid = tl_news4ward_article.id
)
GROUP BY
tl_news4ward_article.id
The interesting part of the tl_news4ward_article table was created as follows:
CREATE TABLE `tl_news4ward_article` (
`title` varchar(255) NOT NULL DEFAULT '',
`keywords` text,
`description` text,
PRIMARY KEY (`id`),
KEY `pid` (`pid`),
KEY `alias` (`alias`)
) ENGINE=MyISAM AUTO_INCREMENT=53 DEFAULT CHARSET=utf8
And tl_news4ward_tag:
CREATE TABLE `tl_news4ward_tag` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`pid` int(10) unsigned NOT NULL DEFAULT '0',
`tstamp` int(10) unsigned NOT NULL DEFAULT '0',
`tag` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
KEY `pid` (`pid`),
FULLTEXT KEY `tag` (`tag`)
) ENGINE=MyISAM AUTO_INCREMENT=103 DEFAULT CHARSET=utf8
When I run the query SELECT MATCH (keywords,title,description) AGAINST (' something1 something2' IN BOOLEAN MODE) AS score FROM tl_news4ward_article it just works, but if I run SELECT MATCH (keywords,title,description) AGAINST (' something1 something2' IN BOOLEAN MODE) AS score FROM tl_news4ward_articleWithTags I get an error:
#1191 - Can't find FULLTEXT index matching the column list
I can provide more information if needed. It just works on Mariadb.
DB Fiddle:
https://www.db-fiddle.com/f/3LN1cAM6aoohaB4a1q6oWb/1
EDIT2:
The code comes from this Contao module: https://github.com/psi-4ward/news4ward/issues/106 we think is supposed to work with MySQL
EDIT3:
More clear compare-fiddle: https://dbfiddle.uk/?rdbms=mysql_5.7&rdbms2=mariadb_10.3&fiddle=2f1ab88e65a8acb3bb992a1cf6fb4101
EDIT4:
The above query is simplified - as you can see in the Contao module, the tags column should be included in the score and match as well.
The error comes from not having an index on the combination of the 3 columns:
FULLTEXT(keywords,title,description)
You should move to InnoDB, which now has FULLTEXT. The two tables may need to use the same Engine. Caution: There are several details about InnoDB's FULLTEXT that are different than MyISAM's. Here is a list: http://mysql.rjweb.org/doc.php/myisam2innodb#fulltext
MySQL's VIEWs are mostly syntactic sugar. While I have not encountered your specific problem, it seems clear that VIEW is getting in the way.
Did you try explicitly specifying the two 'algorithms' for expanding Views? https://dev.mysql.com/doc/refman/5.7/en/view-algorithms.html
If you are having performance problems with the query (once it works), it may be because of doing the JOIN before doing the filtering. If you encounter such, we can discuss that in a separate Question.
(From fiddle)
I prefer this way of getting the record, plus tags:
SELECT a.*,
( SELECT GROUP_CONCAT(tag)
FROM tl_news4ward_tag AS t
WHERE t.pid = a.id
) AS tags
FROM tl_news4ward_article AS a
But there is no way (with the current schema) to include tags in the MATCH.
4-col FT
If you decide that you must have tags in the FULLTEXT index, then the commalist of tags in the main table and get rid of the tags table. Then you need some stored procedures/functions to handle add-tag, del-tag, etc. Tip: FIND_IN_SET('foo', tags) will come in handy for testing for a particular tag.
You can't use full text indexes on view.
MySQL doesn't allow any form of indexes on a view, just on it's underlying tables. The reason for this is because MySQL only materializes a view when you select from it, due to the possibility of the underlying tables changing data all the time. If you had a view that returned 10 million rows, you'd have to apply a full text index to it every time you selected from it, and that takes a lot of time.
So the only chance you get is
add
ALTER TABLE tl_news4ward_article
ADD FULLTEXT(keywords,title,description)
SELECT * FROM tl_news4ward_articleWithTags t1 INNER JOIN ( SELECT MATCH (keywords,title,description) AGAINST (' something1 something2' IN BOOLEAN MODE) AS score ,id FROM tl_news4ward_article ) t2
ON t1.id = t2.id;
id | title | keywords | description | tags | score | id
-: | :-------- | :-------- | :---------- | :--- | ----: | -:
1 | something | something | something | null | 0 | 1
db<>fiddle here
Views do not have indexes, so index hints do not apply. Use of index hints when selecting from a view is not permitted.
This actually might be a MySQL 5.7 regression.
The linked Contao Module's code is 8 years (2014) old:
https://github.com/psi-4ward/news4ward_related/blame/master/RelatedHelper.php#L18
MySQL 5.7 was releasted 6 years (2015) ago.
The code broke was working in 5.6 but does no longer in 5.7
https://dbfiddle.uk/?rdbms=mysql_5.6&rdbms2=mysql_5.7&fiddle=1d875703c4781dad878e64943057de5a
Not sure if this s a bug in MySQL 5.7 or an intended removal of that functionality.
Lowset friction solution might really be to update the Contao module and remove the usage of the VIEW in the queries.

MySQL Date Range Query Optimization

I have a MySQL table structured like this:
CREATE TABLE `messages` (
`id` int NOT NULL AUTO_INCREMENT,
`author` varchar(250) COLLATE utf8mb4_unicode_ci NOT NULL,
`message` varchar(2000) COLLATE utf8mb4_unicode_ci NOT NULL,
`serverid` varchar(200) COLLATE utf8mb4_unicode_ci NOT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`guildname` varchar(1000) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=27769461 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
I need to query this table for various statistics using date ranges for Grafana graphs, however all of those queries are extremely slow, despite the table being indexed using a composite key of id and date.
"id" is auto-incrementing and date is also always increasing.
The queries generated by Grafana look like this:
SELECT
UNIX_TIMESTAMP(date) DIV 120 * 120 AS "time",
count(DISTINCT(serverid)) AS "servercount"
FROM messages
WHERE
date BETWEEN FROM_UNIXTIME(1615930154) AND FROM_UNIXTIME(1616016554)
GROUP BY 1
ORDER BY UNIX_TIMESTAMP(date) DIV 120 * 120
This query takes over 30 seconds to complete with 27 million records in the table.
Explaining the query results in this output:
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| 1 | SIMPLE | messages | NULL | ALL | PRIMARY | NULL | NULL | NULL | 26952821 | 11.11 | Using where; Using filesort |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
This indicates that MySQL is indeed using the composite primary key I created for indexing the data, but still has to scan almost the entire table, which I do not understand. How can I optimize this table for date range queries?
Plan A:
PRIMARY KEY(date, id), -- to cluster by date
INDEX(id) -- needed to keep AUTO_INCREMENT happy
Assiming the table is quite big, having date at the beginning of the PK puts the rows in the given date range all next to each other. This minimizes (somewhat) the I/O.
Plan B:
PRIMARY KEY(id),
INDEX(date, serverid)
Now the secondary index is exactly what is needed for the one query you have provided. It is optimized for searching by date, and it is smaller than the whole table, hence even faster (I/O-wise) than Plan A.
But, if you have a lot of different queries like this, adding a lot more indexes gets impractical.
Plan C: There may be a still better way:
PRIMARY KEY(id),
INDEX(server_id, date)
In theory, it can hop through that secondary index checking each server_id. But I am not sure that such an optimization exists.
Plan D: Do you need id for anything other than providing a unique PRIMARY KEY? If not, there may be other options.
The index on (id, date) doesn't help because the first key is id not date.
You can either
(a) drop the current index and index (date, id) instead -- when date is in the first place this can be used to filter for date regardless of the following columns -- or
(b) just create an additional index only on (date) to support the query.

Duplicating records in a mysql table intermittently causes statements to hang and not return

We are trying to duplicate existing records in a table: make 10 records out of one. The original table contains 75.000 records, and once the statements are done will contain about 750.000 (10 times as many). The statements sometimes finish after 10 minutes, but many times they never return. Hours later we will receive a timeout. This happens about 1 out of 3 times. We are using a test database where nobody is working on, so there is no concurrent access to the table. I don't see any way to optimise the SQL since to me the EXPLAIN PLAN looks fine.
The database is mysql 5.5 hosted on AWS RDS db.m3.x-large. The CPU load goes up to 50% during the statements.
Question: What could cause this intermittent behaviour? How do I resolve it?
This is the SQL to create a temporary table, make roughly 9 new records per existing record in ct_revenue_detail in the temporary table, and then copy the data from the temporary table to ct_revenue_detail
---------------------------------------------------------------------------------------------------------
-- CREATE TEMPORARY TABLE AND COPY ROLL-UP RECORDS INTO TABLE
---------------------------------------------------------------------------------------------------------
CREATE TEMPORARY TABLE ct_revenue_detail_tmp
SELECT r.month,
r.period,
a.participant_eid,
r.employee_name,
r.employee_cc,
r.assignments_cc,
r.lob_name,
r.amount,
r.gp_run_rate,
r.unique_id,
r.product_code,
r.smart_product_name,
r.product_name,
r.assignment_type,
r.commission_pcent,
r.registered_name,
r.segment,
'Y' as account_allocation,
r.role_code,
r.product_eligibility,
r.revenue_core,
r.revenue_ict,
r.primary_account_manager_id,
r.primary_account_manager_name
FROM ct_revenue_detail r
JOIN ct_account_allocation_revenue a
ON a.period = r.period AND a.unique_id = r.unique_id
WHERE a.period = 3 AND lower(a.rollup_revenue) = 'y';
This is the second query. It copies the records from the temporary table back to the ct_revenue_detail TABLE
INSERT INTO ct_revenue_detail(month,
period,
participant_eid,
employee_name,
employee_cc,
assignments_cc,
lob_name,
amount,
gp_run_rate,
unique_id,
product_code,
smart_product_name,
product_name,
assignment_type,
commission_pcent,
registered_name,
segment,
account_allocation,
role_code,
product_eligibility,
revenue_core,
revenue_ict,
primary_account_manager_id,
primary_account_manager_name)
SELECT month,
period,
participant_eid,
employee_name,
employee_cc,
assignments_cc,
lob_name,
amount,
gp_run_rate,
unique_id,
product_code,
smart_product_name,
product_name,
assignment_type,
commission_pcent,
registered_name,
segment,
account_allocation,
role_code,
product_eligibility,
revenue_core,
revenue_ict,
primary_account_manager_id,
primary_account_manager_name
FROM ct_revenue_detail_tmp;
This is the EXPLAIN PLAN of the SELECT:
+----+-------------+-------+------+------------------------+--------------+---------+------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------+--------------+---------+------------------------------------+-------+-------------+
| 1 | SIMPLE | a | ref | ct_period,ct_unique_id | ct_period | 4 | const | 38828 | Using where |
| 1 | SIMPLE | r | ref | ct_period,ct_unique_id | ct_unique_id | 5 | optusbusiness_20160802.a.unique_id | 133 | Using where |
+----+-------------+-------+------+------------------------+--------------+---------+------------------------------------+-------+-------------+
This is the definition of ct_revenue_detail:
ct_revenue_detail | CREATE TABLE `ct_revenue_detail` (
`participant_eid` varchar(255) DEFAULT NULL,
`lob_name` varchar(255) DEFAULT NULL,
`amount` decimal(32,16) DEFAULT NULL,
`employee_name` varchar(255) DEFAULT NULL,
`period` int(11) NOT NULL DEFAULT '0',
`pk_id` int(11) NOT NULL AUTO_INCREMENT,
`gp_run_rate` decimal(32,16) DEFAULT NULL,
`month` int(11) DEFAULT NULL,
`assignments_cc` int(11) DEFAULT NULL,
`employee_cc` int(11) DEFAULT NULL,
`unique_id` int(11) DEFAULT NULL,
`product_code` varchar(50) DEFAULT NULL,
`smart_product_name` varchar(255) DEFAULT NULL,
`product_name` varchar(255) DEFAULT NULL,
`assignment_type` varchar(100) DEFAULT NULL,
`commission_pcent` decimal(32,16) DEFAULT NULL,
`registered_name` varchar(255) DEFAULT NULL,
`segment` varchar(100) DEFAULT NULL,
`account_allocation` varchar(25) DEFAULT NULL,
`role_code` varchar(25) DEFAULT NULL,
`product_eligibility` varchar(25) DEFAULT NULL,
`rollup` varchar(10) DEFAULT NULL,
`revised_amount` decimal(32,16) DEFAULT NULL,
`original_amount` decimal(32,16) DEFAULT NULL,
`comment` varchar(255) DEFAULT NULL,
`amount_revised_flag` varchar(255) DEFAULT NULL,
`exclude_segment` varchar(10) DEFAULT NULL,
`revenue_type` varchar(50) DEFAULT NULL,
`revenue_core` decimal(32,16) DEFAULT NULL,
`revenue_ict` decimal(32,16) DEFAULT NULL,
`primary_account_manager_id` varchar(100) DEFAULT NULL,
`primary_account_manager_name` varchar(100) DEFAULT NULL,
PRIMARY KEY (`pk_id`,`period`),
KEY `ct_participant_eid` (`participant_eid`),
KEY `ct_period` (`period`),
KEY `ct_employee_name` (`employee_name`),
KEY `ct_month` (`month`),
KEY `ct_segment` (`segment`),
KEY `ct_unique_id` (`unique_id`)
) ENGINE=InnoDB AUTO_INCREMENT=15338782 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (period)
PARTITIONS 120 */ |
Edit 29.9: The intermittent behaviour was caused by the omission of a delete SQL statement. If the original table was not deleted before automatically duplicating records. The first time all is fine: we started with 75,000 records and ended up with 750,000 records.
Because the delete statement was missed the next time we already had 750,000 records, and the script would make 7.5M records out of it. That would still work, but the subsequent run trying to make 7.5M into 75M records would fail. 1 in 3 failures.
We would then try all the scripts manually, and of course then we would delete the table properly, and all would go well. The reason why we didn't see that beforehand was that our application does not output anything when running the SQL.
The real delay would be with your second query inserting from the temporary table back into the original tables. There are several issues here.
Sheer amount of data
Looking at your table, there are several columns of varchar(255) a conservative estimate would put an average length of your rows at 2kb. That's roughly about 1.5GB that's being copied from one table to another and being moved to different partitions! Partitioning makes reads more efficient but for inserting the engine has to figure out which partition the data should be moved to so it's actually writing to lots of different files instead of sequentially to one file. For spinning disks, this is slow.
Rebuilding the indexes
One of the biggest costs of inserts is rebuilding the indexes. In your case you have many of them.
KEY `ct_participant_eid` (`participant_eid`),
KEY `ct_period` (`period`),
KEY `ct_employee_name` (`employee_name`),
KEY `ct_month` (`month`),
KEY `ct_segment` (`segment`),
KEY `ct_unique_id` (`unique_id`)
And some of this indexes like employee_name are on varchar(255) columns. That means pretty hefty indexes.
Solution part 1 - Normalize
Your database isn't normalized. Here is a classic example:
primary_account_manager_id varchar(100) DEFAULT NULL,
primary_account_manager_name varchar(100) DEFAULT NULL,
You should really be having a table called account_manager and these two fields should be in that. primary_account_manager_id probably should be an integer field. It is only the id that should be in your ct_revenue_detail table.
Similarly you really shouldn't have employee_name, registered_name etc in this table. They should be in separate tables and they should be linked to ct_revenue_detail by foreign keys.
Solution part 2 - Rethink indexes.
Do you need so many? Mysql only uses one index per table per where clause anyway so some of these indexes are probably never used. Is this one really needed:
KEY `ct_unique_id` (`unique_id`)
You already have primary key why do you even need another unique column?
Indexes for the SELECT: For
SELECT ...
FROM ct_revenue_detail r
JOIN ct_account_allocation_revenue a
ON a.period = r.period AND a.unique_id = r.unique_id
WHERE a.period = 3 AND lower(a.rollup_revenue) = 'y';
a needs INDEX(period, rollup_revenue) in either order. However, you also need to declare rollup_revenue to have a ..._ci collation and avoiding the column in a function. That is change lower(a.rollup_revenue) = 'y' to a.rollup_revenue = 'y'.
r needs INDEX(period, unique_id) in either order. But, as e4c5 mentioned, if unique_id is really "unique" in this table, then take advantage of such.
Bulkiness is a problem when shoveling data around.
decimal(32,16) takes 16 bytes and gives you precision and range that are probably unnecessary. Consider FLOAT (4 bytes, ~7 significant digits, adequate range) or DOUBLE (8 bytes, ~16 significant digits, adequate range).
month int(11) takes 4 bytes. If that is just a value 1..12, then use TINYINT UNSIGNED (1 byte).
DEFAULT NULL -- I suspect most columns will never be NULL; if so, say NOT NULL for them.
amount_revised_flag varchar(255) -- if that is really a "flag", such as "yes"/"no", then use an ENUM and save lots of space.
It is uncommon to have both an id and a name in the same table (see primary_account_manager*); that is usually relegated to a "normalization table".
"Normalize" (already mentioned by #e4c5).
HASH partitioning
Hash partitioning is virtually useless. Unless you can justify it (preferably with a benchmark), I recommend removing partitioning. More discussion.
Adding or removing partitioning usually involves changing the indexes. Please show us the main queries so we can help you build suitable indexes (especially composite indexes) for the queries.

~150ms on a 2 million rows MySQL MyISAM table

I'm learning about MySQL performance with a pet project consisting of ~2million rows + ~600k rows (two MyISAM tables). A range query using BETWEEN on two INT(10) indexed columns, LIMITed to 1 returned result takes about 160ms (including an INNER JOIN). I figure my configuration isn't optimised and am looking for some advice on how to either diagnose, or perhaps "common configuration".
I created a gist containing both tables, the query and the contents of my.cnf.
I created the b-tree index after inserting all data which was imported from a CSV file from MaxMinds open database. I tried two separate, and now a combined index with no difference in performance.
I'm running this locally on a Macbook Pro clocking at 2,6GHz (i5) and 8GB 1600MHz RAM. MySQL is installed using the downloadable binary from mysql's download page (unable to supply a third link because my rep is to low). It's a default installation with no major additions to the my.cnf config-file, included in the gist (located under /usr/local/mysql-5.6.xxx/ directory on my system).
My concern is that I'm reaching ~160ms which indicates to me that I'm missing something. I've considered compressing the table but I have a feeling that I'm missing other configurations. Also the myisampack wasn't in my PATH (I think) so I'm considering other optimisations before I explore this further.
Any advice is appreciated!
$ mysql --version
/usr/local/mysql-5.6.23-osx10.8-x86_64/bin/mysql Ver 14.14 Distrib 5.6.23, for osx10.8 (x86_64) using EditLine wrapper
Tables
CREATE TABLE `blocks` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`begin_range` int(10) unsigned NOT NULL,
`end_range` int(10) unsigned NOT NULL,
`_location_id` int(11) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `begin_range` (`begin_range`,`end_range`)
) ENGINE=MyISAM AUTO_INCREMENT=2008839 DEFAULT CHARSET=ascii;
CREATE TABLE `locations` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`country` varchar(2) NOT NULL DEFAULT '',
`region` varchar(255) DEFAULT NULL,
`city` varchar(255) DEFAULT NULL,
`postalcode` varchar(255) DEFAULT NULL,
`latitude` float NOT NULL,
`longitude` float NOT NULL,
`metro_code` int(11) DEFAULT NULL,
`area_code` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=641607 DEFAULT CHARSET=utf8;
Query
SELECT locations.latitude, locations.longitude
FROM blocks
INNER JOIN locations ON blocks._location_id = locations.id
WHERE INET_ATON('139.130.4.5') BETWEEN begin_range AND end_range
LIMIT 0, 1;
Edit;
Updated gist with EXPLAIN on the SELECT, also posted here for convenience.
EXPLAIN SELECT locations.latitude, locations.longitude FROM blocks INNER JOIN locations ON blocks._location_id = locations.id WHERE INET_ATON('94.137.106.123') BETWEEN begin_range AND end_range LIMIT 0, 1;
+----+-------------+-----------+--------+---------------+-------------+---------+---------------------------+---------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+--------+---------------+-------------+---------+---------------------------+---------+------------------------------------+
| 1 | SIMPLE | blocks | range | begin_range | begin_range | 4 | NULL | 1095345 | Using index condition; Using where |
| 1 | SIMPLE | locations | eq_ref | PRIMARY | PRIMARY | 4 | geoip.blocks._location_id | 1 | NULL |
+----+-------------+-----------+--------+---------------+-------------+---------+---------------------------+---------+------------------------------------+
2 rows in set (0.00 sec)
Edit 2; Included data into the question for convenience.
The problem, and the normal approach (which your code exemplifies) leads to hitting 1095345 rows. I have an approach that can do that query in one disk hit, even the cache is cold.
Excerpts from http://mysql.rjweb.org/doc.php/ipranges :
The Situation
Your data includes a large set of non-overlapping 'ranges'. These could be IP addresses, datetimes (show times for a single station), zipcodes, etc.
You have pairs of start and end values; one 'item' belongs to each such 'range'. So, instinctively, you create a table with start and end of the range, plus info about the item. Your queries involve a WHERE clause that compares for being between the start and end values.
The Problem
Once you get a large set of items, performance degrades. You play with the indexes, but find nothing that works well. The indexes fail to lead to optimal functioning because the database does not understand that the ranges are non-overlapping.
The Solution
I will present a solution that enforces the fact that items cannot have overlapping ranges. The solution builds a table to take advantage of that, then uses Stored Routines to get around the clumsiness imposed by it.

How to optimize a query that's using group by on a large number of rows

The table looks like this:
CREATE TABLE `tweet_tweet` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`text` varchar(256) NOT NULL,
`created_at` datetime NOT NULL,
`created_date` date NOT NULL,
...
`positive_sentiment` decimal(5,2) DEFAULT NULL,
`negative_sentiment` decimal(5,2) DEFAULT NULL,
`entity_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `tweet_tweet_entity_created` (`entity_id`,`created_at`)
) ENGINE=MyISAM AUTO_INCREMENT=1097134 DEFAULT CHARSET=utf8
The explain on the query looks like this:
mysql> explain SELECT `tweet_tweet`.`entity_id`,
STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`,
AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`,
COUNT(`tweet_tweet`.`id`) AS `tweet_count`
FROM `tweet_tweet`
WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'
GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| 1 | SIMPLE | tweet_tweet | ALL | NULL | NULL | NULL | NULL | 1097452 | Using where; Using temporary; Using filesort |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
1 row in set (0.00 sec)
About 300k rows are added to the table every day. The query runs about 4 seconds right now but I want to get it down to around 1 second and I'm afraid the query will take exponentially longer as the days go on. Total number of rows in tweet_tweet is currently only a little over 1M, but it will be growing fast.
Any thoughts on optimizing this? Do I need any more indexes? Should I be using something like Cassandra instead of MySQL? =)
You may try to reorder fields in the index (i.e. KEY tweet_tweet_entity_created (created_at, entity_id). That will allow mysql to use the index to reduce the quantity of actual rows that need to be grouped and ordered).
You're not using the index tweet_tweet_entity_created. Change your query to:
explain SELECT `tweet_tweet`.`entity_id`,
STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`,
AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`,
COUNT(`tweet_tweet`.`id`) AS `tweet_count`
FROM `tweet_tweet` FORCE INDEX (tweet_tweet_entity_created)
WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'
GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;
You can read more about index hints in the MySQL manual http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
Sometimes MySQL's query optimizer needs a little help.
MySQL has a dirty little secret. When you create an index over multiple columns, only the first one is really "used". I've made tables that used Unique Keys and Foreign Keys, and I often had to set a separate index for one or more of the columns.
I suggest adding an extra index to just created_at at a minimum. I do not know if adding indexes to the aggregate columns will also speed things up.
if your mysql version 5.1 or higher ,you can consider partitioning option for large tables.
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html