Why is RediSearch slow when sorting numeric field? - redisearch

I'm using RediSearch in my project which has an index with over 13 millions documents. I need to fetch latest documents if there is no filter provided by users. My index schema has a NUMERIC field with SORTABLE flag and I've tried to run following query.
FT.SEARCH media * SORTBY media_id DESC LIMIT 0 10
It doesn't return a response for a while and I usually terminate the query.
Is there a way to get last documents in an acceptable time?

I was able to reproduce the behavior you describe by inserting documents with increasing values for the numeric field. I have created a FlameChart to check which part of the code consumes the CPU.
The culprit is the sorting heap we use which is an expensive data structure. In my experiment, each numeric value is inserted into the heap which results in a lengthy query time. This is the expected behavior for how you run your query.
As a solution, you can run the query with LIMIT 0 1 which will reduce the heap work to almost nothing then use the value you will get to run a query with a filter and LIMIT 0 10.
We are considering ways to optimize such queries but for now, there is no solution.
Cheers

A short term work around might be to store the lastest document ID in a Redis string as you update the index. Run in a pipeline to eliminate an unnecessary network back and forth
SET LASTEST_DOCUMENT_ID $docId
HSET $docId KEY VALUE....
Then you can simply GET LASTEST_DOCUMENT_ID if there are no search parameters

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Improve performance on MySQL fulltext search query

I have a following MySQL query:
SELECT p.*, MATCH (p.description) AGAINST ('random text that you can use in sample web pages or typography samples') AS score
FROM posts p
WHERE p.post_id <> 23
AND MATCH (p.description) AGAINST ('random text that you can use in sample web pages or typography samples') > 0
ORDER BY score DESC LIMIT 1
With 108,000 rows, it takes ~200ms. With 265,000 rows, it takes ~500ms.
Under performance testing(~80 concurrent users) it shows ~18sec average latency.
Is any way to improve performance for this query ?
EXPLAIN OUTPUT:
UPDATED
We have added one new mirror MyISAM table with post_id, description and synchronized it with posts table via triggers. Now, fulltext search on this new MyISAM table works ~400ms(with the same performance load where InnoDB shows ~18sec.. this is a huge performance boost) Look like MyISAM is much more quicker for fulltext in MySQL than InnoDB. Could you please explain it ?
MySQL profiler results:
Tested on AWS RDS db.t2.small instance
Original InnoDB posts table:
MyISAM mirror table with post_id, description only:
Here are a few tips what to look for in order to maximise the speed of such queries with InnoDB:
Avoid redundant sorting. Since InnoDB already sorted the result according to ranking. MySQL Query Processing layer does not need to
sort to get top matching results.
Avoid row by row fetching to get the matching count. InnoDB provides all the matching records. All those not in the result list
should all have ranking of 0, and no need to be retrieved. And InnoDB
has a count of total matching records on hand. No need to recount.
Covered index scan. InnoDB results always contains the matching records' Document ID and their ranking. So if only the Document ID and
ranking is needed, there is no need to go to user table to fetch the
record itself.
Narrow the search result early, reduce the user table access. If the user wants to get top N matching records, we do not need to fetch
all matching records from user table. We should be able to first
select TOP N matching DOC IDs, and then only fetch corresponding
records with these Doc IDs.
I don't think you cannot get that much faster looking only at the query itself, maybe try removing the ORDER BY part to avoid unnecessary sorting. To dig deeper into this, maybe profile the query using MySQLs inbuild profiler.
Other than that, you might look into the configuration of your MySQL server. Have a look at this chapter of the MySQL manual, it contains some good informations on how to tune the fulltext index to your needs.
If you've already maximized the capabilities of your MySQL server configuration, then consider looking at the hardware itself - sometimes even a lost cost solution like moving the tables to another, faster hard drive can work wonders.
My best guess for the performance hit is the number of rows being returned by the query. To test this, simply remove the order by score and see if that improves the performance.
If it does not, then the issue is the full text index. If it does, then the issue is the order by. If so, the problem becomes a bit more difficult. Some ideas:
Determine a hardware solution to speed up the sorts (getting the intermediate files to be in memory).
Modifying the query so it returns fewer values. This might involve changing the stop-word list, changing the query to boolean mode, or other ideas.
Finding another way of pre-filtering the results.
The issue here is WHERE p.post_id <> 23
Design your system in such a way so that non-indexed columns — like post_id — need not be added to the WHERE clause.
Basically MySQL will search for the full-text indexed column and then filter the post_id. Hence, if there are a lot of matches returned by the full text search, the response time will not be as expected.

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?

Fast mysql query to randomly select N usernames

In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.
How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.
does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/

Is there a way to get rows_examined in MySQL without the slow log?

I'm building some profile information for a home grown app. I'd like the debug page to show the query sent along with how many rows were examined without assuming that slow_log is turned on, let alone parsing it.
Back in 2006, what I wanted was not possible. Is that still true today?
I see Peter Zaitsev has a technique where you:
Run FLUSH STATUS;
Run the query.
Run SHOW STATUS LIKE "Handler%";
and then in the output:
Handler_read_next=42250 means 42250 rows were analyzed during this scan
which sounds like if MySQL is only examining indexes, it should give you the number. But are there a set of status vars you can poll, add up and find out how many rows examined? Any other ideas?
It's slightly better than it was in 2006. You can issue SHOW SESSION STATUS before and after and then look at each of the Handler_read_* counts in order to be able to tell the number of rows examined.
There's really no other way.. While the server protocol has a flag to say if a table scan occurred, it doesn't expose rows_examined. Even tools like MySQL's Query Analyzer have to work by running SHOW SESSION STATUS before/after (although I think it only runs SHOW SESSION STATUS after, since it remembers the previous values).
I know it's not related to your original question, but there are other expensive components to queries besides rows_examined. If you choose to do this via the slow log, you should check out this patch:
http://www.percona.com/docs/wiki/patches:microslow_innodb#changes_to_the_log_format
I can recommend looking for "Disk_tmp_table: Yes" and "Disk_filesort: Yes".
Starting in 5.6.3, the MySQL performance_schema database also exposes statements statistics, in tables such as performance_schema.events_statements_current.
The statistics collected by statements include the 'ROWS_EXAMINED' column.
See
http://dev.mysql.com/doc/refman/5.6/en/events-statements-current-table.html
From there, statistics are aggregated to provide summaries.
See
http://dev.mysql.com/doc/refman/5.6/en/statement-summary-tables.html
From documentation:
Handler_read_rnd
The number of requests to read a row based on a fixed position. This value is high if you are doing a lot of queries that require sorting of the result. You probably have a lot of queries that require MySQL to scan entire tables or you have joins that don't use keys properly.
Handler_read_rnd_next
The number of requests to read the next row in the data file. This value is high if you are doing a lot of table scans. Generally this suggests that your tables are not properly indexed or that your queries are not written to take advantage of the indexes you have.
read_rnd* means reading actual table rows with a fullscan.
Note that it will show nothing if there is a index scan combined with a row lookup, it still counts as key read.
For the schema like this:
CREATE TABLE mytable (id INT NOT NULL PRIMARY KEY, data VARCHAR(50) NOT NULL)
INSERT
INTO mytable
VALUES …
SELECT id
FROM mytable
WHERE id BETWEEN 100 AND 200
SELECT *
FROM mytable
WHERE id BETWEEN 100 AND 200
, the latter two queries will both return 1 in read_key, 101 in read_next and 0 in both read_rnd and read_rnd_next, despite the fact that actual row lookups occur in the second query.
Prepend the query with EXPLAIN. In MySQL that will show the query's execution path, which tables were examined as well as the number of rows examined for each table.
Here's the documentation.