Is there a limit for Redisearch ft.mget command for number of doc ids in a single query? - redisearch

I would want to pass maybe tens of thousands of document ids to ft.mget command, or even more - possibly even hundreds of thousands. Is this supported?

The only limit is the number of argument in a redis command which is 1M if I remember correctly.

Related

Why is RediSearch slow when sorting numeric field?

I'm using RediSearch in my project which has an index with over 13 millions documents. I need to fetch latest documents if there is no filter provided by users. My index schema has a NUMERIC field with SORTABLE flag and I've tried to run following query.
FT.SEARCH media * SORTBY media_id DESC LIMIT 0 10
It doesn't return a response for a while and I usually terminate the query.
Is there a way to get last documents in an acceptable time?
I was able to reproduce the behavior you describe by inserting documents with increasing values for the numeric field. I have created a FlameChart to check which part of the code consumes the CPU.
The culprit is the sorting heap we use which is an expensive data structure. In my experiment, each numeric value is inserted into the heap which results in a lengthy query time. This is the expected behavior for how you run your query.
As a solution, you can run the query with LIMIT 0 1 which will reduce the heap work to almost nothing then use the value you will get to run a query with a filter and LIMIT 0 10.
We are considering ways to optimize such queries but for now, there is no solution.
Cheers
A short term work around might be to store the lastest document ID in a Redis string as you update the index. Run in a pipeline to eliminate an unnecessary network back and forth
SET LASTEST_DOCUMENT_ID $docId
HSET $docId KEY VALUE....
Then you can simply GET LASTEST_DOCUMENT_ID if there are no search parameters

Which mysql method is fast?

I am saving some text in database say 1000 characters.
But I display only first 200 characters.
Method 1
I could save first 200 characters in one column
and the remaining in second column of sql table
Method 2
I can save everything in one column and while displaying I can
query for 200 characters
It would be "cleaner" to store everything in 1 column. and you can select only the first 200 characters like this
select substring(your_column, 1, 200) as your_column from your_table
It really is irrelevant, but if you try to optimize, then method 1 is better, as long as you limit your query to that column (or you only query these columns you really need), because doing any substring on server side takes time and resources (times number of requests...). Method 2 is cleaner, but you are optimize for time so method 1.
This will come down to one of two things:
If you are pulling the entire row back into PHP and then only showing the first 200 chars, then your network speed will potentially be a bottleneck on the pulling data back:
If on the other hand you have two columns, you will potentially have a bottleneck at your drive access which fetches the data back to your PHP - longer rows can cause a slower access to multiple rows.
This will come down to a server-specific weigh-up. It will really depend on how your server performs. I would suggest running some scenarios where your code tries to pull back a few hundred thousand of each to see how long it takes.
Method 2.
First, duplicate storage of data is usually bad (demoralization). This is certainly true in this case.
Second, it would take longer to write to two tables than one.
Third, you have now made updates and deleted vulnerable to annoying inconsistencies (see #1).
Fourth, unless you are searching the first 200 characters for text, getting data out will be the same for both methods (just select a sub string of the first 200 characters).
Fifth, even if you are searching the first 200 characters, you can index on those, and retrieval speed should be identical.
Sixth, you don't want a database design that limits your UX--what if you need to change to 500 characters? You'll have a lot of work to do.
This is a very obvious case of what not to do in database design.
reference: as answered by Joe Emison http://www.quora.com/MySQL/Which-mysql-method-is-fast

MySQL Improving speed of order by statements

I've got a table in a MySQL db with about 25000 records. Each record has about 200 fields, many of which are TEXT. There's nothing I can do about the structure - this is a migration from an old flat-file db which has 16 years of records, and many fields are "note" type free-text entries.
Users can be viewing any number of fields, and order by any single field, and any number of qualifiers. There's a big slowdown in the sort, which is generally taking several seconds, sometimes as much as 7-10 seconds.
an example statement might look like this:
select a, b, c from table where b=1 and c=2 or a=0 order by a desc limit 25
There's never a star-select, and there's always a limit, so I don't think the statement itself can really be optimized much.
I'm aware that indexes can help speed this up, but since there's no way of knowing what fields are going to be sorted on, i'd have to index all 200 columns - what I've read about this doesn't seem to be consistent. I understand there'd be a slowdown when inserting or updating records, but assuming that's acceptable, is it advisable to add an index to each column?
I've read about sort_buffer_size but it seems like everything I read conflicts with the last thing I read - is it advisable to increase this value, or any of the other similar values (read_buffer_size, etc)?
Also, the primary identifier is a crazy pattern they came up with in the nineties. This is the PK and so should be indexed by virtue of being the PK (right?). The records are (and have been) submitted to the state, and to their clients, and I can't change the format. This column needs to sort based on the logic that's in place, which involves a stored procedure with string concatenation and substring matching. This particular sort is especially slow, and doesn't seem to cache, even though this one field is indexed, so I wonder if there's anything I can do to speed up the sorting on this particular field (which is the default order by).
TYIA.
I'd have to index all 200 columns
That's not really a good idea. Because of the way MySQL uses indexes most of them would probably never be used while still generating quite a large overhead. (see chapter 7.3 in link below for details). What you could do however, is to try to identify which columns appear most often in WHERE clause, and index those.
In the long run however, you will probably need to find a way, to rework your data structure into something more manageable, because as it is now, it has the smell of 'spreadsheet turned into database' which is not a nice smell.
I've read about sort_buffer_size but it seems like everything I read
conflicts with the last thing I read - is it advisable to increase
this value, or any of the other similar values (read_buffer_size,
etc)?
In general he answer is yes. However the actual details depend on your hardware, OS and what storage engine you use. See chapter 7.11 (especially 7.11.4 in link below)
Also, the primary identifier is a crazy pattern they came up with in
the nineties.[...] I wonder if there's anything I can do to speed up
the sorting on this particular field (which is the default order by).
Perhaps you could add a primarySortOrder column to your table, into which you could store numeric values that would map the PK order (precaluclated from the store procedure you're using).
Ant the link you've been waiting for: Chapter 7 from MySQL manual: Optimization
Add an index to all the columns that have a large number of distinct values, say 100 or even 1000 or more. Tune this number as you go.

Is using LIMIT and OFFSET in MySQL less expensive than returning the full set of records?

It might be a silly question but I am just curious about what goes on behind the curtains.
If I want to paginate database records I can either use LIMIT and OFFSET or simply get all the records and extrapolate the ones I want with more code.
I know the second option is absolutely silly, I just want to know if it is more expensive
If I use LIMIT and OFFSET will the database grab just what I ask for, or will internally get all the records matching my query (even hundreds of thousands) and then use internally a starting index (OFFSET) and an ending index (OFFSET + LIMIT) to get the requested subset of records?
I don't even know if I used the right words to describe the doubt I have, I hope someone can shed some light.
Thanks!
Yes, it would be more expensive, for two reasons.
1) Mysql will optimize internally to only calculate the rows that it needs, rather than retrieving them all internally. Note that this optimization is a lot less if you have an order by in your query, because then mysql has to match and sort all of the rows in the dataset, rather than stopping when it finds the first X in your limit.
2) When all the records are returned, they all need to be transmitted over the wire from the database to your application server. That can take time, especially for medium to large data sets.
The difference can be enormous. Not only is the network difference big sometimes (a few rows vs hundreds to thousands), but the number of rows the database needs to find can be large also. For example, if you ask for 10 rows, the database can stop after finding 10 rows, rather than having to check every row.
Whenever possible, use LIMIT and OFFSET.

ORDER BY RAND() alternative [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MySQL: Alternatives to ORDER BY RAND()
I currently have a query that ends ORDER BY RAND(HOUR(NOW())) LIMIT 40 to get 40 random results. The list of results changes each hour.
This kills the query cache, which is damaging performance.
Can you suggest an alternative way of getting a random(ish) set of results that changes from time to time? It does not have to be every hour and it does not have to be totally random.
I would prefer a random result, rather than sorting on an arbitrary field in the table, but I will do that as a last resort...
(this is a list of new products that I want to shuffle around a bit every now and then).
If you have an ID column it's better to do a:
-- create a variable to hold the random number
SET #rownum := SELECT count(*) FROM table;
SET #row := (SELECT CEIL((rand() * #rownum));
-- use the random number to select on the id column
SELECT * from tablle WHERE id = #row;
The logic of selecting the random id number can be move to the application level.
SELECT * FROM table ORDER BY RAND LIMIT 40
is very inefficient because MySQL will process ALL the records in the table performing a full table scan on all the rows, order them randomly.
Its going to kill the cache because you are expecting a different result set each time. There is no way that you can cache a random set of values. If you want to cache a group of results, cache a large random set of values, and then within sub sections of the time that you are going to use those values do a random grab within the smaller set [outside of sql].
I think the better way is to download product identifiers to your middle layer, choose random 40 values when you need (once per hour or for every request) and use them in the query: product_id in (#id_1, #id_2, ..., #id_40).
you may have a column with random values that you update every hour.
This is going to be a significantly nasty query if it needs to sort a large data set into a random order (which really does require a sort), then discard all but the first 40 records.
A better solution would be to just pick 40 random records. There are lots of ways of doing this and it usually depends on having keys which are evenly distributed.
Another option is to pick the 40 random records in a batch job which is only run once per hour (or whatever) and then remember which ones they are.
One way to achieve it is to shuffle the objects you map the data to. If you don't map the data to objects, you could shuffle the result array from the database. I don't know if this will perform better or not, but you will at least get the benefits from the query cache as you mention.
You could also generate a random sequence from 1 to n, and index the result array (or object array) with those.
calculate the current hour in your PHP code and pass that to your query. this will result in a static value that can be cached.
note that you might also have a hidden bug. since you're only taking the hour, you only have 24 different values, which will repeat every day. which means that what's showing at 1 pm today will also be the same as what shows tomorrow at 6. you might want to change that.
Don't fight with the cache-- expoit it!
Write your query as you are (or even simpler). Then, in your code, cache the results, setting a cache expiry for 1 hour. If you are using a caching layer, like memcached, you are set. If not, you can build a fairly simple one:
[pseudocode]
global cache[24]
h = Time.hour
if (cache[h] == null) {
cache[h] = .. run your query
}
return cache[h];
If you only need a new set of random data once an hour, don't hit the database - save the results to your application's caching layer (or, if it doesn't have one, just put it out into a temporary file of some sort). Query cache is handy, but if you never need to even execute a query, even better...