Couchbase 5.5
N1Ql
I have 150k documents in a sandbox couchbase database where the document name is in the following format:
alpha_model::XXXXXXX::version
When I run this command:
SELECT META().id FROM Q1036628 WHERE META().id LIKE "alpha_model::100004993::%" LIMIT 10;
result count: 5. Elapsed time is 1.13s
However, when I add a '\' before the '_' the performance is greatly increased to
SELECT META().id FROM Q1036628 WHERE META().id LIKE "alpha\\_model::100004993::%" LIMIT 10;
result count: 5. Elapsed time is 8.16ms
Why is the second way over 100 times faster? Are underscores bad? Are there any other characters I should escape to improve performance
_ is wildcard to match any character at that place. If you want to search exactly you need to escape it. Checkout LIKE at https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/comparisonops.html
You might not have any other character at that place due to that your results are same. If you have any other character at _ the results will be different.
As IndexScans can't be done on wildcards, IndexScan done on prefix string until first wild card. That is reason without escape character IndexScan producing more results and taking time. By escaping _, the wildcard character starts at %.
Do EXPLAIN and checkout the Spans for correctness and optimized.
Checkout Page 152, How predicate is transformed into IndexScan spans https://blog.couchbase.com/wp-content/uploads/2017/10/N1QL-A-Practical-Guide-2nd-Edition.pdf
Checkout Page 341 Optimize query using request profiling.
Related
Our DB contains a lot of entries with comma in their titles (as in Hungarian, it's the decimal separator instead of period) and would like to match those with the right relevance. The search SQL looks like this currently in case of the user inputted terms are 7,5x20 otherTerm
SELECT (MATCH(title) AGAINST('(+7,5x20* +otherTerm* ) (7,5x20* otherTerm* ) (+7,5x20 +otherTerm )' IN BOOLEAN MODE)) AS Relevance,
id, title, product_id FROM versions
WHERE (MATCH(title) AGAINST('(+7,5x20* +otherTerm* ) (7,5x20* otherTerm* ) (+7,5x20 +otherTerm )' IN BOOLEAN MODE))
ORDER BY Relevance DESC LIMIT 50
Now the result order gives a higher relevance for eg. 5x20 than 7,5x20 so there has to be some kind of character escaping has to be done on the comma, preventing MySQL to handle them as separated strings. I didn't find the right one.
Thanks for any help in advance.
Edit: disassembling title into more digestible data is currently not an option. Really looking for solution escaping the comma or replacing it with 'match any single character' operator like dot in regex.
FULLTEXT indexing is not designed to handle numbers, regardless of the Locale for the numbers.
One approach is to alter the incoming text to replace punctuation that you want to treat as "letters" into, say, _. (And build a separate column for storing this altered text into. Then add the FULLTEXT index to it instead of the "real" text.)
Please note that +x will fail in a bad way -- one character strings are not indexed, so they cannot be found. So including strings that are two short leads to returning zero matches.
Alterations to the saved text (eg, 7_5x20) needs to be applied to the search, too.
50K rows? Write a special, one-time, script to perform the above transformation to the existing 50K rows. Then incorporate the transformations into both the INSERTs and the SELECTs.
I have a SQL table, with genetic information (name of the gene, function, strand...)
I want to retrieve the amount of chromosomes (21 as I'm working with the human genome). Problem is that some chromosomes are "repeated". For example:
SELECT DISTINCT chrom FROM table LIMIT 6;
chr1
chr10
chr10_GL383545v1_alt
chr10_GL383546v1_alt
chr11
chr11_JH159136v1_alt
As you can see I have more than one chr10, so if I count the DISTINCT chromosomes I get about 6000.
I've tried using NOT LIKE "_" but didn't work. I've thought I could "force" the result with LIKE "chr1" and so on, but I feel like cheating and is not exactly what I'm searching for. I would like a way to avoid every "_", but running
SELECT COUNT(DISTINCT chrom) NOT LIKE "_" FROM table; gives me back just 1 result...
LEFT is not optimal either, because I would have to specify the length of the string, and, I want a system that I could use without knowing anything about the expected result. So running a LEFT "", 4 and LEFT "", 5 is not what I'm searching for.
Is there a way I can count everything that does NOT CONTAIN a certain character? There's a better strategy?
Thank you very much!
Underscore is a wildcard character itself, so it must be escaped. Furthermore you want to match any characters before and after that underscore character so the % wildcard is needed around the escaped underscore.
SELECT count(chrom) FROM table WHERE chrom NOT LIKE '%\_%`;
Also you could use substring_index() to get distinct string before the underscore and count those:
SELECT COUNT(DISTINCT SUBSTRING_INDEX(chrom, '_', 1)) FROM table;
Although that is almost definitely going to be slower.
The problem with SELECT COUNT(DISTINCT chrom) NOT LIKE "_" FROM table; is the location of the comparison and the lack of the % wildcards in the LIKE comparison string.
Either of the following should work for you:
SELECT COUNT(DISTINCT chrom) FROM table WHERE chrom NOT LIKE '%|_%' ESCAPE '|';
Using ESACPE and specifying an escape character after the LIKE is easier than using \ in many cases since, depending on your scenario, you may need to remember to double escape with \. (or if you are writing this in say php, triple escape)
SELECT COUNT(DISTINCT chrom) FROM table WHERE LOCATE('_', chrom) > 0;
LOCATE() is also easier to use here. But I believe it would be slower than just doing a LIKE. The performance difference is probably pretty insignificant, so in most cases, its just preference.
Use REGEXP if you wish to keep it simple.LIKE is faster though.
SELECT count(chrom) FROM table WHERE chrom NOT REGEXP '_';
I also recommend INSTR which I think will perform better than REGEXP.
SELECT count(chrom) FROM table WHERE INSTR(chrom, '_')=0;
1. select count(*) from tableX where code = "XYZ";
2. select count(*) from tableX where code like "%XYZ";
Result for query 1 is 18734. <== Not Correct
Result for query 2 is 93003. <== Correct
We know that query 2's count is correct based on independent verification.
We expect these two queries to have the exact same count for each because we know that no rows in tableX have a code that ends with "XYZ", so the wildcard at the beginning shouldn't affect the query.
Why would these queries produce different counts?
We have already researched the differences between "=" comparison and "like" string comparison, but based on all our verification checks, we still don't understand why this would give us different counts
We have confirmed the following:
There are no leading or trailing characters in the "code" field
There are no hidden characters (tried all found here: How can I find non-ASCII characters in MySQL?)
The collation is "utf8_unicode_ci"
We are using MySQL version 5.5.40-0ubuntu0.12.04.1.
Try this in order to get your answer:
SELECT code
FROM tableX
WHERE code LIKE "%XYZ"
AND code <> "XYZ"
LIMIT 10
My guess is that some of your codes end with a lowercase xyz, and since LIKE is case-insensitive, it matched these where = did not.
where code = "XYZ"; gives exact match whereas where code LIKE "%XYZ"; includes partial match as well. In your case, there could be an extra space present which is giving wrong count. Consider trimming before comparing like
where UPPER(TRIM(code)) = 'XYZ';
We restarted the server that the database resides on, we re-ran the queries, and now they all are producing the expected, correct results...
We'll have to look into possibilities for why this "fixed" the issue.
Say you have a TEXT column on your table which could either be huge / paragraph long rants, or only a few sentence long lines.
Performance wise / server load wise, is it better to do:
SELECT SUBSTRING( myTxtColumn, 1, 200) FROM myTable AS myTxtColumn
Or to do:
SELECT myTxtColumn FROM myTable
Which of the queries puts more load on the server?
I'm curious if it will be easier for the server to fetch the full value of the column than doing a SUBSTRING() on it, or if the SUBSTRING() will be easier since its only returning the first 200 chars rather than the several KB long text values.
I'd err on the side of caution and use the substring. Your first query may need to read the whole string before doing substring on it (I suspect mysql is smarter than that, but who knows). Your second query definitely has to read the whole string and send it over the network.
I'm trying to find rows where the first character is not a digit. I have this:
SELECT DISTINCT(action) FROM actions
WHERE qkey = 140 AND action NOT REGEXP '^[:digit:]$';
But, I'm not sure how to make sure it checks just the first character...
First there is a slight error in your query. It should be:
NOT REGEXP '^[[:digit:]]'
Note the double square parentheses. You could also rewrite it as the following to avoid also matching the empty string:
REGEXP '^[^[:digit:]]'
Also note that using REGEXP prevents an index from being used and will result in a table scan or index scan. If you want a more efficient query you should try to rewrite the query without using REGEXP if it is possible:
SELECT DISTINCT(action) FROM actions
WHERE qkey = 140 AND action < '0'
UNION ALL
SELECT DISTINCT(action) FROM actions
WHERE qkey = 140 AND action >= ':'
Then add an index on (qkey, action). It's not as pleasant to read, but it should give better performance. If you only have a small number of actions for each qkey then it probably won't give any noticable performance increase so you can stick with the simpler query.
Your current regex will match values consisting of exactly one digit, not the first character only. Just remove the $ from the end of it, that means "end of value". It'll only check the first character unless you tell it to check more.
^[:digit:] will work, that means "start of the value, followed by one digit".