mysql "KEEP ONLY" command? - mysql

I know that there is a DELETE FROM <table> WHERE <exprs> command in mysql that deletes tuples from the specified table if the expressions are valid.
However, it becomes a burden always using demorgan's law in taking the complement of keep only expressions.
My question is, is there a KEEP ONLY type of command for mysql? I tried looking everywhere, but only came across examples of taking the complements of expressions and then using the DELETE command.

There is nothing like KEEP ONLY. However, it's not usually as hard as DeMorgan's Law to invert a result set.
Imagine you have some query to produce the results you want to keep. Remember that tables generally have some kind of primary key. What you want to do is select those columns. Often, this is a surrogate key (ID column), and it's as easy as selecting that a column:
SELECT ID FROM table WHERE X
What you can do is nest the query:
DELETE FROM table
WHERE ID NOT IN (SELECT ID FROM table WHERE X)
It's also common to write that as a JOIN, instead of a NOT IN, but as the whole point of this answer was to simplify your logical process for producing this code, I'll leave that as an exercise for the reader. I will add, however, that if your primary key has more than one column, you may have to write that JOIN code.

I just figured out another way to structure a "KEEP ONLY" type of command!
Say you want have something like this:
KEEP the tuples that satisfy <massive_expression>
All you have to do is negate <massive_expression> in the DELETE command, like so:
DELETE FROM table
WHERE ! (massive_expression);
It makes total sense, I should have seen this before!

Related

MySQL- INDEX(): How to Create a Functional Key Part Using Last nth Characters?

How would I write the INDEX() statement to use the last Nth characters of a functional keypart? I'm brand new to SQL/MySQL, and believe that's the proper verbiage of my question. explanation of what I'm looking for is below.
The MySQL 8.0 Ref Manual explains how to use the first nth characters, showing that the secondary index using col2's first 10 characters, via example:
CREATE TABLE t1 (
col1 VARCHAR(40),
col2 VARCHAR(30),
INDEX (col1, col2(10))
);
However, I would like to know how one could form this using the ending characters? Perhaps something like:
...
INDEX ((RIGHT (col2,3)));
);
However, I think that says to index over a column called 'xyz' instead of "put an index on each column value using the last 3 of 30 potential characters"? That's what I'm really trying to figure out.
For some context, it'd be helpful to index something with smooshed/mixed data and am playing around as to how such a thing could be accomplished. Example of the kind of data I'm talking about, below, is a simplified, adjusted version of exported data from an inventory/billing manager that hails from the 90's that I had to endure some years back...:
Col1
Col2
GP6500012_SALES_FY2023_SBucks_503_Thurs
R-DK_Sumat__SKU-503-20230174
GP6500012_SALES_FY2023_SBucks_607_Mon
R-MD_Columb__SKU-607-2023035
GP6500012_SALES_FY2023_SBucks_627_Mon-pm
R-BLD_House__SKU-503-20230024
GP6500012_SALES_FY2023_SBucks_929_Wed
R-FR_Ethp__SKU-929-20230324
Undoubtedly, better options exist that bypass this question altogether- and I'll presumably learn those techniques with time in my data analytics coursework. For now, I'm just curious if it's possible to somehow index the rows by suffix instead of prefix, and what a code example would look like to accomplish that. TIA.
Proposed solution (INDEX ((RIGHT (col2,3)))):
Not available.
Case 1:
When you need to split apart a column to search it, you have probably designed the schema wrong. In particular, that part of the columns needs to be in its own column. That being said, it is possible to use a 'virtual' (or 'generated') column that is a function of the original column, then INDEX that.
Case 2:
If you are suggesting that the last 3 characters are the most selective and that might speed up any lookup, don't bother. Simply index the entire column.
That data:
I would consider splitting up the stuff that is concatenated together by _. Do it as you INSERT the rows. If it needs to be put back together, do so during subsequent SELECTs.
DATEs:
Do not, on the other hand, split up dates (into year, month, etc). Keep them together. (That's another discussion.) Always go to the effort to convert dates (and datetimes) to the MySQL format (year-first) when storing. That way, you can properly use indexes and use the many date functions.
MySQL's Prefix indexing:
In general it is a "bad idea" to use the INDEX(col(10)) construct. It rarely is of any benefit; it often fails to use the index as much as you would expect. This is especially deceptive: UNIQUE(col(10)) -- It declares that the first 10 chars are unique, not the entire col!
CAST:
If the data is the wrong datatype (string vs int; wrong collation; etc), the I argue that it is a bad schema design. This is a common problem with EAV (Entity-Attribute-Value) schemas. When a number is stored as a string, CAST is needed to sort (ORDER BY) it.
Functional indexes:
Your proposed solution not a "prefix", it is something more complicated. I suspect any expression, even on non-string columns will work. This is when it became available:
---- 2018-10-22 8.0.13 General Availability -- -- -----
MySQL now supports creation of functional index key parts that index
expression values rather than column values. Functional key parts
enable indexing of values that cannot be indexed otherwise, such as
JSON values. For details, see CREATE INDEX Syntax.

Can I use index in MySQL in this way? [duplicate]

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..

Erratic behaviour of a mysql query

I have the following query :
SELECT days.from, days.to, days.nombre, days.totalDays, days.bloque,
days.comentario, days.local, admin.eMail, admin.passcode, days.id,
admin.username
FROM days,admin
WHERE days.id='9' AND days.nombre=admin.username
The problem is that the query somethimes work but sometimes doesnt. Sometimes works with only certain IDs. Is there any other way to formulate the query?
You are currently using implicit joins. Explicit joins are easier to read and understand for you and tend to make for much more consistent queries.
You could rewrite your query using JOINs. So, instead of:
SELECT days.from, days.to, days.nombre, days.totalDays, days.bloque,
days.comentario , days.local, admin.eMail, admin.passcode,
days.id, admin.username
FROM days,admin
WHERE days.id='9'
AND days.nombre=admin.username
You can use:
SELECT days.from,days.to,days.nombre,days.totalDays,days.bloque,
days.comentario,days.local,admin.eMail,admin.passcode,
days.id,admin.username
FROM days
INNER JOIN admin ON days.nombre=admin.username
WHERE days.id='9'
You may be able to note already how much easier it is to understand what is happening here. While this shouldn't in and of itself fix your query, it is far easier to read and thus to debug.
If you find that certain cases are not working, the best way to figure out why is to remove some restrictions and see if it then works. In this instance, make sure that the usernames that are not showing up have the column days.id equal to 9. Other potential issues when using a natural key are things like extra white space. Check for this in cases that do not work as the JOIN property days.nombre=admin.username may be failing.
Your other option, if, in fact, whitespaces are causing you issues, is to do away with your natural keys and implement surrogate keys. Surrogate keys mean that you will be using a standard and unique key code like an int that increments over time. Rather than have days.nombre as your foreign key, you would have days.admin_id as your foreign key.
As a rule, while there are many pros to natural keys and it is a debate which rages on, it is generally accepted that natural keys only work if the keys are consistent and unique.
Just guessing, but here's something that caused a problem for me recently: check your table and column definitions that the character sets are consistent. It looks like you have a mixture of English and Spanish, so perhaps some non-ASCII characters like ñ are not matching as expected.

MySQL: One row or more?

I have any kind of content what has an ID now here I can specify multiple types for the content.
The question is, should I use multiple rows to add multiple types or use the type field and put there the types separated with commas and parse them in PHP
Multiple Rows
`content_id` | `type`
1 | 1
1 | 2
1 | 3
VS
Single Row
`content_id` | `type`
1 | 1,2,3
EDIT
I'm looking for the faster answer, not the easier, please consider this. Performance is really important for me. So I'm talking about a really huge database with millions or ten millions of rows.
I'd generally always recommend the "multiple rows" approach as it has several advantages:
You can use SQL to return for example WHERE type=3 without any great difficulty as you don't have to use WHERE type LIKE '%3%', which is less efficient
If you ever need to store additional data against each content_id and type pair, you'll find it a lot easier in the multiple row version
You'll be able to apply one, or more, indexes to your table when it's stored in the "multiple row" format to improve the speed at which data is retrieved
It's easier to write a query to add/remove content_id and type pairs when each pair is stored separately than when you store them as a comma seaparated list
It'll (nearly) always be quicker to let SQL process the data to give you a subset than to pass it to PHP, or anything else, for processing
In general, let SQL do what it does best, which is allow you to store the data, and obtain subsets of the data.
I always use multiple rows. If you use single rows your data is hard to read and you have to split it up once you grab it from the database.
Use multiple rows. That way, you can index that type column later, and search it faster if you need to in the future. Also it removes a dependency on your front-end language to do parsing on query results.
Normalised vs de-normalised design.
usually I would recommend sticking to the "multiple rows" style (normalised)
Although sometimes (for performance/storage reasons) people deliberately implement "single row" style.
Have a look here:
http://www.databasedesign-resource.com/denormalization.html
The single row could be better in a few cases. Reporting tends to be easer with some denormalization is the main example. So if your code is cleaner/performs better with the single row, then go for that. Other wise the multiple rows would be the way to go.
Never, ever, ever cram multiple logical fields into a single field with comma separators.
The right way is to create multiple rows.
If there's some performance reason that demands you use a single row, at least make multiple fields in the row. But that said, there is almost never a good performance reason to do this. First make a good design.
Do you ever want to know all the records with, say, type=2? With multiple rows, this is easy: "select content_id from mytable where type=2". With the crammed field, you would have to say "select content_id from mytable where type like '%2%'". Oh, except what happens if there are more than 11 types? The above query would find "12". Okay, you could say "where type like '%,2,%'". Except that doesn't work if 2 is the first or the last in the list. Even if you came up with a way to do it reliably, a LIKE search with an initial % means a sequential read of every record in the table, which is very slow.
How big will you make the cram field? What if the string of types is too big to fit in your maximum?
Do you carry any data about the types? If you create a second table with key of "type" and, say, a description of that type, how will you join to that table. With multiple rows, you could simply write "select content_id, type_id, description from content join type using (type_id)". With a crammed field ... not so easy.
If you add a new type, how do you make it consistent? Suppose it used to say "3,7,9" and now you add "5". Can you say "3,7,9,5" ? Or do they have to be in order? If they're not in order, it's impossible to check for equality, because "1,2" and "2,1" will not look equal but they are really equivalent. In either case, updating a type field now becomes a program rather than a single SQL statement.
If there is some trivial performace gain, it's just not worth it.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.