I'm working with a MySql table. The table contains numeric values and unix date values only. Each row has a unique id, columns containing ids related to other parts of the db, totals (like downloads per day) and the date that the row was inserted. My queries need to get the latest date for each id combination, in order to get the downloads for that day. One row is inserted each day for each id combination, and the index spans all of the ids and the date.
I have found through testing that it is quicker to perform two queries to get the exact row I want under certain circumstances. I would like a second opinion about this.
Here is a scenario that is very fast and uses the index:
SELECT * FROM foo WHERE A = 1 AND B = 1 AND mydate BETWEEN 123456789 AND 134567890 ORDER BY mydate DESC LIMIT 1
(The index is A, B, mydate)
Here is one that is very slow and doesn't use the index:
SELECT * FROM foo WHERE A IN (1, 2) AND B = 1 AND mydate BETWEEN 123456789 AND 134567890 GROUP BY A, B ORDER BY mydate DESC
This returns the correct result but doesn't use the index and is very slow. In reality, this simple example might use the index, but something like A IN(1,2,3,4,5,6,7,8,....10000) AND B IN (1,2,3,4,5,... 10000) doesn't, and that's what I need to cater for.
Here is where it gets interesting.
The following uses the index and is very fast:
SELECT *, MAX(mydate) FROM foo WHERE A IN (1,2,3,4,5,6,7,8,....10000) AND B IN (1,2,3,4,5,6,7,8,....10000) AND mydate BETWEEN 123456789 AND 134567890 GROUP BY A, B
The rows returned contain each unique combination of ids and the MAX of mydate for each combination. But, the row returned for each combination isn't necessarily the one with the corresponding MAX(mydate), and therefore does not necessarily give the correct downloads of that day. The MAX value is the correct value for that specific combination though, so my second query can be specific and use the index. Assuming A was 1, B was 1 and the MAX(mydate) equalled 1235555555 for that specific id combination, then I can execute
SELECT * FROM foo WHERE A = 1 AND B = 1 AND mydate = 1235555555
This second query returns the specific row I want, uses the index and is therefore fast.
I do have to do a foreach with php, so there's a processing overhead there, but it's still significantly quicker than trying to get MySQL to do all the work.
Another benefit is that all of these simple queries execute as seperate MySQL processes.
It just doesn't feel right, am I missing something?
Related
Having trouble with a query. Here is the outline -
Table structure:
CREATE TABLE `world` (
`placeRef` int NOT NULL,
`forenameRef` int NOT NULL,
`surnameRef` int NOT NULL,
`incidence` int NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb3;
ALTER TABLE `world`
ADD KEY `surnameRef_forenameRef` (`surnameRef`,`forenameRef`),
ADD KEY `forenameRef_surnameRef` (`forenameRef`,`surnameRef`),
ADD KEY `forenameRef` (`forenameRef`,`placeRef`);
COMMIT;
This table contains data like and has over 600,000,000 rows:
placeRef forenameRef surnameRef incidence
1 1 2 100
2 1 3 600
This represents the number of people with a given forename-surname combination in a place.
I would like to be able to query all the forenames that a surname is attached to; and then perform another search for where those forenames exist, with a count of the sum incidence. For Example: get all the forenames of people who have the surname "Smith"; then get a list of all those forenames, grouped by place and with the sum incidence. I can do this with the following query:
SELECT placeRef, SUM( incidence )
FROM world
WHERE forenameRef IN
(
SELECT DISTINCT forenameRef
FROM world
WHERE surnameRef = 214488
)
GROUP BY world.placeRef
However, this query takes about a minute to execute and will take more time if the surname being searched for is common.
The root problem is: performing a range query with a group doesn't utilize the full index.
Any suggestions how the speed could be improved?
In my experience, if your query has a range condition (i.e. any kind of predicate other than = or IS NULL), the column for that condition is the last column in your index that can be used to optimize search, sort, or grouping.
In other words, suppose you have an index on columns (a, b, c).
The following uses all three columns. It is able to optimize the ORDER BY c, because since all rows matching the specific values of a and b will by definition be tied, and then those matching rows will already be in order by c, so the ORDER BY is a no-op.
SELECT * FROM mytable WHERE a = 1 AND b = 2 ORDER BY c;
But the next example only uses columns a, b. The ORDER BY needs to do a filesort, because the index is not in order by c.
SELECT * FROM mytable WHERE a = 1 AND b > 2 ORDER BY c;
A similar effect is true for GROUP BY. The following uses a, b for row selection, and it can also optimize the GROUP BY using the index, because each group of values per distinct value of c is guaranteed to be grouped together in the index. So it can count the rows for each value of c, and when it's done with one group, it is assured there will be no more rows later with that value of c.
SELECT c, COUNT(*) FROM mytable WHERE a = 1 AND b = 2 GROUP BY c;
But the range condition spoils that. The rows for each value of c are not grouped together. It's assumed that the rows for each value of c may be scattered among each of the higher values of b.
SELECT c, COUNT(*) FROM mytable WHERE a = 1 AND b > 2 GROUP BY c;
In this case, MySQL can't optimize the GROUP BY in this query. It must use a temporary table to count the rows per distinct value of c.
MySQL 8.0.13 introduced a new type of optimizer behavior, the Skip Scan Range Access Method. But as far as I know, it only applies to range conditions, not ORDER BY or GROUP BY.
It's still true that if you have a range condition, this spoils the index optimization of ORDER BY and GROUP BY.
Unless I don't understand the task, it seems like this works:
SELECT placeRef, SUM( incidence )
FROM world
WHERE surnameRef = 214488
GROUP BY placeRef;
Give it a try.
It would benefit from a composite index in this order:
INDEX(surnameRef, placeRef, incidence)
Is incidence being updated a lot? If so, leave it off my Index.
You should consider moving from MyISAM to InnoDB. It will need a suitable PK, probably
PRIMARY KEY(placeRef, surnameRef, forenameRef)
and it will take 2x-3x the disk space.
I'm using following query:
SELECT plan.datum, plan.anketa, plan.ai as autoinc, plan.objekt as sifra , objekt.sifra, objekt.temp4_da FROM plan
LEFT JOIN objekt ON plan.objekt = objekt.sifra WHERE objekt.temp4_da = '1'
AND objekt.sifra >= 30 AND plan.datum > '2019-01-15' and plan.datum < '2019-01-30'
GROUP BY objekt.sifra
ORDER BY plan.datum ASC, plan.objekt ASC
I get results which is sorted by the last records, though I did put it sorted by date.
Results should be from 2019-01-15, but as you can see its sorted to last date plan.datum < '2019-01-30'...
How can I achive this?
EDIT:
When I select from 2019-01-15 to 2019-01-20 I achive this:
Your result comes from the ability of MySQL to process incorrect SQL queries with GROUP BY. For example, most of the DBMS is not capable to process query like this
SELECT col1, col2
FROM tab
GROUP BY col1
Some DBMS process this query if col1 is the primary key of tab, however, MySQL process it always and it returns a RANDOM col2 value if there are more than one col2 values corresponding to col1! For example, having table
col1 | col2
-----------
a | 1
a | 2
then MySQL may return (a, 1) result on Monday, and (a, 2) on Tuesday using my SQL query shown above (I'm little sarcastic).
I believe that is also your case. MySQL picks random plan.datum for each objekt.sifra (the group by attribute in your query) and you subsequently miss some plan.datum values in the result. Fix your query to obtain deterministic values and you will get rid of your problems.
Given that it does seem to have sorted how you wanted the first 4 rows shows that, those dates are all in the range specified.
You need to just go through basic diagnosis:
Does 'plan' table actually contain data with those dates?
If it does, then the data is being removed by your query.
So the next easiest to check is the WHERE, so remove the other clauses (i.e. leave the 'datum' restrictions), does that data now appear?
If it still doesn't, then the LEFT JOIN is the issue, as joins are filters too.
If you do those and the data appears, then the data and your understanding of the data don't match, and you need to check/confirm any assumptions about the data you may have made.
I'm not 100% familiar with mysql, but the GROUP BY looks really odd, your not doing any sums, mins, or operations on the group. Do you need that line?
I want to write a query to select a subset of a table, only starting from a given id.
I know about limit x, y, but x here is the number of the raw to start from. But in my case I want to start from a specific Id, no matter what its location inside the table.
What I mean is that the query below selects from row number 5, but I want it to select 10 records from row with id, say 213odin2d211d21:
SELECT * FROM my_table Limit 5, 10
I can't find a way to do this. Any help will be appreciated.
Note that, the Id here is a mix of strings and integers. So I can't do
SELECT * FROM <table> WHERE id > (id)
What you want to do is not possible. By default, records in the database are not ordered. Without ORDER BY you can't expect the server to return your queries in any particular order. Since you are saying, that you store some kind of digit/char identifier as your id, for which less then and greater then are not defined, it is not clear which records "follow" your specific record.
You will either have to:
Define another column to sort your records on, or
Define a behaviour for comparing your ids (What is "less then"? What is "greater then"?)
That being said, you can of course define that you want to sort your id just like sorting strings! In this case, you can use STRCMP() to compare two strings. Your query would look like this:
SELECT * FROM <table> WHERE STRCMP(id,?) = 1 ORDER BY id LIMIT 10
This will select the first 10 records, with id "greater than" ?.
I have been looking on the web on how to select a random row on big tables, I have found various results, but then I analyzed my data and figured out that the best way for me to go is to count the rows and select a random one of those with LIMIT
While testing I start to wonder why this works:
SET #t = CEIL(RAND()*(SELECT MAX(id) FROM logo));
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=#t
ORDER BY id
LIMIT 1;
and gives random results, but this always returns the same 4 or 5 results ?
SELECT id
FROM logo
WHERE
current_status_id=29 AND
logo_type_id=4 AND
active='y' AND
id>=CEIL(RAND()*(SELECT MAX(id) FROM logo))
ORDER BY id
LIMIT 1;
the table has MANY fields (almost 100) and quite a few indexes. over 14 Million records and counting. When I select a random it is almost NEVER that I have to select it from the table, I always have to select depending on various fields values (all indexed).
Could it be a bug of my MySQL server version (5.6.13-log Source distribution)?
One possibility is that this statement in the documentation:
RAND() in a WHERE clause is re-evaluated every time the WHERE is executed.
is simply not always true. It is true when you do:
where rand() < 0.01
to get an approximate 1% sample of the rows. Perhaps the MySQL optimizer says something like "Oh, I'll evaluate the subquery to get one value back. And, just to be more efficient, I'll multiply that row by rand() before defining the constant."
If I had to guess, that would be the case.
Another possibility is that the data is arranged so the values you are looking for has one row with a large id. Or, it could be that there are lots of rows with small ids at the very beginning, and then a very large gap.
Your method of getting a random row, by the way is not guaranteed to return a result when you are doing filtering. I don't know if that is important to you.
EDIT:
Check to see if this version works as you expect:
SELECT id
FROM logo cross join
(SELECT MAX(id) as maxid FROM logo) c
WHERE current_status_id = 29 AND
logo_type_id = 4 AND
active = 'y' AND
id >= RAND() * maxid
ORDER BY id
LIMIT 1;
If so, the problem is that the max id is being calculated and then there is an extra step of multiplying it by rand() as execution of the query begins.
I have a table "A" with a "date" field. I want to make a select query and order the rows with previous dates in a descending order, and then, the rows with next dates in ascending order, all in the same query. Is it possible?
For example, table "A":
id date
---------------------
a march-20
b march-21
c march-22
d march-23
e march-24
I'd like to get, having as a starting date "march-22", this result:
id date
---------------------
c march-22
b march-21
a march-20
d march-23
e march-24
In one query, because I'm doing it with two of them and it's slow, because the only difference is the sorting, and the joins I have to do are a bit "heavy".
Thanks a lot.
You could use something like this -
SELECT *
FROM test
ORDER BY IF(
date <= '2012-03-22',
DATEDIFF('2000-01-01', date),
DATEDIFF(date, '2000-01-01')
);
Here is a link to a test on SQL Fiddle - http://sqlfiddle.com/#!2/31a3f/13
That's wrong, sorry :(
From documentation:
However, use of ORDER BY for individual SELECT statements implies nothing about the order in which the rows appear in the final result because UNION by default produces an unordered set of rows. Therefore, the use of ORDER BY in this context is typically in conjunction with LIMIT, so that it is used to determine the subset of the selected rows to retrieve for the SELECT, even though it does not necessarily affect the order of those rows in the final UNION result. If ORDER BY appears without LIMIT in a SELECT, it is optimized away because it will have no effect anyway.
This should do the trick. I'm not 100% sure about adding an order in a UNION...
SELECT * FROM A where date <= now() ORDER BY date DESC
UNION SELECT * FROM A where date > now() ORDER BY date ASC
I think the real question here is how to do the joining once. Create a temporary table with the result of joining, and make the 2 selects from that table. So it will be be time consuming only on creation (once) not on select query (twice).
CREATE TABLE tmp SELECT ... JOIN -- do the heavy duty here
With this you can make the two select statenets as you originally did.