Here is the query I'm trying to run:
SELECT A.*
FROM student_lesson_progress A
LEFT JOIN student_lesson_progress B
ON A.studentId = B.studentId
AND A.lessonId = B.lessonId
WHERE A.lessonStatusTypeId = 2 AND
EXISTS (SELECT * FROM student_lesson_progress WHERE B.lessonStatusTypeID = 4)
Basically I'm not very skilled with SQL, but am trying to return all rows with a lessonStatusTypeID = 2 but only if there is a row with the same studentId and lessonId that has lessonStatusTypeID = 4.
My end goal once I am certain I have the query right, is that if a Student (studentID) has achieved a Status (lessonStatusTypeId) of 4 on a particular lesson (lessonID) I want to delete all the rows where Status is 2 for that particular Student on that particular lesson, as that data is no longer needed.
I pieced together the above query, and it runs alright on a small test DB, and seems to be returning the desired rows. However, when I try and run it on the production DB, where the student_lesson_progress table has around 600,000 rows, it just runs and runs and runs, locks up the database, pins the server cpu at 100%, and never returns data.
My guess is that my query is very poorly put together, and probably overly complicated for what I'm trying to do. I would greatly appreciate any tips or nudges in the right direction with this one.
General rule of thumb: If you're using a sub-select, you're probably not doing it right. This is not always the case, but if you can avoid sub-selects, you should.
This should work for the query. Your sub-select is what is probably killing your performance. You also should index sutdentId and lessonId, or put a compound index on both columns.
SELECT A.*
FROM student_lesson_progress A
INNER JOIN student_lesson_progress B
ON A.studentId = B.studentId
AND A.lessonId = B.lessonId
WHERE A.lessonStatusTypeId = 2 AND B.lessonStatusTypeID = 4
You need to use corelated subquery. But first make sure you have right indices on the table.
SELECT distinct A.*
FROM student_lesson_progress A
WHERE A.lessonStatusTypeId = 2
AND A.studentId in (
SELECT B.studentId
FROM student_lesson_progress B
WHERE B.studentId = A.studentId
And B.lessonStatusTypeId = 4);
Essentially means, get me list of all students with status 2, who also have a corresponding lesson with status of 4. The distinct will eliminate duplicates (if a student has more than 1 lesson with status 4).
Hope this works..
Related
I want to check 2 databases to see if the money-payments are the same as the total. That is possible, but I get a very long table:
select
transaction_id
,total_low+total_high a
, sum(money_received) b
from
archive_transaction inner join archive_transaction_payment
on archive_transaction.id=archive_transaction_payment.transaction_id
group by transaction_id;
Actually I only want the transactions where the total is wrong!!
So now I want to add a!=b and that gives an invalid query. How to proceed?
Table archive_transaction has 1 row per transaction, but archive_transaction_payment can have multiple payments for one transaction. This makes it complicated for me.
select
transaction_id
,total_low+total_high a
, sum(money_received) b
from archive_transaction inner join archive_transaction_payment
on archive_transaction.id=archive_transaction_payment.transaction_id
where
a!=b
group by transaction_id;
Joins are still problematic for me, but I found an answer without join to find faults in the database.
SELECT id
FROM archive_transaction a
WHERE total_low + total_high != (SELECT Sum(money_received)
FROM archive_transaction_payment b
WHERE a.id = b.transaction_id);
Now I get a short list of problems in my database. Thanks for helping me out.
Here's my problem: I need to get the amount of test cases and issues associated to a project that meet certain conditions (test cases that are successful, and issues that are flaws of the application), but for some reason the amount doesn't add up. I have 10 test cases in a project, of which 6 are successful; and 8 issues, of which only 4 are flaws. However, the respective results for COUNT each show 24, which makes no sense. I did notice, though, that 24 happens to be 6 times 4, but I don't see how the query would multiply them.
Anyway... Can someone help me find which part of my query is wrong? How can I get the correct result? Thanks in advance.
Here's the query:
SELECT
p.codigo_proyecto,
p.nombre,
IFNULL(COUNT(iep.id_incidencia_etapa_proyecto), 0) AS cantidad_defectos,
IFNULL(COUNT(tc.id_test_case), 0) AS test_cases_exitosos,
CASE IFNULL(COUNT(tc.id_test_case), 0) WHEN 0 THEN 'No aplica'
ELSE CONCAT((IFNULL(COUNT(tc.id_test_case), 0) / IFNULL(COUNT(tc.id_test_case), 0)) * 100, '%') END AS tasa_defectos
FROM proyecto p
INNER JOIN etapa_proyecto ep ON p.codigo_proyecto = ep.codigo_proyecto
INNER JOIN incidencia_etapa_proyecto iep ON ep.id_etapa_proyecto = iep.id_etapa_proyecto
INNER JOIN incidencia i ON iep.id_incidencia = i.id_incidencia
INNER JOIN test_case tc ON ep.id_etapa_proyecto = tc.id_etapa_proyecto
INNER JOIN etapa_proyecto ep_ultima ON ep_ultima.id_etapa_proyecto =
(SELECT ep_ultima2.id_etapa_proyecto FROM etapa_proyecto ep_ultima2
WHERE p.codigo_proyecto = ep_ultima2.codigo_proyecto ORDER BY ep_ultima2.fecha_termino_real DESC LIMIT 1)
WHERE p.esta_cerrado = 1
AND i.es_defecto = 1
AND tc.resultado = 'Exitoso'
AND ep_ultima.fecha_termino_real BETWEEN '2015-01-01' AND '2016-12-31';
I would have thought it obvious that you're not going to get the expected output from an aggregate query without a GROUP BY (which suggests you're not really in a position to evaluate any advice given here effectively).
You've not said how the states of your data are represented in the database - so I'm having to make a lot of guesses based on SQL which is clearly very wrong. And I don't speak spanish/portugese or whatever your native language is.
It looks like you are inferring that a defect exists if the primary key of the defects table is null. Primary keys cannot be null. The only way this would make any sort of sense (BTW it still won't give you the answer you're looking for) is to do a LEFT JOIN rather than an INNER JOIN.
But even then a simple COUNT() will consider null cases (no record in source table) as 1 record in the output set.
Then you've got the problem that you will have the product of defects and test cases in your output - consider the case where you have no defects, but 2 tests cases (1,2) - the result of an outer joiun will be:
defect test
------ ----
null 1
null 2
If you just count the rows, you'll get 2 defects in your output.
Taking a simpler schema, this demonstrates the 2 methods for getting the values - note that they have very different performance characteristics.
SELECT project.id
, dilv.defects
, (SELECT COUNT(*)
FROM test_cases) AS tests
FROM project
LEFT JOIN ( SELECT project_id, COUNT(*) AS defects
FROM defect_table
GROUP BY project_id) AS dilv
ON project.id=dilv.project_id
I'm reasonably new to MySQL and I'm trying to select a distinct set of rows using this statement:
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4);
However, the select statement is taking around 10 minutes, so something is clearly afoot.
One significant factor is that the table gtfsstop_times is huge. (~250 million records)
Indexes seem to be set up properly; all the above joins are using indexed columns. Table sizes are, roughly:
gtfsagencys - 4 rows
gtfsroutes - 56,000 rows
gtfstrips - 5,500,000 rows
gtfsstop_times - 250,000,000 rows
`transportdata`.stoppoints - 400,000 rows
The server has 22Gb of memory, I've set the InnoDB buffer pool to 8G and I'm using MySQL 5.6.
Can anybody see a way of making this run faster? Or indeed, at all!
Does it matter that the stoppoints table is in a different schema?
EDIT:
EXPLAIN SELECT... returns this:
It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?
It looks like atcoCode is a unique key for your stoppoints table. Is that right?
If so, try this:
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints` AS sp
JOIN (
SELECT DISTINCT st.fk_atco_code AS atcoCode
FROM `vehicledata`.gtfsroutes AS route
JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
JOIN `vehicledata`.gtfsstop_times AS st ON trip.trip_id = st.trip_id
WHERE route.agency_id BETWEEN 1 AND 4
) ids ON sp.atcoCode = ids.atcoCode
This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.
(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)
This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.
Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.
The trick is to reduce how many rows of gtfsstop_times SQL has to evaluate. In this case SQL first evaluates every row in the inner join of gtfsstop_times and transportdata.stoppoints, right? How many rows does transportdata.stoppoints have? Then SQL evaluates the WHERE clause, then it evaluates DISTINCT. How does it do DISTINCT? By looking at every single row multiple times to determine if there are other rows like it. That would take forever, right?
However, GROUP BY quickly squishes all the matching rows together, without evaluating each one. I normally use joins to quickly reduce the number of rows the query needs to evaluate, then I look at my grouping.
In this case you want to replace DISTINCT with grouping.
Try this;
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4)
GROUP BY sp.name
, sp.longitude
, sp.latitude
, sp.atcoCode
There other valuable answers to your question and mine is an addition to it. I assume sp.atcoCode and st.fk_atco_code are indexed columns in their table.
If you can validate and make sure that agency ids in the WHERE clause are valid, you can eliminate joining `vehicledata.gtfsagencys` in the JOINS as you are not fetching any records from the table.
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
WHERE route.agency_id IN (1,2,3,4);
I need to gather posts from two mysql tables that have different columns and provide a WHERE clause to each set of tables. I appreciate the help, thanks in advance.
This is what I have tried...
SELECT
blabbing.id,
blabbing.mem_id,
blabbing.the_blab,
blabbing.blab_date,
blabbing.blab_type,
blabbing.device,
blabbing.fromid,
team_blabbing.team_id
FROM
blabbing
LEFT OUTER JOIN
team_blabbing
ON team_blabbing.id = blabbing.id
WHERE
team_id IN ($team_array) ||
mem_id='$id' ||
fromid='$logOptions_id'
ORDER BY
blab_date DESC
LIMIT 20
I know that this is messy, but i'll admit, I am no mysql veteran. I'm a beginner at best... Any suggestions?
You could put the where-clauses in subqueries:
select
*
from
(select * from ... where ...) as alias1 -- this is a subquery
left outer join
(select * from ... where ...) as alias2 -- this is also a subquery
on
....
order by
....
Note that you can't use subqueries like this in a view definition.
You could also combine the where-clauses, as in your example. Use table aliases to distinguish between columns of different tables (it's a good idea to use aliases even when you don't have to, just because it makes things easier to read). Example:
select
*
from
<table> as alias1
left outer join
<othertable> as alias2
on
....
where
alias1.id = ... and alias2.id = ... -- aliases distinguish between ids!!
order by
....
Two suggestions for you since a relative newbie in SQL. Use "aliases" for your tables to help reduce SuperLongTableNameReferencesForColumns, and always qualify the column names in a query. It can help your life go easier, and anyone AFTER you to better know which columns come from what table, especially if same column name in different tables. Prevents ambiguity in the query. Your left join, I think, from the sample, may be ambigous, but confirm the join of B.ID to TB.ID? Typically a "Team_ID" would appear once in a teams table, and each blabbing entry could have the "Team_ID" that such posting was from, in addition to its OWN "ID" for the blabbing table's unique key indicator.
SELECT
B.id,
B.mem_id,
B.the_blab,
B.blab_date,
B.blab_type,
B.device,
B.fromid,
TB.team_id
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
WHERE
TB.Team_ID IN ( you can't do a direct $team_array here )
OR B.mem_id = SomeParameter
OR b.FromID = AnotherParameter
ORDER BY
B.blab_date DESC
LIMIT 20
Where you were trying the $team_array, you would have to build out the full list as expected, such as
TB.Team_ID IN ( 1, 4, 18, 23, 58 )
Also, not logical "||" or, but SQL "OR"
EDIT -- per your comment
This could be done in a variety of ways, such as dynamic SQL building and executing, calling multiple times, once for each ID and merging the results, or additionally, by doing a join to yet another temp table that gets cleaned out say... daily.
If you have another table such as "TeamJoins", and it has say... 3 columns: a date, a sessionid and team_id, you could daily purge anything from a day old of queries, and/or keep clearing each time a new query by the same session ID (as it appears coming from PHP). Have two indexes, one on the date (to simplify any daily purging), and second on (sessionID, team_id) for the join.
Then, loop through to do inserts into the "TempJoins" table with the simple elements identified.
THEN, instead of a hard-coded list IN, you could change that part to
...
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
LEFT JOIN TeamJoins TJ
on TB.Team_ID = TJ.Team_ID
WHERE
TB.Team_ID IN NOT NULL
OR B.mem_id ... rest of query
What I ended up doing is;
I added an extra column to my blabbing table called team_id and set it to null as well as another field in my team_blabbing table called mem_id
Then I changed the insert script to also insert a value to the mem_id in team_blabbing.
After doing this I did a simple UNION ALL in the query:
SELECT
*
FROM
blabbing
WHERE
mem_id='$id' OR
fromid='$logOptions_id'
UNION ALL
SELECT
*
FROM
team_blabbing
WHERE
team_id
IN
($team_array)
ORDER BY
blab_date DESC
LIMIT 20
I am open to any thought on what I did. Try not to be too harsh though:) Thanks again for all the info.
Hey All.
I have a bunch of tables that have some common fields tying them together, but I can't figure out the right way to dump them in a meaningful way.
Basically, users will be given two tests, and each test may be taken multiple times.
The main table stores information about the user and the test, similar to the below (we'll call this table MAIN):
user_id test iteration completion_time
1 1 1 1:00
1 2 1 1:30
1 1 2 0:49
1 2 2 1:30
Each test page then has its own table to store the answers provided, since some pages have a ton of questions. We'll call this one sample table RESULTS, but there are many tables like this that are basically the same.
user_id test iteration q1 q2 q3
1 1 1 A B A
1 2 1 B B A
1 1 2 A B B
1 2 2 A B B
These results tables (again, there are many) basically store the results, plus just enough information to accurately tie the results together across all tables. I set it up this way because to use just one table for the results would have left me with a table with several hundred columns, which I had read was not recommended.
So the problem here is i can't figure out how best to tie together these tables and get the results out. I've read up on joins and unions and neither one seems right, as far as I can tell, because i need to pull data from ~10-15 tables at once.
I can do a huge complex select -- something along the lines of
select m.*, a.*, b.*, c.* from main m, results_a a, results_b b, results_c c where a.user_id=m.user_id and b.user_id=m.user_id and c.user_id=m.user_id'
and that works, but there's got to be a better way. Keep in mind that i've only given 3 results tables in this example -- in my actual application, it's going to be more like 15-20 tables of results.
Beyond being really complicated, it returns duplicates of some rows, and if i want to throw in any extra logic (lets say, for example, that i only want the same data i queried before, but only for test 2) it gets even more complicated. And lets not even talk about sorting.
From what I've read, JOIN is for 2 tables, and UNION combines rows of results, not columns.
I don't claim to be a mysql expert, but i have looked into this before posting. I feel like I must be close to the right answer, but just not quite hitting on it.
Can anyone give me some guidance?
To use inner joins try:
SELECT m.*, a.*, b.*, c.*
FROM main m
INNER JOIN results_a a ON a.user_id = m.user_id
INNER JOIN results_b b ON b.user_id = m.user_id
INNER JOIN results_c c ON c.user_id = m.user_id
WHERE m.user_id = x
To differentiate column names, explicitly name the column and assign an alias
SELECT m.*, a.q1 as A_Q1, a.q2 as A_Q2, b.q1 as B_Q1, b.q2 as B_Q2 ...
I think you've designed your database in a way that you're going to have to do joins. It sounds better than the alternative (large, non-normalized table). What are you trying to get out? View all tests and determine a user's best score?
So if I understand the tables you have multiple users (obviously). Each user can take multiple tests. Each test could be taken multiple times.
select m.*, a.*, b.*, c.* from main m
join results_a a
on a.user_id = m.user_id and a.test_id = m.test_id and a.iteration_id = m.iteration_id
join results_b b
on b.user_id = m.user_id and b.test_id = m.test_id and b.iteration_id = m.iteration_id
join results_c c
on c.user_id = m.user_id and c.test_id = m.test_id and c.iteration_id = m.iteration_id
That should give you one row per iteration, per test, per user. If you want a certain test for a user then add:
where m.user_id = #userid and m.test_id = #testid
and then you can look at all the iterations for a test, by a user.
If you want the latest iteration:
select top(1) ...
where m.user_id = #userid and m.test_id = #testid
order by m.iteration_id desc
You might consider generating a view when your questions are set-up so you don't have to generate the select statement at run-time which represents a particular test.