I'm attempting to use SQL to pull data from a database into a Jupyter (python) notebook and work with it there. I have a query that pulls the yearweek of flight's upload date, and counts the number of flights in that yearweek. Finally, it groups the results by the yearweek of upload date:
SELECT YEARWEEK(d.upload_date), COUNT(f.id)
FROM apps_flight f
LEFT JOIN apps_enginedatafile d ON d.id=f.import_file_id
WHERE f.global_duplicate = 0
GROUP BY YEARWEEK(d.upload_date)
I want to count number of subscribers (located in another table) from each yearweek to compare them to count of flights. So I'm trying to join said table by adding:
LEFT JOIN apps_subscription s ON s.basesubscription_ptr_id = f.id
But, when I do this, the counts of my flight values change!
The first few counts for the original query look like:
[327, 605, 78, 5768, 9716, 9686, 7902, 3699, 3323, 6081, 4966, 3456, 3181, 2749, 4577, 3157, 1792, 1806, ...]
After joining the table, it becomes:
[327, 738, 78, 8854, 17418, 16156, 13921, 7536, 5380, 10040, 7559, 5461, 6323, 6412, 6702, 5433, 2924, ...]
I'm not sure what's happening here. Perhaps the join is creating duplicate rows? The data set is very large, and takes about 30 minutes to run the query. Adding a LIMIT doesn't seem to speed it up, so as you can imagine, testing takes a little while. (If I'm oblivious to another way to speed up the query aside from a LIMIT, feel free to make me aware)!
Thanks for any info.
Simply join two aggregate count queries. Below assumes same structure including columns names. (Adjust upload_date to actual date/time column in apps_subscription.)
WITH agg_flights AS (
SELECT YEARWEEK(d.upload_date) AS year_week,
COUNT(f.id) AS flight_counts
FROM apps_flight f
LEFT JOIN apps_enginedatafile d
ON d.id = f.import_file_id
WHERE f.global_duplicate = 0
GROUP BY YEARWEEK(d.upload_date)
), agg_subs AS (
SELECT YEARWEEK(s.upload_date) AS year_week, -- ADJUST date/time variable
COUNT(f.id) AS subscriber_counts
FROM apps_flight f
LEFT JOIN apps_subscription s
ON s.basesubscription_ptr_id = f.id
WHERE f.global_duplicate = 0
GROUP BY YEARWEEK(s.upload_date) -- ADJUST date/time variable
)
SELECT f.year_week,
f.flight_counts,
s.subscriber_counts
FROM agg_flights f
INNER JOIN agg_subs s
ON f.year_week = s.year_week
Joins create combined rows of all the tables joined. So your join between f and d will have multiple rows (before the group by) for a single flight if that flight has more than one import_file_id value, and the join on s will add multiple rows if a flight has more than one subscription. And COUNT operates on the result of the joins, not on the f table before the join.
In this case, the easy fix is to just use COUNT(DISTINCT f.id) instead of COUNT(f.id), so each flight is only counted once per yearweek.
Related
I've got a problem with MySQL select statement.
I have a table with different Department and statuses, there are 4 statuses for every department, but for each month there are not always every single status but I would like to show it in the analytics graph that there is '0'.
I have a problem with select statement that it shows only existing statuses ( of course :D ).
Is it possible to create temporary table with all of the Departments , Statuses and amount of statuses as 0, then update it by values from other select?
Select statement and screen how it looks in perfect situation, and how it looks in bad situation :
SELECT utd.Departament,uts.statusDef as statusoforder,Count(uts.statusDef) as Ilosc_Statusow
FROM ur_tasks_details utd
INNER JOIN ur_tasks_status uts on utd.StatusOfOrder = uts.statusNR
WHERE month = 'Sierpien'
GROUP BY uts.statusDef,utd.Departament
Perfect scenario, now bad scenario :
I've tried with "union" statements but i don't know if there is a possibility to take only "the highest value" for every department.
example :
I've also heard about
With CTE tables, but I don't really get how to use it. Would love to get some tips on it!
Thanks for your help.
Use a cross join to generate the rows you want. Then use a left join and aggregation to bring in the data:
select d.Departament, uts.statusDef as statusoforder,
Count(uts.statusDef) as Ilosc_Statusow
from (select distinct utd.Departament
from ur_tasks_details utd
) d cross join
ur_tasks_status uts left join
ur_tasks_details utd
on utd.Departament = d.Departament and
utd.StatusOfOrder = uts.statusNR and
utd.month = 'Sierpien'
group by uts.statusDef, d.Departament;
The first subquery should be your source of all the departments.
I also suspect that month is in the details table, so that should be part of the on clause.
I have a two tables, one of the table is called participants_tb while the second is called allocation_tb. On the participants_tb, I have my columns as participant_id, name, username.
Under the allocation_tb, I have my columns as allocation_id, sender_username, receiver_username, done. The column done holds any of these three numbers: 0, 1, 2.
I used this sql statement to fetch my values
SELECT *, COUNT(done) d
FROM participants_tb
JOIN allocation_tb ON (username=receiver_username)
WHERE done = 0 || done = 1
GROUP BY receiver_username
It worked very well, the problem I have is that, I want it to also include the information of participants that are in the participants_tb but not in the allocation_tb. I tried to use the left outer join but it did not work as expected because I want it to include participants that are only in the participants_tb but not in the allocation_tb, since the done in the where clause is in the allocation_tb, it won't include those information.
You seem to want:
SELECT p.*, COUNT(a.done) as d
FROM participants_tb p LEFT JOIN
allocation_tb a
ON p.username = a.receiver_username) AND
a.done IN (0, 1)
GROUP BY p.participant_id;
Notes:
The LEFT JOIN keeps all participants.
The GROUP BY needs to be on the first table.
You can use SELECT p.* with the GROUP BY -- assuming that the GROUP BY key is unique (or the primary key).
All columns should be qualified.
IN is an easier way to express your logic.
I have a relatively simple query that returns a user profile, together with 2 counts related to that user (stream events & inventory);
SELECT u.*, r.regionName, COUNT(i.fmid) AS invcount, COUNT(s.fmid) AS streamcount
FROM fm_users u
JOIN fm_regions r
ON u.region=r.regionid
LEFT OUTER JOIN fm_inventory i
ON u.fmid=i.fmid
LEFT OUTER JOIN fm_stream s
ON u.fmid=s.fmid
WHERE u.username='sampleuser'
Both the inventory & stream values could be zero, hence using a left outer join.
However, the values for both currently return numbers for all users, not the specific user (always identified as integer 'fmid' in each table). This is obviously because the query doesn't specify a 'where' for either count - but I'm not sure where. If I change the last part to this;
WHERE u.username='sampleuser' AND s.fmid=u.fmid AND i.fmid=u.fmid
GROUP BY u.fmid
It still returns incorrect numbers, albeit 'different' numbers depending on the search criteria - and the name number for both invcount and streamcount.
Your query cross joins the fm_inventory and fm_stream tables for each user. For example if a specific user has 2 matching fm_inventory rows, and fm_stream also has two, the result will have 4 (=2 x 2) rows and each will be counted in both COUNT() functions.
The way to accomplish what you are doing is to use correlated subselects for the counts.
SELECT u.*, r.regionName,
(select COUNT(*) from fm_inventory i
where i.fmid = u.fmid) AS invcount,
(select COUNT(*) from fm_stream s
where s.fmid = u.fmid) AS streamcount
FROM fm_users u
JOIN fm_regions r
ON u.region=r.regionid
WHERE u.username='sampleuser'
This allows the counts to be independent of each other. This works even if more than just one user is selected. In that case you need a group by clause, but otherwise it is the same syntax.
When you mix normal columns with aggregates, include all normal columns in the GROUP BY. In this particular case, you are missing the r.regionName in the GROUP BY.
See longer explanation.
I'm reasonably new to MySQL and I'm trying to select a distinct set of rows using this statement:
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4);
However, the select statement is taking around 10 minutes, so something is clearly afoot.
One significant factor is that the table gtfsstop_times is huge. (~250 million records)
Indexes seem to be set up properly; all the above joins are using indexed columns. Table sizes are, roughly:
gtfsagencys - 4 rows
gtfsroutes - 56,000 rows
gtfstrips - 5,500,000 rows
gtfsstop_times - 250,000,000 rows
`transportdata`.stoppoints - 400,000 rows
The server has 22Gb of memory, I've set the InnoDB buffer pool to 8G and I'm using MySQL 5.6.
Can anybody see a way of making this run faster? Or indeed, at all!
Does it matter that the stoppoints table is in a different schema?
EDIT:
EXPLAIN SELECT... returns this:
It looks like you are trying to find a collection of stop points, based on certain criteria. And, you're using SELECT DISTINCT to avoid duplicate stop points. Is that right?
It looks like atcoCode is a unique key for your stoppoints table. Is that right?
If so, try this:
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints` AS sp
JOIN (
SELECT DISTINCT st.fk_atco_code AS atcoCode
FROM `vehicledata`.gtfsroutes AS route
JOIN `vehicledata`.gtfstrips AS trip ON trip.route_id = route.route_id
JOIN `vehicledata`.gtfsstop_times AS st ON trip.trip_id = st.trip_id
WHERE route.agency_id BETWEEN 1 AND 4
) ids ON sp.atcoCode = ids.atcoCode
This does a few things: It eliminates a table (agency) which you don't seem to need. It changes the search on agency_id from IN(a,b,c) to a range search, which may or may not help. And finally it relocates the DISTINCT processing from a situation where it has to handle a whole ton of data to a subquery situation where it only has to handle the ID values.
(JOIN and INNER JOIN are the same. I used JOIN to make the query a bit easier to read.)
This should speed you up a bit. But, it has to be said, a quarter gigarow table is a big table.
Having 250M records, I would shard the gtfsstop_times table on one column. Then each sharded table can be joined in a separate query that can run parallel in separate threads, you'll only need to merge the result sets.
The trick is to reduce how many rows of gtfsstop_times SQL has to evaluate. In this case SQL first evaluates every row in the inner join of gtfsstop_times and transportdata.stoppoints, right? How many rows does transportdata.stoppoints have? Then SQL evaluates the WHERE clause, then it evaluates DISTINCT. How does it do DISTINCT? By looking at every single row multiple times to determine if there are other rows like it. That would take forever, right?
However, GROUP BY quickly squishes all the matching rows together, without evaluating each one. I normally use joins to quickly reduce the number of rows the query needs to evaluate, then I look at my grouping.
In this case you want to replace DISTINCT with grouping.
Try this;
SELECT sp.name, sp.longitude, sp.latitude, sp.atcoCode
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
INNER JOIN `vehicledata`.gtfsagencys as agency ON route.agency_id = agency.agency_id
WHERE agency.agency_id IN (1,2,3,4)
GROUP BY sp.name
, sp.longitude
, sp.latitude
, sp.atcoCode
There other valuable answers to your question and mine is an addition to it. I assume sp.atcoCode and st.fk_atco_code are indexed columns in their table.
If you can validate and make sure that agency ids in the WHERE clause are valid, you can eliminate joining `vehicledata.gtfsagencys` in the JOINS as you are not fetching any records from the table.
SELECT DISTINCT sp.atcoCode, sp.name, sp.longitude, sp.latitude
FROM `transportdata`.stoppoints as sp
INNER JOIN `vehicledata`.gtfsstop_times as st ON sp.atcoCode = st.fk_atco_code
INNER JOIN `vehicledata`.gtfstrips as trip ON st.trip_id = trip.trip_id
INNER JOIN `vehicledata`.gtfsroutes as route ON trip.route_id = route.route_id
WHERE route.agency_id IN (1,2,3,4);
I just imported a large amount of data into two tables. Let's call them shipments and returns.
When trying to do a simple join (left or inner) based on any criteria in these two tables. query looks like it tries to do a cross join or find every combination instead of what the query should be pulling.
each table has an PK id field, but there is not FK relationship between the two other than some shared field.
I'm currently just trying to related them on shipment_id.
I feel this is a simple answer. Am I missing a reference or something obvious that is causing this? Thanks!
here's an example. This should returned under 100 rows. This instead returns hundreds of thousands.
SELECT r.*
FROM returns as r
left outer join shipments as s
on r.shipment_id = s.shipment_id
where r.date = '2011-06-20'
Here is a query that should work:
SELECT T0.*, T1.*
FROM shipments AS T0 LEFT JOIN returns AS T1 ON T0.shipment_id = T1.shipment_id
ORDER BY T0.shipment_id;
This query join assumes 1:1 on the shipment_id
It would be nice if you included the query you were using
You need to specify what you are joining on, otherwise it will do a cartesian join:
SELECT r.*
FROM returns as r
LEFT JOIN shipments as s ON s.shipment_id = r.shipment_id
where r.date = '2011-06-20'
Josh,
I would be interested in seeing what would happen if you forced a join to a specific record or set of records instead of the whole table. Assuming there is a shipment with an id of 5 in your table, you could try:
SELECT r.* FROM returns as r
left join shipments as s
ON 5 = r.shipment_id
WHERE r.date = '2011-06-20'
While just a fancy where clause, it would at least prove that the join you are attempting will eventually work correctly. The issue is that your on clause is always returning true, no matter what the value is. This could be because it's not interpreting the shipment_id as an integer, but instead as a true/false variable where any value evaluates to true.
Original Rejected Solution:
No Foreign Key relationship should be needed in order to make the joins happen. The PK id fields I'm assuming are an integer (or number, or whatever your rdms equivalent is)?
Can you past a snippet of your sql query?
Updating based on posted query:
I would add your explicit join criteria in order to rule out any funny business (my guess is since no criteria is specified, it's using 1=1, which always joins). So I would change your query to look like:
SELECT r.*
FROM returns as r
left join shipments as s ON
s.ShipId = R.ReturnId
where r.date = '2011-06-20'
The issue turned out to be very simple, just not readily apparent until going through all the columns. It turns out that the shipment ID was duplicated through every row as it hit the upper limit for the int datatype. This is why joins were returning every record.
After switching the datatype to bigint and reimporting, everything worked great. Thanks all for looking into it.