ActiveRecord vs. SQL - Is there a cleaner way? - mysql

I have a Rails 4 based application that's handling some SIEM style work for us. I'm a big believer in making code as readable as possible and then worrying about optimization. I'm finding that attempting to find all of the events that contain a set of words leads to exceptionally poor performance if I rely on AR, so I've resorted to using SQL directly even though it's fragile.
Is there a better way to do the following using AR?
sql = "select event_id from events_words where generated>'#{starting_time.to_s(:db)}' and word_id in (select id from words where words.text in ('#{terms.join("', '")}')) group by event_id having count(distinct(word_id))=#{terms.count}"
events_words is a join table containing the word_id for every word in every event, the event_id for each event and generated, the timestamp when the event was generated. The generated field is being used to limit search results to a time frame and the table itself is partitioned by date to keep the indices to a size that can fit in RAM.

For even better performance and readability, consider using a JOIN operation in place of the IN (subquery). To improve readability, consider qualifying every column reference.
Personally, I would find this statement to be much more "readable":
SELECT e.event_id
FROM events_words e
JOIN ( SELECT w.id
FROM words w
WHERE w.text IN ('#{terms.join("', '")}')
) s
ON s.id = e.word_id
WHERE e.generated > '#{starting_time.to_s(:db)}'
GROUP BY e.event_id
HAVING COUNT(DISTINCT(e.word_id))=#{terms.count}
... ("readability" gauged in terms of the ability of the reader to quickly figure out what the query is doing).
As to getting a query like that done in ActiveRecord (if that's possible), I am inclined to pity the poor soul that has to wade through whatever that looks like to decipher what the query is actually doing.
EDIT
After reviewing again, I see there's no need for the inline view. (That was generated from the subquery during my initial change to the JOIN operation, but that's not really necessary.
This should return an equivalent result:
SELECT e.event_id
FROM events_words e
JOIN words w
ON w.id = e.word_id
WHERE e.generated > '#{starting_time.to_s(:db)}'
AND w.text IN ('#{terms.join("', '")}')
GROUP BY e.event_id
HAVING COUNT(DISTINCT(e.word_id))=#{terms.count}

You might try this:
EventWord.joins(:word).
where(:words => {:text => terms}).
where("generated > ?", :starting_time).
group(:event_id).
having("count(distinct(word_id)) = ?", terms.count).
select(:event_id)
Or ...
Event.joins(:word).
where(:words => {:text => terms}).
where("generated > ?", :starting_time).
group(:id).
having("count(distinct(words.id)) = ?", terms.count)

Related

Is there any difference, performance wise, with these two queries? (Repeating the where clause inside the sub-query) MYSQL

I have a query that goes something like this.
Select *
FROM FaultCode FC
JOIN (
SELECT INNER_E.* FROM Equipment INNER_E
) E USING(EquipmentID)
LEFT JOIN AssetType AT ON AT.id_asset_type = E.id_asset_type AND AT.id_language = 'en-us'
LEFT JOIN Project P ON E.current_id_project = P.id_project
WHERE E.id_organization = 100057 AND E.equipment_status = 'ACTIVE'
AND FC.code_status = 'OPEN'
As you can see, in the outside query, there is a where clause in the outside main query.
But also, on the inside, we have an Inner Join statement with the line SELECT INNER_E.* FROM Equipment INNER_E. This inner join makes us only retrieve the fault codes that are inside the equipment table (correct me if I'm wrong).
I am trying to optimize this query.
My question is, does it make any difference to do this
Select *
FROM FaultCode FC
JOIN (
SELECT INNER_E.* FROM Equipment INNER_E
WHERE INNER_E.id_organization = 100057 AND INNER_E.equipment_status = 'ACTIVE'
) E USING(EquipmentID)
LEFT JOIN AssetType AT ON AT.id_asset_type = E.id_asset_type AND AT.id_language = 'en-us'
LEFT JOIN Project P ON E.current_id_project = P.id_project
WHERE E.id_organization = 100057 AND E.equipment_status = 'ACTIVE'
AND FC.code_status = 'OPEN'
So repeating the where clause inside the inner sub query, to further limit it before it joins. Or does the optimizer know to do this automatically?
I tried implementing that line in code, and it seemed to only make my query slower strangely enough. Is there any way I can optimize that query above, or since it's pretty simple, is that the best it's going to get without indexes?
I tried running the Explain Select statement, but I have a hard time parsing what it's telling me. Are there any good resources I can look into to learn some tips or techniques to optimize my query?
I don't have any aggregate functions in my Select fields. So is the only real answer Indexes?
Why is the first subquery needed? Perhaps simply
Select *
FROM FaultCode FC
JOIN Equipment AS E USING(EquipmentID)
LEFT JOIN AssetType AT ON AT.id_asset_type = E.id_asset_type
AND AT.id_language = 'en-us'
LEFT JOIN Project P ON E.current_id_project = P.id_project
WHERE E.id_organization = 100057
AND E.equipment_status = 'ACTIVE'
AND FC.code_status = 'OPEN';
Likely Indexes:
FC: INDEX(code_status, EquipmentID)
E: INDEX(id_organization, equipment_status, EquipmentID,)
Probably unwise to do SELECT * -- It will give you all the columns of all 4 tables. (Without further details, I cannot suggest any "covering" indexes, which seems likely for AT.)
With my version of the query, your question about repeating the WHERE vanishes. With your version, it is likely to help. I don't think the Optimizer is smart enough to catch on to what you are doing.
Show us the EXPLAINs. We can help some with what the cryptic stuff is saying. (And what it is not saying.)
"the best it's going to get without indexes" -- Are you saying you have no indexes??! Not even a PRIMARY KEY for each table? "So is the only real answer Indexes?" Every time you write a query against a non-tiny table, you should ask "do the table(s) have adequate indexes for this query?"

Complex MySQL query problems and also SQL hangs

I am trying to write an SQL query which is pretty complex. The requirements are as follows:
I need to return these fields from the query:
track.artist
track.title
track.seconds
track.track_id
track.relative_file
album.image_file
album.album
album.album_id
track.track_number
I can select a random track with the following query:
select
track.artist, track.title, track.seconds, track.track_id,
track.relative_file, album.image_file, album.album,
album.album_id, track.track_number
FROM
track, album
WHERE
album.album_id = track.album_id
ORDER BY RAND() limit 10;
Here is where I am having trouble though. I also have a table called "trackfilters1" thru "trackfilters10" Each row has an auto incrementing ID field. Therefore, row 10 is data for album_id 10. These fields are populated with 1's and 0's. For example, album #10 has 10 tracks, then trackfilters1.flags will contain "1111111111" if all tracks are to be included in the search. If track 10 was to be excluded, then it would contain "1111111110"
My problem is including this clause.
The latest query I have come up with is the following:
select
track.artist, track.title, track.seconds,
track.track_id, track.relative_file, album.image_file,
album.album, album.album_id, track.track_number
FROM
track, album, trackfilters1, trackfilters2
WHERE
album.album_id = track.album_id
AND
( (album.album_id = trackfilters1.id)
OR
(album.album_id=trackfilters2.id) )
AND
( (mid(trackfilters1.flags, track.track_number,1) = 1)
OR
( mid(trackfilters2.flags, track.track_number,1) = 1))
ORDER BY RAND() limit 2;
however this is causing SQL to hang. I'm presuming that I'm doing something wrong. Does anybody know what it is? I would be open to suggestions if there is an easier way to achieve my end result, I am not set on repairing my broken query if there is a better way to accomplish this.
Additionally, in my trials, I have noticed when I had a working query and added say, trackfilters2 to the FROM clause without using it anywhere in the query, it would hang as well. This makes me wonder. Is this correct behavior? I would think adding to the FROM list without making use of the data would just make the server procure more data, I wouldn't have expected it to hang.
There's not enough information here to determine what's causing the performance issue.
But here's a few suggestions and comments.
Ditch the old-school comma syntax for the join operations, and use the JOIN keyword instead. And relocate the join predicates to an ON clause.
And for heaven's sake, format the SQL so that it's decipherable by someone trying to read it.
There's some questions here... will there always be a matching row in both trackfilters1 and trackfilters2 for rows you want to return? Or could a row be missing from trackfilters2, and you still want to return the row if there's a matching row in trackfilters1? (The answer to that question determines whether you'd want to use an outer join vs an inner join to those tables.)
For best performance with large sets, having appropriate indexes defined is going to be critical.
Use EXPLAIN to see the execution plan.
I suggest you try writing your query like this:
SELECT track.artist
, track.title
, track.seconds
, track.track_id
, track.relative_file
, album.image_file
, album.album
, album.album_id
, track.track_number
FROM track
JOIN album
ON album.album_id = track.album_id
LEFT
JOIN trackfilters1
ON trackfilters1.id = album.album_id
LEFT
JOIN trackfilters2
ON trackfilters2.id = album.album_id
WHERE MID(trackfilters1.flags, track.track_number, 1) = '1'
OR MID(trackfilters2.flags, track.track_number, 1) = '1'
ORDER BY RAND()
LIMIT 2
And if you want help with performance, provide the output from EXPLAIN, and what indexes are defined.

SQL: do we need ANY/SOME and ALL keywords?

I'm using SQL (SQL Server, PostgreSQL) over 10 years and still I'm never used ANY/SOME and ALL keywords in my production code. All situation I've encountered I could get away with IN, MAX, MIN, EXISTS, and I think it's more readable.
For example:
-- = ANY
select * from Users as U where U.ID = ANY(select P.User_ID from Payments as P);
-- IN
select * from Users as U where U.ID IN (select P.User_ID from Payments as P);
Or
-- < ANY
select * from Users as U where U.Salary < ANY(select P.Amount from Payments as P);
-- EXISTS
select * from Users as U where EXISTS (select * from Payments as P where P.Amount > U.Salary);
Using ANY/SOME and ALL:
PostgreSQL
SQL Server
MySQL
SQL FIDDLE with some examples
So the question is: am I missing something? is there some situation where ANY/SOME and ALL shine over other solutions?
I find ANY and ALL to be very useful when you're not just testing equality or inequality. Consider
'blah' LIKE ANY (ARRAY['%lah', '%fah', '%dah']);
as used my answer to this question.
ANY, ALL and their negations can greatly simplify code that'd otherwise require non-trivial subqueries or CTEs, and they're significantly under-used in my view.
Consider that ANY will work with any operator. It's very handy with LIKE and ~, but will work with tsquery, array membership tests, hstore key tests, and more.
'a => 1, e => 2'::hstore ? ANY (ARRAY['a', 'b', 'c', 'd'])
or:
'a => 1, b => 2'::hstore ? ALL (ARRAY['a', 'b'])
Without ANY or ALL you'd probably have to express those as a subquery or CTE over a VALUES list with an aggregate to produce a single result. Sure, you can do that if you want, but I'll stick to ANY.
There's one real caveat here: On older Pg versions, if you're writing ANY( SELECT ... ), you're almost certainly going to be better off in performance terms with EXISTS (SELECT 1 FROM ... WHERE ...). If you're on a version where the optimizer will turn ANY (...) into a join then you don't need to worry. If in doubt, check EXPLAIN output.
No, I've never used the ANY, ALL, or SOME keywords either, and I've never seen them used in other people's code. I assume these are vestigal syntax, like the various optional keywords that appear in some places in SQL (for example, AS).
Keep in mind that SQL was defined by a committee.
I had tried anything but no missing anything, just different type of habit only if i use a Not condition. the exists and in will need to add not while any/some just change the operator to <>. i only use sql server and i not sure about the other software might missing something

Select taking too long. Need advice for a better performance

Ok, here we go. There's this messy SELECT crossing other tables and ordering to get the one desired row. Basically I do the "math" inside the ORDER BY.
1 base table.
7 JOINS poiting to local tables.
WHERE with 2 clauses and a NOT IN crossing another table.
You'll see in the code the ORDER BY is pretty damn big/ugly, it sums the result of 5 different calculations. I need that result to order by those calculations in order to get the worst row-case.
The problem is once I execute the Stored Procedure it takes up to 8 seconds to run. That's kind of non-acceptable. So, I'm starting to check Indexes.
So, I'm looking for advices on how to make this query run faster.
I'm indexing the WHERE clauses and the field LINEA, Should I index something else? Like the rows Im crossing for the JOINs? or should I approach the query differently?
Query:
SET #LINEA = (
SELECT TOP 1
BOA.LIN
FROM
BAND_BA BOA
LEFT JOIN
TEL PAR
ON REPLACE(BOA.Lin,'-','') = SUBSTRING(PAR.Te,2,10)
LEFT JOIN
TELP CLP
ON REPLACE(BOA.Lin,'-','') = SUBSTRING(CLP.Numtel,2,10)
LEFT JOIN
CA C
ON REPLACE(BOA.Lin,'-','') = C.An
LEFT JOIN
RE R
ON REPLACE(BOA.Lin,'-','') = R.Lin
LEFT JOIN
PRODUCTOS2 P2
ON BOA.PRODUCTO = P2.codigo
LEFT JOIN
EN
ON REPLACE(BOA.Lin,'-','') = EN.G
LEFT JOIN
TIP ID
ON TIPID = ID.ID
WHERE
BOA.EST = 'C' AND
ID.SE = 'boA' AND
BOA.LIN NOT IN (
SELECT
LIN
FROM
BAN
)
ORDER BY (EN.VALUE + ANT.VALUE + REIT.VAL + C.VALUE + TEL.VALUE
) DESC,
I'll be frank, this is some pretty terrible SQL. Without seeing all your table structures, advice here will be incomplete. That being said, please don't post all your table structures because you are already very close to "hire a consultant" territory with this.
All the REPLACE logic should be done away with. If you need to JOIN on these fields, then add comparable fields to the tables so you don't need to manipulate the data. Every single JOIN that uses a REPLACE or SUBSTRING is a table or index scan - those are non-SARGable and a definite anti-pattern.
The ORDER BY is probably the most convoluted ORDER BY I have ever seen. Some major issues there:
Subqueries should all be eliminated and materialized either in the outer query or as variables
String manipulation should be eliminated (see item 1 above)
The entire query is basically a code smell. If you need to write code like this to meet business requirements then you either have a terribly inappropriate design or some other much larger issue in the organization or data.
One thing that can kill performance is using a lot of LEFT JOINs. To improve performance of LEFT JOIN, you might want to make sure that the column(s) to which you join have an index - that can have a huge impact on performance.

MySQL - Fastest way to select relational data avoiding left join

I've currently got a query that selects metrics data from two tables whilst getting the projects to query from two other tables (one is owned projects, the other is projects to which the user has access).
SELECT v.`projectID`,
(SELECT COUNT(m.`session`)
FROM `metricData` m
WHERE m.`projectID` = v.`projectID`) AS `sessions`,
(SELECT COUNT(pb.`interact`)
FROM `interactionData` pb WHERE pb.`projectID` = v.`projectID` GROUP BY pb.`projectID`) AS `interactions`
FROM `medias` v
LEFT JOIN `projectsExt` pa ON v.`projectsExtID` = pa.`projectsExtID`
WHERE (pa.`user` = '1' OR v.`ownerUser` = '1')
GROUP BY v.`projectID`
It takes too long, 1-2seconds. This is obviously the multi left-join scenario. But, I've got a couple of ideas to improve speed and wondered what the thoughts were in principle. Do I:-
Try and select the list in the query and then get the data, rather than doing the joins. Not sure how this would work.
Do a select in a separate query to get the projectIDs and then run queries on each projectID afterwards. This may lead to hundreds of potentially thousands of requests, but may be better for the processing?
Other ideas?
There's two questions here:
how can I get my result in less than 2 seconds
how can I avoid a left join.
To answer #1 properly there has to be more information. Technical information, such as the explain plan for this particular query is a good start. Even better if we'd have the SHOW CREATE TABLE of all tables that you access, as well as the number of rows they contain.
But I'd also appreciate more functional information: what exactly is the question you're trying to answer? Right now, it seems you're looking at two different sets of medias:
either there is no matching row in projectsExt, in which case medias.ownerUser must equal '1' (is that '1' supposed to be a string btw?)
or there is exactly one mathching row in projectsExt for which projectsExt.user must equal '1' (is that '1' supposed to be a string btw?)
By lack of enough information to answer #1, I can answer #2 - "how to avoid a left join". Answer is: write a UNION of the two sets, one where there is a match and one where there isn't a match.
SELECT v.`projectID`
, (
SELECT COUNT(m.`session`)
FROM `metricData` m
WHERE m.`projectID` = v.`projectID`
) AS `sessions`
, (
SELECT COUNT(pb.`interact`)
FROM `interactionData` pb
WHERE pb.`projectID` = v.`projectID`
GROUP BY pb.`projectID`
) AS `interactions`
FROM (
SELECT v.projectID
FROM medias
WHERE ownerUser = '1'
GROUP BY projectID
UNION ALL
SELECT v.projectID
FROM medias v
INNER JOIN projectsExt pa
ON v.projectsExtID = pa.projectsExtID
WHERE v.ownerUser != '1'
AND pa.user = '1'
GROUP BY v.`projectID
) v
Have you tried, instead, to refactor everything into left joins? Seeing as how you're always grouping on the same field, it shouldn't be a problem. Try that and post an EXPLAIN to see what the bottlenecks are.
Subselects are less performant than joins, because the engine can optimize the joins to a much higher degree. In fact, subselects will usually, where applicable, be rewritten into joins by the engine where possible.
As a rule of a thumb, there is no gain in splitting queries, all you gain is overhead and confusing the optimizer. There are, as always, exceptions to this rule, but they come into play after you've done what you can traditionally and know you keen such an approach.