MySql SELECT Query performance issues in huge database - mysql

I have a pretty huge MySQL database and having performance issues while selecting data. Let me first explain what I am doing in my project: I have a list of files. Every file should be analyzed with a number of tools. The result of the analysis is stored in a results table.
I have one table with files (samples). The table contains about 10 million rows. The schema looks like this:
idsample|sha256|path|...
The other (really small table) is a table which identifies the tool. Schema:
idtool|name
The third table is going to be the biggest one. The table contains all results of the tools I am using to analyze the files (The number of rows will be the number of files TIMES the number of tools). Schema:
id|idsample|idtool|result information| ...
What I am looking for is a query, which returns UNPROCESSED files for a given tool id (where no result exists yet).
The (most efficient) way I found so far to query those entries is following:
SELECT
s.idsample
FROM
samples AS s
WHERE
s.idsample NOT IN (
SELECT
idsample
FROM
results
WHERE
idtool = 1
)
LIMIT 100
The problem is that the query is getting slower and slower as the results table is growing.
Do you have any suggestions for improvements? One further problem is, that i cannot change the structure of the tables, as this a shared database which is also used by other projects. (I think) the only way for improvement is to find a more efficient select query.
Thank you very much,
Philipp

A left join may perform better, especially if idsample is indexed in both tables; in my experience, those kinds of "inquiries" are better served by JOINs rather than that kind of subquery.
SELECT s.idsample
FROM samples AS s
LEFT JOIN results AS r ON s.idsample = r.idsample AND r.idtool = 1
WHERE r.idsample IS NULL
LIMIT 100
;
Another more involved possible solution would be to create a fourth table with the full "unprocessed list", and then use triggers on the other three tables to maintain it; i.e.
when a new tool is added, add all the current files to that fourth table (with the new tool).
when a new file is added, add all the current tools to that fourth table (with the new file).
when a new result in entered, remove the corresponding record from the fourth table.

Related

How to select Master and Detail for Joiner Transformation containing large records in IICS?

I am using Informatica Intelligent Cloud Services (IICS) Intelligent Structure model to parse the JSON file that I have.The file is located on S3 bucket,and it contains 3 groups. 2 Groups contains lots of records (~100,000) and 3rd group contains (~10,000 records). According to Intelligent structure model, largest group contains PK, which I can use to join the other group, but the issue is for Master and Detail which group should I select ? Usually, group with lower records should be selected but in my case, lower records contains foreign key ? Is there a work around for this issue ?
I am new to IICS so how to resolve the issue ?
Any help will be appreciated. Thanks in advance!
Rule is, select table with samll rowcount should be master because during execution, the master source is cached into the memory for joining purpose.
Having said that, can you use 3rd group with less rows as master for both joins like below. If its normal join, logic remains same but perf will improve if you choose master with less rows and less granularity.
Sq_gr1(d)\
Sq_gr3-jnr1(m)->|jnr2----->
Sq_gr2(d)------>/
Outer join will take time equivalent to count of rows.

Right way to phrase MySQL query across many (possible empty) tables

I'm trying to do what I think is a set of simple set operations on a database table: several intersections and one union. But I don't seem to be able to express that in a simple way.
I have a MySQL table called Moment, which has many millions of rows. (It happens to be a time-series table but that doesn't impact on my problem here; however, these data have a column 'source' and a column 'time', both indexed.) Queries to pull data out of this table are created dynamically (coming in from an API), and ultimately boil down to a small pile of temporary tables indicating which 'source's we care about, and maybe the 'time' ranges we care about.
Let's say we're looking for
(source in Temp1) AND (
((source in Temp2) AND (time > '2017-01-01')) OR
((source in Temp3) AND (time > '2016-11-15'))
)
Just for excitement, let's say Temp2 is empty --- that part of the API request was valid but happened to include 'no actual sources'.
If I then do
SELECT m.* from Moment as m,Temp1,Temp2,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11'15'))
)
... I get a heaping mound of nothing, because the empty Temp2 gives an empty Cartesian product before we get to the WHERE clause.
Okay, I can do
SELECT m.* from Moment as m
LEFT JOIN Temp1 on m.source=Temp1.source
LEFT JOIN Temp2 on m.source=Temp2.source
LEFT JOIN Temp3 on m.source=Temp3.source
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... but this takes >70ms even on my relatively small development database.
If I manually eliminate the empty table,
SELECT m.* from Moment as m,Temp1,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... it finishes in 10ms. That's the kind of time I'd expect.
I've also tried putting a single unmatchable row in the empty table and doing SELECT DISTINCT, and it splits the difference at ~40ms. Seems an odd solution though.
This really feels like I'm just conceptualizing the query wrong, that I'm asking the database to do more work than it needs to. What is the Right Way to ask the database this question?
Thanks!
--UPDATE--
I did some actual benchmarks on my actual database, and came up with some really unexpected results.
For the scenario above, all tables indexed on the columns being compared, with an empty table,
doing it with left joins took 3.5 minutes (!!!)
doing it without joins (just 'FROM...WHERE') and adding a null row to the empty table, took 3.5 seconds
even more striking, when there wasn't an empty table, but rather ~1000 rows in each of the temporary tables,
doing the whole thing in one query took 28 minutes (!!!!!), but,
doing each of the three AND clauses separately and then doing the final combination in the code took less than a second.
I still feel I'm expressing the query in some foolish way, since again, all I'm trying to do is one set union (OR) and a few set intersections. It really seems like the DB is making this gigantic Cartesian product when it seriously doesn't need to. All in all, as pointed out in the answer below, keeping some of the intelligence up in the code seems to be the better approach here.
There are various ways to tackle the problem. Needless to say it depends on
how many queries are sent to the database,
the amount of data you are processing in a time interval,
how the database backend is configured to manage it.
For your use case, a little more information would be helpful. The optimization of your query by using CASE/COUNT(*) or CASE/LIMIT combinations in queries to sort out empty tables would be one option. However, if-like queries cost more time.
You could split the SQL code to downgrade the scaling of the problem from 1*N^x to y*N^z, where z should be smaller than x.
You said that an API is involved, maybe you are able handle the temporary "no data" tables differently or even don't store them?
Another option would be to enable query caching:
https://dev.mysql.com/doc/refman/5.5/en/query-cache-configuration.html

MySQL count selected rows in one table to update value in another table

I have created a table ("texts" table) for storing ocr text from scanned documents. The table now has 100,000 + records. It stores a separate record for each page in the document. I set up the table originally so it stored the documents' title and its location against each record, which was obviously bad design as the info was duplicated for many records. I have subsequently created a separate table which now only stores one record for each document ("documents" table). The original table still contains a record for each page in the document, but the only columns now are the ocr text and the id of the document record in the documents table.
The documents table has a column "total_pages". I am trying to update this value using the following query:
UPDATE documents SET total_pages=(SELECT Count(*) from texts where texts.docs_id=documents.id)
This just seems to take forever to execute and I have had to crash out of it on a couple of occasions. There are over 8000 records in the documents table.
I have tested the query by limiting it to just one document
UPDATE documents SET total_pages=(SELECT Count(*) from texts where texts.docs_id=documents.id and documents.id=1)
This works eventually with just one record, but it takes a very long time to execute. I am guessing that my full query needs a bit of optimization! Any help greatly appreciated.
This is your query:
UPDATE documents
SET total_pages = (SELECT Count(*)
from texts
where texts.docs_id = documents.id)
For performance, you want an index on texts(docs_id). That will probably fix your performance problem. In fact, it might make it unnecessary to store this value in the master table.
If you do decide to store the count, be sure that you keep the value up-to-date. That would typically require a trigger to handle inserts and dates (and perhaps updates, if doc_id changes).

Multiple ranking for large result set

I have a large Mysql table (approx 2 million rows). I want to run searches on it that will match possibly upto 25k rows (returned results will be limited [eg 25 per page]). What I wanted to do was to rank these results on cetain criteria and use that to order them.
The solution I have so far is to create a script go through the table and assign a score to each row based on my criteria. Each result would be given points depending on how it compared with my ideal result. I could then order by that score when executing a select, instead of caluculating on the fly.
I was then thinking that I wanted other users of the system to be able to setup their own custom scoring criteria. My first thought was to ceate a separate table that would contain the first tables row id, a users id and rank. But I was thinking that this table could get very large (2 million rows foreach user). So I am thinking of alternatives so far options I have are:
1) Use separate ranking table
2) Use user specific ranking tables
3) calculate on the fly
Anyone have any experience with a similar problem? The results will be searched in real time by users so my primary concern would be to make this part of the process as fast as possible.
Many Thanks
I prefer seperate table, as you can generate a temporary new table while re-calculating, and rename the temp table to the real table.
This also may help you to bypass locking problems later...

MYSQL Triple Join Performance help, Copying to Tmp Table

I'm working on a query for a news site, which will find FeaturedContent for display on the main homepage. Content marked this way is tagged as 'FeaturedContent', and ordered in a featured table by 'homepage'. I currently have the desired output, but the query runs in over 3 seconds, which I need to cut down on. How does one optimize a query like the one which follows?
EDIT: Materialized the view every minute as suggested, down to .4 seconds:
SELECT f.position, s.item_id, s.item_type, s.title, s.caption, s.date
FROM live.search_all s
INNER JOIN live.tags t
ON s.item_id = t.item_id AND s.item_type = t.item_type AND t.tag = 'FeaturedContent'
LEFT OUTER JOIN live.featured f
ON s.item_id = f.item_id AND s.item_type = f.item_type AND f.feature_type = 'homepage'
ORDER BY position IS NULL, position ASC, date
This returns all the homepage features in order, followed by other featured content ordered by date.
The explain looks like this:
|-id---|-select_type-|-table-|-type---|-possible_keys---------|-key--------|-key_len-|-ref---------------------------------------|-rows--|-Extra-------------------------------------------------------------|
|-1----|-SIMPLE------|-t2----|-ref----|-PRIMARY,tag_index-----|-tag_index--|-303-----|-const-------------------------------------|-2-----|-Using where; Using index; Using temporary; Using filesort;--------|
|-1----|-SIMPLE------|-t-----|-ref----|-PRIMARY---------------|-PRIMARY----|-4-------|-newswires.t2.id---------------------------|-1974--|-Using index-------------------------------------------------------|
|-1----|-SIMPLE------|-s-----|-eq_ref-|-PRIMARY, search_index-|-PRIMARY----|-124-----|-newswires.t.item_id,newswires.t.item_type-|-1-----|-------------------------------------------------------------------|
|-1----|-SIMPLE------|-f-----|-index--|-NULL------------------|-PRIMARY----|-190-----|-NULL--------------------------------------|-13----|-Using index-------------------------------------------------------|
And the Profile is as follows:
|-Status---------------|-Time-----|
|-starting-------------|-0.000091-|
|-Opening tables-------|-0.000756-|
|-System lock----------|-0.000005-|
|-Table lock-----------|-0.000008-|
|-init-----------------|-0.000004-|
|-checking permissions-|-0.000001-|
|-checking permissions-|-0.000001-|
|-checking permissions-|-0.000043-|
|-optimizing-----------|-0.000019-|
|-statistics-----------|-0.000127-|
|-preparing------------|-0.000023-|
|-Creating tmp table---|-0.001802-|
|-executing------------|-0.000001-|
|-Copying to tmp table-|-0.311445-|
|-Sorting result-------|-0.014819-|
|-Sending data---------|-0.000227-|
|-end------------------|-0.000002-|
|-removing tmp table---|-0.002010-|
|-end------------------|-0.000005-|
|-query end------------|-0.000001-|
|-freeing items--------|-0.000296-|
|-logging slow query---|-0.000001-|
|-cleaning up----------|-0.000007-|
I'm new to reading the EXPLAIN output, so I'm unsure if I have a better ordering available, or anything rather simple that could be done to speed these along.
The search_all table is the materialized view table which is periodically updated, while the tags and featured tables are views. These views are not optional, and cannot be worked around.
The tags view combines tags and a relational table to get back a listing of tags according to item_type and item_id, but the other views are all simple views of one table.
EDIT: With the materialized view, the biggest bottleneck seems to be the 'Copying to temp table' step. Without ordering the output, it takes .0025 seconds (much better!) but the final output does need ordered. Is there any way to enhance the performance of that step, or work around it?
Sorry if the formatting is difficult to read, I'm new and unsure how it is regularly done.
Thanks for your help! If anything else is needed, please let me know!
EDIT: Table sizes, for reference:
Tag Relations: 197,411
Tags: 16,897
Stories: 51,801
Images: 28,383
Videos: 2,408
Featured: 13
I think optimizing your query alone won't be very useful. First thoughts are that joining a subquery, itself made of UNIONs, is alone a double bottleneck for performance.
If you have the option to change your database structure, then I would suggest to merge the 3 tables stories, images and videos into one, if they are, as it looks like, very similar (adding them a type ENUM('story', 'image', 'video')) to differentiate the records; this would remove both the subquery and the union.
Also, it looks like your views on stories and videos, are not using an indexed field to filter content. Are you querying an indexed column?
It's a pretty tricky problem without knowing your full table structure and the repartition of your data!
Another option, which would not involve bringing modifications to your existing database (especially if it is already in production), would be to "cache" this information into another table, which would be periodically refreshed by a cron job.
The caching can be done at different levels, either on the full query, or on subparts of it (independent views, or the 3 unions merged into a single cache table, etc.)
The viability of this option depends on whether it is acceptable to display slightly outdated data, or not. It might be acceptable for just some parts of your data, which may imply that you will cache just a subset of the tables/views involved in the query.