Running SQL queries with JOINs on large datasets

Running SQL queries with JOINs on large datasets - mysql

Im new to using MySQL.
Im trying to run an inner join query, between a database of 80,000 (this is table B) records against a 40GB data set with approx 600million records (this is table A)
Is Mysql suitable for running this sort of query?
Whay sort of time should I expect it to take?
This is the code I ied is below. However it failed as my dbs connection failed at 60000 secs.
set net_read_timeout = 36000;
INSERT
INTO C
SELECT A.id, A.link_id, link_ref, network,
date_1, time_per,
veh_cls, data_source, N, av_jt
from A
inner join B
on A.link_id = B.link_id;
Im starting to look into ways to cutting down the 40GB table size to a temp table, to try and make the query more manageabe. But I keep getting
Error Code: 1206. The total number of locks exceeds the lock table size 646.953 sec
Am I on the right track?
cheers!
my code for splitting the database is:
LOCK TABLES TFM_830_car WRITE, tfm READ;
INSERT
INTO D
SELECT A.id, A.link_id, A.time_per, A.av_jt
from A
where A.time_per = 34 and A.veh_cls = 1;
UNLOCK TABLES;
Perhaps my table indices are in correct all I have is a simple primary key
CREATE Table A
(
id int unsigned Not Null auto_increment,
link_id varchar(255) not Null,
link_ref int not Null,
network int not Null,
date_1 varchar(255) not Null,
#date_2 time default Null,
time_per int not null,
veh_cls int not null,
data_source int not null,
N int not null,
av_jt int not null,
sum_squ_jt int not null,
Primary Key (id)
);
Drop table if exists B;
CREATE Table B
(
id int unsigned Not Null auto_increment,
TOID varchar(255) not Null,
link_id varchar(255) not Null,
ABnode varchar(255) not Null,
#date_2 time not Null,
Primary Key (id)
);
In terms of the schema, it is just these two two tables (A and B) loaded underneath a database

I believe that answer has already been given in this post: The total number of locks exceeds the lock table size
ie. use a table lock to avoid InnoDB default row by row lock mode

thanks foryour help.
Indexing seems to have solved the problem. I managed to reduce the query time from 700secs to aprox 0.2secs per record by indexing on:
A.link_id
i.e. from
from A
inner join B
on A.link_id = B.link_id;
found this really usefull post. v helpfull for a newbe like myself
http://hackmysql.com/case4
code used to index was:
CREATE INDEX linkid_index ON A(link_id);

Related

if attribute x = "something" do this

So I have the following tables running, but I'm having a problem on a specific situation.
I have a network of soap dispensers, that I want to keep track of their current soap level. I'm counting the number of pumps (3 mililiters each) and doing greatest(full_capacity - number_pumps * 3, 0) as seen on the View table.
But my problem is: there is table maintenance, and one of the "descriptions" may be "refill". What I wanted was for when a maintenance_description = "refill" for the number_pumps in table records be set to 0 for that exact dispenser. Is is possible? I read about triggers, but couldn't really understand how to do this.
As a pratical example, lets say I have soap dispenser id 1 with a max capacity of 1000ml, I then count 300 pumps, so I know I have 100ml left. I then do a refill and want the number of pumps to get set to 0. Otherwise in the next use it will say I have 97ml available, when in reality I have 997ml because I already made a refill.
Thank you very much in advance.
create table dispenser(
id_dispenser int not null auto_increment,
localization_disp varchar(20) not null,
full_capacity int not null,
primary key (id_dispenser));
create table records(
time_stamp DATETIME DEFAULT CURRENT_TIMESTAMP not null,
dispenser_id int not null,
number_pumps int not null,
battery_level float not null,
primary key (dispenser_id,time_stamp));
create table maintenance(
maintenance_id int not null auto_increment,
maintenance_date DATETIME DEFAULT CURRENT_TIMESTAMP not null,
employee_id int not null,
maintenance_description varchar(20) not null,
dispenser_id int not null,
primary key (maintenance_id));
CREATE VIEW left_capacity
AS
SELECT max(time_stamp) AS calendar,
id_dispenser AS dispenser,
full_capacity AS capacity,
greatest(full_capacity - number_pumps * 3, 0) AS available
FROM records r
INNER JOIN dispenser d
ON d.id_disp = r.id_dispenser
GROUP by id_dispenser;

If I understand correctly you want a view with the amount remaining. This would be the number pumps since the last refill, subject to your formula.
MySQL has had tricky issues with subqueries in views. I think the following is view-safe for MySQL:
select d.*,
(d.full_capacity -
(select count(*) * 3
from records r
where r.id_dispenser = d.id_dispenser and
r.time_stamp > (select max(m.maintenance_date)
from maintenance m
where m.id_dispenser = r.id_dispenser and
m.maintenance_description = 'refill'
)
)
) as available
from dispenser d;

Optimal Database Struct

Im a data lover and created a list of possible item combinations for a widely known mobile game. There are 21.000.000 combinations (useless combos filtered out by logics).
So what i wanna do now is creating a website people can access to see what they need to get the best gear OR whats the best they can do with the gear the have right now.
My Item Database currently looks like this:
CREATE TABLE `items` (
`ID` int(8) unsigned NOT NULL,
`Item1` int(2) unsigned NOT NULL,
`Item2` int(2) unsigned NOT NULL,
`Item3` int(2) unsigned NOT NULL,
`Item4` int(2) unsigned NOT NULL,
`Item5` int(2) unsigned NOT NULL,
`Item6` int(2) unsigned NOT NULL,
`Item7` int(2) unsigned NOT NULL,
`Item8` int(2) unsigned NOT NULL,
PRIMARY KEY (`ID`)
) ENGINE=InnoDB
ID range: 1 - 21.000.000
Every Item is known by its number e.g. 11. First number describes the category and second number the item of this category. For example 34 means Item3 --> 4. Its saved like this as i also have images to show on the website later using this number as identification (34.png).
The Stats Database looks like this right now:
CREATE TABLE stats (
Stat1 FLOAT UNSIGNED NOT NULL,
Stat2 FLOAT UNSIGNED NOT NULL,
Stat3 FLOAT UNSIGNED NOT NULL,
Stat4 FLOAT UNSIGNED NOT NULL,
Stat5 FLOAT UNSIGNED NOT NULL,
Stat6 FLOAT UNSIGNED NOT NULL,
Stat7 FLOAT UNSIGNED NOT NULL,
Stat8 FLOAT UNSIGNED NOT NULL,
ID1 INT UNSIGNED,
ID2 INT UNSIGNED,
ID3 INT UNSIGNED,
ID4 INT UNSIGNED,
ID5 INT UNSIGNED,
ID6 INT UNSIGNED,
ID7 INT UNSIGNED,
ID8 INT UNSIGNED
) ENGINE = InnoDB;
Where Stat* stands for stuff like Attack, Defense, Health, etc. and ID* for the ID of the Item Database. Some Combinations have the same stat combinations over all 8 possible stats, so i grouped them together to save some entries (dunno if that was smart yet). For example one Stat combination can have ID1, ID2 and ID3 filled and another combination just has ID1 (the max is 8 IDs tho, i calced it).
Right now im displaying a huge table sortable by every Stat, and its working fine.
What i want in the future tho is to let the user search for items or exclude certain items from the list. I know i can do this with some join and where-clauses (where items.ID == stats.ID1 OR items.ID == stats.ID2 etc.), but i wonder if my current structure is the smartest solution for this? I try to get the best performance as im running this on my old Pi 2.

When you have very large data-sets that only have a small number of matches, the best performance is often to use a subquery in the FROM or WHERE clause.
SELECT SP.TerritoryID,
SP.BusinessEntityID,
SP.Bonus,
TerritorySummary.AverageBonus
FROM (SELECT TerritoryID,
AVG(Bonus) AS AverageBonus
FROM Sales.SalesPerson
GROUP BY TerritoryID) AS TerritorySummary
INNER JOIN
Sales.SalesPerson AS SP
ON SP.TerritoryID = TerritorySummary.TerritoryID
Copied from here
This effectively creates a virtual table of only those rows that match, then runs the join on the virtual table - a lot like selecting the matching rows into a tmp table, then joining on the tmp table. Running a join on the entire table, although you might think it would be OK, often comes out terrible.
You may also find using a subquery in the WHERE clause works
... where items.id in (select id1 from stats union select id2 from stats)
Or select your matching stats IDs into a tmp table, then indexing the tmp table.
It all depends quite a lot on what your other selection logic is.
It also sounds like you should get some indexes on the stats table. If you're not updating it a lot, then indexing every ID can work OK. Just make sure the unfilled stats IDs have the value NULL

Strange behaviour on MYSQL querying huge table

I'm trying to understand a strange performance behaviour happening in a MYSQL data structure that I'm working on:
CREATE TABLE metric_values
(
dmm_id INT NOT NULL,
dtt_id BIGINT NOT NULL,
cus_id INT NOT NULL,
nod_id INT NOT NULL,
dca_id INT NULL,
value DOUBLE NOT NULL
)
ENGINE = InnoDB;
CREATE INDEX metric_values_dmm_id_index
ON metric_values (dmm_id);
CREATE INDEX metric_values_dtt_index
ON metric_values (dtt_id);
CREATE INDEX metric_values_cus_id_index
ON metric_values (cus_id);
CREATE INDEX metric_values_nod_id_index
ON metric_values (nod_id);
CREATE INDEX metric_values_dca_id_index
ON metric_values (dca_id);
CREATE TABLE dim_metric
(
dmm_id INT AUTO_INCREMENT
PRIMARY KEY,
met_id INT NOT NULL,
name VARCHAR(45) NOT NULL,
instance VARCHAR(45) NULL,
active BIT DEFAULT b'0' NOT NULL
)
ENGINE = InnoDB;
CREATE INDEX dim_metric_dmm_id_met_id_index
ON dim_metric (dmm_id, met_id);
CREATE INDEX dim_metric_met_id_index
ON dim_metric (met_id);
CONTEXT:
Some context, I'm trying to understand some strange performance behaviour happening in a data structure that I'm working on:
CREATE TABLE metric_values
(
dmm_id INT NOT NULL,
dtt_id BIGINT NOT NULL,
cus_id INT NOT NULL,
nod_id INT NOT NULL,
dca_id INT NULL,
value DOUBLE NOT NULL
)
ENGINE = InnoDB;
CREATE INDEX metric_values_dmm_id_index
ON metric_values (dmm_id);
CREATE INDEX metric_values_dtt_index
ON metric_values (dtt_id);
CREATE INDEX metric_values_cus_id_index
ON metric_values (cus_id);
CREATE INDEX metric_values_nod_id_index
ON metric_values (nod_id);
CREATE INDEX metric_values_dca_id_index
ON metric_values (dca_id);
CREATE TABLE dim_metric
(
dmm_id INT AUTO_INCREMENT
PRIMARY KEY,
met_id INT NOT NULL,
name VARCHAR(45) NOT NULL,
instance VARCHAR(45) NULL,
active BIT DEFAULT b'0' NOT NULL
)
ENGINE = InnoDB;
CREATE INDEX dim_metric_dmm_id_met_id_index
ON dim_metric (dmm_id, met_id);
CREATE INDEX dim_metric_met_id_index
ON dim_metric (met_id);
CONTEXT:
Metric_values have something close to 100 milion rows and table dim_metric has 1024 rows.
I'm doing a simple JOIN between this 2 tables and I'm having huge performance issues. Trying to figure out what the problem is I stumbled in this strange behaviour.
I can't execute the JOIN using the column met_id as a filter. I left it running for 10 minutes and lost the connection to the database due to timeout before I got any results back;
Running a explain on the query I can see that the indexes are being used correctly (I assume) and only 1052 rows are being scanned from metric_values.
EXPLAIN
SELECT
count(0)
FROM metric_values v
INNER JOIN dim_metric m ON m.dmm_id = v.dmm_id
WHERE 1=1
AND m.met_id = 1;
1 SIMPLE m ref PRIMARY,dim_metric_met_id_index,dim_metric_dmm_id_met_id_index dim_metric_met_id_index 4 const 1 Using index
1 SIMPLE v ref metric_values_dmm_id_index metric_values_dmm_id_index 4 oi_fact.m.dmm_id 1052 Using index
Doing a simple change to the query to use a sub select instead of a JOIN I can get the results after ~45 seconds.
Running a explain on the modified query I can see that the index is not the primary resource being used to fetch the data and that almost 20 million rows were scaned to bring me the result.
EXPLAIN
SELECT
count(0)
FROM metric_values v
WHERE 1=1
AND v.dmm_id = (SELECT m.dmm_id FROM dim_metric m WHERE m.met_id = 1);
1 PRIMARY v ref metrics_values_dmm_id_index metrics_values_dmm_id_index 4 const 19589800 Using where; Using index
2 SUBQUERY m ref dim_metric_met_id_index dim_metric_met_id_index 4 const 1 Using index
Can someone explain to me what is happening? Did I misunderstood what the EXPLAIN is telling me? Can I do some changes to the data model to improve the query performance?

Longest MySQL queries for worst case testing

I have a big mysql Database (planned is about one million entries) and I want to test its performance by creating a worst query (longest calculation time) i am able to.
For now it is a database with two tables:
CREATE TABLE user (ID BIGINT NOT NULL AUTO_INCREMENT,
createdAt DATETIME NULL DEFAULT NULL,
lastAction DATETIME NULL DEFAULT NULL,
ip TEXT NULL DEFAULT NULL,
browser TEXT NULL DEFAULT NULL,
PRIMARY KEY (ID))
CREATE TABLE evt (ID BIGINT AUTO_INCREMENT,
UID BIGINT NULL DEFAULT NULL,
timeStamp DATETIME NULL DEFAULT NULL,
name TEXT NULL DEFAULT NULL,
PRIMARY KEY (ID),
FOREIGN KEY (UID)
REFERENCES user(ID))
It's populated and is running locally so no connection is required.
Are there any rules of Thumb on how to create horrible queries?
My worst query for now was:
SELECT user.browser, evt.name, count(*) as AmountOfActions
FROM evt
JOIN user ON evt.UID = user.ID
GROUP BY user.browser, evt.name
ORDER BY AmountOfActions DESC

The number one cost in a query is disk hits. So, make a table big enough so that it cannot be cached in RAM. And/or do a cross-join (etc) such that an intermediate table is too big to be cached in RAM.
A common problem on this forum is lots of joins followed by a group by. Or lots of joins, plus an order by on the big intermediate result.
Here's a double-whammy -- join two tables (each too big to be cached) on a UUID.

Indexing table columns

I am developing an application and I create query strings from the program and pass it to a stored procedure which includes four Prepared statements. After passing the variables, the statements would be as follows,
DROP TABLE IF EXISTS tbl_correlatedData;
CREATE TABLE tbl_correlatedData
SELECT t0.*,t1.counttimeStamplocalIp,t2.countlocalPort,t3.countlocalGeo,t4.countisp,t5.countforeignip,t6.countforeignPort,t7.countforeignGeo,t8.countinfection
FROM tbl_union_threats t0
LEFT JOIN tbl_tsli t1
USING (timeStamp,localIp) LEFT JOIN tbl_tslilp t2 USING (timeStamp,localIp,localPort)
LEFT JOIN tbl_tslilplg t3
USING (timeStamp,localIp,localPort,localGeo)
LEFT JOIN tbl_tslilplgisp t4
USING (timeStamp,localIp,localPort,localGeo,isp)
LEFT JOIN tbl_tslilplgispfi t5
USING (timeStamp,localIp,localPort,localGeo,isp,foreignip)
LEFT JOIN tbl_tslilplgispfifp t6
USING (timeStamp,localIp,localPort,localGeo,isp,foreignip,foreignPort)
LEFT JOIN tbl_tslilplgispfifpfg t7
USING (timeStamp,localIp,localPort,localGeo,isp,foreignip,foreignPort,foreignGeo)
LEFT JOIN tbl_tslilplgispfifpfginf t8 USING (timeStamp,localIp,localPort,localGeo,isp,foreignip,foreignPort,foreignGeo,infection)
GROUP BY timeStamp,localIp;
ALTER TABLE tbl_correlatedData
MODIFY timeStamp VARCHAR(200) NOT NULL,
MODIFY localIp VARCHAR(200) NOT NULL,
MODIFY localPort VARCHAR(200) NOT NULL,
MODIFY localGeo VARCHAR(200) NOT NULL,
MODIFY isp VARCHAR(200) NOT NULL,
MODIFY foreignip VARCHAR(200) NOT NULL,
MODIFY foreignPort VARCHAR(200) NOT NULL,
MODIFY foreignGeo VARCHAR(200) NOT NULL,
MODIFY infection VARCHAR(200) NOT NULL;
CREATE INDEX id_index ON tbl_correlatedData (timeStamp,localIp,localPort,localGeo,isp,foreignIp,foreignPort,foreignGeo,infection);
BUT when the process gets to the indexing query, it gives out an error saying,
Incorrect key file for table 'tbl_correlateddata'; try to repair it
FYI :
I am trying this out on Windows Vista 32 bit with a free space of 19 GB on the drive with the xampp server and the table getting created shows its size as 25Mb on phpMyadmin.
EDIT:
when i try to repair it using REPAIR TABLE tbl_correlateddata gives out the following,
Table | Op | Msg_type | Msg_text
-----------------------------------------------------------------------------------------------------------------
db_threatanalysis.tbl_correlateddata | repair| Error | Table 'db_threatanalysis.tbl_correlateddata' doesn...
db_threatanalysis.tbl_correlateddata | repair| status | Operation failed
Thank you very much for the help..in advance :)

You are trying to create compound (multi-column) index, which is just too long for Innodb.
You have 9 columns, each VARCHAR(200), so total index width is 1800 chars. According to MySQL documentation, Innodb key is limited at 3072 chars. So, you should be ok, but, there is no guarantee that ALTER TABLE ... MODIFY ... was able to reduce all column widths to 200 or less, so even if one remained at something like 4000 chars, you will get this error.
Solution:
Reduce number of fields in your compound index.
Analyze queries which are going to query this generated table, and only create indexes that are really necessary. I would imagine most of them will be one-column indexes.
Also, it is rather strange why do you need VARCHAR(200) to store something as simple as timestamp, ip, port, etc. You can probably easily squeeze it to 10 bytes or less and call it a day.

Your key size might be too long. I tried something similiar on my local MySQL install. Since I dont have your tables I could not run the CREATE TABLE statement. As my database is setup for UNICODE my keys size was over 4000 bytes. MySQL InnoDB can only create indexes with a key size of 3072 bytes.
My code looked like follows:
CREATE TABLE tbl_correlatedData
(
`timeStamp` VARCHAR(200) NOT NULL,
localIp VARCHAR(200) NOT NULL,
localPort VARCHAR(200) NOT NULL,
localGeo VARCHAR(200) NOT NULL,
isp VARCHAR(200) NOT NULL,
foreignip VARCHAR(200) NOT NULL,
foreignPort VARCHAR(200) NOT NULL,
foreignGeo VARCHAR(200) NOT NULL,
infection VARCHAR(200) NOT NULL
);
CREATE INDEX id_index ON tbl_correlatedData
(timeStamp,
localIp,
localPort,
localGeo,
isp,
foreignIp,
foreignPort,
foreignGeo,
infection
);
This resulted in the error:
Error Code: 1071. Specified key was too long; max key length is 3072 bytes
Please read about size limitations here: http://dev.mysql.com/doc/refman/5.5/en/innodb-restrictions.html. I suspect you have this problem.

index key prefixes can be up to 767 bytes for innodb table where it will be approximately 1000 bytes for myisam table
total index length of mysql innodb is 3072
here first you just check the length of the index and if possible reduce the column size varchar(100) for all
if possible create separate indexes ( if it suits your requirement )
see the link
http://dev.mysql.com/doc/refman/5.0/en//create-index.html
http://bugs.mysql.com/bug.php?id=6604

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Running SQL queries with JOINs on large datasets - mysql

I believe that answer has already been given in this post: The total number of locks exceeds the lock table size ie. use a table lock to avoid InnoDB default row by row lock mode

Related

if attribute x = "something" do this

Optimal Database Struct

Strange behaviour on MYSQL querying huge table

Longest MySQL queries for worst case testing

Indexing table columns

Categories

Resources