Generating a histogram from mysql data - mysql

I was wondering if anyone had some advice for me regarding a histogram-generating query. I have a query that I like (in that it works), but it is extremely slow. Here is the background:
I have a table of metadata, a table of data values where one row in meta_data is a key-row for many (perhaps several thousand) rows in data_values, and a table of histogram bin information:
create table meta_data (
id int not null primary key,
name varchar(100),
other_data char(10)
);
create table data_values (
id int not null primary key,
meta_data_id int not null,
data_value real
);
create table histogram_bins (
id int not null primary key,
bin_min real,
bin_max real,
bin_center real,
bin_size real
);
And a query that creates the histogram:
SELECT md.name AS `Name`,
md.other_data AS `OtherData`,
hist.bin_center AS `Bin`,
SUM(data.data_value BETWEEN hist.bin_min AND hist.bin_max) AS `Frequency`
FROM histogram_bins hist
LEFT JOIN data_values data ON 1 = 1
LEFT JOIN meta_data md ON md.id = data.meta_data_id
GROUP BY md.id, `Bin`;
In an earlier version of this query, the BETWEEN ... AND logical statement was down in the JOIN (replacing 1 = 1), but then I would only receive histogram rows with non-zero frequency. I need rows for all of the bins (even the zero-frequency ones), for analysis purposes.
Its pretty darn slow, to the tune of 10-15 minutes or so. The data_values table has about 7.9 million rows, and meta_data weighs in at 15,900 rows -- so maybe it is just going to take a long time!
Thanks very much!

I think this might help
SELECT h.bin_center AS `Bin`,
ISNULL(F.Frequency,0) AS `Frequency`
FROM histogram_bins h
LEFT JOIN
(SELECT hist.bin_center AS `Bin`,
COUNT(data_values) AS `Frequency`
FROM data_values data
LEFT JOIN histogram_bins hist ON data.data_value BETWEEN hist.bin_min AND hist.bin_max
GROUP BY md.name, md.other_data, hist.bin_center) F ON F.bin_center = h.bin_center
I changed the order of the tables because I think it's best to find the corresponding bin for every record in the data and then just count how many there are grouped by bin

Related

complex SQL query - one table

I am new to SQL.
I was wondering if there is a way to form a complex (I think) query of a certain form, regarding a single table - or a simple query for the same effect.
Let's say I have a table of voice actor candidates, with different attributes (columns) - name and characteristics.
Let's say I have two different actor evaluators (Stewie and Griffin), and all the candidates were evaluated by minimum one of them (one, or both). The evaluators evaluate the actors, and the table is built.
The rows in the table are per-evaluation, not per-person, meaning that some candidates have two separate rows, one from each evaluation.
The evaluator's name is also an attribute, a column.
Can I make a query that will choose all candidates that were evaluated by both evaluators? (and let's say show all these rows, an even number then)
(There is no attribute "evaluated by both" - that's the core)
I think it should find all rows with evaluator Stewie, then search the entire table for rows with the corresponding candidates' names, and get those with evaluator Griffin.
Summary
A table with people - names and characteristics. One or two rows per person. Each row was filled according to a different observer. There is an attribute "Is Nice". How to find all people that were observed by two observers, one marked "Yes" and one "No" under "Is Nice"?
Update
It will take me some time to check all the answers (as not enough experience yet), and I will update what worked for me.
Can I make a query that will choose all candidates that were evaluated
by both evaluators?
(and let's say show all these rows, an even number then)
There are multiple ways to do this. You can check the existence of other evaluator's evaluation, using EXISTS:
SELECT * FROM Candidate AS C1 WHERE EXISTS (SELECT * FROM Candidate AS C2 WHERE C1.id = C2.id AND C1.evaluator != C2.evaluator)
Or, you could join the table to itself: (The checks for evaluators should be changed as appropriate)
SELECT C1.candidateName FROM Candidate AS C1 JOIN Candidate AS C2 USING (id) WHERE C1.evaluator = Stewie AND C2.evaluator = Griffin
How to find all people that were observed by two observers, one marked
"Yes" and one "No" under "Is Nice"?
For this one, you add another condition to the queries above, that checks if one evaluation was "Yes" and the other one was "No".
You seem to want group by and having. SInce a person cannot have more than two rows, and there are only two distinct possible values for isnice (yes or no), we can phrase the query as:
select name
from people
group by name
having max(isnice) <> min(isnice)
This filter names that have (at least) two different values in isnice. Starting from the above assumptions, this is sufficient to ensure that that person was evaluated more than once, and that isnice has (at least) two different values.
So, I read the problem very carefully, and came up with my own solution.
Please verify the code below if this is what you were really asking for?
--Create Candidates Table
CREATE TABLE tbl_candidates
(
c_id INT PRIMARY KEY NOT NULL IDENTITY(1,1),
c_name VARCHAR(30),
)
--Create Evaluators Table
CREATE TABLE tbl_evaluators
(
e_id INT PRIMARY KEY NOT NULL IDENTITY(1,1),
e_name VARCHAR(30),
)
--Create Evaluations Table
CREATE TABLE tbl_evaluations
(
ee_id INT PRIMARY KEY NOT NULL IDENTITY(1,1),
ee_title VARCHAR(30) NOT NULL,
ee_remarks VARCHAR(30) NOT NULL,
ee_date date NOT NULL,
c_id INT FOREIGN KEY (c_id) REFERENCES tbl_candidates(c_id) NOT NULL,
e_id1 INT FOREIGN KEY (e_id1) REFERENCES tbl_evaluators(e_id) NOT NULL,
e_id2 INT FOREIGN KEY (e_id2) REFERENCES tbl_evaluators(e_id),
IsNice VARCHAR(4)
)
--Populate data & check to verify
INSERT INTO tbl_candidates (c_name) VALUES ('Sam') , ('Smith')
SELECT * FROM tbl_candidates
INSERT INTO tbl_evaluators (e_name) VALUES ('Stewie'),('Griffin')
SELECT * FROM tbl_evaluators
INSERT INTO tbl_evaluations
(ee_title,ee_remarks,ee_date,c_id,e_id1,e_id2,IsNice)
VALUES
('Some Title','Some Comment','2020-6-12',1,1,NULL,'No'),
('Some Title','Some Comment','2020-6-12',2,1,2,'Yes'),
('Some Title','Some Comment','2020-6-12',3,2,NULL,'No')
--finally comparing whether we have the matching data of our input vs tables combined data display
select * from tbl_evaluations
select ee_id,ee_title,c_name,ee_remarks,e1.e_name,e2.e_name,ee_date,IsNice from tbl_evaluations ee
left join tbl_candidates c on c.c_id = ee.c_id left join tbl_evaluators e1 on e1.e_id = ee.e_id1 left join tbl_evaluators e2 on e2.e_id = ee.e_id2
See the result proof :
This is surely not the best way to write it, but my first thought is
SELECT * FROM evaluations
WHERE PrName IN (
SELECT PrName
FROM evaluations
WHERE IsNice ='No')
AND PrName IN (
SELECT PrName
FROM evaluations
WHERE IsNice ='Yes')

Proper way for setting indexes in query

I have got 2 tables.
first - table t_games (alias g)
column type
g_id mediumint(8)
t_id_1 smallint(5)
t_id_2 smallint(5)
g_team_1 varchar(50)
g_team_2 varchar(50)
g_date datetime
g_live tinyint(3)
Primary index is set on g_id field and there is additional composite index set on (t_id_1, t_id_2, g_date, g_live) fields.
second - table t_teams (aliases: t1 and t2)
column type
t_id smallint(5)
t_gw_name varchar(50)
gw_cid tinyint(3)
Primary index is set on t_id.
relation between tables updated
There are two teams on each game. In table t_teams there are team's names. In t_games table I keep ID's related to the t_teams, to retreive name of each team taking part in the game. So to retreive a game ID with team's names:
SELECT g.g_id, t1.t_gw_name, t2.t_gw_name FROM t_games g
JOIN t_teams t1 ON (g.t_id_1 = t1.t_id)
JOIN t_teams t2 ON (g.t_id_2 = t2.t_id)
My SQL query:
SELECT g_id, t_id_1, t_id_2, g_team_1, g_team_2, g_date, g_live, t1.t_gw_name AS t_gw_name_1, t1.gw_cid AS gw_cid_1, t2.t_gw_name AS t_gw_name_2, t2.gw_cid AS gw_cid_2
FROM t_games g
JOIN t_teams t1 ON (t_id_1 = t1.t_id) JOIN t_teams t2 ON (t_id_2 = t2.t_id)
WHERE g.g_date < "2013-07-24 20:00:00" AND g.g_live < 2`
And after explain I get:
`
1 SIMPLE g ALL t_id_1 NULL NULL NULL 16 Using where
1 SIMPLE t1 eq_ref PRIMARY PRIMARY 2 t_id_1 1
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 2 t_id_2 1`
I tried many combination of indexing the table, but I can't get rid of the ALL scan.
In your case (for the query you've shown) you only need an index that covers a single column g_date.
Whereas you see ALL because:
There are only 16 rows in the table (?)
You're selecting more than ~30% of rows of the table
On both cases it's easier to scan all the table rather than use index.
So to check that g_date index works:
Fill the t_games table with something like 1000 rows
Perform a query that would return about 10 rows from t_games table
PS:
composite index (g_date, g_live) won't work because you have range comparison for both columns
single g_live won't work be very effective because it's a low cardinality for that column

Select records which are changes to a series (MySQL/rails)

I have a table which records a series of values, as so ...
ID. VAL
1. 18
2. 18
3. 20
4. 20
5. 18
I'm trying to work out how to select the records at which the series changes (e.g. Record 1, 3 and 5). I'm using rails, but I'm guessing raw MySQL might be the way forward.
Would appreciate any help you could offer...
Assuming your table looks like this:
CREATE TABLE records (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
val INT UNSIGNED NOT NULL
);
What you want to do is self join it against itself, like this:
SELECT
records2.id
FROM
records AS records1
JOIN records AS records2 ON (records1.id = records2.id-1)
WHERE
records1.val != records2.val
Such that we join a record with the one preceding it. If the values for the two records differ - we have our answer.

mysql left join, limit and sorting

I've a doubt. I need to make a left join between two tables and get only the first result (I mean the first record on table A that doesn't match nothing on table B).
This is an example
create table a (
id int not null auto_increment primary key,
name varchar(50),
surname varchar(50),
prov char(2)
) engine = myisam;
insert into a (name,surname,prov)
values ('aaa','aaa','ss'),('bbb','bbb','ca'),('ccc','ccc','mi'),('ddd','ddd','mi'),('eee','eee','to'),
('fff','fff','mi'),('ggg','ggg','ss'),('hhh','hhh','mi'),('jjj','jjj','ss'),('kkk','kkk','to');
create table b (
id int not null auto_increment primary key,
id_name int
) engine = myisam;
insert into b (id_name) values (3),(4),(8),(5),(10),(1);
Query A:
select a.*
from a
left join b
on a.id = b.id_name
where b.id_name is null and a.prov = 'ss'
order by a.id
limit 1
Query B:
select a.*
from a
left join b
on a.id = b.id_name
where b.id_name is null and a.prov = 'ss'
limit 1
Both queries gives me right result, that is record with id = 7.
I want to know if I can rely on query B even without specifing sorting on id or if it's just a case that I get the right result.
I ask that because on large recordset (more than 10 millions of rows), the query without sorting gives me one record immediately while applying sorting it takes even more than 20 seconds even though a.id is primary key.
Thanks in advance.
You can't rely on query B. Mysql just returned what it found faster to return.
Is there an index on table "b" on column "id_name"? If no, then create it and tell us what You get (I mean how fast) It doesn't matter You are looking for not matched rows, JOIN has to be made before it can test if there is match or not.

Optimizing a MySQL query with a large IN() clause or join on derived table

Let's say I need to query the associates of a corporation. I have a table, "transactions", which contains data on every transaction made.
CREATE TABLE `transactions` (
`transactionID` int(11) unsigned NOT NULL,
`orderID` int(11) unsigned NOT NULL,
`customerID` int(11) unsigned NOT NULL,
`employeeID` int(11) unsigned NOT NULL,
`corporationID` int(11) unsigned NOT NULL,
PRIMARY KEY (`transactionID`),
KEY `orderID` (`orderID`),
KEY `customerID` (`customerID`),
KEY `employeeID` (`employeeID`),
KEY `corporationID` (`corporationID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
It's fairly straightforward to query this table for associates, but there's a twist: A transaction record is registered once per employee, and so there may be multiple records for one corporation per order.
For example, if employees A and B from corporation 1 were both involved in selling a vacuum cleaner to corporation 2, there would be two records in the "transactions" table; one for each employee, and both for corporation 1. This must not affect the results, though. A trade from corporation 1, regardless of how many of its employees were involved, must be treated as one.
Easy, I thought. I'll just make a join on a derived table, like so:
SELECT corporationID FROM transactions JOIN (SELECT DISTINCT orderID FROM transactions WHERE corporationID = 1) AS foo USING (orderID)
The query returns a list of corporations who have been involved in trades with corporation 1. That's exactly what I need, but it's very slow because MySQL can't use the corporationID index to determine the derived table. I understand that this is the case for all subqueries/derived tables in MySQL.
I've also tried to query a collection of orderIDs separately and use a ridiculously large IN() clause (typhically 100 000+ IDs), but as it turns out MySQL has issues using indices on ridiculously large IN() clauses as well and as a result the query time does not improve.
Are there any other options available, or have I exhausted them both?
If I understand your requirement, you could try this.
select distinct t1.corporationID
from transactions t1
where exists (
select 1
from transactions t2
where t2.corporationID = 1
and t2.orderID = t1.orderID)
and t1.corporationID != 1;
or this:
select distinct t1.corporationID
from transactions t1
join transactions t2
on t2.orderID = t1.orderID
and t1.transactionID != t2.transactionID
where t2.corporationID = 1
and t1.corporationID != 1;
Your data makes no sense to me, I think you are using corporationID where you mean customer ID at some point in there, as your query joins the transaction table to the transaction table for corporationID=1 based on orderID to get the corporationIDs...which would then be 1, right?
Can you please specify what the customerID, employeeID, and corporationIDs mean? How do I know employees A and B are from corporation 1 - in that case, is corporation 1 the corporationID, and corporation 2 is the customer, and so stored in the customerID?
If that is the case, you just need to do a group by:
SELECT customerID
FROM transactions
WHERE corporationID = 1
GROUP BY customerID
(Or select and group by orderID if you want one row per order instead of one row per customer.)
By using the group by, you ignore the fact that there are multiple records that are duplicate except for the employeeID.
Conversely, to returns all corporations that have sold to corporation 2.
SELECT corporationID
FROM transactions
WHERE customerID = 2
GROUP BY corporationID