Problem
Given a user query Q on a webpage, I am attempting to compute the following somewhat complex function for each entity p in my database:
(D is also dependent upon the user query Q) There are roughly 440,000 p entities in my database, and each has an average of ~50 associated w_{d,p} weights and ~400 associated u_{p, c} weights. Right now, these weights are stored across three tables, and there is an additional table that defines the relationship between d and c:
table columns indices
----- ------- ----
w weights: p | d | w_pd primary key (p, d), index on d
u weights: p | c | u_pc primary key (p, c), index on c
v weights: c | v_c primary key c
d/c relationship: d | c primary key (d, c), index on c
The IDs p, d and c are all varchars ranging in size from 10-16. w_dp, u_pc, and v_c are floats.
My current approach to computing the above equation involves multiple joins, subqueries, group by statements, etc. Depending on the Q, we may be able to trim the set of p entities we are interested in down to a more manageable number, and the computation may only take a few seconds. But some values of Q result in a computation that takes multiple minutes.
Example computation
Suppose we have the following four tables:
d/c relationship w weights u weights v weights
d | c p | d | w_dp p | c | u_pc c | v_c
d1 | c1 p1 | d1 | 1 p1 | c1 | 6 c1 | 12
d1 | c2 p1 | d2 | 3 p1 | c2 | 8 c2 | 16
d2 | c2 p2 | d1 | 2 p2 | c1 | 11 c3 | 15
d3 | c3 p3 | d2 | 4 p3 | c2 | 7
p3 | d3 | 5 p3 | c3 | 9
p4 | d3 | 10 p4 | c1 | 13
The query Q is actually a set of c IDs, that restricts the set D that the sum is over. So let's say Q = {c1, c2}. Using the d/c relationship table, our sum is then over d1 and d2. Computing the sum for p1, we have:
d1: 1 * max(6, 8) * [v_c : c = argmax_c (u_11=6, u_12=8)] = 1 * 8 * 16 = 128
d2: 3 * max(8) * [v_c : c = argmax_c (u_12=8)] = 3 * 8 * 16 = 384
So the score for p1 is 128 + 384 = 512. We would do a similar computation for p2 and p3. p4 is only associated with d3, and d3 has no relation with Q = {c1, c2}, so we have nothing to sum for p4.
Question
Is there any sensible way to compute the above function quickly? In particular,
Using the existing db schema? My understanding is that using aggregate functions over joined tables like I am currently doing is quite slow, so I'm guessing the answer to this question is "no".
Using an alternative table structure? My only current idea is to join the w/u/v weights into one massive, ~700m row table, and use that table for the query. I'm not sure how helpful this will be, but maybe if properly indexed could be faster.
Should I do something else entirely? Is there an alternative data storage/processing tool I should be considering?
Because I am very much a novice when it comes to SQL, I was hoping to get a clearer sense of direction before I waste too much time on a dead end.
Update
I know this didn't get much traction, but for future viewers: the way I solved this was by doing all computations ahead of time. There are about 5000 possible queries q, and a maximum of ~500,000 entities p. This would result in a table of size 2.5 billion rows, where (p, q) is the primary key. In reality, for each query q, only roughly 75k entities are assigned scores, on average, resulting in a slightly more manageable table size of 75000*5000 = 375,000,000. Now at query time, all I need to do to get the relevant list of scores is filter on q, which is much simpler than doing this crazy summation.
That said, if anyone sees this and has input, I'd still like to hear the opinion of any SQL/MySQL experts out there.
Related
I'm post-processing traces for two different kinds of events, where the data is stored in table A and B. Both tables have an producer ID and a time index value. While the same producer can trigger a record in both tables, the time when the different events occur are independent, and much more frequent for table B.
I want to update table A such that, for every row in table A, a column value from table B is taken for the most recent row in table B for the same producer.
Example mappings between two tables:
Here is a simplified example with just one producer in both tables. The goal is not to get the oldest entry in table B, but rather the most recent entry in table B relative to a row in table A. I'm showing B.tIdx < A.tIdx in this example, but <= is just as good for my purposes; just a detail.
Table A Table B
+----+------+----------------------+ +------+------+-------+
| ID | tIdx | NEW value SET FROM B | | ID | tIdx | value |
+----+------+----------------------+ +------+------+-------+
| 1 | 2 | 12.5 | | 1 | 1 | 12.5 |
| 1 | 4 | 4.3 | | 1 | 2 | 9.0 |
+----+------+----------------------+ | 1 | 3 | 4.3 |
| 1 | 4 | 7.8 |
| 1 | 5 | 6.2 |
+------+------+-------+
The actual tables have thousands of different IDs, millions of rows, and nearly as many distinct time index values as rows. I'm having trouble to come up with an UPDATE that doesn't take days to complete.
The following UPDATE works, but executes far too slowly; it starts off at a rate of 100s of updates/s, but soon slows to roughly 5 updates/s.
UPDATE A AS x SET value =
(SELECT value
FROM B AS y
WHERE x.ID = y.ID AND x.tIdx > y.tIdx
ORDER BY y.tIdx DESC
LIMIT 1);
I've tried creating indexes for ID and tIdx separately, but also multi-column indexes with both orders (ID,tIdx) and (tIdx,ID). But even when the multi-column indexes exist, EXPLAIN shows that it only ever indexes on ID or tIdx, but not both together.
I was wondering if the solution is to create nested SELECTs, to first get a temporary table with a particular ID, and then find the 1 row in table B that will meet the time constraint for each tIdx for that particular ID. The following SELECT, with hardcoded ID and tIdx, works and is very fast, completing in 0.00 sec.
SELECT value, ID, tIdx
FROM (
SELECT value, ID, tIdx
FROM B
WHERE ID = 5216
) y
WHERE tIdx < 1253707
ORDER BY tIdx DESC LIMIT 1;
I'd like to incorporate this into an UPDATE somehow, but replace the hardcoded ID and tIdx with the ID,tIdx pair for each row in A.
Or try any other suggestion for a more efficient UPDATE statement.
This is my first post to stackoverflow. Sincere apologizes in advance if I have violated any etiquette.
Update with Inner Join should do it, but it's going to get nasty to do this.
Update a INNER JOIN
(Select b.ID, maxb.atIdx, b.value
From b INNER JOIN (Select a.ID, a.tIdx as atIdx, max(b.tIdx) as bigb
From b INNER JOIN a
ON b.ID=a.ID
Where b.tIdx<=a.tIdx
Group By a.ID,a.tIdx) maxb
ON b.ID=maxb.ID and b.tIdx=maxb.bigb
) bestb ON a.ID=bestb.ID and a.tIdx=bestb.atIdx
Set a.value=bestb.value
To explain this it's best to start with the innermost SQL and work your way to the outermost UPDATE. To start, we need to join every record in table A to every record in table B for each ID. We can filter out the B records that are too recent and summarize that result for each table A record. That leaves us with the tIdx of the B table whose value goes into A for every record key in A. So then we join that to the B table to select the values to update, preserving the A-table's keys. That result is joined back to A to perform the update.
You'll have to see whether this is fast enough for you - I'm worried that this accesses the B table twice and the inner query creates A LOT of join combinations. I would pull out that inner query and see how long it runs by itself. On the positive side, they are all very simple, straightforward queries and they are connected by Inner Joins so there is some opportunity for efficiency in the query optimizer. I think indexes on a(ID,TIdx) [fast lookup to get the Update row] and b(ID) would be useful here.
One thing you can try is lead() to see if that helps the performance:
UPDATE A JOIN
(SELECT b.*,
LEAD(tIDx) OVER (PARTITION BY id order by tIDx) as next_tIDx
FROM b
) b
ON a.id = b.id AND
a.tIDx >= b.tIDx AND
(b.next_tIDx IS NULL or a.tIDx < b.next_tIDx)
SET a.value = b.value;
And for this you want an index on b(id, tidx).
I want to Access temporarily created column
Please see below code example
Note: I have created one stored procedure for GP calculation It has
very long I want to simplify some calculations code.
SELECT 25 AS A,35 AS B, SUM(A+B) AS C
I know syntax is wrong but I want like below
| A | B | C |
| 25 | 35 | 60 |
You can either store values in the temp table.
SELECT A,
B,
Sum(A + B) C
FROM (SELECT 25 AS A, 35 AS B) AS TempTable;
Suppose I have a database that contains two different types of information about certain unique objects, say their 'State' and 'Condition' to give a name to their classifiers. The State can take the values A, B, C or D, the condition the values X or Y. Depending on where I am sourcing data from, sometimes this database lacks entries for a particular pair.
From this data, I'd like to make a crosstab query that shows the count of data with a given State and Condition combination, but to have it still yield a row even when a given row is a 0. For example, I'd like the following table:
Unit | State | Condition
1 | A | X
2 | B | Y
3 | C | X
4 | B | Y
5 | B | X
6 | B | Y
7 | C | X
To produce the following crosstab:
Count | X | Y
A | 1 | 0
B | 1 | 3
C | 2 | 0
D | 0 | 0
Any help that would leave blanks instead of zeroes is fit for purpose as well, these are being pasted into a template Excel document that requires each crosstab to have an exact dimension.
What I've Tried:
The standard crosstab SQL
TRANSFORM Count(Unit)
SELECT Condition
FROM Sheet
GROUP BY Count(Unit)
PIVOT State;
obviously doesn't work as it doesn't raise the possibility of a D occurring. PIVOTing by a nested IIf that explicitly names D as a possible value does nothing either, nor does combining it with an Nz() around the TRANSFORM clause variable.
TRANSFORM Count(sheet.unit) AS CountOfunit
SELECT AllStates.state
FROM AllStates LEFT JOIN sheet ON AllStates.state = sheet.state
GROUP BY AllStates.state
PIVOT sheet.condition;
This uses a table "AllStates" that has a row for each state you want to force into the result. It will produce an extra column for entries that are neither Condition X nor Condition Y - that's where the forced entry for state D ends up, even though the count is 0.
If you have a relatively small number of conditions, you can use this instead:
SELECT AllStates.state, Sum(IIf([condition]="x",1,0)) AS X, Sum(IIf([condition]="Y",1,0)) AS Y
FROM AllStates LEFT JOIN sheet ON AllStates.state = sheet.state
GROUP BY AllStates.state;
Unlike a crosstab, though, this won't automatically add new columns when new condition codes are added to the data. It can also be cumbersome if you have many condition codes.
I have three tables.
The first table is like:
+----+----+----+
| id | x | y |
+----+----+----+
The second and third tables are like:
+----+----+----+----+----+----+----+----+----+----+----+----+
| id | Z1 | Z2 | Z3 | .. | .. | .. | .. | .. | .. | .. | Zn |
+----+----+----+----+----+----+----+----+----+----+----+----+
n is quite large, about 800-900.
I know it is quite ugly tables and database. But it is a raw data set and a learning set of a certain experiment. Please, just ignore it.
And a skeleton of a query is like:
'SELECT a.*, b.*, c.* \
FROM `test_xy` a, `test_1` b, `test_2` c \
WHERE a.id = b.id AND b.id = c.id'
What I concern is, the result with the query includes id field three times. I want id field to appear just one time at the front of the result.
I can do it by slicing the result table (by Python, MATLAB, etc.)
But, is there a better way to do this with a large number of columns? I mean, can id field of the second and third tables be excluded at the query stage?
The answer is the USING syntax: MySQL specific by the way. http://dev.mysql.com/doc/refman/5.5/en/join.html. Learn to use JOINs before you do anything else; putting the jon condition into the where clause is just plan wrong.
SELECT a.*, b.*, c.*
FROM `test_xy` a JOIN `test_1` b USING(`id)
JOIN `test_2` c USING(`id)
Tables
stores (100,000 rows): id (pk), name, lat, lng, ...
store_items (9,000,000 rows): store_id (fk), item_id (fk)
items (200,000 rows): id(pk), name, ...
item_words (1,000,000 rows): item_id(fk), word_id(fk)
words (50,000 rows): id(pk), word VARCHAR(255)
Note: all ids are integers.
========
Indexes
CREATE UNIQUE INDEX storeitems_storeid_itemid_i ON store_items(store_id,item_id);
CREATE UNIQUE INDEX itemwords_wordid_itemid_i ON item_words(word_id,item_id);
CREATE UNIQUE INDEX words_word_i ON words(word);
Note: I prefer multi column indexes (storeitems_storeid_itemid_i and itemwords_wordid_itemid_i) because: http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/
QUERY
select s.name, s.lat, s.lng, i.name
from words w, item_words iw, items i, store_items si, stores s
where iw.word_id=w.id
and i.id=iw.item_id
and si.item_id=i.id
and s.id=si.store_id
and w.word='MILK';
Problem: elapsed time is 20-120 sec (depending on the word)!!!
explain $QUERY$
+----+-------------+-------+--------+-------------------------------------------------------+-----------------------------+---------+-----------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------------------+-----------------------------+---------+-----------------------------+------+-------------+
| 1 | SIMPLE | w | const | PRIMARY,words_word_i | words_word_i | 257 | const | 1 | Using index |
| 1 | SIMPLE | iw | ref | itemwords_wordid_itemid_i,itemwords_itemid_fk | itemwords_wordid_itemid_i | 4 | const | 1 | Using index |
| 1 | SIMPLE | i | eq_ref | PRIMARY | PRIMARY | 4 | iw.item_id | 1 | |
| 1 | SIMPLE | si | ref | storeitems_storeid_itemid_i,storeitems_itemid_fk | storeitems_itemid_fk | 4 | iw.item_id | 16 | Using index |
| 1 | SIMPLE | s | eq_ref | PRIMARY | PRIMARY | 4 | si.store_id | 1 | |
I want elapsed time to be less than 5 secs!!! Any ideas???
==============
What I tried
I tried to see when increase in the execution time happens by adding tables to the query.
1 table
select * from words where word='MILK';
Elapsed time: 0.4 sec
2 tables
select count(*)
from words w, item_words iw
where iw.word_id=w.id
and w.word='MILK';
Elapsed time: 0.5-2 sec (depending on word)
3 tables
select count(*)
from words w, item_words iw, items i
where iw.word_id=w.id
and i.id=iw.item_id
and w.word='MILK';
Elapsed time: 0.5-2 sec (depending on word)
4 tables
select count(*)
from words w, item_words iw, items i, store_items si
where iw.word_id=w.id
and i.id=iw.item_id
and si.item_id=i.id
and w.word='MILK';
Elapsed time: 20-120 sec (depending on word)
I guess the problem with the indexes or with the design of query/database. But there must be a way to make it work fast. Google does it somehow and their tables are much bigger!
a) You're actually writing queries to do FTS inside mysql -> use real FTS like lucene instead.
b) Clearly, adding the 9M row join is the performance issue
c) How about limiting that join (maybe it's being done in full with the current query plan) like this :
SELECT
s.name, s.lat, s.lng, i.name
FROM
(SELECT * FROM words WHERE word='MILK') w
INNER JOIN
item_words iw
ON
iw.word_id=w.id
INNER JOIN
items i
ON
i.id=iw.item_id
INNER JOIN
store_items si
ON
si.item_id=i.id
INNER JOIN
stores s
ON
s.id=si.store_id;
The logic behind this is that instead of joining full tables and then limiting the results, you start by limiting the tables on which you will join, this (if the join order happens to be the one I wrote) will greatly reduce your working set and inner queries running time.
d) Google does NOT use mysql for FTS
Consider de-normalising the structure - the first candidate is the 1 million record item_words table - bring the words directly into the table. Creating a list of unique words might be more easily achieved through a view (depends on how often you need this data compared to, for example, your need to extract a list of shops with products associated with a keyword).
Secondly - create indexed views (not an option in MySQL, but certainly an option on other commercial databases).
You don't have an index that it can use to find the store_id if given the item_id. If the cardinality of store_id is low enough it might gain some benefit from storeitems_storeid_itemid_i, but since you have 100,000 stores this might not be so useful. You might try creating an index on store_items that lists the item_id first:
CREATE UNIQUE INDEX storeitems_item_store ON store_items(item_id, store_id);
Also, I'm not sure if putting join conditions in the where clause will affect performance adversely in the way you're seeing but you might try changing the query to something like this:
select s.name, s.lat, s.lng, i.name
from words w LEFT JOIN item_words iw ON w.id=iw.word_id
LEFT JOIN items i ON i.id=iw.item_id
LEFT JOIN store_items si ON si.item_id=i.id
LEFT JOIN stores s ON s.id=si.store_id
where w.word='MILK';
Without knowing the exact layout of your tables it's hard to give a good answer. But these types of multi table joins has a tendency to get really bogged down. Especially if one of the factors making up the expression of selection is a dynamic string.
You could try to return multiple resultsets of the tables in one go, from a stored procedure or something and then joining the data outside of SQL. This way I got the query time of a massive join down from 2 minutes to 4 seconds. Or do it using a temporary table and return the resultset from that when you are done.
Start with selecting from the words table since that's where you have the dynamic string. Then you can select from the other tables based on the data returned from that query.
Try this one.
Rewrite the query in such way
select s.name, s.lat, s.lng, i.name
from words w LEFT JOIN item_words iw ON w.id=iw.word_id AND w.word='MILK'
LEFT JOIN items i ON i.id=iw.item_id
LEFT JOIN store_items si ON si.item_id=i.id
LEFT JOIN stores s ON s.id=si.store_id
And create index on (w.id, w.word)
Have you tried analyzing the tables ?
this will help the optimiser select the best possible execution plan.
e.g:
ANALYZE TABLE words
ANALYZE TABLE item_words
ANALYZE TABLE items
ANALYZE TABLE store_items
ANALYZE TABLE stores
see: http://dev.mysql.com/doc/refman/5.0/en/analyze-table.html