MYSQL Union between two large tables - mysql

This is probably a simple question for you gurs out there, but my limited knowledge about MYSQL is really showing here:
I have two tables:
Table Q with fields id(pk), symbol, timestamp(bigint) and a few data fields
Table T with fields id(pk), symbol, timestamp(bigint) and a few data fields.
Table Q has about 800 million rows, table T about 80 million rows.
I want a report for one symbol, where rows from Q and T are mixed in timestamp order. With a row from T, the data fields from Q should be NULL and vice versa.
Can someone please recommend how the query should look? Also, a recommendation on how the index should be constructed would be great.
Have tried a lot of variations on inner, outer joins, union all etc but to no avail.

It sounds like you need a simple UNION ALL (could be union since you indicate that an entry will only exist in only one table or the other anyhow). Since the tables are the same, should be easy.
I would suggest an index on the table based on your criteria PLUS the date/time field if you want the transactions in a specific order, otherwise an ORDER BY clause can kill your time performance. If you want more columns, just make sure that each query of the union matches the same order of columns desired and same data types too.
select ID, Symbol, timestamp
from Q
where symbol = 'something'
UNION ALL
select ID, Symbol, timestamp
from T
where symbol = 'something'
Again, if you wanted more criteria, you could adjust for each WHERE clause to something like
where symbol = 'something'
and timestamp between someStartTime and someEndTime

Related

mysql query using where clause with 24 million rows

SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.

Comparing Relevance Scores From mySQL full text searches from different tables

Scenario:
I have 5 tables all which need to be searched. I have proper full text indexes (indices?) for each. I can search each individually using MATCH and AGAINST, and ordering by their relevance scores.
The problem is I want to combine and interweave the search results of all 5 tables and base it off of relevance score. Like so:
(SELECT *, MATCH(column) AGAINST (query) as score
FROM table1
WHERE MATCH (column) AGAINST (query))
UNION
(SELECT *, MATCH(column) AGAINST (query) as score
FROM table2
WHERE MATCH (column) AGAINST (query))
UNION
...
ORDER BY score DESC
This works well and dandy except that table 1 may have twice as many rows as table 2. Thus, since mySQL takes into account uniqueness for relevance, the score for results of table 1 are most often significantly higher the results of table 2.
Ultimately: How can I normalize the scores for the results from the 5 tables of varying size if I want to weight results from each table equally?
Your UNION'ing of the results from the five tables makes me believe you probably should merge the five tables into a single one (with perhaps an additional column that identifies the one of five types of data, currently spread in five tables).
Similarly, you could store just the text column in one single table, like this one :
CREATE TABLE text_table (
text_col TEXT,
fk INT, -- references the PK of an item in either table1, or table2, or...
ref_table INT, -- identifies the related table, e.g. 1 means 'table1', etc.
FULLTEXT INDEX (text_col)
)
Then you could run the full-text seach on this table. JOIN'ing the results with the actual data tables seems to be straightforward.
As a note:
The suggestions above by YaK are likely the best options for most scenarios asking this question. The route I actually took was to record the average highest relevance score for each of the 5 tables. I then would divide al future relevance scores by this factor in an attempt to 'normalize' the scores so that they could be compared to the relevance scores from the other tables. Thus far it has worked well, but not perfectly (particularly large queries).

How to select records in one table but not another with multiple PKIDs?

Here is my setup:
Table records contains multiple (more than two) PKID columns along with some other columns.
Table cached_records only has two columns, which are the same as two of the PKIDs for records.
For instance, let's assume records has PKIDs 'keyA', 'keyB', and 'keyC' and cached_records only has 'keyA' and 'keyB'.
I need to pull the rows from the records table where the appropriate PKIDs (so, 'keyA' and 'keyB') are not in the cached_records table.
IF I was working with only ONE PKID, I know how simple this task would be:
SELECT
pkid
FROM
records
WHERE
pkid NOT IN (SELECT pkid FROM cached_records)
However, the fact that there is two PKIDs means I can't use a simple NOT IN. This is what I currently have:
SELECT
`keys`.`keyA` AS `keyA`,
`keys`.`keyB` AS `keyB`
FROM
(
SELECT DISTINCT
`keyA`,
`keyB`
FROM
`records`
) AS `keys`
LEFT JOIN
`cached_records` AS `cached`
ON
`keys`.`keyA` = `cached`.`keyA`
AND
`keys`.`keyB` = `cached`.`keyB`
WHERE
(
`cached`.`keyA` IS NULL
AND
`cached`.`keyB` IS NULL
)
(The DISTINCT is needed because since I am only grabbing two of the multiple PKIDs from the records table, there could be duplicates and I really don't need duplicates; 'keyC' is not being used and it helps determine uniqueness of the records).
This query above works just fine, however, as the cached_records table grows, the query takes longer and longer to process (we're talking minutes now, sometimes takes long enough that my code hangs and crashes).
So, I'm wondering what the most efficient way is to do this kind of operation (selecting rows from one table where the rows don't exist in another) with multiple PKIDS instead of just one...
This should be quicker:
SELECT DISTINCT
`records`.`keyA` AS `keyA`,
`records`.`keyB` AS `keyB`
FROM
`records`
LEFT JOIN
`cached_records` AS `cached`
ON
`records`.`keyA` = `cached`.`keyA`
AND
`records`.`keyB` = `cached`.`keyB`
WHERE
`cached`.`keyA` IS NULL -- one is enough here
Notes:
with the query as table, you lose a lot of performance. You can do the distinct in the outmost SELECT here.
it is enough to check one of the two keys if they are null, as none can be null
you should verify that the keyA and keyB columns are of the same type, and no conversion occurs (seen such in working live code...)
You should have proper indexes on the tables. Minutes for this query is the sign of something awful going on... (Or an insane amount of data)

Best way to combine multiple advanced mysql select queries

I have multiple select statements from different tables on the same database. I was using multiple, separate queries then loading to my array and sorting (again, after ordering in query).
I would like to combine into one statement to speed up results and make it easier to "load more" (see bottom).
Each query uses SELECT, LEFT JOIN, WHERE and ORDER BY commands which are not the same for each table.
I may not need order by in each statement, but I want the end result, ultimately, to be ordered by a field representing a time (not necessarily the same field name across all tables).
I would want to limit total query results to a number, in my case 100.
I then use a loop through results and for each row I test if OBJECTNAME_ID (ie; comment_id, event_id, upload_id) isset then LOAD_WHATEVER_OBJECT which takes the row and pushes data into an array.
I won't have to sort the array afterwards because it was loaded in order via mysql.
Later in the app, I will "load more" by skipping the first 100, 200 or whatever page*100 is and limit by 100 again with the same query.
The end result from the database would pref look like "this":
RESULT - selected fields from a table - field to sort on is greatest
RESULT - selected fields from a possibly different table - field to sort on is next greatest
RESULT - selected fields from a possibly different table table - field to sort on is third greatest
etc, etc
I see a lot of simpler combined statements, but nothing quite like this.
Any help would be GREATLY appreciated.
easiest way might be a UNION here ( http://dev.mysql.com/doc/refman/5.0/en/union.html ):
(SELECT a,b,c FROM t1)
UNION
(SELECT d AS a, e AS b, f AS c FROM t2)
ORDER BY a DESC

How to write this simple MySQL JOIN query?

I need to select records from 2 tables, one called cities and one called neighborhoods. They both share a table column in common called parent_state. In this cell the id of the parent state is stored.
I need to select all cities and neighborhoods that belong to a certain state. For example if the state id is 10, I need to get all the cities and neighborhoods that has this value for it's parent_state cell.
The state id is stored in a PHP variable like so:
$parent_state = '10';
What would this query look like (preferably the merged results from both tables should be sorted by the column name in alphabetical order)?
EDIT
Yes, I probably do need a union. I'm very new to mysql and all I can do at the moment is query tables individually.
I can always query both the cities and neighborhoods tables individually but the reason why I want to merge the results is for the sole purpose of listing said results alphabetically.
So can someone please show how the UNION query for this would look?
Use:
SELECT c.name
FROM CITIES c
WHERE c.parent_state = 10
UNION ALL
SELECT n.name
FROM NEIGHBORHOODS h
WHERE n.parent_state = 10
UNION ALL will return the result set as a combination of both queries as a single result set. UNION will remove duplicates, and is slower for it - this is why UNION ALL is a better choice, even if it's unlikely to have a city & neighbourhood with the same name. Honestly, doesn't sound like a good idea mixing the two, because a neighbourhood is part of a city...
Something else to be aware of with UNION is that there needs to be the same number of columns in the SELECT clause for all the queries being UNION'd (this goes for UNION and UNION ALL). IE: You'll get an error if the first query has three columns in the SELECT clause and the second query only had two.
Also, the data types have to match -- that means not returning a DATE/TIME data type in the same position was an other query returning an INTEGER.
What you want is probably not a join, but rather, a union. note that a union can only select the exact same columns from both of the joined expressions.
select * from city as c
inner join neighborhoods as n
on n.parent_state = c.parent_state
where c.parent_state=10
You can use Left,Right Join, in case of city and nighborhoods dont have relational data.