SSIS Lookup Returns Too Much Data - ssis

I have an SSIS package that performs a lookup on a table with tens of millions of rows. It seems by default it returns all rows from the table into a refTable, and then selects from that refTable where the columns match specified parameters to find the matching lookup. Does it have to insert into a refTable to do this? Can I just filter out with the parameters immediately? Currently it is pulling the millions of records into the refTable and it is wasting a ton of time. Is it done this way because multiple records are being looked up from that refTable, or is it pulling all of those records every time for each lookup it tries to find?
Here is the slow way and my proposed new way of doing this:
-- old
select * from (SELECT InvoiceID, CustomerId, InvoiceNumber, InvoiceDate
FROM Invoice) [refTable]
where [refTable].[InvoiceNumber] = ? and [refTable].[CustomerId] = ? and [refTable].[InvoiceDate] = ?
-- new
SELECT i.InvoiceID, i.CustomerId, i.InvoiceNumber, i.InvoiceDate
FROM Invoice i
where i.InvoiceNumber = ? and i.CustomerId = ? and i.InvoiceDate = ?

The Partial Cache Mode makes a new call to the database every time it encounters a new distinct value in the source data. Afterwards it chaches this new value. It's not creating a massiv ref table.
The two queries
Select * FROM A WHERE A.Id = ?
SELECT * FROM (SELECT * FROM A) [refTable] WHERE refTable.Id = ?
have the same execution plan. So there is no difference
Overview over the different caching modes:
Overview over caching modes
You can speed the whole thing up by not using a whole table as Lookup Connection but a sql query which returns only the columns you need.

The problem was I had one lookup that was a full cache instead of partial cache like the others, it was loading nearly a million rows so it slowed things up quite a bit. I have a good index created so doing the lookup for each source item isn't bad.

Another good method is for handling lookup is this -> Lookup Pattern: Cascading.
Use a full cache lookup for values that where created yesterday and partial cache for anything older. 80-20 rule, 80% of the rows should hit the full cache and fly through.

Related

Optimize query from view with UNION ALL

So I'm facing a difficult scenario, I have a legacy app, bad written and designed, with a table, t_booking. This app, has a calendar view, where, for every hall, and for every day in the month, shows its reservation status, with this query:
SELECT mr1b.id, mr1b.idreserva, mr1b.idhotel, mr1b.idhall, mr1b.idtiporeserva, mr1b.date, mr1b.ampm, mr1b.observaciones, mr1b.observaciones_bookingarea, mr1b.tipo_de_navegacion, mr1b.portal, r.estado
FROM t_booking mr1b
LEFT JOIN a_reservations r ON mr1b.idreserva = r.id
WHERE mr1b.idhotel = '$sIdHotel' AND mr1b.idhall = '$hall' AND mr1b.date = '$iAnyo-$iMes-$iDia'
AND IF (r.comidacena IS NULL OR r.comidacena = '', mr1b.ampm = 'AM', r.comidacena = 'AM' AND mr1b.ampm = 'AM')
AND (r.estado <> 'Cancelled' OR r.estado IS NULL OR r.estado = '')
LIMIT 1;
(at first there was also a ORDER BY r.estado DESC which I took out)
This query, after proper (I think) indexing, takes 0.004 seconds each, and the overall calendar view is presented in a reasonable time. There are indexes over idhotel, idhall, and date.
Now, I have a new module, well written ;-), which does reservations in another table, but I must present both types of reservations in same calendar view. My first approach was create a view, joining content of both tables, and selecting data for calendar view from this view instead of t_booking.
The view is defined like this:
CREATE OR REPLACE VIEW
t_booking_hall_reservation
AS
SELECT id,
idreserva,
idhotel,
idhall,
idtiporeserva,
date,
ampm,
observaciones,
observaciones_bookingarea,
tipo_de_navegacion, portal
FROM t_booking
UNION ALL
SELECT HR.id,
HR.convention_id as idreserva,
H.id_hotel as idhotel,
HR.hall_id as idhall,
99 as idtiporeserva,
date,
session as ampm,
observations as observaciones,
'new module' as observaciones_bookingarea,
'events' as tipo_de_navegacion,
'new module' as portal
FROM new_hall_reservation HR
JOIN a_halls H on H.id = HR.hall_id
;
(table new_hall_reservation has same indexes)
I tryed UNION ALL instead of UNION as I read this is much more efficient.
Well, the former query, changing t_booking for t_booking_hall_reservation, takes 1.5 seconds, to multiply for each hall and each day, which makes calendar view impossible to finish.
The app is spaguetti code, so, looping twice, once over t_booking and then over new_hall_reservation and combining results is somehow difficult.
Is it possible to tune the view to make this query fast enough? Another approach?
Thanks
PS: the less I modify original query, the less I'll need to modify the legacy app, which is, at less, risky to modify
This is too long for a comment.
A view is (almost) never going to help performance. Yes, they make queries simpler. Yes, they incorporate important logic. But no, they don't help performance.
One key problem is the execution of the view -- it doesn't generally take the filters in the overall tables into account (although the most recent versions of MySQL are better at this).
One suggestion -- which might be a fair bit of work -- is to materialize the view as a table. When the underlying tables change, you need to change t_booking_hall_reservation using triggers. Then you can create indexes on the table to achieve your performance goals.'
t_booking, unless it is a VIEW, needs
INDEX(idhotel, idhall, date)
VIEWs are syntactic sugar; they do not enhance performance; sometimes they are slower than the equivalent SELECT.

What's my best approach re: creating calculated tables based on scraped data

I have a few spiders running on my vps to scrape data each day and the data is stored in MySQL.
I need to build a pretty complicated time series model on the data from varies data sources.
Here I run into an issue which is that:
I need to create a new calculated table based on my scraped data. The model is quite complicated as it involves historical raw data and calculated data. I was going to write a python script to do this, but it seems not efficient enough.
I then realize that I can just create a view within MySQL and write my model in the format of a nested sql query. That said, I want the view to be materialized ( which is not supported by MySQL now) , and the view can be refreshed each day when new data comes in.
I know there is a third party plugin called flex*** , but i searched online and it seems not easy to install and maintain.
What is my best approach here?
Thanks for the help.
=========================================================================
To add some clarification, the time series model I made is very complicated, it involves:
rolling average on raw data
rolling average on the rolling averaged data above
So it depends on both the raw data and previously calculated data.
The timestamp solution does not really solve the complexity of the issue.
I'm just not sure about the best way to.
Leaving aside whether you should use a dedicated time-series tool such as rrdtool or carbon, mysql provides the functionality you need to implement a semi-materialized view, e.g. given data batch consolidated by date:
SELECT DATE(event_time), SUM(number_of_events) AS events,
, SUM(metric) AS total
, SUM(metric)/SUM(number_of_events) AS average
FROM (
SELECT pc.date AS event_time, events AS number_of_events
, total AS metric
FROM pre_consolidated pc
UNION
SELECT rd.timestamp, 1
, rd.metric
FROM raw_data rd
WHERE rd.timestamp>#LAST_CONSOLIDATED_TIMESTAMP
)
GROUP BY DATE(event_time)
(note that although you could create this as a view and access that, IME, MySQL is not the best at optimizing queries involving views and you might be better using the equivalent of the above as a template for building your queries around)
The most flexible way to maintain an accurate record of #LAST_CONSOLIDATED_TIMESTAMP would be to add a state column to the raw_data table (to avoid locking and using transactions to ensure consistency) and an index on the timestamp of the event, then, periodically:
UPDATE raw_data
SET state='PROCESSING'
WHERE timestamp>=#LAST_CONSOLIDATED_TIMESTAMP
AND state IS NULL;
INSERT INTO pre_consolidated (date, events, total)
SELECT DATE(rd.timestamp), COUNT(*), SUM(rd.metric)
FROM raw_data
WHERE timestamp>#LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING'
GROUP BY DATE(rd.timestamp);
SELECT #NEXT_CONSOLIDATED_TIMESTAMP := MAX(timestamp)
FROM raw_data
WHERE timestamp>#LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING';
UPDATE raw_data
SET state='CONSOLIDATED'
WHERE timestamp>#LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING';
SELECT #LAST_CONSOLIDATED_TIMESTAMP := #NEXT_CONSOLIDATED_TIMESTAMP;
(you should think of a way to persist LAST_CONSOLIDATED_TIMESTAMP between DBMS sessions)
Hence the base query (to allow for more than one event with the same timestamp) should be:
SELECT DATE(event_time), SUM(number_of_events) AS events,
, SUM(metric) AS total
, SUM(metric)/SUM(number_of_events) AS average
FROM (
SELECT pc.date AS event_time, events AS number_of_events
, total AS metric
FROM pre_consolidated pc
UNION
SELECT rd.timestamp, 1
, rd.metric
FROM raw_data rd
WHERE rd.timestamp>#LAST_CONSOLIDATED_TIMESTAMP
AND state IS NULL
)
GROUP BY DATE(event_time)
Adding the state variable to the timestamp index will likely slow down the overall performance of the update as long as you are applying the consolidation reasonably frequently.

Right way to phrase MySQL query across many (possible empty) tables

I'm trying to do what I think is a set of simple set operations on a database table: several intersections and one union. But I don't seem to be able to express that in a simple way.
I have a MySQL table called Moment, which has many millions of rows. (It happens to be a time-series table but that doesn't impact on my problem here; however, these data have a column 'source' and a column 'time', both indexed.) Queries to pull data out of this table are created dynamically (coming in from an API), and ultimately boil down to a small pile of temporary tables indicating which 'source's we care about, and maybe the 'time' ranges we care about.
Let's say we're looking for
(source in Temp1) AND (
((source in Temp2) AND (time > '2017-01-01')) OR
((source in Temp3) AND (time > '2016-11-15'))
)
Just for excitement, let's say Temp2 is empty --- that part of the API request was valid but happened to include 'no actual sources'.
If I then do
SELECT m.* from Moment as m,Temp1,Temp2,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11'15'))
)
... I get a heaping mound of nothing, because the empty Temp2 gives an empty Cartesian product before we get to the WHERE clause.
Okay, I can do
SELECT m.* from Moment as m
LEFT JOIN Temp1 on m.source=Temp1.source
LEFT JOIN Temp2 on m.source=Temp2.source
LEFT JOIN Temp3 on m.source=Temp3.source
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... but this takes >70ms even on my relatively small development database.
If I manually eliminate the empty table,
SELECT m.* from Moment as m,Temp1,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... it finishes in 10ms. That's the kind of time I'd expect.
I've also tried putting a single unmatchable row in the empty table and doing SELECT DISTINCT, and it splits the difference at ~40ms. Seems an odd solution though.
This really feels like I'm just conceptualizing the query wrong, that I'm asking the database to do more work than it needs to. What is the Right Way to ask the database this question?
Thanks!
--UPDATE--
I did some actual benchmarks on my actual database, and came up with some really unexpected results.
For the scenario above, all tables indexed on the columns being compared, with an empty table,
doing it with left joins took 3.5 minutes (!!!)
doing it without joins (just 'FROM...WHERE') and adding a null row to the empty table, took 3.5 seconds
even more striking, when there wasn't an empty table, but rather ~1000 rows in each of the temporary tables,
doing the whole thing in one query took 28 minutes (!!!!!), but,
doing each of the three AND clauses separately and then doing the final combination in the code took less than a second.
I still feel I'm expressing the query in some foolish way, since again, all I'm trying to do is one set union (OR) and a few set intersections. It really seems like the DB is making this gigantic Cartesian product when it seriously doesn't need to. All in all, as pointed out in the answer below, keeping some of the intelligence up in the code seems to be the better approach here.
There are various ways to tackle the problem. Needless to say it depends on
how many queries are sent to the database,
the amount of data you are processing in a time interval,
how the database backend is configured to manage it.
For your use case, a little more information would be helpful. The optimization of your query by using CASE/COUNT(*) or CASE/LIMIT combinations in queries to sort out empty tables would be one option. However, if-like queries cost more time.
You could split the SQL code to downgrade the scaling of the problem from 1*N^x to y*N^z, where z should be smaller than x.
You said that an API is involved, maybe you are able handle the temporary "no data" tables differently or even don't store them?
Another option would be to enable query caching:
https://dev.mysql.com/doc/refman/5.5/en/query-cache-configuration.html

MySql SELECT Query performance issues in huge database

I have a pretty huge MySQL database and having performance issues while selecting data. Let me first explain what I am doing in my project: I have a list of files. Every file should be analyzed with a number of tools. The result of the analysis is stored in a results table.
I have one table with files (samples). The table contains about 10 million rows. The schema looks like this:
idsample|sha256|path|...
The other (really small table) is a table which identifies the tool. Schema:
idtool|name
The third table is going to be the biggest one. The table contains all results of the tools I am using to analyze the files (The number of rows will be the number of files TIMES the number of tools). Schema:
id|idsample|idtool|result information| ...
What I am looking for is a query, which returns UNPROCESSED files for a given tool id (where no result exists yet).
The (most efficient) way I found so far to query those entries is following:
SELECT
s.idsample
FROM
samples AS s
WHERE
s.idsample NOT IN (
SELECT
idsample
FROM
results
WHERE
idtool = 1
)
LIMIT 100
The problem is that the query is getting slower and slower as the results table is growing.
Do you have any suggestions for improvements? One further problem is, that i cannot change the structure of the tables, as this a shared database which is also used by other projects. (I think) the only way for improvement is to find a more efficient select query.
Thank you very much,
Philipp
A left join may perform better, especially if idsample is indexed in both tables; in my experience, those kinds of "inquiries" are better served by JOINs rather than that kind of subquery.
SELECT s.idsample
FROM samples AS s
LEFT JOIN results AS r ON s.idsample = r.idsample AND r.idtool = 1
WHERE r.idsample IS NULL
LIMIT 100
;
Another more involved possible solution would be to create a fourth table with the full "unprocessed list", and then use triggers on the other three tables to maintain it; i.e.
when a new tool is added, add all the current files to that fourth table (with the new tool).
when a new file is added, add all the current tools to that fourth table (with the new file).
when a new result in entered, remove the corresponding record from the fourth table.

How do I ensure that this query sticks to indexes?

I have a database with two tables. The one contains accounts, and the other contains over 2 million rows containing addresses and their coordinates. Obviously with such an amount of rows, any time a query runs that doesn't take full advantage of the indexes will take minutes if not hours to complete. Unfortunately that is currently the case with one of my queries:
SELECT
addr.`Linje-ID` as lineid,
addr.`Sluttbruker` as companyname,
addr.`Gate` as street,
addr.`Husnr` as housenr,
addr.`Postnr` as zip,
addr.`Poststed` as location,
loc.`UX_KOORDINAT` as coord_x,
loc.`UY_KOORDINAT` as coord_y,
loc.`ADRESSE_ID` as addr_id
FROM
addresses addr INNER JOIN
locationdata loc ON
loc.`POSTSTED` = addr.`Poststed` AND
loc.`POST_NR` = addr.`Postnr` AND
loc.`GATENAVN` = addr.`Gate` AND
loc.`HUSNUMMER` = addr.`Husnr`
GROUP BY
addr.`Linje-ID`
The locationdata table has a primary index id as well as an index defined as (POSTSTED, POST_NR, GATENAVN, HUSNUMMER). Fetching rows from the table using those columns in that order goes very quickly. The query above, however, had to be cancelled as it was taking too long (>15 minutes).
As my MySQL client (HeidiSQL) freezes while queries are performed, it's getting very tedious to force the application shut and start over for every attempt to fix this problem, so I'm asking for help here.
Just for testing, the table "addresses" only contains one row at the moment.
Can anyone identify why this query 'never' completes?
This is the EXPLAIN results I was asked for
http://pastebin.com/qWdQhdv5
You should copy the content and paste it into a larger container as it linebreaks.
EDIT: I've edited the query to reflect some of your replies. It still uses over 300 seconds where it shouldn't need 1.
First of all I'd remove the upper from loc.GATENAVN = UPPER(addr.Gate) since this clause is already searching in a case insensitive mode.
I went with subqueries instead of joins