What's my best approach re: creating calculated tables based on scraped data

What's my best approach re: creating calculated tables based on scraped data - mysql

I have a few spiders running on my vps to scrape data each day and the data is stored in MySQL.
I need to build a pretty complicated time series model on the data from varies data sources.
Here I run into an issue which is that:
I need to create a new calculated table based on my scraped data. The model is quite complicated as it involves historical raw data and calculated data. I was going to write a python script to do this, but it seems not efficient enough.
I then realize that I can just create a view within MySQL and write my model in the format of a nested sql query. That said, I want the view to be materialized ( which is not supported by MySQL now) , and the view can be refreshed each day when new data comes in.
I know there is a third party plugin called flex*** , but i searched online and it seems not easy to install and maintain.
What is my best approach here?
Thanks for the help.
=========================================================================
To add some clarification, the time series model I made is very complicated, it involves:
rolling average on raw data
rolling average on the rolling averaged data above
So it depends on both the raw data and previously calculated data.
The timestamp solution does not really solve the complexity of the issue.
I'm just not sure about the best way to.

Leaving aside whether you should use a dedicated time-series tool such as rrdtool or carbon, mysql provides the functionality you need to implement a semi-materialized view, e.g. given data batch consolidated by date:
SELECT DATE(event_time), SUM(number_of_events) AS events,
, SUM(metric) AS total
, SUM(metric)/SUM(number_of_events) AS average
FROM (
SELECT pc.date AS event_time, events AS number_of_events
, total AS metric
FROM pre_consolidated pc
UNION
SELECT rd.timestamp, 1
, rd.metric
FROM raw_data rd
WHERE rd.timestamp>#LAST_CONSOLIDATED_TIMESTAMP
)
GROUP BY DATE(event_time)
(note that although you could create this as a view and access that, IME, MySQL is not the best at optimizing queries involving views and you might be better using the equivalent of the above as a template for building your queries around)
The most flexible way to maintain an accurate record of #LAST_CONSOLIDATED_TIMESTAMP would be to add a state column to the raw_data table (to avoid locking and using transactions to ensure consistency) and an index on the timestamp of the event, then, periodically:
UPDATE raw_data
SET state='PROCESSING'
WHERE timestamp>=#LAST_CONSOLIDATED_TIMESTAMP
AND state IS NULL;
INSERT INTO pre_consolidated (date, events, total)
SELECT DATE(rd.timestamp), COUNT(*), SUM(rd.metric)
FROM raw_data
WHERE timestamp>#LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING'
GROUP BY DATE(rd.timestamp);
SELECT #NEXT_CONSOLIDATED_TIMESTAMP := MAX(timestamp)
FROM raw_data
WHERE timestamp>#LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING';
UPDATE raw_data
SET state='CONSOLIDATED'
WHERE timestamp>#LAST_CONSOLIDATED_TIMESTAMP
AND state='PROCESSING';
SELECT #LAST_CONSOLIDATED_TIMESTAMP := #NEXT_CONSOLIDATED_TIMESTAMP;
(you should think of a way to persist LAST_CONSOLIDATED_TIMESTAMP between DBMS sessions)
Hence the base query (to allow for more than one event with the same timestamp) should be:
SELECT DATE(event_time), SUM(number_of_events) AS events,
, SUM(metric) AS total
, SUM(metric)/SUM(number_of_events) AS average
FROM (
SELECT pc.date AS event_time, events AS number_of_events
, total AS metric
FROM pre_consolidated pc
UNION
SELECT rd.timestamp, 1
, rd.metric
FROM raw_data rd
WHERE rd.timestamp>#LAST_CONSOLIDATED_TIMESTAMP
AND state IS NULL
)
GROUP BY DATE(event_time)
Adding the state variable to the timestamp index will likely slow down the overall performance of the update as long as you are applying the consolidation reasonably frequently.

Related

SSIS Lookup Returns Too Much Data

I have an SSIS package that performs a lookup on a table with tens of millions of rows. It seems by default it returns all rows from the table into a refTable, and then selects from that refTable where the columns match specified parameters to find the matching lookup. Does it have to insert into a refTable to do this? Can I just filter out with the parameters immediately? Currently it is pulling the millions of records into the refTable and it is wasting a ton of time. Is it done this way because multiple records are being looked up from that refTable, or is it pulling all of those records every time for each lookup it tries to find?
Here is the slow way and my proposed new way of doing this:
-- old
select * from (SELECT InvoiceID, CustomerId, InvoiceNumber, InvoiceDate
FROM Invoice) [refTable]
where [refTable].[InvoiceNumber] = ? and [refTable].[CustomerId] = ? and [refTable].[InvoiceDate] = ?
-- new
SELECT i.InvoiceID, i.CustomerId, i.InvoiceNumber, i.InvoiceDate
FROM Invoice i
where i.InvoiceNumber = ? and i.CustomerId = ? and i.InvoiceDate = ?

The Partial Cache Mode makes a new call to the database every time it encounters a new distinct value in the source data. Afterwards it chaches this new value. It's not creating a massiv ref table.
The two queries
Select * FROM A WHERE A.Id = ?
SELECT * FROM (SELECT * FROM A) [refTable] WHERE refTable.Id = ?
have the same execution plan. So there is no difference
Overview over the different caching modes:
Overview over caching modes
You can speed the whole thing up by not using a whole table as Lookup Connection but a sql query which returns only the columns you need.

The problem was I had one lookup that was a full cache instead of partial cache like the others, it was loading nearly a million rows so it slowed things up quite a bit. I have a good index created so doing the lookup for each source item isn't bad.

Another good method is for handling lookup is this -> Lookup Pattern: Cascading.
Use a full cache lookup for values that where created yesterday and partial cache for anything older. 80-20 rule, 80% of the rows should hit the full cache and fly through.

Optimize query from view with UNION ALL

So I'm facing a difficult scenario, I have a legacy app, bad written and designed, with a table, t_booking. This app, has a calendar view, where, for every hall, and for every day in the month, shows its reservation status, with this query:
SELECT mr1b.id, mr1b.idreserva, mr1b.idhotel, mr1b.idhall, mr1b.idtiporeserva, mr1b.date, mr1b.ampm, mr1b.observaciones, mr1b.observaciones_bookingarea, mr1b.tipo_de_navegacion, mr1b.portal, r.estado
FROM t_booking mr1b
LEFT JOIN a_reservations r ON mr1b.idreserva = r.id
WHERE mr1b.idhotel = '$sIdHotel' AND mr1b.idhall = '$hall' AND mr1b.date = '$iAnyo-$iMes-$iDia'
AND IF (r.comidacena IS NULL OR r.comidacena = '', mr1b.ampm = 'AM', r.comidacena = 'AM' AND mr1b.ampm = 'AM')
AND (r.estado <> 'Cancelled' OR r.estado IS NULL OR r.estado = '')
LIMIT 1;
(at first there was also a ORDER BY r.estado DESC which I took out)
This query, after proper (I think) indexing, takes 0.004 seconds each, and the overall calendar view is presented in a reasonable time. There are indexes over idhotel, idhall, and date.
Now, I have a new module, well written ;-), which does reservations in another table, but I must present both types of reservations in same calendar view. My first approach was create a view, joining content of both tables, and selecting data for calendar view from this view instead of t_booking.
The view is defined like this:
CREATE OR REPLACE VIEW
t_booking_hall_reservation
AS
SELECT id,
idreserva,
idhotel,
idhall,
idtiporeserva,
date,
ampm,
observaciones,
observaciones_bookingarea,
tipo_de_navegacion, portal
FROM t_booking
UNION ALL
SELECT HR.id,
HR.convention_id as idreserva,
H.id_hotel as idhotel,
HR.hall_id as idhall,
99 as idtiporeserva,
date,
session as ampm,
observations as observaciones,
'new module' as observaciones_bookingarea,
'events' as tipo_de_navegacion,
'new module' as portal
FROM new_hall_reservation HR
JOIN a_halls H on H.id = HR.hall_id
;
(table new_hall_reservation has same indexes)
I tryed UNION ALL instead of UNION as I read this is much more efficient.
Well, the former query, changing t_booking for t_booking_hall_reservation, takes 1.5 seconds, to multiply for each hall and each day, which makes calendar view impossible to finish.
The app is spaguetti code, so, looping twice, once over t_booking and then over new_hall_reservation and combining results is somehow difficult.
Is it possible to tune the view to make this query fast enough? Another approach?
Thanks
PS: the less I modify original query, the less I'll need to modify the legacy app, which is, at less, risky to modify

This is too long for a comment.
A view is (almost) never going to help performance. Yes, they make queries simpler. Yes, they incorporate important logic. But no, they don't help performance.
One key problem is the execution of the view -- it doesn't generally take the filters in the overall tables into account (although the most recent versions of MySQL are better at this).
One suggestion -- which might be a fair bit of work -- is to materialize the view as a table. When the underlying tables change, you need to change t_booking_hall_reservation using triggers. Then you can create indexes on the table to achieve your performance goals.'

t_booking, unless it is a VIEW, needs
INDEX(idhotel, idhall, date)
VIEWs are syntactic sugar; they do not enhance performance; sometimes they are slower than the equivalent SELECT.

Performance Issues with DB Design and Heavy Data

I asked the following question regarding DB Design and Performance issue in my application today.
DB Design and Data Retrieval from a heavy table
But, could not get much replies on that. I don't know, I may not have explained the question properly. Now, I have re-defined my question, hoping to get some suggestion from the experts.
I am facing performance issues while selecting data from a particular table. The business logic of the application is as following:
I have a number of import processes which result in creating pivot columns under their parent column names while showing them to the user. As the columns get pivoted, system takes time to convert rows into columns which results in slow performance.
The database tables related to this functionality are as following:
There can be N number of clients. CLT_Clients table stores client information.
There can be N number of projects associated to a client. PRJ_Projects table stores project information and a link to the client.
There can be N number of listings associated to a project. PRJ_Listings table stores listing information and a link to the project.
There can be N number of source entities associated to a listing. ST_Entities table stores source entity information and a link to the listing.
This source entity is the actual import that contains the InvestorID, position values, source date, active and formula status.
The name of the import e.g. L1Entity1 is stored in ST_Entities table alongwith ID field i.e. EntityID
InvestorID, Position, Source Date, Active and Formula values get stored in ST_Positions table
Database Diagram
Data need to be view as following:
With this design I’m able to handle N number of imports because the Position, Source Date, IsActive, Formula columns get Pivoted.
The problem that I’m facing with this design is that the system performs very slow when it has to select data for more than 10-12 source entities, and the requirement is to show about 150 source entities. Because data is not stored row wise and I need to show it column wise, hence dynamic queries are written to pivot these columns which takes long.
Ques 1: Please comment/suggest on my current database design if it’s correct or need to be changed with the new design by taking 150 columns each for Position, Source Date, IsActive, Formula; In this new way data will already be stored in the way I need to retrieve in i.e. I’ll not have to pivot/unpivot it. But the downside is:
a) There are going to be more than 600 columns in this table?
b) There will be limit i.e. 150 on the source entities.
Ques 2: If I need to stick to my current, what can be done to improve the performance?
Please see below the indexing and Pivot method information:
Regarding indexes in Position table, I also have taken ProjectID field with clustered index as the data is selected from Position table either on the basis of ProjectID OR EntityID.
Whenever EntityID is used to select data from Position table, it's always used in JOIN. And Whenever ProjectID is used to select data from this table, it's always used in WHERE.
Point to notice here is that I have a Clustered index on ProjectID but I have not taken any index on Pivoted column OR EntityID. Is there any room for improvement here?
Pivoting Method used:
Example 1:
'Select * From
(
Select DD.InvestorID,Cast(1 As Bit) As IsDSInvestor,DD.Position,
Case DD.ProjectID
When ' + CAST(#ProjectID AS VARCHAR) +' Then DE.SourceName
Else ''' + #PPDeliveryDate + '''+DE.SourceName
End As SourceName
From DE_PositionData DD
Inner Join DE_DataEntities DE ON DE.EntityID=DD.EntityID
Where DD.ProjectID IN (' + CAST(#ProjectID AS VARCHAR) +',' + CAST(#PreviousProjectID AS VARCHAR) +') AND InvestorID IS NOT NULL
) IDD
Pivot
(
max(Position) for SourceName in ('+ #DataColumns+')
) as p1'
Example2:
'Select * From
(
Select DD.InvestorID As DSSOFID,Cast(1 As Bit) As IsActiveInvestor,
Case ST.SourceTypeCode
When ''RSH'' Then Cast(IsNull(DD.IsActive,0) As Int)
Else Cast(IsNull(DD.IsActive,1) As Int)
End As IsActive,
''~''+DE.SourceName As ActiveSourceName
From DE_DataEntities DE
Left Join DE_PositionData DD ON DE.EntityID=DD.EntityID
Left Join
(
Select * From #DataSources
Where ProjectID=' + CAST(#ProjectID AS VARCHAR) +'
) ST ON ST.ESourceID=DE.ESourceID
Where DE.ProjectID=' + CAST(#ProjectID AS VARCHAR) +' AND ST.SourceTypeCode NOT IN (''PBC'',''EBL'',''REG'')
AND InvestorID IS NOT NULL
) IDD
Pivot
(
Max(IsActive) for ActiveSourceName in ('+ #DataColumns+')
) As p1'

I would suggest the following.
You should store your data in the normalized format. You should be able to set up indexes to make the pivoting of the data go faster. If you post the actual query that you are using for pivoting, we might be able to help.
There are many reasons you want to store the data this way:
Flexibility in the number of repeating blocks
Many databases have limits on the number of columns or total width of a table. You don't want your design to approach those limits.
You may want additional information on each block. I always include CreatedBy and CreatedAt columns in my tables, and you would want this per block.
You have additional flexibility in summarization.
Adding/removing intermediate values is cumbersome.
That said, the pivoted table has one key advantage: it is what users want to see. If your data is updated only once per day, then you should create a reporting table with the pivot.
If your data is updated incrementally throughout the day, then you can set up triggers (or stored procedure code) to update the base tables and the reporting summary.
But as I said earlier, you should ask another question is the particular method you are using for pivoting. Perhaps we can improve the performance there.

Is it a bad idea to keep a subtotal field in database

I have a MySQL table that represents a list of orders and a related child table that represents the shipments associated with each order (some orders have more than one shipment, but most have just one).
Each shipment has a number of costs, for example:
ItemCost
ShippingCost
HandlingCost
TaxCost
There are many places in the application where I need to get consolidated information for the order such as:
TotalItemCost
TotalShippingCost
TotalHandlingCost
TotalTaxCost
TotalCost
TotalPaid
TotalProfit
All those fields are dependent on the aggregated values in the related shipment table. This information is used in other queries, reports, screens, etc., some of which have to return a result on tens of thousands of records quickly for a user.
As I see it, there are a few basic ways to go with this:
Use a subquery to calculate these items from the shipment table whenever they are needed. This complicates things quite a bit for all the queried that needs all or part of this information. It is also slow.
Create a view that exposes the subqueries as simple fields. This keeps the reports that needs them simple.
Add these fields in the order table. These would give me the performance I am looking for, at the expense of having to duplicate data and calculate it when I make any changes to the shipment records.
One other thing, I am using a business layer that exposes functions to get this data (for example GetOrders(filter)) and I don't need the subtotals each time (or only some of them some of the time), so generating a subquery each time (even from a view) is probably a bad idea.
Are there any best practices that anybody can point me to help me decide what the best design for this is?
Incidentally, I ended up doing #3 primarily for performance and query simplicity reasons.
Update:
Got lots of great feedback pretty quickly, thank you all. To give a bit more background, one of the places the information is shown is on the admin console where I have a potentially very long list of orders and needs to show TotalCost, TotalPaid, and TotalProfit for each.

Theres absolutely nothing wrong with doing rollups of your statistical data and storing it to enhance application performance. Just keep in mind that you will probably need to create a set of triggers or jobs to keep the rollups in sync with your source data.

I would probably go about this by caching the subtotals in the database for fastest query performance if most of the time you're doing reads instead of writes. Create an update trigger to recalculate the subtotal when the row changes.
I would only use a view to calculate them on SELECT if the number of rows was typically pretty small and access somewhat infrequent. Performance will be much better if you cache them.

Option 3 is the fastest
If and when you are running into performance issues and if you cannot solve these any other way, option #3 is the way to go.
Use triggers to do the updating
You should use triggers after insert, update and delete to keep the subtotals in your order table in sync with the underlying data.
Take special care when retrospectively changing prices and stuff as this will require a full recalc of all subtotals.
So you will need a lot of triggers, that usually don't do much most of the time.
if a taxrate changes, it will change in the future, for orders that you don't yet have
If the triggers take a lot of time, make sure you do these updates in off-peak hours.
Run an automatic check periodically to make sure the cached values are correct
You may also want to keep a golden subquery in place that calculates all the values and checkes them against the stored values in the order table.
Run this query every night and have it report any abnormalities, so that you can see when the denormalized values are out-of-sync.
Do not do any invoicing on orders that have not been processed by the validation query
Add an extra date field to table order called timeoflastsuccesfullvalidation and have it set to null if the validation was unsuccessful.
Only invoice items with a dateoflastsuccesfullvalidation less than 24 hours ago.
Of course you don't need to check orders that are fully processed, only orders that are pending.
Option 1 may be fast enough
With regards to #1
It is also slow.
That depends a lot on how you query the DB.
You mention subselects, in the below mostly complete skeleton query I don't see the need for many subselects, so you have me puzzled there a bit.
SELECT field1,field2,field3
, oifield1,oifield2,oifield3
, NettItemCost * (1+taxrate) as TotalItemCost
, TotalShippingCost
, TotalHandlingCost
, NettItemCost * taxRate as TotalTaxCost
, (NettItemCost * (1+taxrate)) + TotalShippingCost + TotalHandlingCost as TotalCost
, TotalPaid
, somethingorother as TotalProfit
FROM (
SELECT o.field1,o.field2, o.field3
, oi.field1 as oifield1, i.field2 as oifield2 ,oi.field3 as oifield3
, SUM(c.productprice * oi.qty) as NettItemCost
, SUM(IFNULL(sc.shippingperkg,0) * oi.qty * p.WeightInKg) as TotalShippingCost
, SUM(IFNULL(hc.handlingperwhatever,0) * oi.qty) as TotalHandlingCost
, t.taxrate as TaxRate
, IFNULL(pay.amountpaid,0) as TotalPaid
FROM orders o
INNER JOIN orderitem oi ON (oi.order_id = o.id)
INNER JOIN products p ON (p.id = oi.product_id)
INNER JOIN prices c ON (c.product_id = p.id
AND o.orderdate BETWEEN c.validfrom AND c.validuntil)
INNER JOIN taxes t ON (p.tax_id = t.tax_id
AND o.orderdate BETWEEN t.validfrom AND t.validuntil)
LEFT JOIN shippingcosts sc ON (o.country = sc.country
AND o.orderdate BETWEEN sc.validfrom AND sc.validuntil)
LEFT JOIN handlingcost hc ON (hc.id = oi.handlingcost_id
AND o.orderdate BETWEEN hc.validfrom AND hc.validuntil)
LEFT JOIN (SELECT SUM(pay.payment) as amountpaid FROM payment pay
WHERE pay.order_id = o.id) paid ON (1=1)
WHERE o.id BETWEEN '1245' AND '1299'
GROUP BY o.id DESC, oi.id DESC ) AS sub
Thinking about it, you would need to split this query up for stuff that's relevant per order and per order_item but I'm lazy to do that now.
Speed tips
Make sure you have indexes on all fields involved in the join-criteria.
Use a MEMORY table for the smaller tables, like tax and shippingcost and use a hash index for the id's in the memory-tables.

I would avoid #3 as possible as I can. I prefer that for different reasons:
It's too hard to discuss performance without measurement. Imaging the user is shopping around, adding order items into an order; every time an item is added, you need to update the order record, which may not be necessary (some sites only show order total when you click shopping cart and ready to checkout).
Having a duplicated column is asking for bugs - you cannot expect every future developer/maintainer to be aware of this extra column. Triggers can help but I think triggers should only be used as a last resort to address a bad database design.
A different database schema can be used for reporting purpose. The reporting database can be highly de-normalized for performance purpose without complicating the main application.
I tend to put the actual logic for computing subtotal at application layer because subtotal is actually an overloaded thing related to different contexts - sometimes you want the "raw subtotal", sometimes you want the subtotal after applying discount. You just cannot keep adding columns to the order table for different scenario.

It's not a bad idea, unfortunately MySQL doesn't have some features that would make this really easy - computed columns and indexed (materialized views). You can probably simulate it with a trigger.

mysql - storing a range of values

I have a resource that has a availability field that lists what hours of a day its available for use?
eg. res1 available between 0-8,19-23 hours on a day, the range here can be comma separated values of hour ranges. e.g are 0-23 for 24 hour access, 0-5,19-23 or 0-5,12-15,19-23
What's the best way to store this one? Is char a good option? When the resource is being accessed, my php needs to check the current hour with the hour defined here and then decide whether to allow this access or not. Can I ask mysql to tell me if the current hour is in the range specified here?

I'd store item availability in a separate table, where for each row I'd have (given your example):
id, startHour, endHour, resourceId
And I'd just use integers for the start and end times. You can then do queries against a join to see availability given a certain hour of the day using HOUR(NOW()) or what have you.
(On the other hand, I would've preferred a non-relational database like MongoDb for this kind of data)

1) create a table for resource availability, normalized.
CREATE TABLE res_avail
{
ra_resource_id int,
ra_start TIME,
ra_end TIME
# add appropriate keys for optimization here
)
2) populate with ($resource_id, '$start_time', '$end_time') for each range in your list (use explode())
3) then, you can query: (for example, PHP)
sql = "SELECT ra_resource_id FROM res_avail where ('$time' BETWEEN ra_start AND ra_end)";
....

I know this is an old question, but since v5.7 MySQL supports storing values in JSON format. This means you can store all ranges in one JSON field. This is great if you want to display opening times in your front-end using JavaScript. But it's not the best solution when you want to show all places that are currently open, because querying on a JSON field means a full table scan. But it would be okay if you only need to check on for one place at the time. For example, you load a page showing the details of one place and display whether it's open or closed.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008