How can this query be optimized for speed? - exact-online

This query creates an export for UPS from the deliveries history:
select 'key'
, ACC.Name
, CON.FullName
, CON.Phone
, ADR.AddressLine1
, ADR.AddressLine2
, ADR.AddressLine3
, ACC.Postcode
, ADR.City
, ADR.Country
, ACC.Code
, DEL.DeliveryNumber
, CON.Email
, case
when CON.Email is not null
then 'Y'
else 'N'
end
Ship_Not_Option
, 'Y' Ship_Not
, 'ABCDEFG' Description_Goods
, '1' numberofpkgs
, 'PP' billing
, 'CP' pkgstype
, 'ST' service
, '1' weight
, null Shippernr
from ExactOnlineREST..GoodsDeliveries del
join ExactOnlineREST..Accounts acc
on ACC.ID = del.DeliveryAccount
join ExactOnlineREST..Addresses ADR
on ADR.ID = DEL.DeliveryAddress
join ExactOnlineREST..Contacts CON
on CON.ID = DEL.DeliveryContact
where DeliveryDate between $P{P_SHIPDATE_FROM} and $P{P_SHIPDATE_TO}
order
by DEL.DeliveryNumber
It takes many minutes to run. The number of deliveries and accounts grows with several hundreds each day. Addresses and contacts are mostly 1:1 with accounts. How can this query be optimized for speed in Invantive Control for Excel?

Probably this query is run at most once every day, since the deliverydate does not contain time. Therefore, the number of rows selected from ExactOnlineREST..GoodsDeliveries is several hundreds. Based upon the statistics given, the number of accounts, deliveryaddresses and contacts is also approximately several hundreds.
Normally, such a query would be optimized by a solution such as Exact Online query with joins runs more than 15 minutes, but that solution will not work here: the third value of a join_set(soe, orderid, 100) is the maximum number of rows on the left-hand side to be used with index joins. At this moment, the maximum number on the left-hand side is something like 125, based upon constraints on the URL length for OData requests to Exact Online. Please remember the actual OData query is a GET using an URL, not a POST with unlimited size for the filter.
The alternatives are:
Split volume
Data Cache
Data Replicator
Have SQL engine or Exact Online adapted :-)
Split Volume
In a separate query select the eligible GoodsDeliveries and put them in an in-memory or database table using for instance:
create or replace table gdy#inmemorystorage as select ... from ...
Then create a temporary table per 100 or similar rows such as:
create or replace table gdysubpartition1#inmemorystorage as select ... from ... where rowidx$ between 0 and 99
... etc for 100, 200, 300, 400, 500
And then run the query several times, each time with a different gdysubpartition1..gdysubpartition5 instead of the original from ExactOnlineREST..GoodsDeliveries.
Of course, you can also avoid the use of intermediate tables by using an inline view like:
from (select * from goodsdeliveries where date... limit 100)
or alike.
Data Cache
When you run the query multiple times per day (unlikely, but I don't know), you might want to cache the Accounts in a relational database and update it every day.
You can also use a 'local memorize results clipboard andlocal save results clipboard to to save the last results to a file manually and later restore them usinglocal load results clipboard from ...andlocal insert results clipboard in table . And maybe theninsert into from exactonlinerest..accounts where datecreated > trunc(sysdate)`.
Data Replicator
With Data Replicator enabled, you can have replicas created and maintained automatically within an on-premise or cloud relational database for Exact Online API entities. For low latency, you will need to enable the Exact webhooks.
Have SQL Engine or Exact adapted
You can also register a request to have the SQL engine to allow higher number in the join_set hint, which would require addressing the EOL APIs in another way. Or register a request at Exact to also allow POST requests to the API with the filter in the body.

Related

Suggestion/feedback on database design for work order tracking in multiple stations

I'm a student intern in a business team and my coworkers don't have the CS background so I hope to get some feedback and suggestion for improvement on the database design for the Flask web application that I will work on. Also, I self-learned sql a couple years ago by following tutorials on Youtube.
When a new work order is received by the business, it is then passed to a line of 5 stations to process it further. Currently the status of the work order is either started or finished. We hope to track it better by knowing the current station/stage (A, B, C, D, E) of the work order and then help improve the flow by letting the operator at each station know what's next in line.
My idea is create a web app (using Python 3, Flask, and postgresql) that updates the database when an operator at each station scans the work order's barcode and two other static barcodes (in_station_X and out_station_X). Each station will have a tablet connected to a scanner.
I. Station Operator perspective (for example Station 1)
Scan the batch of all incoming work order (barcode) for that shift. For each item, they would also scan the in_station_1 barcode to record the time_in for each work order.
The work orders come in queue so eventually the web app running on the tablet can show them what's next in line.
When an item is processed, the operator would scan the work order again and also the out_station_1 barcode to record the time_out for each work order.
The item coming out of that station may not have the same order as the incoming queue due to different priority (boolean Yes/No).
II. Admin/dashboard perspective:
See the current station and cycle time of each work order in that day.
Modify the priority of a work order if needed be.
Also, possibility to see reloop if a work order fails to be processed in station 2 and needs to go back to station 1.
III. The database:
a. Work Order Info table that contains fields such as:
id, workorder_barcode, requestor, priority (boolean Yes/No), date_created.
b. The Tracking Database: I'm thinking of having columns like:
- id (automatically generated for new row)
- workorder_barcode (nullable = False)
- current_station (nullable = False)
- time_in
- time_out
I have several questions/concerns related to this tracking table:
Every time a work order is scanned in or out, a new row will be created (which mean either column is blank). Do you see any issues with this approach vs. looking up the same work order that has time_in to fill the time_out? The reason for this is to avoid multiple look up when the database scales big.
Since the app screen at each station will show what's next in line, do you think a simple query with ORDER_BY to show the the order needed would suffice? What concerns me is showing the next item based on both Priority of each item and the current incoming order. I think I can sort by multiple columns (time_in and priority) and FILTER by current_station. However, as you can see below, I think the current table design may be more suitable for capturing events than doing queue control.
For example: the table for today would look like
id, workorder_barcode, current_station, time_in, time_out
61, 100.1, A, 6:00pm, null
62, 100.3, A, 6:01pm, null
63, 100.2, A, 6:02pm, null
...
70, 100.1, A, null, 6:03pm
71, 100.1, B, 6:04pm, null
...
74, 100.5, C, 6:05pm, null
At 6:05pm, the queue at each station would be
Station A queue: 100.3, 100.2
Station B queue: 100.1
Station C queue: 100.5
I think this can get complicated to have all 5 stations sharing the same table but seeing different queues. Is there a Queue based database that you would recommend I look into?
Thank you so much for reading this. I appreciate any questions, comments, and suggestions since I'm trying to learn more about database as I get hands-on with this project.

Get list of blocked Exact Online divisions

We have a few thousand companies in Exact Online from which a certain percentage runs their own accounting and has their own license. However, there is a daily changing group of companies that are behind with their payments to Exact and therefore their companies are blocked.
For all companies we run Invantive Data Replicator to replicate all Exact Online companies into a SQL Server datawarehouse for analytical reporting and continuous monitoring.
In the SystemDivisions table, the state of such a blocked company remains at 1 (Active). It does not change to 2 (Archive) or 0 (upcoming). Nor is there any enddate set in the past.
However, when the XML or REST APIs are used through a query from Invantive SQL or directly from Python on such a blocked company there are lot of fuzzy error messages.
Currently we have to open each company which had an error during replication individually each day and check whether a block by Exact is causing the error and for what reason.
It seems that there is no way to retrieve the list of blocked companies.
Is there an alternative?
Although it is not supported and disadviced, you can access a limited number of screens in Exact Online using native requests. It is rumoured that this is not possible for all screens.
However, you are lucky. The blocking status of a company can be requested using the following queries:
insert into NativePlatformScalarRequests(url, orig_system_group)
select /*+ ods(false) */ 'https://start.exactonline.nl/docs/SysAccessBlocked.aspx?_Division_=' || code
, 'BLOCK-DIV-CHECK-' || code
from systemdivisions
create or replace table currentlyblockeddivisions#inmemorystorage
as
select blockingstatus
, divisioncode
from ( select regexp_replace(result, '.*<table class="WizardSectionHeader" style="width:100%;"><tr><th colspan="2">([^<]*)</th>.*', '$1', 1, 0, 'n') blockingstatus
, replace(orig_system_group, 'BLOCK-DIV-CHECK-', '') divisioncode
from NativePlatformScalarRequests
where orig_system_group like 'BLOCK-DIV-CHECK-%'
)
where blockingstatus not like '%: Onbekend%'
Please note that the hyperlink with '.nl' needs to be replaced when you run on a different country. The same holds for searching on the Dutch term 'Onbekend' ('Unknown' in english).
This query runs several thousand of HTTP request, each requesting the screen with the blocking status of a company. However, when the company is not blocked, the screen reports back a reason of 'Unknown'.
These companies with 'Unknown' reason are probably not blocked. The rest is.

Analyze data volume of API calls with Invantive SQL

The SQL engine hides away all nifty details on what API calls are being done. However, some cloud solutions have pricing per API call.
For instance:
select *
from transactionlines
retrieves all Exact Online transaction lines of the current company, but:
select *
from transactionlines
where financialyear = 2016
filters it effectively on REST API of Exact Online to just that year, reducing data volume. And:
select *
from gltransactionlines
where year_attr = 2016
retrieves all data since the where-clause is not forwarded to this XML API of Exact.
Of course I can attach fiddler or wireshark and try to analyze the data volume, but is there an easier way to analyze the data volume of API calls with Invantive SQL?
First of all, all calls handled by Invantive SQL are logged in the Invantive Cloud together with:
the time
data volume in both directions
duration
to enable consistent API use monitoring across all supported cloud platforms. The actual data is not logged and travels directly.
You can query the same numbers from within your session, for instance:
select * from exactonlinerest..projects where code like 'A%'
retrieves all projects with a code starting with 'A'. And then:
select * from sessionios#datadictionary
shows you the API calls made:
You can also put a query like to following at the end of your session before logging off:
select main_url
, sum(bytes_received) bytes_received
, sum(duration_ms) duration_ms
from ( select regexp_replace(url, '\?.*', '') main_url
, bytes_received
, duration_ms
from sessionios#datadictionary
)
group
by main_url
with a result such as:

Comparison of sets in MySQL

I have a challenge with the following database structure:
HEADER table called 'DOC' containing document details among which the document ID
DETAIL tabel called 'DOC_SET' containing data related to the document.
The header table is approximately 16000 records. The detail table contains on average 75 records per header table (1.2 million records in total).
I have one source document and its related set (source set). This source set I like to compare to the other documents' sets (which I refer to as destination documents and sets). Through my application I have a list of ID's of the source set available and as such also the length (in the example below shown as a list of 46 elements) which I can use in the query directly.
What I need per destination document is the length of the intersection (number of shared elements) of the source and destination sets and the length of the difference (length of what is in the source set and what is not in the destination set) for display. I also need a filter to retrieve only records for which a 75% intersection between source and destination, compared to the source set is reached.
Currently I have a query which does this by using sub selects containing expressions, but it is utterly slow and the results need to be available at page refresh in a web application. The point is I only need to display about 20 results at a time, but when sorting on calculated fields I need to calculate every destination record before being able to sort and paginate.
The query is something like this:
select
DOC.id,
calc_subquery._calcSetIntersection,
calc_subquery._calcSetDifference
from
DOC
inner join
(
select
DOC.id as document_id,
(
select
count(*)
from
DOC_SET
where
DOC_SET.doc_id = DOC.id and
DOC_SET.element_id in (60,114,130,187,267,394,421,424,426,603,604,814,909,1035,1142,1223,1314,1556,2349,2512,4953,5134,6318,6339,6344,6455,6528,6601,6688,6704,6705,6731,6894,6895,7033,7088,7103,7119,7129,7132,7133,7137,7154,7159,7188,7201)
) as _calcSetIntersection
,46-(
select
count(*)
from
DOC_SET
where
DOC_SET.doc_id = DOC.id and
DOC_SET.element_id in (60,114,130,187,267,394,421,424,426,603,604,814,909,1035,1142,1223,1314,1556,2349,2512,4953,5134,6318,6339,6344,6455,6528,6601,6688,6704,6705,6731,6894,6895,7033,7088,7103,7119,7129,7132,7133,7137,7154,7159,7188,7201)
) as _calcSetDifference
from
DOC
where
DOC.id = 2599
) as calc_subquery
on
DOC.id = calc_subquery.document_id
where
DOC.id = 2599 and
_calcSetIntersection / 46 > 0.75;
I'm wondering if:
this is possible while being performed in < 100msec or so on MySQL
on an average spec server running MySQL fully in memory (24Gb).
I should use a better suiting solution for this, perhaps like a NoSQL solution.
If I should use some sort of temporary table or cache containing
calculated values. This is an issue for me as the source set of id's
might change in between queries and the whole thing needs to be
calculated again.
Anyway, some thoughts or solutions are really appreciated.
Kind regards,
Eric

Is this a case for denormalisation?

I have a site with about 30,000 members to which I'm adding a functionality that involves sending a random message from a pool of 40 possible messages. Members can never receive the same message twice.
One table contains the 40 messages and another table maps the many-to-many relationship between messages and members.
A cron script runs daily, selects a member from the 30,000, selects a message from the 40 and then checks to see if this message has been sent to this user before. If not, it sends the message. If yes, it runs the query again until it finds a message that has not yet been received by this member.
What I'm worried about now is that this m-m table will become very big: at 30,000 members and 40 messages we already have 1.2 million rows through which we have to search to find a message that has not yet been sent.
Is this a case for denormalisation? In the members table I could add 40 columns (message_1, message_2 ... message_40) in which a 1 flag is added each time a message is sent. If I'm not mistaken, this would make the queries in the cron script run much faster
?
I know that doesn't answer your original question, but wouldn't it be way faster if you selected all the messages that weren't yet sent to a user and then select one of those randomly?
See this pseudo-mysql here:
SELECT
CONCAT_WS(',', messages.ids) unsent_messages,
user.id user
FROM
messages,
user
WHERE
messages.id NOT IN (
SELECT
id
FROM
sent_messages
WHERE
user.id = sent_messages.user
)
GROUP BY ids
You could also append the id of the sent messages to a varchar-field in the members-table.
Despite of good manners, this would make it easily possible to use one statement to get a message that has not been sent yet for a specific member.
Just like this (if you surround the ids with '-')
SELECT message.id
FROM member, message
WHERE member.id = 2321
AND member.sentmessages NOT LIKE '%-' && id && '-%'
1.2 M rows # 8 bytes (+ overhead) per row is not a lot. It's so small I wouldn't even bet it needs indexing (but of course you should do it).
Normalization reduces redundancy and it is what you'll do if you have large amount of data which seems to be your case. You need not denormalize. Let there be an M-to-M table between members and messages.
You can archive the old data as your M-to-M data increases. I don't even see any conflicts because your cron job runs daily for this task and accounts only for the data for the current day. So you can archive M-to-M table data every week.
I believe there will be maintenance issue if you denormalize by adding additional coloumns to members table. I don't recommend the same. Archiving of old data can save you from trouble.
You could store only available (unsent) messages. This implies extra maintenance when you add or remove members or message types (nothing that can't be automated with foreign keys and triggers) but simplifies delivery: pick a random line from each user, send the message and remove the line. Also, your database will get smaller as messages get sent ;-)
You can achieve the effect of sending random messages by preallocating the random string in your m-m table and a pointer to the offset of the last message sent.
In more detail, create a table MemberMessages with columns
memberId,
messageIdList char(80) or varchar ,
lastMessage int,
primary key is memberId.
Pseudo-code for the cron job then looks like this...
ONE. Select next message for a member. If no row exists in MemberMessages for this member, go to step TWO. The sql to select next message looks like
select substr(messageIdList, 2*lastMessage + 1, 2) as nextMessageId
from MemberMessages
where member_id = ?
send the message identified by nextMessageId
then update lastMessage incrementing by 1, unless you have reached 39 in which case reset it to zero.
update MemberMessages
set lastMessage = MOD(lastMessage + 1, 40)
where member_id = ?
TWO. Create a random list of messageIds as a String of couplets like 2117390740... This is your random list of message IDs as an 80 char String. Insert a row to MemberMessages for your member_id setting message_id_list to your 80 char String and set last_message to 1.
Send the message identified by the first couplet from the list to the member.
You can create a kind of queue / heap.
ReceivedMessages
UserId
MessageId
then:
Pick up a member and select message to send:
SELECT * FROM Messages WHERE MessageId NOT IN (SELECT MessageId FROM ReceivedMessages WHERE UserId = #UserId) LIMIT 1
then insert MessageId and UserId to ReceivedMessages
and do send logic here
I hope that helps.
There are potential easier ways to do this, depending on how random you want "random" to be.
Consider that at the beginning of the day you shuffle an array A, [0..39] which describes the order of the messages to be sent to users today.
Also, consider that you have at most 40 Cron jobs, which are used to send messages to the users. Given the Nth cron job, and ID the selected user ID, numeric, you can choose M, the index of the message to send:
M = (A[N] + ID) % 40.
This way, a given ID would not receive the same message twice in the same day (because A[N] would be different), and two randomly selected users have a 1/40 chance of receiving the same message. If you want more "randomness" you can potentially use multiple arrays.