Comparison of sets in MySQL - mysql

I have a challenge with the following database structure:
HEADER table called 'DOC' containing document details among which the document ID
DETAIL tabel called 'DOC_SET' containing data related to the document.
The header table is approximately 16000 records. The detail table contains on average 75 records per header table (1.2 million records in total).
I have one source document and its related set (source set). This source set I like to compare to the other documents' sets (which I refer to as destination documents and sets). Through my application I have a list of ID's of the source set available and as such also the length (in the example below shown as a list of 46 elements) which I can use in the query directly.
What I need per destination document is the length of the intersection (number of shared elements) of the source and destination sets and the length of the difference (length of what is in the source set and what is not in the destination set) for display. I also need a filter to retrieve only records for which a 75% intersection between source and destination, compared to the source set is reached.
Currently I have a query which does this by using sub selects containing expressions, but it is utterly slow and the results need to be available at page refresh in a web application. The point is I only need to display about 20 results at a time, but when sorting on calculated fields I need to calculate every destination record before being able to sort and paginate.
The query is something like this:
select
DOC.id,
calc_subquery._calcSetIntersection,
calc_subquery._calcSetDifference
from
DOC
inner join
(
select
DOC.id as document_id,
(
select
count(*)
from
DOC_SET
where
DOC_SET.doc_id = DOC.id and
DOC_SET.element_id in (60,114,130,187,267,394,421,424,426,603,604,814,909,1035,1142,1223,1314,1556,2349,2512,4953,5134,6318,6339,6344,6455,6528,6601,6688,6704,6705,6731,6894,6895,7033,7088,7103,7119,7129,7132,7133,7137,7154,7159,7188,7201)
) as _calcSetIntersection
,46-(
select
count(*)
from
DOC_SET
where
DOC_SET.doc_id = DOC.id and
DOC_SET.element_id in (60,114,130,187,267,394,421,424,426,603,604,814,909,1035,1142,1223,1314,1556,2349,2512,4953,5134,6318,6339,6344,6455,6528,6601,6688,6704,6705,6731,6894,6895,7033,7088,7103,7119,7129,7132,7133,7137,7154,7159,7188,7201)
) as _calcSetDifference
from
DOC
where
DOC.id = 2599
) as calc_subquery
on
DOC.id = calc_subquery.document_id
where
DOC.id = 2599 and
_calcSetIntersection / 46 > 0.75;
I'm wondering if:
this is possible while being performed in < 100msec or so on MySQL
on an average spec server running MySQL fully in memory (24Gb).
I should use a better suiting solution for this, perhaps like a NoSQL solution.
If I should use some sort of temporary table or cache containing
calculated values. This is an issue for me as the source set of id's
might change in between queries and the whole thing needs to be
calculated again.
Anyway, some thoughts or solutions are really appreciated.
Kind regards,
Eric

Related

How can this query be optimized for speed?

This query creates an export for UPS from the deliveries history:
select 'key'
, ACC.Name
, CON.FullName
, CON.Phone
, ADR.AddressLine1
, ADR.AddressLine2
, ADR.AddressLine3
, ACC.Postcode
, ADR.City
, ADR.Country
, ACC.Code
, DEL.DeliveryNumber
, CON.Email
, case
when CON.Email is not null
then 'Y'
else 'N'
end
Ship_Not_Option
, 'Y' Ship_Not
, 'ABCDEFG' Description_Goods
, '1' numberofpkgs
, 'PP' billing
, 'CP' pkgstype
, 'ST' service
, '1' weight
, null Shippernr
from ExactOnlineREST..GoodsDeliveries del
join ExactOnlineREST..Accounts acc
on ACC.ID = del.DeliveryAccount
join ExactOnlineREST..Addresses ADR
on ADR.ID = DEL.DeliveryAddress
join ExactOnlineREST..Contacts CON
on CON.ID = DEL.DeliveryContact
where DeliveryDate between $P{P_SHIPDATE_FROM} and $P{P_SHIPDATE_TO}
order
by DEL.DeliveryNumber
It takes many minutes to run. The number of deliveries and accounts grows with several hundreds each day. Addresses and contacts are mostly 1:1 with accounts. How can this query be optimized for speed in Invantive Control for Excel?
Probably this query is run at most once every day, since the deliverydate does not contain time. Therefore, the number of rows selected from ExactOnlineREST..GoodsDeliveries is several hundreds. Based upon the statistics given, the number of accounts, deliveryaddresses and contacts is also approximately several hundreds.
Normally, such a query would be optimized by a solution such as Exact Online query with joins runs more than 15 minutes, but that solution will not work here: the third value of a join_set(soe, orderid, 100) is the maximum number of rows on the left-hand side to be used with index joins. At this moment, the maximum number on the left-hand side is something like 125, based upon constraints on the URL length for OData requests to Exact Online. Please remember the actual OData query is a GET using an URL, not a POST with unlimited size for the filter.
The alternatives are:
Split volume
Data Cache
Data Replicator
Have SQL engine or Exact Online adapted :-)
Split Volume
In a separate query select the eligible GoodsDeliveries and put them in an in-memory or database table using for instance:
create or replace table gdy#inmemorystorage as select ... from ...
Then create a temporary table per 100 or similar rows such as:
create or replace table gdysubpartition1#inmemorystorage as select ... from ... where rowidx$ between 0 and 99
... etc for 100, 200, 300, 400, 500
And then run the query several times, each time with a different gdysubpartition1..gdysubpartition5 instead of the original from ExactOnlineREST..GoodsDeliveries.
Of course, you can also avoid the use of intermediate tables by using an inline view like:
from (select * from goodsdeliveries where date... limit 100)
or alike.
Data Cache
When you run the query multiple times per day (unlikely, but I don't know), you might want to cache the Accounts in a relational database and update it every day.
You can also use a 'local memorize results clipboard andlocal save results clipboard to to save the last results to a file manually and later restore them usinglocal load results clipboard from ...andlocal insert results clipboard in table . And maybe theninsert into from exactonlinerest..accounts where datecreated > trunc(sysdate)`.
Data Replicator
With Data Replicator enabled, you can have replicas created and maintained automatically within an on-premise or cloud relational database for Exact Online API entities. For low latency, you will need to enable the Exact webhooks.
Have SQL Engine or Exact adapted
You can also register a request to have the SQL engine to allow higher number in the join_set hint, which would require addressing the EOL APIs in another way. Or register a request at Exact to also allow POST requests to the API with the filter in the body.

SAP BusinessObjects - Merging dimensions with no directly related attributes

Given the following 3 queries
Query 1
SELECT
COMPONENTINFO__SOFTWARE.SOFTWARENAME,
COMPONENTINFO__SOFTWARE.SOFTWAREVERSION,
COMPONENTINFO__SOFTWARE.PARENTOID,
COMPONENTINFO__SOFTWARE.OID,
COMPONENT_VERSION_INFO.OID,
COMPONENT_VERSION_INFO.HWSERIAL,
COMPONENT_VERSION_INFO.COMPONENTID
FROM
COMPONENTINFO__SOFTWARE,
COMPONENT_VERSION_INFO
WHERE
( COMPONENTINFO__SOFTWARE.PARENTOID=COMPONENT_VERSION_INFO.OID )
Query 2
SELECT
V_MACH.OID,
V_MACH.NAME,
V_MACH.IPADDR
FROM
V_MACH
Query 3
SELECT
V_VERSIONINFO.MACHINEOID,
VM_VERSIONINFO_VERSIONINFOINFO.HWSERIAL,
VM_VERSIONINFO_VERSIONINFOINFO.OSVERSION,
VM_VERSIONINFO_VERSIONINFOINFO.PARENTOID,
VM_VERSIONINFO_VERSIONINFOINFO.OID,
COMPONENT_VERSION_INFO.PARENTOID,
V_VERSIONINFO.OID
FROM
V_VERSIONINFO,
VM_VERSIONINFO_VERSIONINFOINFO,
COMPONENT_VERSION_INFO
WHERE
( VM_VERSIONINFO_VERSIONINFOINFO.PARENTOID=V_VERSIONINFO.OID )
I'm trying to produce a report (Webi, using the rich client) that shows in 1 table:
V_MACH.NAME, COMPONENTINFO__SOFTWARE.SOFTWARENAME, COMPONENTINFO__SOFTWARE.SOFTWAREVERSION
But no matter what dimensions I merge, it won't let me put the NAME field alongside the software version fields.
I've tried to merge on:
VM_VERSIONINFO_VERSIONINFOINFO.HWSERIAL + COMPONENT_VERSION_INFO.HWSERIAL.
VM_VERSIONINFO_VERSIONINFOINFO.OID + COMPONENT_VERSION_INFO.OID (I found these represent the same values for each machine)
But nothing works.
Is the only way to do a join at the SQL level? I was hoping to avoid that but if it's the only way then that's ok.
I think what you need to do is this:
1) Create a merged dimension between V_MACH.OID in Query 2 and
V_VERSIONINFO.MACHINEOID in Query 3. Call the merged dim
"machineoid".
Create a merged dimension between
VM_VERSIONINFO_VERSIONINFOINFO.OID in Query 3 and
COMPONENT_VERSION_INFO.OID in Query 1. Call the merged dim "oid".
Create a new variable as a detail type, defined as
=[V_MACH.NAME], and its associated dimension as the merged
machineoid dimension. Call it name_detail.
Use the two merged dims in place of the
underlying dims in your report block, then add in the name_detail variable.
The reason you're having trouble is that BO can't recognize what Query 2.NAME should be associated with. By creating a detail variable, you are explicitly telling it that it is an attribute of the now-merged OID dimension.

MySQL - return one row from 2 rows in the same table, overwrite the contents of the first 'default' with the populated fields of the second 'override'

I am trying to make use of the mobile device lookup data in the WUFL database at http://wurfl.sourceforge.net/smart.php but I'm having problems getting my head around the MySQL code needed (I use Coldfusion for the server backend). To be honest its really doing my head in but I'm sure there is a straightforward approach to this.
The WUFL is supplied as XML (approx 15200 records to date), I have the method written that saves the data to a MySQL database already. Now I need to get the data back out in a useful way!
Basically it works like this: firstly run a select using the userAgent data from a CGI pull to match against a known mobile device (row 1) using LIKE; if found then use the resultant fallback field to look up the default data for the mobile device's 'family root' (row 2). The two rows need to be combined by overwriting the contents of (row 2) with the specific mobile device's features of (row 1). Both rows contain NULL entries and not all the features are present in (row 1).
I just need the fully populated row of data returned if a match is found. I hope that makes sense, I would provide what I think the SQL should look like but I will probably confuse things even more.
Really appreciate any assistance!
This would be my shot at it in SQL Server. You would need to use IFNULL instead of ISNULL:
SELECT
ISNULL(row1.Feature1, row2.Feature1) AS Feature 1
, ISNULL(row1.Feature2, row2.Feature2) AS Feature 2
, ISNULL(row1.Feature3, row2.Feature3) AS Feature 3
FROM
featureTable row1
LEFT OUTER JOIN featureTable row2 ON row1.fallback = row2.familyroot
WHERE row1.userAgent LIKE '%Some User Agent String%'
This should accomplish the same thing in MySQL:
SELECT
IFNULL(row1.Feature1, row2.Feature1) AS Feature 1
, IFNULL(row1.Feature2, row2.Feature2) AS Feature 2
, IFNULL(row1.Feature3, row2.Feature3) AS Feature 3
FROM
featureTable AS row1
LEFT OUTER JOIN featureTable AS row2 ON row1.fallback = row2.familyroot
WHERE row1.userAgent LIKE '%Some User Agent String%'
So what this does, is takes your feature table, aliases it as row1 to get your specific model features. We then join it back to itself as row2 to get the family features. Then the ISNULL function says "if there is no Feature1 value in row 1 (it's null) then get the Feature1 value from row2".
Hope that helps.

Count of related items in a 2nd table with zero results needed (query check please)

This MySQL statement is a bit over my head. I pieced it togather through a lot of Google searches. It seems to work right but I just wanted to see if I could get a thumbs up. I'm paranoid I did something a bit off and some issue could come up I'm not understanding.
I have a 'directories' table, 'folders' table and 'documents' table. (directories have many folders, folders have many documents).
On a web page, I have a select where a user can choose a directory (which has many folders). This query is for an AJAX call that loads a second select with the list of all folders belonging to the directory (getting the id's and names to load the 'folders' select).
So, this query will be made against one directory to get a list of folder id's and folder names for that directory. I also needed the folder name to contain a count of how many documents are contained in each folder. Also, I originally had just "join" which did not return zero results but changing it to "left join" listed folders with 0 documents (don't have an understanding of the different types of joins yet).
MY FRANKEN-QUERY:
SELECT f.id, CONCAT(f.folder_name , ' (', COUNT(DISTINCT d.id), ' documents')') AS folder_name
FROM folders f
LEFT JOIN documents d ON d.folder_id = f.id
WHERE f.directory_id = '2'
GROUP BY f.id
ORDER BY f.folder_name
RESULTS (seems to work fine):
id folder_name
1 MAIN (2 documents)
8 test1 (2 documents)
9 test2 (3 documents)
50 test3 (0 documents)
Thanks - much appreciated!
It looks fine offhand, but just run a couple tests on your data ans make sure you get consistent (correct) results.
Assuming document.id is a primary key, you can remove the DISTINCT keyword from the count.
For more on the various join types
http://en.wikipedia.org/wiki/Join_%28SQL%29

DynamicQuery: How to select a column with linq query that takes parameters

We want to set up a directory of all the organizations working with us. They are incredibly diverse (government, embassy, private companies, and organizations depending on them ). So, I've resolved to create 2 tables. Table 1 will treat all the organizations equally, i.e. it'll collect all the basic information (name, address, phone number, etc.). Table 2 will establish the hierarchy among all the organizations. For instance, Program for illiterate adults depends on the National Institute for Social Security which depends on the Labor Ministry.
In the Hierarchy table, each column represents a level. So, for the example above, (i)Labor Ministry - Level1(column1), (ii)National Institute for Social Security - Level2(column2), (iii)Program for illiterate adults - Level3(column3).
To attach an organization to an hierarchy, the user needs to go level by level(i.e. column by column). So, there will be at least 3 situations:
If an adequate hierarchy exists for an organization(for instance, level1: US Embassy), that organization can be added (For instance, level2: USAID).--> US Embassy/USAID, and so on.
How about if one or more levels are missing? - then they need to be added
How about if the hierarchy need to be modified? -- not every thing need to be modified.
I do not have any choice but working by level (i.e. column by column). I does not make sense to have all the levels in one form as the user need to navigate hierarchies to find the right one to attach an organization.
Let's say, I have those queries in my repository (just that you get the idea).
Query1
var orgHierarchy = (from orgH in db.Hierarchy
select orgH.Level1).FirstOrDefault;
Query2
var orgHierarchy = (from orgH in db.Hierarchy
select orgH.Level2).FirstOrDefault;
Query3, Query4, etc.
The above queries are the same except for the property queried (level1, level2, level3, etc.)
Question: Is there a general way of writing the above queries in one? So that the user can track an hierarchy level by level to attach an organization.
In other words, not knowing in advance which column to query, I still need to be able to do so depending on some conditions. For instance, an organization X depends on Y. Knowing that Y is somewhere on the 3rd level, I'll go to the 4th level, linking X to Y.
I need to select (not manually) a column with only one query that takes parameters.
=======================
EDIT
As I just said to #Mark Byers, all I want is just to be able to query a column not knowing in advance which one. Check this out:
How about this
Public Hierarchy GetHierarchy(string name)
{
var myHierarchy = from hierarc in db.Hierarchy
where (hierarc.Level1 == name)
select hierarc;
retuen myHierarchy;
}
Above, the query depends on name which is a variable. It mighbe Planning Ministry, Embassy, Local Phone, etc.
Can I write the same query, but this time instead of looking to much a value in the DB, I impose my query to select a particular column.
var myVar = from orgH in db.Hierarchy
where (orgH.Level1 == "Government")
select orgH.where(level == myVariable);
return myVar;
I don't pretend that select orgH.where(level == myVariable) is even close to be valid. But that is what I want: to be able to select a column depending on a variable (i.e. the value is not known in advance like with name).
Thanks for helping
How about using DynamicQueryable?
http://weblogs.asp.net/scottgu/archive/2008/01/07/dynamic-linq-part-1-using-the-linq-dynamic-query-library.aspx
Your database is not normalized so you should start by changing the heirarchy table to, for example:
OrganizationId Parent
1 NULL
2 1
3 1
4 3
To query this you might need to use recursive queries. This is difficult (but not impossible) using LINQ, so you might instead prefer to create a parameterized stored procedure using a recursive CTE and put the query there.