MySQL how to rank objects by similarity of multiple property rows - mysql

Hello all and a Happy New Year
SITUATION:
I have some tables in MySQL db:
Scores:
(Unique ID, unique (objectID, metricID))
| ID | ObjectID | MetricID | Score |
|--------+----------+----------+----------|
|0 | 1 | 7 | 0 |
|1 | 5 | 3 | 13 |
|2 | 7 | 2 | 78 |
|3 | 7 | 3 | 22 |
|.....
|--------+----------+----------+----------|
Objects:
(unique ID, unique ObjectName)
| ID | ObjectName |
|--------+------------|
|0 | Ook |
|1 | Oop |
|2 | Oww |
|3 | Oat |
|.....
|--------+------------|
Metrics:
(unique ID, unique MetricName)
| ID | MetricName |
|--------+------------|
|0 | Moo |
|1 | Mar |
|2 | Mee |
|3 | Meep |
|.....
|--------+------------|
For a given object ID:
There will be a number of scores between '0' and 'one per metric'
REQUIREMENT:
For a given ObjectID, I want to return a sorted list based on the following criteria:
Returned rows ranked in order of similarity to the provided object
Returned rows not to include provided object
(this is the hard bit I think) Order of similarity is determined by an object's "score distance" from the provided object based on the numeric offset/difference of its score from the provided object's score for any metric for which there is an entry for both the provided and the currently-examined objects
Contains objectID, Object name, score difference (or something similar)
PROBLEM STATEMENT:
I don't know the correct SQL syntax to use for this, and my experiments so far have failed. I would like to do as much of this work in the DB as possible and have little or none of this work done in nasty for-loops in the code or similar.
ADDITIONAL NON-FUNCTIONALS
At present there are only 200 rows in the Scores table. My calculations show that ultimately there may be up to around 2,000,000 rows, but probably no more.
The Objects table will only ever have up to around 5000 rows
The Metrics table will only ever have up to around 400 rows

Here's an approach to order objects based on their similarity to object 1:
select other.ObjectID
, avg(abs(target.Score - other.Score)) as Delta
from Scores target
join Scores other
on other.MetricID = target.MetricID
and other.ObjectID <> target.ObjectID
where target.ObjectID = 1
group by
other.ObjectID
order by
Delta
Similarity is defined as the average difference in common metrics. Objects that do not share at least one metric with object 1 are not listed. If this answer makes wrong assumptions, feel free to clarify your question :)
Live example at SQL Fiddle.

Related

creating a sorted list in database

So basically I have a table named contents where users can store their items. Normally here when a user add a new item, The item is added at the end of rows.
Something like:
|ID | Name | Item |
--------------------
|1 | Jack | pen |
|2 | Mark | apple |
|3 | albert| orange|
|4 | Jack | pencil|
But the problem with above is that it might take a lot of time when we have a lot of users and items like Jack's first item is at row ID 40 and item #2 is at 1000048 and so which might take a while to search for all the items that belongs to Jack So I was wondering how to sort them up by their Name so it could be something like:
|ID | Name | Item |
--------------------
|1 | Jack | pen |
|2 | Jack | pencil|
|3 | Mark | apple |
|4 | albert| orange|
And if the user added a new item it should be added to the end of his rows list.
All replies are much appreciated, Thank:)
Add index(es) for any column (or combination of columns) you want to search for and/or want to order by.
Do not reorder the table, nor re-number the ids.
If you are talking about 1000 rows, you are unlikely to notice any performance problems even if you don't do proper indexing or normalization. With a million rows, you will notice.

Sum query for MySQL where field contain certain values

I need help with a Query, i have a table like this:
| ID | codehwos |
| --- | ----------- |
| 1 | 16,17,15,26 |
| 2 | 15,32,12,23 |
| 3 | 53,15,21,26 |
I need an outpout like this:
| codehwos | number_of_this_code |
| -------- | ---------------------- |
| 15 | 3 |
| 17 | 1 |
| 26 | 2 |
I want to sum all the time a code is used in a row.
Can anyone make a query for doing it for all the code in one time?
Thanks
You have a very poor data format. You should not store lists in strings and never store lists of numbers in strings. SQL has a great data structure for storing lists. Hint: it is called a "table" not a "string".
That said, sometimes one is stuck with other people's really poor design choices. We wouldn't make them ourselves, but we still need to get something done. Assuming you have a list of codes, you can do what you want with:
select c.code, count(*)
from codes c join
table t
on find_in_set(c.code, t.codehwos) > 0
group by c.code;
If you have any influence over the data structure, then advocate for a junction table, the right way to store this data in a relational database.

Access a parent field from sub query in mysql

I'm trying to access a field being called from the parent query within a nested one and here is my table
TABLE: reminders.
Columns: id:PK, rid:VARCHAR, title:VARCHAR, remind:Integer, start_day:DATE
SELECT id, remind, rid, title
FROM reminders
WHERE DATEDIFF(start_day, NOW()) <= (SELECT LEAST(3, remind))
Basically the second "remind" column in the LEAST() command is suppossed to reference the first "remind" column value for every row being spanned but for reasons that I can't just imagine i keep getting unexpected returns.
EDIT
In response to Sir Gordons that i provide more detailed info, I will try my best but I really do not know how to present table data here, but i'll try.
So basically i'm trying to SELECT all items from the reminders table WHERE the DIFFERENCE between the SET DAY (start_day) and TODAY doesn't exceed one of TWO values, those are either 3 or the value set in the remind column of the current row. Basically if the value set there is less than 3 then it should be used instead, but if it exceeds 3, 3 should be chosen. Here's a visual of the table.
+---+-----------------+--------------------+-----------------+-------------+
|id | rid | title | start_day | remind |
+---|-----------------|--------------------|-----------------|-------------|
|1 | ER456GH | This is real deep | 2014-01-01 | 10 |
|2 | OUBYV90 | This is also deep | 2014-01-13 | 10 |
|3 | UI90POL | This is deeper | 2014-01-13 | 60 |
|4 | TWEET90 | This is just deep | 2014-01-14 | 0 |
+---+-----------------+--------------------+-----------------+-------------+
So in editing this I realized that there was a false table entry under remind on the 4th entry that was causing it to pull false (i.e where remind = 0). Sigh. Some serious short sight on my part/lack of sleep I guess. The query does work . Thanks again.
You don't need a subquery here. Does this work?
SELECT id, remind, rid, title
FROM reminders
WHERE DATEDIFF(start_day, NOW()) <= LEAST(3, remind);

Database structure for a classifieds website

I am developing a classifieds website similar to Quickr.com.
The main problem is that each category requires a different set of properties. For example, for a mobile phone the attributes might be Manufacturer, Operating System, Is Touch Screen, Is 3G enabled etc... Whereas for an apartment the attributes are Number of bedrooms, Is furnished, Which floor, total area etc. Since the attributes and the number of attributes varies for each category, I am keeping the attributes and their values in separate tables.
My current database structure is
Table classifieds_ads
This table stores all the ads. One record per ad.
ad_id
ad_title
ad_desc
ad_created_on
cat_id
Sample data
-----------------------------------------------------------------------------------------------
|ad_id | ad_title | ad_desc | ad_created_on | cat_id |
-----------------------------------------------------------------------------------------------
|1 | Nokia Phone | Nokia n97 phone for sale. Excellent condition | <timestamp> | 2 |
-----------------------------------------------------------------------------------------------
Table classifieds_cat
This table stores all the available category. cat_id in classifieds_ads table relates to cat_id in this table.
cat_id
category
parent_cid
Sample data
-------------------------------------------
|cat_id| category | parent_cid |
-------------------------------------------
|1 | Electronics | NULL |
|2 | Mobile Phone | 1 |
|3 | Apartments | NULL |
|4 | Apartments - Sale | 3 |
-------------------------------------------
Table classifieds_attribute
This table contains all the available attributes for a particular category. Relates to classifieds_cat table.
attr_id
cat_id
input_type
attr_label
attr_name
Sample data
-----------------------------------------------------------
|attr_id | cat_id | attr_label | attr_name |
-----------------------------------------------------------
|1 | 2 | Operating System | Operating_System |
|2 | 2 | Is Touch Screen | Touch_Screen |
|3 | 2 | Manufacturer | Manufacturer |
|4 | 3 | Bedrooms | Bedrooms |
|5 | 3 | Total Area | Area |
|6 | 3 | Posted By | Posted_By |
-----------------------------------------------------------
Table classifieds_attr_value
This table stores the attribute value for each ad in classifieds_ads table.
attr_val_id
attr_id
ad_id
attr_val
Sample data
---------------------------------------------
|attr_val_id | attr_id | ad_id | attr_val |
---------------------------------------------
|1 | 1 | 1 | Symbian OS |
|2 | 2 | 1 | 1 |
|3 | 3 | 1 | Nokia |
---------------------------------------------
========
Is this design okay?
Is it possible to index this data with solr?
How can I perform a faceted search on this data?
Does MySQL support field collapsing like solr?
My suggestion is to remove cat_id from the classifieds_attribute table, then create a new table.
The new table would look like:
cat_attr | id | cat_id | attr_id
This should help you decrease redundancy.
Your design is fine, although I question why you are using hierarchical categories. I understand that you want to organize categories from an end-user standpoint. The hierarchy helps them drill down to the category that they are looking for. However, your schema allows for attribute values at every level. I would suggest that you only need (or possibly want) attributes at the leaf level.
It is certainly possible that you could come up with attributes that would be applicable at higher levels, but this would drastically complicate your management of the data since you'd have to spend a lot of time thinking about exactly how high up the chain a certain attribute belongs and whether or not there is some reason why a lower level might be an exception to the parent rule and so forth.
It also certainly over complicates your retrieveal as well - which is part of the reason for your question, I think.
I would suggest creating an additional table that will be used to manage the hierarchy of categories above the leaf level. It would look exactly like your classifieds_cat table except the involuted relationship will obviously be to the new table. Then classifieds_cat.parent_cid becomes an FK to the new table rather than an involuted FK to classifieds_cat.
I think this schema change will reduce your application and data management complexity.

SSIS MDX Query Problem

Hallo at all!
I have a little Problem with my Query in MDX.
I try to query up the Damage Repair Types from my Cube. Next i explain my Dimension and the Fact Table:
Dimension: Demage Repair Type
RepairTypeKey | Name | RepairTypeAlternateKey | RepairSubTypeAlternateKey | SubName 0 |Unknown |0 | NULL | NULL
1 |Repair |1 |1 | 1 Boil
2 |Replacement |2 |NULL | NULL
3 |Repair |1 |2 | 2 Boils
4 |Repair |1 |3 | 3 Boils
So I have in my Fact Table "CLaimCosts" for every Claim one RepairTypeKey. I Fill the Tables and design a Cube. The Dimension have a Hirarchy with RepairType and SubRepairType. I Process the Cube and it works Fine:
Demage Repair Type
Hirarchy
Members
All
Replacement
Repair
1 Boil
2 Boils
3 Boils
Unknown
Now I Create a Query with MDX:
select
{
[Measures].[Claim Count],
[Measures].[Claim Cost Position Count],
[Measures].[Claim Cost Original],
[Measures].[Claim Cost Original Average],
[Measures].[Claim Cost Possible Savings],
[Measures].[Claim Cost Possible Savings Average],
[Measures].[Claim Cost Possible Savings Percentage]
} on 0,
NON EMPTY{
NonEmpty([Damage Repair Type].[Hierarchy].Allmembers, ([Measures].[Claim Count]))
} on 1
from
Cube
where
(
({StrToMember(#DateFrom) : StrToMember(#DateTo)})
,([Claim Document Type].[Document Type].&[4])
)
Now i try to Run the Query and it Works but i have to much Rows Shown:
Demage Repair Type | Demage Repair Sub Type | Claim Count | ....
NULL |NULL | 200000
Replacement | NULL | 150000
Repair | NULL | 45000
Repair | 1 Boil | 10000
Repair | 2 Boil | 15000
Repair | 3 Boil | 19000
Unknown |NULL | 1000
My Problem are the frist Row (Sum) and the third Row (Sum)! I don't need this Rows but I don't know how to Filter them! I don't need this Sums because i have the Childs with the right Counts!
How I can Filter this? Please help me. It doesn't work!
Sorry for my bad English and Thank you!
Alex
NonEmpty([Damage Repair Type].[Hierarchy].Allmembers, ([Measures].[Claim Count]))
You can use:
NonEmpty([Damage Repair Type].[Hierarchy].Levels(2).Members, [Measures].[Claim Count])
This way we exclude the All members. Also, when you use the level members (e.g. [dim].[hier].[lvl].Members) instead of the hierarchy members (e.g. [dim].[hier].members) you don't get the aggregate members - e.g. the All member which is commonly present in all hierarchies other than non-aggregatable attribute hierarchies.