I have an imported table of several thousand customers, the development I am working on runs on the basis of anonymity for purchase checkouts (customers do not need to log in to check out), but if enough of their details match the database record then do a soft match and email the (probably new) email address and eventually associate the anonymous checkout with the account record on file.
This is rolling out this way due to the age of the records, many people have the same postal address or names but not the same email address, likewise some people will have moved house and some people will have changed name (marriage etc).
What I think I am looking for is a MySQL CASE system, however the CASE questions on Stack Overflow I've found don't appear to cover what I'm trying to get from this query.
The query should work something like this:
$input[0] = postcode (zip code)
$input[1] = postal address
$input[2] = phone number
$input[3] = surname
$input[4] = forename
SELECT account_id FROM account WHERE <4 or more of the variables listed match the same row>
The only way I KNOW I can do this is with a massive bunch of OR statements but that's excessive and I'm sure there's a cleaner more concise method.
I also apologise in advance if this is relatively easy but I don't [think I] know the keyword to research constructing this. As I say, CASE is my best guess.
I'm having trouble working out how to manipulate CASE to fit what I'm trying to do. I do not need to return the values only the account_id from the valid row (only) that matches 4 or 5 of the given inputs.
I imagine that I could construct a layout that does this:
SELECT account_id CASE <if postcode_column=postcode_var> X=X+1
CASE <if surname_column=surname_var> X=X+1
...
...
WHERE X > 3
Is CASE the right idea?
If not, What is the process I need to use to achieve the desired results?
What is [another] MySQL keyword / syntax I need to research, if not CASE.
Here is your pseudo query:
SELECT account_id
FROM account
WHERE (postcode = 'pc')+
(postal_address = 'pa')+
(phone_number = '12345678901')+
(surname = 'sn')+
(forename= 'fn') > 3
I stumbled uppon the same problem like in neo4j - Relationship between three nodes. I am new in Neo4J and wanted to model a relational database as a graph.
I have three types of data, A, B and C. An instance of B, B1, can belong to an instance of C, C1, and an instance of A, A1 can belong to B1 and C1, but if and only if B1 belongs to C1. The problem here is that, i can also have A1 belonging to B2 which belongs to C1 and A1 also belonging to B1 which belongs to C2. These would be three legit, separate combinations. So, in my DB, i would have a join table where in a record i would keep the combination of the three ids, expressing the relationships formed by the instances.
Now, from what i can figure, hyperedges is something like what i am searching(http://neo4j.com/docs/stable/cypher-cookbook-hyperedges.html). I was thinking, for starters, to add relationships like these:
A-[:PART_OF]->TRIPLE
B-[:PART_OF]->TRIPLE
C-[:PART_OF]->TRIPLE
1st problem/question: if a type B instance was to be deleted then the triple node would remain. I am guessing i would have to handle this in some way when deleting things?
2nd problem/question: My main question is, would it be more efficient if i added some more relationships? For example A-[:BELONGS_TO]->C. So, if i wanted all type C instances that A1 belongs to, i could have the results back with one hop. The relationship traversals would be less in number than the "triple" type nodes that would have to be visited. I think basically i am asking how Neo4j is looking for patterns internally(Where does it start, does it re-examine nodes etc). Because i am guessing(maybe wrongfully) if i want the Cs that A1 belongs to, i would write something like:
match A1-[:PART_OF]->triple<-[:PART_OF]-C
return C
How would that query work? First retrieve all triples A1 is part of, then those all C types are part of and finally return the matches?
Any hints?
Thanks in advance.
If I understand you correctly, you really want to represent a transitive relationship, as in:
(:A)-[:BELONGS_TO]->(:B)-[:BELONGS_TO]->(:C)
If so, there is no need to use "triplets", as neo4j can easily represent and handle transitive relationships.
Suppose your test data (as stated in your question) is created this way:
CREATE
(a1:A {name:"a1"}),
(b1:B {name:"b1"}), (b2:B {name:"b2"}),
(c1:C {name:"c1"}), (c2:C {name:"c2"}),
(a1)-[:BELONGS_TO]->(b1),
(b1)-[:BELONGS_TO]->(c1),
(b1)-[:BELONGS_TO]->(c2),
(a1)-[:BELONGS_TO]->(b2),
(b2)-[:BELONGS_TO]->(c1)
With that data, you can use the following query to get the C nodes that a specific A node indirectly belongs to. This query also returns the path that was used, since there can be multiple paths between any specific A and C pair. Here is a console that shows this working.
MATCH p=(:A { name:"a1" })-[:BELONGS_TO*2..2]->(c:C)
RETURN p, c;
If you just wanted distinct C nodes in the result, this would do that:
MATCH (:A { name:"a1" })-[:BELONGS_TO*2..2]->(c:C)
RETURN DISTINCT c;
Background
I'm faced with the following problem, relating to three tables
class_sectors table contains three categories of classes
classes table contains a list of classes students can attend
class_choices contains the first, second and third class choice of the student, for each sector. So for sector 1 Student_A has class_1 as first choihce, class_3 as second choice and class_10 as third choice for example, then for sector 2 he has another three choices, etc...
The class_choices table has these columns:
kp_choice_id | kf_personID | kf_sectorID | kf_classID | preference | assigned
I think the column names are self explanatory. preference is either 1, 2 or 3. And assigned is a boolean set to 1 once we have reviewed a student's choices and assigned them to a class.
Problem:
Writing an sql query that tells the students what class they are assigned to for each sector. If their class hasn't been assigned, it should default to show their first preference.
I have actually got this to work, but using two (very bloated??) sql queries as follows:
$choices = $db -> Q("SELECT
*, concat_ws(':', `kf_personID`, `kf_sectorID`) AS `concatids`
FROM
`class_choices`
WHERE
(`assigned` = '1')
GROUP BY
`concatids`
ORDER BY
`kf_personIDID` ASC,
`kf_sectorID` ASC;");
$choices2 = $db -> Q("SELECT
*, concat_ws(':', `kf_personID`, `kf_sectorID`) AS `concatids`
FROM
`class_choices`
WHERE
`preference` = '1'
GROUP BY
`concatids`
HAVING
`concatids` NOT IN (".iimplode($choices).")
ORDER BY
`kf_personID` ASC,
`kf_sectorID` ASC;");
if(is_array($choices2)){
$choices = array_merge($choices,$choices2);
}
Now $choices does have what I want.
But I'm sure there is a way to simplify this, merge the two SQL queries, and so it's a bit more lightweight.
Is there some kind of conditional SQL query that can do this???
Your solution uses two steps to enable you to filter the data as needed. Since you are generating a report, this is a pretty good approach even if it looks a bit more verbose than you might like.
The advantage of this approach is that it is much easier to debug and maintain, a big plus.
To improve the situation, you need to consider the data structure itself. When I look at the class_choices table, I see the following fields: kf_classID, preference, assigned which contain the key information.
For each class, the assigned field is either 0 (default) or 1 (when the class preference is assigned for the student). By default, the class with preference = 1 is the assigned one since you display it in the report when assigned=0 for all the student's class choices in a particular sector.
The data model could be improved by imposing a business rule as follows:
For preference=1 set the default value assigned=1. When the class selection process
takes place, and if the student gets assigned the 2nd or 3rd choice, then preference 1 is unassigned and the alternate choice assigned.
This means a bit more code in the application but it makes the reporting a bit easier.
The source of the difficulty is that the assignment process does not explicitly assign the 1st preference. It only updates assigned if the student cannot get the 1st choice.
In summary, your SQL is good and the improvements come from taking another look at the data model.
Hope this helps, and good luck with the work!
I have a couple systems which contain a users' table along with some form of karma/weight/reputation. Sometimes it's the number of posts a user has made, sometimes it's the number of up/down votes a user has received across all their activity on the site.
USER {
id int
name string
karma int
}
How do I use these numbers to calculate that user's "weight" or "authority"? For example, the vote of one long-time member is often worth much more than 4 votes from brand new users.
I was thinking about adding up the total points/karma/reputation of all members and then trying to come up with a 1-100 scale.
SUM(user.points) / COUNT(user.*) = average user points
Then something like
CEIL(userA.points / average user points) = their weight on an issue
However, there also needs to be a curve on the points this way as I don't want someone with 5,000 posts/karma to out weigh 20 new users votes.
Mathematically, your best bet is to weight by the log of the percentile ranking of user in question. However, that is painful in SQL.
Simpler would be to cheat and assume the mean is the same as the median (a very bad assumption statistically, but much simpler programmatically):
SELECT 1 - log10(SELECT COUNT (*) FROM user
WHERE (SUM(user.points) / COUNT(user.*)) < user.points)
/ SELECT (COUNT (*) from user))
In this way, your top 10% of karma would have one and a half the impact of your average user, almost twice the impact of a noob.
Changing the log base would scale this, obviously, where natural log (log() in mysql) would give the upper 10% 3 times as much impact as a noob, and twice the impact as average. Log2() is even more extreme. (Note: subtraction is required because the log will be negative.)
If you want a more severe effect you might try squaring the log. (Note: squaring makes the log squared positive, so addition is appropriate here.)
If you want a hyperprecise rule, you can go into standard deviations, but the sql gets cumbersome and slow. It all depends on how far down the rabbit hole you want to go....
There are probably some resources that can provide you with parameters for this, but you should probably decide exactly what you want rather than using some predefined model. I suggest you define some rules for which sets of users should be equivalent or which should outweigh each other (e.g. 10 0 karma users = 1 5k karma user) (equivalence is much easier to work with), which will very quickly produce parameters for some chosen equation.
Using log (as already suggested), some (fractional) power (like square root) or even just linear can work.
I suggest something like newKarma = a.karma^b + c, and it shouldn't be to difficult to solve a, b and c. I suggest you pick b rather than trying to calculate it. Using new users (with karma = 0) should make this quite easy to solve. Guessing values to get close to what you want can be easier than determining them mathematically (since some rules together won't fit any simple equation).
Note that c above is an offset to karma, which will give many new users more total karma than high-karma users. You may also want to think about a.(karma + c)^b, or a.(karma + c)^b + d. Analysing the rules you defined should tell you which one to use.
UPDATE: Added alternatives for c
EDIT: You have some options for SQL. A temp table (with sums) might actually be the fastest. You can also just use a view. A join on the same table might also be possible, though I'm not sure. Using a view would look something like: (for some chosen a,b,c and d) (you may also want to add indices to the view)
Votes(issueID, userID) // table structure
User(userID, karma, ...) // table structure
CREATE VIEW Sums AS
SELECT issueID, SUM(1*POWER(karma + 2, 3) + 4) AS sumVal
FROM Votes JOIN User ON User.userID = Votes.userID
GROUP BY issueID
Query:
SELECT (1*POWER(karma + 2, 3) + 4)/sumVal AS influenceOnIssue
FROM Votes JOIN User ON User.userID = Votes.userID
JOIN Sums on Sums.issueID = Votes.issueID
WHERE Votes.userID = #UserID AND Votes.issueID = #IssueID
A simplification may be to have a computed column that = 1*POWER(karma + 2, 3) + 4
The faster option would be to calculate the derived karma on insert/update, either by having an additional column and using triggers or just calculating in before you call insert/update, and calling insert/update with the new value.
Our win32 application assembles objects from the data in a number of tables in a MySQL relational database. Of such an object, multiple revisions are stored in the database.
When storing multiple revisions of something, sooner or later you'll ask yourself the question if you can visualize the differences between two revisions :) So my question is: what would be a good way to "diff" two such database objects?
Would you do the comparison at the database level? (Doesn't sound like a good idea: too low-level, and too sensitive to the schema).
Would you compare the objects?
Would you write a function that "manually" compares the properties and fields of two objects?
How would you store the diff? In a separate, generic "TDiff" object?
Any general recommendations on how to visualize such things in a user interface?
Advice, or stories about your own experiences with this, are very welcome; thanks a bunch!
Extra info on use case (20090515)
In reply to Antony's comment: this specific application is used to schedule training courses, run by teams of teachers. The schedule of a teacher is stored in various tables in the database, and contains info such as "where does she have to go on which day", "who are her colleagues in the team", etc. This information is spread out over multiple tables.
Once in a while, we "publish" the schedule, so the teachers can see it on a webpage. Each "publication" is a revision, and we'd like to be able to show the users (and later also the teachers) what's changed between two publications --- if anything.
Hope that makes the scenario a bit more tangible :)
Some final remarks
Well, the bounty has come to an end, so I've accepted an answer. If it'd somehow be possible to slice a couple of extra 100's off of my rep and give it to some of the other answers, I would do so without hesitation. All your guys' help has been great, and I am very grateful! ~ Onno 20090519
Just an idea, but would it be worthwhile for you to convert the two object versions being compared to some text format and then comparing these text objects using an existing diff program - like diff for example? There are lots of nice diff programs out there that can offer nice visual representations, etc.
So for example
Text version of Object 1:
first_name: Harry
last_name: Lime
address: Wien
version: 0.1
Text version of Object 2:
first_name: Harry
last_name: Lime
address: Vienna
version: 0.2
The diff would be something like:
3,4c3,4
< address: Wien
< version: 0.1
---
> address: Vienna
> version: 0.2
Assume that a class has 5 known properties - date, time, subject, outline, location. When I look at my schedule, I'm most interested in the most recent (ie current/accurate) version of these properties. It would also be useful for me to know what, if anything, has changed. (As a side note, if the date, time or location changed, I'd also expect to get an email/sms advising me in case I don't check for an updated schedule :-))
I would suggest that the 'diff' is performed at the time the schedule is amended. So, when version 2 of the class is created, record which values have changed, and store this in two 'changelog' fields on the version 2 object (there must already be one parent table that sits atop all your tables - use that one!). One changelog field is 'human readable text' eg 'Date changed from Mon 1 May to Tues 2 May, Time changed from 10:00am to 10:30am'. The second changelog field is a delimted list of changed fields eg 'date,time' To do this, before saving you would loop over the values submitted by the user, compare to current database values, and concatenate 2 strings, one human readable, one a list of field names. Then, update the data and set your concatenated strings as the 'changelog' values.
When displaying the schedule load the current version by default. Loop through the fields in the changelog field list, and annotate the display to show that the value has changed (a * or a highlight, etc). Then, in a separate panel display the human readable change log.
If a schedule is amended more than once, you would probably want to combine the changelogs between version 1 & 2, and 2 & 3. Say in version 3 only the course outline changed - if that was the only changelog you had when displaying the schedule, the change to date and time wouldn't be displayed.
Note that this denormalised approach won't be great for analysis - eg working out which specific location always has classes changed out of it - but you could extend it using an E-A-V model to store the change log.
Doing a comparison at the database level would be good if what you cared about was changes to the database. That makes the most sense if you're trying to design a layer of generic functionality on top of the database itself.
Doing a comparison at the object level would be good if you care about changes to the data. For example, if the data was the input to a program and you were interested in looking at changes in the input to verify that changes to the output were correct.
Your use case doesn't appear to be either of these. You appear to care about the output and want differences from that perspective. If that's the case, I would do differences on the output report (or a pure-text version of it) instead of on the underlying data. You can do that with any off-the-shelf diff tool. To make things easier for your end-users you could parse the diff results and render them as HTML. There are lots of options here: side-by-side with color coding to indicate changes, one document with markup for changes (e.g. red strikethrough for deletions and green for additions), maybe just highlight areas that have changed and use balloons to show the previous/current values on demand.
I've thought about doing database comparisons but never tried to implement it. As you noted, any such attempts are intimately intertwined with the schema.
I have done object-level comparisons. The general algorithm was this:
Do a set comparison on the lists of object IDs. This creates three result groupings: added objects, deleted objects, and objects that live in both sets.
Report the deletions.
Report the additions.
For the things in both sets, do an attribute-by-attribute comparison.
If any differences are found, report the object ID, the attributes that differ, and the respective values. If appropriate, highlight the portion of the attribute value that has changed.
In my case, the comparison algorithms were hand-written to match the object attributes. This gave me control over which attributes were compared and how. A generic comparator might be possible for some cases but would depend on the situation and at least partially on the implementation language.
I've looked into MysQL Diffing a number of times. Unfortunately, there aren't any really good solutions available.
One tool I've tried was mysqldiff (www.mysqldiff.org). mysqldiff is a tool written in PHP which is capable of diffing mysql schemas. Unfortunately, it doesn't do a great job a lot of the time.
MySQL Workbench, MySQLs own SQL IDE provides the option to generate an alter script and I would imagine it does this by performing some kind of diff operation internally.
Aqua Data Studio is another tool that is capable of comparing schemas and outputing a diff of the two. While the ADS diff is quite nice, it does not provide a tool to create an alter script.
If I were writing my own I guess I would write code capable of comparing structure of two tables. Such code could be tuned to be highly sensitive (Ig if column order differs from from version to the next, it's a difference) or more moderately sensitive (Eg Column order is not a major issue, datatypes and lengths are important, as are indices and constraints).
Storage, I'm not to sure. I would look into how a version control system such as Mercurial stores its diff information for revisions and use that to elaborate a method appropriate for the DB.
Finally, for visual output I recommend you take a look at the Aqua Data Stduio compare feature (You can use the Trial version to test this...). Its diff output is pretty good.
My application dbscript compares hierarchical data (database schemas) in a stored procedure, which of course has to compare each field/property of every object with its counterpart. I guess you won't get around that step (unless you have a generic object description model)
As for the UI part of your question, have a look at screenshots to view and select differences.
I would think about some sort of common text representation of the objects and let the texts compare with an existing diffing tool like WinMerge.
I see no need to invent diffing by myself since there are already plenty of nice tools I can use.
In your situation in PostgreSQL I used a difference tables with the schema:
history_columns (
column_id smallint primary key,
column_name text not null,
table_name text not null,
unique (table_name, column_name)
);
create temporary sequence column_id_seq;
insert into history_columns
select nextval('column_id_seq'), column_name, table_name
from information_schema.columns
where
table_name in ('table1','table2','table3')
and table_schema=current_schema() and table_catalog=current_database();
create table history (
column_id smallint not null references history_columns,
id int not null,
change_time timestamp with time zone not null
constraint change_time_full_second -- only one change allowed per second
check (date_trunc('second',change_time)=change_time),
primary key (column_id,id,change_time),
value text
);
And on the tables I used a trigger like this:
create or replace function save_history() returns trigger as
$$
if (tg_op = 'DELETE') then
insert into historia values (
find_column_id('id',tg_relname), OLD.id,
date_trunc('second',current_timestamp),
OLD.id );
[for each column_name] {
if (char_length(OLD.column_name)>0) then
insert into history values (
find_column_id(column_name,tg_relname), OLD.id,
OLD.change_time, OLD.column_name
)
}
elsif (tg_op = 'UPDATE') then
[for each column_name] {
if (OLD.column_name is distinct from NEW.column_name) then
insert into history values (
find_column_id(column_name,tg_relname), OLD.id,
OLD.change_time, OLD.column_name
);
end if;
}
end if;
$$ language plpgsql volatile;
create trigger save_history_table1
before update or delete on table1
for each row execute procedure save_history();
This isn't really an answer to the question you asked rather an attempt to re-imagine the problem. Would you consider altering your database and object model to store the aggregate root and a series of deltas? That is, model and store RevisionSets that are collections of Revisions; a Revision is an entity property paired with a value. In a sense this is internalizing the revision structure into your architecture that the other posters are suggesting that you bolt-on to what you already have via "logs".
It's trivial to display the aggregate from the deltas, and even easier to display the deltas as a change history. The fact that you are using a rich client with state and local memory makes this even more compelling. You could very easily display "all the changes since date xxxx" without revisiting the database.
Credit for the basic idea goes to Greg Young and his work with financial data streams, but it is imminently applicable to your problem.
I'm riffing off of what Harry Lime suggested: Output your properties to text format, then hash the results. That way you can compare the hash values and easily flag the data that has been altered. This way you get the best of both worlds as you can visually see differences but programmatically identify differences. With the has you'll have a good source for an index should you want to store and retrieve the deltas.
Given you want to create a UI for this and need to indicate where the differences are, it seems to me you can either go custom or create a generic object comparer - the latter being dependent on the language you are using.
For the custom method, you need to create a class that takes to two instances of the classes to be comparied. It then returns differences;
public class Person
{
public string name;
}
public class PersonComparer
{
public PersonComparer(Person old, Person new)
{
....
}
public bool NameIsDifferent() { return old.Name != new.Name; }
public string NameDifferentText() { return NameIsDifferent() ? "Name changed from " + old.Name + " to " + new.Name : ""; }
}
This way you can use the NameComparer object to create your GUI.
The gereric approach would be much the same, just that you generalize the calls, and use object insepection (getObjectProperty call below) to find differences;
public class ObjectComparer()
{
public ObjectComparer(object old, object new)
{
...
}
public bool PropertyIsDifferent(string propertyName) { return getObjectProperty(old, propertyName) != getObjectProperty(new, propertyName) };
public string PropertyDifferentText(string propertyName) { return PropertyIsDifferent(propertyName) ? propertyName + " " + changed from " + getObjectProperty(old, propertyName) + " to " + getObjectProperty(new, propertyName): ""; }
}
}
I would go for the second, as it makes things really easy to change GUI on needs. The GUI I would try 'yellowing' the differences to make them easy to see - but that depends on how you want to show the differences.
Getting the object to compare would be loading your object with the initial revision and latest revision.
My 2 cents... Not as techy as the database compare stuff already here.
Have you looked at Open Source DiffKit?
www.diffkit.org
I think it does what you want.
Example with Oracle.
Export ordered objects to text with dbms_metadata
Export ordered tables data into CSV or query format
Make big text file
Diff