Infinite Sub Category Ordering in MySQL - mysql

Given this table
Is it possible to write SQL to write a SELECT query that would result in an order such as this, while also being applicable regardless the number of subcategories in the table? I can write queries to fetch the correct order with a hard coded number of expected subcategories but I run into difficulty when that number is unknown.
Miscellaneous
Personal
Bobby
Jane
Susie
Tom
Other
Work Lunches
Or would I have to adjust the schema?

Well the best way is to manage the hierarchical data via the "Nested set" model:
http://en.wikipedia.org/wiki/Nested_set_model
it may seem alien first, but it is just fantastic once you get your head around it. (Yes I am using it, and it works great)
Of course this means you have to change your schema a bit (include the left & right values) and the selects/inserts/updates are different. But you can select or re-attach whole branches in one go very easily.

Related

SQL - Finding rows with unknown, but slightly similar, values?

I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.
This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.

How can I replace a field with a similar result in MySQL

Unfortunately, I have to deal with a lot of user submitted data, text fields rather than option boxes. I have imported it into my MySQL database as strings. I do all this to be able to run statistics quickly on the data like top 10 most common companies. The problem I have run into is that some of the rows have slightly different names for the same companies. For example:
Brasfield & Gorrie, LLC VS Brasfield and Gorrie
Britt Peters and Associates VS Britt, Peters & Associates Inc.
Is there some fairly straightforward MySQL command or external tool that will allow me to go through and combine these sort of rows. I know how to use REPLACE(), but I don't think it has the power to do this simply. Correct me if I'm wrong!
Taking this example:
Brasfield & Gorrie, LLC VS Brasfield and Gorrie
Assuming that I want to keep the first one, I would find all records that have the ID of the second one and update them to use the first, assuming that this table that has these titles also has an ID field for each one.
You would create a page in PHP that will allow you to administer this with mouse clicks, but it will require regular pruning since you allow users to enter this data. For future entries, you can try to apply the Levenshtein Distance and try to provide a suggestion based on available similar matches so that you can help guide the users to something that already exists rather than a new db entry.

How do I properly structure my relational mySQL database

I am making a database that is for employee scheduling. I am, for the first time ever, making a relational mySQL database so that I can efficiently manage all of the data. I have been using the mySQL Workbench program to help me visualize how this is going to go. Here is what I have so far:
What I have pictured in my head is that, based on the drawing, I would set the schedule in the schedule table which uses references from the other tables as shown. Then when I need to display this schedule, I would pull everything from the schedule table. Whenever I've worked with a database in the past, it hasn't been of the normalized type, so I would just enter the data into one table and then pull the data out from that one table. Now that I'm tackling a much larger project I am sure that having all of the tables split (normalized) like this is the way to go, but I'm having trouble seeing how everything comes together in the end. I have a feeling it doesn't work the way I have it pictured, #grossvogel pointed out what I believe to be something critical to making this all work and that is to use the join function to pull the data.
The reason I started with a relational database was so that if I made a change to (for example) the shift table and instead of record 1 being "AM" I wanted it to be "Morning", it would then automatically change the relevant sections through the cascade option.
The reason I'm posting this here is because I am hoping someone can help fill in the blanks and to point me in the right direction so I don't spend a lot of hours only to find out I made a wrong turn at the beginning.
Maybe the piece you're missing is the idea of using a query with joins to pull in data from multiple tables. For instance (just incorporating a couple of your tables):
SELECT Dept_Name, Emp_Name, Stat_Name ...
FROM schedule
INNER JOIN departments on schedule.Dept_ID = departments.Dept_ID
INNER JOIN employees on schedule.Emp_ID = employees.Emp_ID
INNER JOIN status on schedule.Stat_ID = status.Stat_ID
...
where ....
Note also that a schedule table that contains all of the information needed to be displayed on the final page is not in the spirit of relational data modeling. You want each table to model some entity in your application, so it might be more appropriate to rename schedule to something like shifts if each row represents a shift. (I usually use singular names for tables, but there are multiple perspectives there.)
This is, frankly, a very difficult question to answer because you could get a million different answers, each with their own merits. I'd suggest you take a look at these (there are probably better links out there too, these just seemed like good points to note) :
http://www.devshed.com/c/a/MySQL/Designing-a-MySQL-Database-Tips-and-Techniques/
http://en.wikipedia.org/wiki/Boyce%E2%80%93Codd_normal_form
http://www.sitepoint.com/forums/showthread.php?66342-SQL-and-RDBMS-Database-Design-DO-s-and-DON-Ts
I'd also suggest you try explaining what it is you want to achieve in more detail rather than just post the table structure and let us try to figure out what you meant by what you've done.
Often by trying to explain something verbally you may come to the realisations you need without anyone else's input at all!
One thing I will mention is that you don't have to denormalise a table to report certain values together, you should be considering views for that kind of thing...

Any way to compare/match sentences with only a different word order?

I have 2 MySQL tables , each with address data of companies in it. One table is more recent, but has no telephone and no website data. Now I want to unite these tables into 1 recent and complete table.
But for some companies the order of the words is different,like this:
'Bakery Johnson' in table 1 and 'Johnson Bakery' in table 2.
Now I need to find a way to compare these values, as they're obviously the same company.
I think I will somehow have to split those names first, and then order the different parts alphabetically.
Any chance anybody has done something like this before, and willing to share some code or function?
UPDATE:
I found a function that sorts words inside a string. I can use this to detect name swaps as described above. It's quite SLOW though...
See : MySQL: how to sort the words in a string using a stored function?
If your table is MyISAM you can run this query:
SELECT *
FROM mytable
WHERE MATCH(name) AGAINST ('+bakery +johnson')
This will find all records containing the words bakery and johnson (and probably some other words too).
Creating a FULLTEXT index on the table:
CREATE FULLTEXT INDEX
fx_mytable_name
ON mytable (name)
will speed up this query.
Going back a bit on your solution, you could go with a similar way as modern phones resolve duplicate names conflicts
You present your user with the option, as he finds something suspicious:
Is this a duplicate? Use our [ Merge ] option
You are merging Bakery Johnson, please select the source/original item:
[ Johnson Bakery v ] (my amazing dropdown!)
Everything not already in Johnson Bakery gets ported to Bakery Johnson (orders for example), you may also show an intermediate screen displaying what will be merged, or let the user pick, for example, he wants the address info from Johnson Bakery and orders from both etc
It is not self correcting as you asked, but the collaboration from the users may be more accurate than AI here. I also love low-tech solutions like this so let us know what you ended up doing.

Filter a MySQL Result in Delphi

I'm having an issue with a certain requirement to one of my Homework Assignments. I am required to take a list of students and print out all of the students with credit hours of 12 or more. The Credit hours are stored in a separate table, and referenced through a third table
basically, a students table, a classes table with hours, and an enrolled table matching student id to Course id
I used a SUM aggregate grouped by First name from the tables and that all works great, but I don't quite understand how to filter out the people with less than 12 hours, since the SQL doesn't know how many hours each person is taking until it's done with the query.
my string looks like this
'SELECT Students.Fname, SUM(Classes.Crhrs) AS Credits
FROM Students, Classes, Enrolled
WHERE Students.ID = Enrolled.StudentID AND Classes.ID = Enrolled.CourseID
GROUP BY Students.Fname;'
It works fine and shows the grid in the Delphi Project, but I don't know where to go from here to filter the results, since each query run deletes the previous.
Since it's a homework exercise, I'm going to give a very short answer: look up the documentation for HAVING.
Beside getting the desired result directly from SQL as Martijn suggested, Delphi datasets have ways to filter data on the "client side" also. Check the Filter property and the OnFilter record.
Anyway, remember it is usually better to apply the best "filter" on the database side using the proper SQL, and then use client side "filters" only to allow for different views on an already obtained data set, without re-querying the same data, thus saving some database resources and bandwidth (as long as the data on the server didn't change meanwhile...)