I have an MySql DB.
My main table is all the sentences of a series of 5 books, with indexes for the book, chapter in book, sentence in chapter.
Example - For Harry Potter book five, chapter 1 , sentence 3 I'll have a row like that.
BookID ChapterID SentenceID Text
4 1 3 Deprived of their usual car-washing and lawn-mowing pursuits, the inhabitants of Privet Drive had retreated into the shade of their cool houses, windows thrown wide in the hope of tempting in a nonexistent breeze.
I need to retrieve all the occurrence of a letter or a word.
So if I search for 'e' I'll get the same row 17 times. 'e' occur 17 time in this row.
I've simplified the scenario, I have more information to retrieve for each letter.
So far I've been unable to get something useful.
Thank you
I'm 90% certain an SQL query won't return more than one row for a given database row, unless you use a JOIN. But that would be very inefficient for your purposes.
The way this would typically be implemented is using a query like SELECT * FROM books WHERE Text LIKE '%e%', which would return all of the rows that have at least one "e" in the text; then your application would iterate over the rows and count the occurrences.
Related
I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.
This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.
Take this table as an example :
CREATE TABLE UserServices (
ID BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
Service1 TEXT,
Service2 TEXT,
.
.
.
) ENGINE = MYISAM;
Every user will have different number of services, so lets say the table starts with 10 columns for services for each user. If one user will have 11 services, must all other users have 11 columns also? Now of course it is a table and row needs to have the same number of columns, but it is just seems like an awful waste of memory. Maybe the use of another database type is better?
Thank you!!
Storing a boatload of nulls isn't really a "waste of memory" because the space is negligible - hard disks cost pence per gigabyte, programmers cost tens/hundreds of $/hr so it's certainly economical to burn the space and it's not really a great argument for avoidance.
There is a better argument though, as others have said; databases don't do variable numbers of columns for a particular ID in a table, but they DO do variable numbers of rows per ID.. This is how DBs are designed: columns are fixed, rows are variable. Everything that a database does and offers in terms of querying, storage, retrieval, internal design etc is optimised towards this pattern
There are well established operations (called pivots) that will turn your vertical arrangement of data into horizontal (with nulls) at query time, so you don't have to store the data horizontally
Here's a pivot example:
Table:
ID, ServiceIdentifier, ServiceOwner
1, SV1, John
1, SV2, Sarah
2, SV1, Phil
2, SV2, John
2, SV3, Joe
3, SV2, Mark
SELECT
ID,
MAX(CASE WHEN ServiceIdentifier = 'SV1' THEN ServiceOwner END) as SV1_Owner,
MAX(CASE WHEN ServiceIdentifier = 'SV2' THEN ServiceOwner END) as SV2_Owner,
MAX(CASE WHEN ServiceIdentifier = 'SV3' THEN ServiceOwner END) as SV3_Owner
FROM
Table
GROUP BY
ID
Result:
ID SV1_Owner SV2_Owner SV3_Owner
1 John Sarah
2 Phil John Joe
3 Mark
As noted, it's not a huge cost to just store the data horizontally and if you're sure the table will never change/ not need new columns adding on a weekly basis to cope with new services etc, then it might be a sensible developer optimisation to just have columns full of nulls. If you'll add columns regularly, or one day have thousands of services, then vertical storage is going to have to be the way it goes
To expand a little on what's already been said:
Is there a way to add an attribute to only 1 row in SQL?
No, and that's kinda fundamental to how relationship databases (SQL) work - and that's in any version of SQL, whether it's mysql, t-sql, etc. If you have a table - and you want to add an attribute to that table, it's going to be another column, and that column will be there for every row. Not just relational databases - that's just how tables work.
But, that's not how anyone would do it. What you would do is what Alan suggested - a separate table for Services, then a 3rd table (he suggested naming it 'UserServices') that links the two. And that's not a one-off suggestion - that's pretty much "the" way to do it. There's no waste.
Maybe the use of another database type is better?
Possibly, if you want something with less restrictions, then you could go with something other than SQL. Since SQL is so dominant, everything is usually categorized as NOSQL. - Mongo is the most popular NOSQL database currently, which is why RC brought it up.
We have a large table with product information. Almost all the time we need to find product names that contain specific words, but unfortunately these queries take forever to run.
Example: Find all the products where the name contains the words "steel" and "102" (not necessarily next to each other, so a product like "Ninja steel iron 102 x" is a match, just like "Dragon steel 102 b" is it).
Currently we are doing it like this:
SELECT columns FROM products WHERE name LIKE '%WORD1%' AND name LIKE '%WORD2%' (the number of like words are normally 2-4, but it can in theory be 7-8 or more).
Is there a faster way of doing this?
We are only matching words, so I wonder if that can help somehow (i.e. the products in the example above are matches, but "Samurai swordsteel 102 v" is not a match since "steel" doesn't stand alone).
My own thought is to make a helper table with the words from productnames in and then use that table to get the ids of the matching products.
i.e. a table like: [id, word, productid] so we get for example:
1, samurai, 3
2, swordsteel, 3
3, 102, 3
4, v, 3
Just wonder if there is a built in way to do this in MySQL, so I don't have to implement my own stuff + maintain two tables.
Thanks!
Unfortunately, you have wild cards at the beginning of the pattern name. Hence, MySQL cannot use a standard index for this.
You have two options. First, if the words are really keywords/attributes, then you should have another table, with one row per word.
If that is not the case, you can try a full text index. Note that MySQL has attributes for the minimum words length and uses a stop words list. You should take these into account before building the index.
I'm creating a DB for my office. We have about 200 employees. Each employee was required to complete at least 1 of 12 courses within 2 years of being hired (so different completion/qualification dates for every course, some people have been here 20 years, some just 1 year) to become qualified. Some have completed multiple courses. Each course has to be refreshed periodically (each refresh period is different and based on the last refresher date). I'm having trouble with the layout of the table. Here's what I have as an idea, but i'm trying to see if there is a less busy way to lay out the data. I want to be able to run a query that tells me what person has completed what class (so it would have to look at all 3 class columns). I also want to be able to tell when their qualification has lapsed, or is coming up. So far I've created an employee data table that looks like the table below.
ID Name Class1 Class2 Class3 QualDt-Cl1 QualDt-Cl2 QualDt-Cl3 LstRequal1 ...
1 Bob Art Spanish 3/17/1989 9/12/2010 3/8/2012
2 Sally Math 8/31/2012
3 George Physics History 2/6/2005 7/6/1996
4 Casey History 6/8/2000
5 Joe English Sports Physics 12/10/1993 10/15/2001 4/22/2006
The classes are listed in their own table and each class column pulls from that. The qual date refresher will be a calculated column in the query based on the last refresher date.
Is there a way to put all the classes one person is qualified for in one column and have the associated date for requalifiing for each particular cours in another column?
I think it would be less confusing if you had a table per subject and register the people's names under each one with the date passed.
Also it would probably help to declutter the table from uneccssary info like the exact date the exam was passed, you can do month and year or maybe just year? if the lee way is 2 years that would probably make more sense - also making the qulified calculation easier.
The query would work if you searched per subject maybe ? or who would qualify to do what subject this current year and then the next.
this is not much of a question that you would ask on here by the way - but hope the answer helps.
When designing a database, any time you find yourself adding columns with names like Class1, Class2, Class3 you should immediately stop and think about whether it makes more sense to put those columns in a separate child table called Classes with a link (relation) to the parent. There are several reasons for this, including:
What happens when somebody takes a fourth course? Saying "that will never happen" ignores the fact that "never is a very long time" and none of us can predict the future.
When checking whether or not someone has taken a course you really need to check (Class1 IS NULL) OR (Class2 IS NULL) OR (Class3 IS NULL) and that can get really tedious, It also means that if you do have to add Class4 then all of that SQL code has to be corrected.
Similarly, if you want to find someone who took "CPR" you'd have to look for people with (Class1 = 'CPR') OR (Class2 = 'CPR') OR (Class3 = 'CPR'). Yuck.
So, save yourself some trouble (a lot of trouble, really) and create a Classes table:
ID
ClassName
QualDate
(etc. )
...where ID is the ID number from the main table (what is called a "foreign key"). From your sample data, your Classes table would look something like this:
ID ClassName QualDate
1 Art 3/17/1989
1 Spanish 9/12/2010
2 Math 8/31/2012
3 Physics 2/6/2005
3 History 7/6/1996
...
I have 2 MySQL tables , each with address data of companies in it. One table is more recent, but has no telephone and no website data. Now I want to unite these tables into 1 recent and complete table.
But for some companies the order of the words is different,like this:
'Bakery Johnson' in table 1 and 'Johnson Bakery' in table 2.
Now I need to find a way to compare these values, as they're obviously the same company.
I think I will somehow have to split those names first, and then order the different parts alphabetically.
Any chance anybody has done something like this before, and willing to share some code or function?
UPDATE:
I found a function that sorts words inside a string. I can use this to detect name swaps as described above. It's quite SLOW though...
See : MySQL: how to sort the words in a string using a stored function?
If your table is MyISAM you can run this query:
SELECT *
FROM mytable
WHERE MATCH(name) AGAINST ('+bakery +johnson')
This will find all records containing the words bakery and johnson (and probably some other words too).
Creating a FULLTEXT index on the table:
CREATE FULLTEXT INDEX
fx_mytable_name
ON mytable (name)
will speed up this query.
Going back a bit on your solution, you could go with a similar way as modern phones resolve duplicate names conflicts
You present your user with the option, as he finds something suspicious:
Is this a duplicate? Use our [ Merge ] option
You are merging Bakery Johnson, please select the source/original item:
[ Johnson Bakery v ] (my amazing dropdown!)
Everything not already in Johnson Bakery gets ported to Bakery Johnson (orders for example), you may also show an intermediate screen displaying what will be merged, or let the user pick, for example, he wants the address info from Johnson Bakery and orders from both etc
It is not self correcting as you asked, but the collaboration from the users may be more accurate than AI here. I also love low-tech solutions like this so let us know what you ended up doing.