Two-way partial search using SQL

Two-way partial search using SQL - mysql

I have a PHP snippet that looks up a MySQL table and returns the top 6 closest matches, both exact as well as partial, against a given search string. The SQL statement is:
SELECT phone, name FROM contacts_table WHERE phone LIKE :ph LIMIT 6;
Using the above example, if :ph is assigned, say, %981% it would return every entry that contains 981, e.g. 9819133333, +917981688888, 9999819999, etc. However, is it also possible to return all entries whose values are contained within the search string using the same query? Thus, if the search string is 12345, it would return all of the following:
123456789 (contains the search string)
88881234500 (contains the search string)
99912345 (contains the search string)
123 (is contained within the search string)
45 (is contained within the search string)
2345 (is contained within the search string)

You can do a lookup where the number is LIKE the column:
SELECT * FROM `test`
WHERE '123456' LIKE CONCAT('%',`stuff`,'%')
OR `stuff` LIKE '%123456%';
An index will never be used, though, because an index cannot be used with a preceding %.
An alternate way to do it would be to create a temporary table in memory and insert tokenized strings and use a JOIN on the temporary table. This will likely be much slower than my solution above, but it is a potential option.

You can try the option of dynamic SQL:
SELECT
phone
FROM
contacts_table
WHERE
phone LIKE :ph or
phone = :val1 or
phone = :val2 or
phone = :val3 or
phone = :val4 or
phone = :val5 (so on a so forth)
LIMIT 6;
Where :ph will be your regular input (e.g. %981%) and valX is going to be tokenize input.
It would be good idea if you do the tokenizing smartly (say if input is of length 5 then go for token size of 3 or 4). Try to limit the number of tokens to get better performance.
DEMO
If you using PHP then do something like:
foreach ($phone as getPhoneNumberTokens($input)) {
if ($phone != "") {
$where_args[] = "phone = '$phone'";
}
}
$where_clause = implode(' OR ', $where_args);

You could use three tables. I don't actually know how performant it will be, though. I didn't actually insert anything to test it out.
contact would contain every contact. token would contain every valid token. What I mean is that when you insert into contact, you would also tokenize the phone number and insert every single token into the token table. Tokens would be unique. Kay. So, then you would have a relation table which will contain the many<->many relationship between contact and token.
Then, you would would get all contacts that have tokens that match the input phone number.
Table definitions:
CREATE TABLE contact (id int NOT NULL AUTO_INCREMENT, phone varchar(16), PRIMARY KEY (id), UNIQUE(phone));
CREATE TABLE token (id int NOT NULL AUTO_INCREMENT, token varchar(16), PRIMARY KEY (id), UNIQUE(token));
CREATE TABLE relation (token_id int NOT NULL, contact_id int NOT NULL);
The query:
There might be a better way to write this query (maybe by using a subquery rather than so many joins?), but this is what I came up with.
SELECT DISTINCT contact_list.phone FROM contact AS contact_input
JOIN relation AS relation_input
ON relation_input.contact_id = contact_input.id
JOIN token AS all_tokens
ON all_tokens.id = relation_input.token_id
JOIN relation AS relation_query
ON relation_query.token_id = all_tokens.id
JOIN contact AS contact_list
ON contact_list.id = relation_query.contact_id
WHERE contact_input.phone LIKE '123456789'
Query Plan:
However, this is with no data actually in the database, so the execution plan could change if data were present. It looks promising to me, because of the eq_ref and key usage.
I also made an SQL Fiddle demonstrating this.
Notes:
I didn't add any indexes. You could probably add some indexes and
make it more performant... but indexes might not actually help in
this instance, since you aren't querying over any duplicated rows.
It might be possible to add compiler hints or use LEFT/RIGHT Joins to improve query plan execution. LEFT/RIGHT Joins in the wrong place could break the query, though.
as it currently stands, you'd have to insert the queried number into the contact database and tokenize it and insert into relation and token prior to querying. Instead, you could use a temporary table for the queried tokens, then do JOIN temp_tokens ON temp_tokens.token = all_tokens.token... Actually, that's probably what you should do. But I'm not gonna re-write this answer right now.
Using integer columns for phone and token would perform better, if that is a valid option for you.
An alternate way to do it, which would be better than inserting all the tokens into the table just for a query would be to use an IN (), like:
SELECT DISTINCT contact.phone FROM token
JOIN relation
ON relation.token_id = token.id
JOIN contact
ON relation.contact_id = contact.id
WHERE token.token IN ('123','234','345','and so on')
And here is another, improved fiddle: http://sqlfiddle.com/#!9/48d0e/2

Related

Slow performing LEFT JOIN, CONCAT_WS search (MySQL, VBscript) [duplicate]

I am trying to apply join over two table, the column on which join needs to be applied values for them are not identical due to which i need to used concat but the problem is its taking very long time to run. So here is the example:
I have two tables:
Table: MasterEmployee
Fields: varchar(20) id, varchar(20) name, Int age, varchar(20) status
Table: Employee
Fields: varchar(20) id, varchar(20) designation, varchar(20) name, varchar(20) status
I have constant prefix: 08080
Postfix of constant length 1 char but value is random.
id in Employee = 08080 + {id in MasterEmployee} +{1 char random value}
Sample data:
MasterEmployee:
999, John, 24, approved
888, Leo, 26, pending
Employee:
080809991, developer, John, approved
080808885, Tester, Leo, approved
Here is the query that i am using:
select * from Employee e inner join MasterEmployee me
on e.id like concat('%',me.id,'%')
where e.status='approved' and me.status='approved';
Is there any better way to do the same ?? because i need to run same kind of query over very large dataset.

It would certainly be better to use the static prefix 08080 so that the DBMS can use an index. It won't use an index with LIKE and a leading wildcard:
SELECT * FROM Employee e INNER JOIN MasterEmployee me
ON e.id LIKE CONCAT('08080', me.id, '_')
AND e.status = me.status
WHERE e.status = 'approved';
Note that I added status to the JOIN condition since you want Employee.status to match MasterEmployee.status.
Also, since you only have one postfix character you can use the single-character wildcard _ instead of %.

It's not concat that's the issue, scalar operations are extremely cheap. The problem is using like like you are. Anything of the form field like '%...' automatically skips the index, resuling in a scan operation -- for what I think are obvious reasons.
If you have to have this code, then that's that, there's nothing you can do and you have to be resigned to the large performance hit you'll take. If at all possible though, I'd rethink either your database scheme or the way you address it.
Edit: Rereading it, what you want is to concatenate the prefix so your query takes the form field like '08080...'. This will make use of any indices you might have.

Just what exactly is the performance loss in adding a table that gets joined on every request?

I'm working on an application that previously had unique handles for users only--but now we want to have handles for events, groups, places... etc. Unique string identifiers for many different first class objects. I understand the thing to do is adopt something like the Party Model, where every entity has its own unique partyId and handle. That said, that means on pretty much every data-fetching query, we're adding a join to get that handle! Certainly for every user.
So just what is the performance loss here? For a table with just three or four columns, is a join like this negligible? Or is there a better way of going about this?
Example Table Structure:
Party
int id
int party_type_id
varchar(256) handle
Events
int id
int party_id
varchar(256) name
varchar(256) time
int place_id
Users
int id
int party_id
varchar(256) first_name
varchar(256) last_name
Places
int id
int party_id
varchar(256) name
-- EDIT --
I'm getting a bad rating on this question, and I'm not sure I understand why. In PLAIN TERMS, I'm asking,
If I have three first class objects that must all share a UNIQUE HANDLE property, unique across all three objects, does adding an additional table that must be joined with on almost any request incur a significant performance hit? Is there a better way of accomplishing this in a relational database like MySQL?
-- EDIT: Proposed Queries --
Getting one user
SELECT * FROM Users u LEFT JOIN Party p ON u.party_id = p.id WHERE p.handle='foo'
Searching users
SELECT * FROM Users u LEFT JOIN Party p ON u.party_id = p.id WHERE p.handle LIKE '%foo%'
Searching all parties... I guess I'm not sure how to do this in one query. Would you have to select all Parties matching the handle and then get the individual objects in separate queries? E.g.
db.makeQuery(SELECT * FROM Party p WHERE p.handle LIKE '%foo%')
.then(function (results) {
// iterate through results and assemble lists of matching parties by type, then get those objects in separate queries
})
This last example is what I'm most concerned about I think. Is this a reasonable design?

The queries you show should be blazingly fast on any modern implementation, and should scale to tens or hundreds of thousands of millions of records without too much trouble.
Relational Database Management Systems (of which MySQL is one) are designed explicitly for this scenario.
In fact, the slow part of your second query:
SELECT * FROM Users u LEFT JOIN Party p ON u.party_id = p.id WHERE p.handle LIKE '%foo%'
is going to be WHERE p.handle LIKE '%foo%' as this will not be able to use an index. Once you have a large table, this part of the query will be many times slower than the join.

Storing Friends in Database for Social Network

For storing friends relationships in social networks, is it better to have another table with columns relationship_id, user1_id, user2_id, time_created, pending or should the confirmed friend's user_id be seralized/imploded into a single long string and stored along side with the other user details like user_id, name, dateofbirth, address and limit to like only 5000 friends similar to facebook?
Are there any better methods? The first method will create a huge table! The second one has one column with really long string...
On the profile page of each user, all his friends need to be retrieved from database to show like 30 friends similar to facebook, so i think the first method of using a seperate table will cause a huge amount of database queries?

The most proper way to do this would be to have the table of Members (obviously), and a second table of Friend relationships.
You should never ever store foreign keys in a string like that. What's the point? You can't join on them, sort on them, group on them, or any other things that justify having a relational database in the first place.
If we assume that the Member table looks like this:
MemberID int Primary Key
Name varchar(100) Not null
--etc
Then your Friendship table should look like this:
Member1ID int Foreign Key -> Member.MemberID
Member2ID int Foreign Key -> Member.MemberID
Created datetime Not Null
--etc
Then, you can join the tables together to pull a list of friends
SELECT m.*
FROM Member m
RIGHT JOIN Friendship f ON f.Member2ID = m.MemberID
WHERE f.MemberID = #MemberID
(This is specifically SQL Server syntax, but I think it's pretty close to MySQL. The #MemberID is a parameter)
This is always going to be faster than splitting a string and making 30 extra SQL queries to pull the relevant data.

Separate table as in method 1.
method 2 is bad because you would have to unserialize it each time and wont be able to do JOINS on it; plus UPDATE's will be a nightmare if a user changes his name, email or other properties.
sure the table will be huge, but you can index it on Member11_id, set the foreign key back to your user table and could have static row sizes and maybe even limit the amount of friends a single user can have. I think it wont be an issue with mysql if you do it right; even if you hit a few million rows in your relationship table.

Multiple "where" -s from one table into one view

I have a table called "users" with 4 fields: ID, UNAME, NAME, SHOW_NAME.
I wish to put this data into one view so that if SHOW_NAME is not set, "UNAME" should be selected as "NAME", otherwise "NAME".
My current query:
SELECT id AS id, uname AS name
FROM users
WHERE show_name != 1
UNION
SELECT id AS id, name AS name
FROM users
WHERE show_name = 1
This generally works, but it does seem to lose the primary key (NaviCat telling me "users_view does not have a primary key...") - which I think is bad.
Is there a better way?

That should be fine. I'm not sure why it's complaining about the loss of a primary key.
I will offer one piece of advice. When you know that there can be no duplicates in your union (such as the two parts being when x = 1 and when x != 1), you should use union all.
The union clause will attempt to remove duplicates which, in this case, is a waste of time.
If you want more targeted assistance, it's probably best if you post the details of the view and the underlying table. Views themselves don't tend to have primary keys or indexes, relying instead on the underlying tables.
So this may well be a problem with your "NaviCat" product (whatever that is) expecting to see a primary key (in other words, it's not built very well for views).

If i am understanding your question correctly, you should be able to just use a CASE statement like below for your logic
SELECT
CASE WHEN SHOW_NAME ==1 THEN NAME ELSE UNAME END
FROM users

This can likely be better written as the following:
SELECT id AS id, IF(show_name == 1, name, uname) AS name
FROM users

MYSQL - Help with a more complicated Query

I have two tables:
tbl_lists and tbl_houses
Inside tbl_lists I have a field called HousesList - it contains the ID's for several houses in the following format:
1# 2# 4# 51# 3#
I need to be able to select the mysql fields from tbl_houses WHERE ID = any of those ID's in the list.
More specifically, I need to SELECT SUM(tbl_houses.HouseValue) WHERE tbl_houses.ID IN tbl_lists.HousesList -- and I want to do this select to return the SUM for several rows in tbl_lists.
Anyone can help ?
I'm thinking of how I can do this in a SINGLE query since I don't want to do any mysql loops (within PHP).

If your schema is really fixed, I'd do two queries:
SELECT HousesList FROM tbl_lists WHERE ... (your conditions)
In PHP, split the lists and create one array $houseIDs of IDs. Then run a second query:
SELECT SUM(HouseValue) FROM tbl-Houses WHERE ID IN (.join(", ", $houseIDs).)
I still suggest changing the schema into something like this:
CREATE TABLE tbl_lists (listID int primary key, ...)
CREATE TABLE tbl_lists_houses (listID int, houseID int)
CREATE TABLE tbl_houses (houseID int primary key, ...)
Then the query becomes trivial:
SELECT SUM(h.HouseValue) FROM tbl_houses AS h, tbl_lists AS l, tbl_lists_houses AS lh WHERE l.listID = <your value> AND lh.listID = l.listID AND lh.houseID = h.houseID
Storing lists in a single field really prevents you from doing anything useful with them in the database, and you'll be going back and forth between PHP and the database for everything. Also (no offense intended), "my project is highly dynamic" might be a bad excuse for "I have no requirements or design yet".

normalise http://en.wikipedia.org/wiki/Database_normalization

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Two-way partial search using SQL - mysql

Related

Slow performing LEFT JOIN, CONCAT_WS search (MySQL, VBscript) [duplicate]

Just what exactly is the performance loss in adding a table that gets joined on every request?

Storing Friends in Database for Social Network

Multiple "where" -s from one table into one view

MYSQL - Help with a more complicated Query

Categories

Resources