MySQL Similar values in VARCHAR column - mysql

I have a database table for storing restaurant names and the city they are located in. Example:
name | city
eleven madison park | NYC
gramercy tavern | NYC
Lotus of Siam | TOK
The Modern | LA
ABC Kitchen | LA
Now when there is an incoming entry before INSERT, if there is no similar restaurant name in the same city, I want to go ahead and perform the insert.
But if the entry is like, say { name: "Eleven Madison", city: "NYC" }, I want to find similar entries in "name" column with the same city, in this example "eleven madison park" in "NYC", I want to do the insert and store a new row in 'conflicts' table - with the IDs of these restaurants (last insert id and similar row id)
I used the Levenshtein distance algorithm, with the following SQL query:
SELECT id, levenshtein_ratio(name, 'Eleven Madison') AS levsh from restaurants
where
city_name = 'NYC'
order by levsh asc
limit 0, 1
Then I set a threshold of 8, and if levsh is less than 8, then I mark it as a conflict i.e. insert a new record in 'conflicts' table. This query was working fine until the table grew to 1000 records. Now this query takes 2 seconds to finish.
I understand that this is because I am calculating levenshtein_ratio for all the restaurants in the city - and I only need to apply the ratio function only on similar names for ex. the ones containing 'Eleven' , 'Madison',.. or even better would be if i can do something like
WHERE city_name = 'NYC' AND SOUNDEX(any word in `name`) = SOUNDEX(any word in 'Eleven Madison')
Please help with suggestions on how to improve and optimize this query, and if possible any better approach to what I am doing.
Thanks

Related

Saving a COUNT column appearance while printing each row in the table

So there is a question I have not been able to find an answer to. Say you want to print each row in a table like the following:
ID | Name | Location
----+------+----------
1 | Adam | New York
2 | Eva | London
3 | Jon | New York
which would give the result
1 Adam New York
2 Eva London
3 Jon New York
Say that I at the same time would like to count the number of occurrences someone lives in a specific city, and save that value for printing after I've iterated through the table; is that possible? For example, printing the following:
1 Adam New York
2 Eva London
3 Jon New York
Inhabitants in New York: 2
Inhabitants in London: 1
Is this possible or would you have to iterate through the entire table twice by grouping by Location the second time, and counting those?
EDIT:
To clarify, I know I can solve it by calling:
SELECT * FROM table;
SELECT CONCAT('Inhabitants in ', Location, ': ', COUNT(ID))
FROM table
GROUP BY Location;
But now I am iterating through it twice. Is it possible to do it in only one iteration?
Generally speaking, yes, displaying every row from the table and displaying aggregated data is two separate tasks which should be handled by application, not by the database.
You have the option to run two queries - a plain select * from T, and select location, count(*) from T group by location, and displaying results sequentially. You also have the option to run only a select * from T one, and count the rows within your application, since you're displaying all rows anyway: use any dictionary-like structure your app language provides, with location string for key and running total integer for value.
If you're keen on keeping it a single query, check out WITH ROLLUP clause - https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.html. This would certainly be an unusual way of using it, but if you group by location, id and then tamper with results a little, you can get what you want.
select if(id is null, CONCAT('Inhabitants in ', location, ': ', cnt), concat(id, ' ', name, ' ', location))
from
(
select id, location, name, count(*) cnt
from t
where location is not null
group by location, id with rollup
) q
where location is not null
order by id is null, id asc;
Though the performance could be questionable, compared to two plain queries; you should experiment or check with EXPLAIN.
Try below query, use subquery
select concat(concat(concat('Inhabitants in ',location),':'),total)
from
(select location, count(id) total
from tablename group by location)a

Country State City Table - ID or name

Hi I am relooking at the design of my mysql dbase for efficiency..
currently i have 3 tables
tble country :
country id, country name
table state :
state id, state name, country id
table city :
city id, city name, state id
I am thinking whether it is better to have ...
country name instead of country id in table state
state name instead of state id in table city
this is because everywhere in my code i have to run extra queries to convert country id, state id and city id from numbers to alphabets (eg. 1 to USA)... wouldn't it be better to just reference alphabetically.. (less queries)
The whole world has roughly
260 country/regions
5000 states
many many cities
Design varies based on what you need.
1 - For the purpose of tiny storage:
country(id,country)
state(id,state,country_id)
city(id,city,state_id)
2 - For the purpose of quick query:
city(id,city,state,country)
3 - For the purpose of middle way:
country(code,country)
state(code,country) -- you might merge country and state into one table based on code
city(state_code,city)
You might be interested to have a look at the iso codes:
https://en.wikipedia.org/wiki/ISO_3166-1 eg US
https://en.wikipedia.org/wiki/ISO_3166-2 eg US-NY
As a result iso state code contains iso country code.
UPDATE as per more info from you:
If you are designing property websites for USA.
1 - You do not need a country table, most likely all properties are within USA
2 - There are less than 60 states within USA, so you can use enum to save sates. As nearly all of you will understand NY = New York, as a result you do not need a state table.
3 - So you need a city table. As you will use city_id for more than 10,000,000 property records.
usa_cities(
id int PK
state enum('NY', 'LA', ...)
city varchar
...
)
properties(
id int PK
city_id int,
....
)
AS property table is usually very big, you might skip state table, and de-normalize design to speed up query for less joins:
properties (
id int PK,
state enum('NY', 'LA',...)
city varchar
...
)
You might be able to use enum for city as well, I m not sure how many cities in usa, but it is not encouraged at my first thought.
If you want less query there is some technique call denormalization.
You can weight what most important and fit to your need.
for more about demonalization meaning from techopidia and wikipedia

MySQL Query Is Too Slow Or Times Out

I am having a complete nightmare with my application. I haven't worked with datasets this big before, and my query is either timing out or taking ages to return something. I've got a feeling that my approach is just all wrong.
I have a payments table with a postcode field (among others). It has 40,000 rows roughly (one for each transaction). It has an auto-inc PRIMARY key and an INDEX on the postcode foreign-key.
I also have a postcodes lookup table with 2,500,000 rows. The table is structured like so;
postcode | country | county | localauthority | gor
AB1 1AA S99999 E2304 X 45
AB1 1AB S99999 E2304 X 45
The postcode field is PRIMARY and I have INDEXes on all the other fields.
Each field (apart from postcode) has a lookup table. In the case of country it's something like;
code | description
S99999 Wales
The point of the application is that the user can select areas of interest (such as "England", "London", "South West England" etc) and be shown payments results for those areas.
To do this, when a user selects the areas they are interested, I then created a temp table, with one row, listing ALL postcodes for the areas they selected. Then I LEFT JOIN it on to my payments table.
The problem is that if the user selects a big region (like "England") then I have to create a massive temp table (or about 1 million rows) and then compare it to the 40,000 payments to decide which to display.
UPDATE
Here is my code;
$generated_temp_table = "selected_postcodes_".$timestamp_string."_".$userid;
$temp_table_data = $temp_table
->setTempTable($generated_temp_table)
->newQuery()
->with(['payment' => function ($query) use ($column_values) {
$query->select($column_values);
}])
;
Here is my attempt to print out the raw query;
$sql = str_replace(['%', '?'], ['%%', "'%s'"], $temp_table_data->toSql());
$fullSql = vsprintf($sql, $temp_table_data->getBindings());
print_r($fullSql);
This is the result;
select * from `selected_postcodes_1434967426_1`
This doesn't look like the right query, I can't work out what Eloquent is doing here. I don't know why the full query is not printing out.
if you have too many result like 1 million, then use offset limit concept. Then it will save you'r time of the query. Also make sure in you select query you are filtering required fields only.( avoid select * from XXXX. use select A, B from XXX).

Retrieving the highest-frequency value from a field in one table and updating it into another

I have 2 MySQL tables with the following structure:
**tblLocations**
ID [primary key]
CITY [non-unique varchar]
NAME [non-unique varchar]
----------------------------------
**tblPopularNames**
ID [primary key]
CITY [unique varchar]
POPULARNAME [non-unique varchar]
I am receiving input from users through a web form and a PHP code is then inserting the data into tblLocations. This part is simple. Now, every time an insert is made to tblLocations, I need to trigger the following actions:
See if tblPopularNames contains an entry for the inserted CITY value
If the entry exists, update the corresponding POPULARNAME field with the highest-frequency NAME value against the CITY field in tblLocations.
If the entry doesn't exist, make one with the values just entered.
Can this be done without using any query nesting? What would be the least expensive way to perform this action in terms of memory usage?
I can see a related post here but the answers out there only provide the maximum count of the values being sought which isn't what I'm seeking to do. I need the least contrived way of accomplishing the two tasks. Also, I don't know exactly how the query would handle ties, i.e. two names enjoying the same frequency for the city entered. And I honestly don't mind the query returning either values in such a scenario as long as it doesn't throw an error.
I hope I have explained it as clearly as needed but should you have any doubts, feel free to comment.
P.S. Not sure if the question belongs here or over at DBA. I chose to go with SO because I saw other questions pertaining to queries on this site (e.g., this one). If one of the moderators feel DBA would be a better fit, request them to please move it as they deem appropriate.
The first table accepts two values from users: their name and the city
they live in. The fields affected in that table are CITY and NAME.
Then each time a new entry is made to this table, another is made to
tblPopularNames with that city and the name that occurs most
frequently against that city in tblLocations. For example, if John is
the most popular name in NY, tblPopularNames gets updated with NY,
John. –
Okay, so let's break this up into a trigger. each time a new entry is made translates to AFTER INSERT ON tblLocations FOR EACH ROW; the name that occurs most frequently against that city in tblLocations means we run a SELECT NEW.insertedCity, old.insertedName FROM tblLocations AS old WHERE insertedCity = NEW.insertedCity GROUP BY insertedName ORDER BY COUNT(*) DESC LIMIT 1; and we may want to add something to that ORDER BY to avoid several names at equal frequency get extracted at random.
There's an additional requirement, that if the city already exists in tblPopularNames the entry be updated. We need a UNIQUE KEY on tblPopularNames.popularCity for that; it will allow us to use ON DUPLICATE KEY UPDATE.
And finally:
DELIMITER //
CREATE TRIGGER setPopularName
AFTER INSERT ON tblLocations
FOR EACH ROW BEGIN
INSERT INTO tblPopularNames
SELECT NEW.insertedCity, insertedName
FROM tblLocations
WHERE insertedCity = NEW.insertedCity
GROUP BY insertedName
ORDER BY COUNT(*) DESC, insertedName
LIMIT 1
ON DUPLICATE KEY
UPDATE popularName = VALUES(popularName)
;
END;//
DELIMITER ;
Test
mysql> INSERT INTO tblLocations VALUES ('Paris', 'Jean'), ('Paris', 'Pierre'), ('Paris', 'Jacques'), ('Paris', 'Jean'), ('Paris', 'Etienne');
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM tblPopularNames;
+-------------+-------------+
| popularCity | popularName |
+-------------+-------------+
| Paris | Jean |
+-------------+-------------+
1 row in set (0.00 sec)
mysql> INSERT INTO tblLocations VALUES ('Paris', 'Jacques'), ('Paris', 'Jacques'), ('Paris', 'Etienne'); Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM tblPopularNames;
+-------------+-------------+
| popularCity | popularName |
+-------------+-------------+
| Paris | Jacques |
+-------------+-------------+
1 row in set (0.00 sec)
Triggers vs. code
There's no denying #Phil_1984's answer has lots and lots and lots of merit. Triggers have their uses but they aren't a silver bullet.
Moreover, at this stage it's possible that the design is still too early in its lifecycle for it to be worth the hassle of outsourcing the hard work to a trigger. What if, for example, you decide to go with the "counter" solution hinted at above? Or what if you decide to complicate the choice of the popularName?
There's little doubt that maintaining (which includes thoroughly field-testing) a trigger is much more expensive than the same thing done in code.
So what I'd really do is first to design a function or method with the purpose of receiving the insertedValues and doing some magic.
Then I'd emulate the trigger code with several queries in PHP, wrapped in a transaction. They would be the same queries that appear in the trigger, above.
Then I'd go on with the rest of the work, safe in the knowledge that this solution is working, if perhaps amenable to performance improvement.
If, much later, the design is convincing and gets committed, it will be very easy to modify the function to only run one INSERT query and avail itself of a trigger - that one, or the slightly modified one that has evolved in the meantime.
If the slightly modified has been taken over by creeping featurism and is not easily backported to a trigger, you need do nothing, and have lost nothing. Otherwise you've lost the time for the initial implementation (very little) and are now ready to profit.
So my answer would be: both :-)
Slightly different use case (per comments)
The thing is, the first query being performed by PHP is an
indefinitely large one with potentially hundreds of entries being
inserted at once. And I do need to update the second table every time
a new entry is made to the first because by its very nature, the most
popular name for a city can potentially change with every new entry,
right? That's why I was considering a trigger since otherwise PHP
would have to fire hundreds of queries simultaneously. What do you
think?
The thing is: what should it happen between the first and the last INSERT of that large batch?
Are you using the popular name in that cycle?
If yes, then you have little choice: you need to examine the popularity table after each insert (not really; there's a workaround, if you're interested...).
If no, then you can do all the calculations at the end.
I.e., you have a long list of
NY John
Berlin Gottfried
Roma Mario
Paris Jean
Berlin Lukas
NY Peter
Berlin Eckhart
You can retrieve all the popular names (or all the popular names with cities in the list you're inserting) together with their frequency, and have them in an array of arrays:
[
[ NY, John, 115 ],
[ NY, Alfred, 112 ],
...
]
Then from your list you "distill" the frequencies:
NY John 1
NY Peter 1
Berlin Gottfried 1
Roma Mario 1
Paris Jean 1
Berlin Lukas 1
Berlin Eckhart 1
and you add (you're still in PHP) the frequencies to the ones you retrieved. In this case for example NY,John would go from 115 to 116.
You can do both at the same time, by first getting the "distilled" frequency of the new inserts and then running a query:
while ($tuple = $exec->fetch()) {
// $tuple is [ NY, John, 115 ]
// Is there a [ NY, John ] in our distilled array?
$found = array_filter($distilled, function($item) use ($tuple) {
return (($item[0] === $tuple[0]) && ($item[1] === $tuple[1]));
}
if (empty($found)) {
// This is probably an error: the outer search returned Rome,
// yet there is no Rome in the distilled values. So how comes
// we included Rome in the outer search?
continue;
// But if the outer search had no WHERE, it's OK; just continue
}
$datum = array_pop($found);
// if (!empty($found)) { another error. Should be only one. }
// So we have New York with popular name John and frequency 115
$tuple[2] += $datum[2];
$newFrequency[] = $tuple;
}
You can then sort the array by city and frequency descending using e.g. uasort.
uasort($newFrequency, function($f1, $f2) {
if ($f1[0] < $f2[0]) return -1;
if ($f1[0] > $f2[0]) return 1;
return $f2[2] - $f1[2];
});
Then you loop through the array
$popularName = array();
$oldCity = null;
foreach ($newFrequency as $row) {
// $row = [ 'New York', 'John', 115 ]
if ($oldCity != $row[0]) {
// Given the sorting, this is the new maximum.
$popularNames[] = array( $row[0], $row[1] );
$oldCity = $row[0];
}
}
// Now popularNames[] holds the new cities with the new popular name.
// We can build a single query such as
INSERT INTO tblPopularNames VALUES
( city1, name1 ),
( city2, name2 ),
...
( city3, name3 )
ON DUPLICATE KEY
UPDATE popularName = VALUES(popularName);
This would insert those cities for which there's no entry, or update the popularNames for those cities where there is.
I believe this is a question of application logic over database logic. E.g. code vs triggers.
Since what you are really doing is a form of indexing for use specifically in your application, I would recommend that this logic lies somewhere at your application level(e.g. php). It should be:
simple (i would just do 2 queries. A select count and update.)
easy to maintain (use good database interface abstraction, e.g. 1 function)
only run when needed (using logic in that function)
How you approach that solution is the tricky part. E.g. You might think that it's best to just do the calculation on every insert, but it would be inefficient to do this on every insert if you are doing a batch of inserts for the same city.
I have had a very bad experience of using triggers for everything and having the database get slow. Granted it was in postgre (15 years ago before mysql triggers existed) and on a rather large database of about 500 tables. It's good because it catches 100% of inserts, but sometimes that's not what you want to do. You lose an element of control from the applications point of view by using triggers. You can end up slowing down your whole database with too many of these triggers. So that's an anti triggers perspective. It's that loss of control which is a deal breaker for me.

Querying normalized database, 3 tables

I have three tables in a MySQL database:
stores (PK stores_id)
states (PK states_id)
join_stores_states (PK join_id, FK stores_id, FK states_id)
The "stores" table has a single row for every business. The join_stores_states table links an individual business to each state it's in. So, some businesses have stores in 3 states, so they 3 rows in join_stores_states, and others have stores in 1 state, so they have just 1 row in join_stores_states.
I'm trying to figure out how to write a query that will list each business in one row, but still show all the states it's in.
Here's what I have so far, which is obviously giving me every row out of join_stores_states:
SELECT states.*, stores.*, join_stores_states.*
FROM join_stores_states
JOIN stores
ON join_stores_states.stores_id=stores.stores_id
JOIN states
ON join_stores_states.states_id=states.states_id
Loosely, this is what it's giving me:
store 1 | alabama
store 1 | florida
store 1 | kansas
store 2 | montana
store 3 | georgia
store 3 | vermont
This is more of what I want to see:
store 1 | alabama, florida, kansas
store 2 | montana
store 3 | georgia, vermont
Suggestions as to which query methods to try would be just as appreciated as a working query.
If you need the list of states as a string, you can use MySQL's GROUP_CONCAT function (or equivalent, if you are using another SQL dialect), as in the example below. If you want to do any kind of further processing of the states separately, I would prefer you run the query as you did, and then collect the resultset into a more complex structure (hashtable of arrays, as a simplest measure, but more complex OO designs are certainly possible) in the client by iterating over the resulting rows.
SELECT stores.name,
GROUP_CONCAT(states.name ORDER BY states.name ASC SEPARATOR ', ') AS state_names
FROM join_stores_states
JOIN stores
ON join_stores_states.stores_id=stores.stores_id
JOIN states
ON join_stores_states.states_id=states.states_id
GROUP BY stores.name
Also, even if you only need the concatenated string and not a data structure, some databases might not have an aggregate concatenation function, in which case you will have to do the client processing anyway. In pseudocode, since you did not specify a language either:
perform query
stores = empty hash
for each row from query results:
get the store object from the hash by name
if the name isn't in the hash:
put an empty store object into the hash under the name
add the state name to the store object's stores array