I'm merging two databases for a client. In an ideal world, I'd simply use the unique id to join them, but in this case the newer table has different id's.
So I have to join the tables on another column. For this I need to use a complex LIKE statement to join on the Title field. But... they have changed the title's of some rows which breaks the join on those rows.
How can I write a complex LIKE statement to connect slightly different titles?
For instance:
Table 1 Title = Freezer/Pantry Storage Basket
Table 2 Title = Deep Freezer/Pantry Storage Basket
or
Table 1 Title = Buddeez Bread Buddy
Table 2 Title = Buddeez Bread Buddy Bread Dispenser
Again, there are hundreds of rows with titles only slightly different, but inconsistently different.
Thanks!
UPDATE:
How far can MySQL Full-Text Search get me? Looks similar to Shark's suggestion in SQL Server.
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Do it in stages. First get all the ones that match out of the way so that you are only working with the exceptions. Your mind is incredibly smarter than the computer in finding things that are 'like' each other so scan over the data and look for similarities and make sql statements that cover the specific cases you see until you get it narrowed down as much as possible.
You will have better results if you 'help' the computer in stages like this than if you try to develop a big routine to cover all cases at once.
Of course there are certainly apis out there that do this already (such as the one google uses to guess your search phrase before you finish it) but whether any are freely available I don't know. Certainly wouldn't hurt to search for one though.
It's fairly difficult to describe ' only slightly different ' in a way that computer would understand. I suggest choosing a group of certain criteria that can be considered either most common or most important and work around it. I am not sure what those criteria should be though since i have only a vague idea what the data set looks like.
Related
Okay, so first of all let me tell a little about what I'm trying to do. Basically, during my studies I wrote a little webservice in PHP that calculates how similar movies are to each other based on some measurable sizes like length, actors, directors, writers, genres etc. The data I used for this was basically a collection of data accquired from omdbapi.com.
I still have that database, but it is technically just a SINGLE table that contains all the information to each movie. This means, that for each movie all the above mentioned parameters are divided by commas. Therefore I have so far used a query that encapsulates all these things by using LIKE statements. The query can become quite large as I will pretty much query for every parameter within the table, sometimes 5 different LIKE statements for different actors, the same for directors and writers. Back when I last used this, it took about 30 to 60 seconds to enter a single movie and receive a list of 15 similar ones.
Now I started my first job and to teach myself in my freetime, I want to work on my own website. Because I have no real concept for what I want to do with it, I thought I'd get out my old "movie finder" again and use it differently this time.
Now to challenge myself, I want the whole thing to be faster. Understand, that the data is NEVER changed, only read. It is also not "really" relational, as actor names and such are just strings and have no real entry anywhere else. Which essentially means having the same name will be treated as the same actor.
Now here comes my actual question:
Assuming I want my select queries to operate faster, would it make sense to run a script that splits the comma divided strings into extra tables (these are n to m relations, see attempt below) and then JOIN all these tables (they will be 8 or more) or will using LIKE as I currently do be about the same speed? The ONLY thing I am trying to achieve is faster select queries, as there is nothing else to really do with the data.
This is what I currently have. Keep in mind, I would still have to create tables for the relation between movies + each of these tables. After doing that, I could remove the columns in the movie table and would end up having to join a lot of tables with EACH query. The only real advantage I can see here, is that it would be easier to create an index on individuals tables, rather than one (or a few) covering the one, big movie table.
I hope all of this even makes sense to you. I appreciate any answer short or long, like I said this is mostly for self studies and as such, I don't have/need a real business model.
I don't understand what you currently have. It seems that you only showd the size of tables but not its internal structure. You need to separate data into separate tables using normalization rules and then put correct indexes. Indexes will make your queries very fast. What does the sizing above your query mean? Have you ever run EXPLAIN ANALYZE for you queries, and please post the query I cannot guess your query out of the result. There are a lot of optimization videos on YT.
I'm designing a central search function in a PHP web application. It is focused around a single table and each result is exactly one unique ID out of that table. Unfortunately there are a few dozen tables related to this central one, most of them being 1:n relations. Even more unfortunate, I need to join quite a few of them. A couple to gather the necessary data for displaying the results, and a couple to filter according to the search criteria.
I have been mainly relying on a single query to do this. It has a lot of joins in there and, as there should be exactly one result displayed per ID, it also works with rather complex subqueries and group by uses. It also gets sorted according to a user-set sort method and there's pagination in play as well done by the use of LIMIT.
Anyways, this query has become insanely complex and while I nicely build it up in PHP it is a PITA to change or debug. I have thus been considering another approach, and I'm wondering just how bad (or not?) this is for performance before I actually develop it. The idea is as follows:
run one less complex query only filtering according the search parameters. This means less joins and I can completely ignore group by and similar constructs, I will just "SELECT DISTINCT item_id" on this and get a list of IDs
then run another query, this time only joining in the tables I need to display the results (only about 1/4 of the current total joins) using ... WHERE item_id IN (....), passing the list of "valid" IDs gathered in the first query.
Note: Obviously the IN () could actually contain the first query in full instead of relying on PHP to build up a comma-separated list).
How bad will the IN be performance-wise? And how much will it possibly hurt me that I can not LIMIT the first query at all? I'm also wondering if this is a common approach to this or if there are more intelligent ways to do it. I'd be thankful for any input on this :)
Note to clarify: We're not talking about a few simple joins here. There is even (simple) hierarchical data in there where I need to compare the search parameter against not only the items own data but also against its parent's data. In no other project I've ever worked on have I encountered a query close to this complexity. And before you even say it, yes, the data itself has this inherent complexity, which is why the data model is complex too.
My experience has shown that using the WHERE IN(...) approach tends to be slower. I'd go with the joins, but make sure you're joining on the smallest dataset possible first. Reduce down the simple main table, then join onto that. Make sure your most complex joins are saved to the end to minimize the rows required to search. Try to join on indexes wherever possible to improve speed, and ditch wildcards in JOINS where possible.
But I agree with Andomar, if you have the time build both and measure.
Here is a concrete example:
Wordpress stores user information(meta) in a table called wp_usermeta where you get the meta_key field (ex: first_name) and the meta_value (John)
However, only after 50 or so users, the table already packs about 1219 records.
So, my question is: On a large scale, performance wise, would it be better to have a table with all the meta as a field, or a table like WordPress does with all the meta as a row ?
Indexes are properly set in both cases. There is little to no need of adding new metas. Keep in mind that a table like wp_usermeta must use a text/longtext field type (large footprint) in order to accommodate any type of data that could be entered.
My assumptions are that the WordPress approach is only good when you don't know what the user might need. Otherwise:
retrieving all the meta requires more I/O because the fields aren't stored in a single row. The field isn't optimised.
You can't really have an index on the meta_value field without suffering from major drawbacks (indexing a longtext ? unless it's a partial index...but then, how long?)
Soon, your database is cluttered with many rows, cursing your research even for the most precise meta
Developer-friendly is absent. You can't really do a join request to get everything you need and displayed properly.
I may be missing a point though. I'm not a database engineer, and I know only the basics of SQL.
You're talking about Entity-Attribute-Value.
- Entity = User, in your Wordpress Example
- Attribute = 'First Name', 'Last Name', etc
- Value = 'John', 'Smith', etc
Such a schema is very good at allowing a dynamic number of Attributes for any given Entity. You don't need to change the schema to add an Attribute. Depending on the queries, the new attributes can often be used without changing any SQL at all.
It's also perfectly fast enough at retrieving those attributes values, provided that you know the Entity and the Attribute that you're looking for. It's just a big fancy Key-Value-Pair type of set-up.
It is, however, not so good where you need to search the records based on the Value contents. Such as, get me all users called 'John Smith'. Trivial to ask in English. Trivial to code against a 'normal' table; first_name = 'John' AND last_name = 'Smith'. But non-trivial to write in SQL against EAV, and awful relative performance; (Get all the Johns, then all the Smiths, then intersect them to get Entities that match both.)
There is a lot said about EAV on-line, so I won't go in to massive detail here. But a general rule of thumb is: If you can avoid it, you probably should.
Depends on the number of names packed into wp_usermeta on average.
Text field searches are notoriously slow. Indexes are generally faster.
But some data warehouses index the crap out of every field and Wordpress might be doing the same thing.
I would vote for meta as a field not a row.
Good SQL, good night.
Mike
Examples from two major software in the GPL arena would illustrate how big difference there is in between the two designs :
Wordpress & oScommerce
Both have their flaws and strengths, and both are massively dominant in their respective areas and a lot of things are done with them. But one of the fundamental and biggest differences in between them is their approach to database table design. Of course, when comparing these, their code architecture also plays a role in how fast they do searches, but both are hampered by their own drawbacks and boosted by their own advantages, so the comparison is more or less accurate for production environments.
Wordpress uses EAV. The general data (called posts with different post types) is stored as the main entity, and all else is stored in post meta tables. Some fundamental data is stored in the main table, like revisions, post type etc, but almost all the rest is stored in metas.
VERY good for adding, modifying data, and therefore functionality.
But try a search with a complex SQL join which needs to pick up 3-4 different values from the meta table and get the resulting set. Its an iron dog. Search comes out VERY slow depending on the data you are searching for.
This is one of the reasons why you dont see many wordpress plugins which need to host complex data, and the ones which actually do, tend to create their own tables.
oScommerce on the other hand, keeps almost all product related data in products table. And majority of oScommerce mods modify this table and add their fields. There is products_attribute table, however this is also rather flat, and doesnt have any meta design. Its just linked to products over product ids.
As a result, despite being an aged spaghetti code from a very long time ago, oScommerce comes up with stunningly fast search results even when you search for complicated and combined product criteria. Actually, most of oScommerce's normal display code (like what it shows in product details page) comes from quite complicated SQLs pulling data from around 2-3 tables in complicated join statements. Comparably much more simpler sql with even one join could make wordpress duke it out with the database.
Therefore its rather plain conclusion : EAV is very good for easy extension and modification of data for extended functionality (ie as in wordpress). Flat, big monolithic tables are MUCH better for databases which will represent complicated records, and will have complicated searches with multiple criteria run on them.
Its a question of application.
For what i've seen the EAV model doesn't affect the performance. Unless you need the null values. In that case you should make a join with the table that holds all the type_meta.
I don't agree with the answer of Dems.
If you want to make the fullname of the user you don't ask for every name that matchs the name.
For that you should use a 5th or 6th NF.
Or you may even have a table of the user entity where you have:
id
username
password
salt
and there you go. That's the base, and for all the user "extra" data you should have a user_meta and user_type_meta entities. Then with the user.
We need to implement a search filter (Net-log like) for my social networking site against user profile, filters on profile include age range, gender and interests
we have approx 1M profiles running on MySQL, MySQL doesn't seems the right option to implement such filters so we are looking on Cassandra as well,
So what is the best way to implement such filter, The result need to be very quick
e.g. age = 18 - 24 and gender = male and interest = Football
Age in Date, Gender and interests are varchar
EDITED:
Let me rephrase the problem, How can I get fastest result of any type of search.
It could be on the bases of profile name, or any other profile thing on 1M profile records.
Thanks
It would serve your project well to make an underlying SQL change. You might want to consider changing the Interest column from a free-input field (varchar) to a tag (Many-to-many on an additional table, for example).
You used the example of Football and having a like operator on it. If you changed it to a tag, then you will have an initial structural problem of deciding where to place:
football
Football
American Football
Australian-rules football
But once you have done so, the tags will help your select statement go much faster.
Without this change, you will be pushing your data management problem from a database (which is equipped to handle it) to Java (which might not be).
It may make some sense to try to optimize your query (there may at least be some things that you can do). It sounds like you have a large database, and if you are returning a large result set and filtering the results with java, you may get performance issues because of all of the data kept in cache.
If this is the case, one thing that you could try is looking into caching the results, outside of the database and reading from that. This is something that Hibernate does very well, but you could implement your own version if needed. If this is something that you are interested in, Memcached, is a good starting place.
I just noticed this for MySQL. I do not know how efficient it is but they have some build in full text searching functions, that may help speed things up.
The following question is regarding the speed between selecting an exact match (example: INT) vs a "LIKE" match with a varchar.
Is there much difference? The main reason I'm asking this is because I'm trying to decide if it's a good idea to leave IDs out of my current project.
For example Instead of:
http://mysite.com/article/391239/this-is-an-entry
Change to:
http://mysite.com/article/this-is-an-entry
Do you think I'll experience any performance problems on the long run? Should I keep the ID's?
Note:
I would use LIKE to keep it easier for users to remember. For example, if they write "http://mysite.com/article/this-is-an" it would redirect to the correct.
Regarding the number of pages, lets say I'm around 79,230 and the app. is growing fast. Like lets say 1640 entries per day
An INT comparison will be faster than a string (varchar) comparison. A LIKE comparison is even slower as it involves at least one wildcard.
Whether this is significant in your application is hard to tell from what you've told us. Unless it's really intensive, ie. you're doing gazillions of these comparisons, I'd go with clarity for your users.
Another thing to think about: are users always going to type the URL? Or are they simply going to use a search engine? These days I simply search, rather than try and remember a URL. Which would make this a non-issue for me as a user. What are you users like? Can you tell from your application how they access your site?
Firstly I think it doesn't really matter either way, yes it will be slower as a LIKE clause involves more work than a direct comparison, however the speed is negligible on normal sites.
This can be easily tested if you were to measure the time it took to execute your query, there are plenty of examples to help you in this department.
To move away slighty from your question, you have to ask yourself whether you even need to use a LIKE for this query, because 'this-is-an-entry' should be unique, right?
SELECT id, friendly_url, name, content FROM articles WHERE friendly_url = 'this-is-an-article';
A "SELECT * FROM x WHERE = 391239" query is going to be faster than "SELECT * FROM x WHERE = 'some-key'" which in turn is going to be faster than "SELECT * FROM x WHERE LIKE '%some-key%'" (presence of wild-cards isn't going to make a heap of difference.
How much faster? Twice as fast? - quite likely. Ten times as fast? stretching it but possible. The real questions here are 1) does it matter and 2) should you even be using LIKE in the first place.
1) Does it matter
I'd probably say not. If you indeed have 391,239+ unique articles/pages - and assuming you get a comparable level of traffic, then this is probably just one of many scaling problems you are likely to encounter. However, I'd warrant this is not the case, and therefore you shouldn't worry about a million page views until you get to 1 million and one.
2) Should you even be using LIKE
No. If the page/article title/name is part of the URL "slug", it has to be unique. If it's not, then you are shooting yourself in the foot in term of SEO and writing yourself a maintanence nightmare. If the title/name is unique, then you can just use a "WHERE title = 'some-page'", and making sure the title column has a unique index on.
Edit
You plan of using LIKE for the URL's is utterly utterly crazy. What happens if someone visits
yoursite.com/articles/the
Do you return a list of all the pages starting "the" ? What then happens if:
Author A creates
yoursite.com/articles/stackoverflow-is-massive
2 days later Author B creates
yoursite.com/articles/stackoverflow-is-massively-flawed
Not only will A be pretty angry that his article has been hi-jacked, all the perma-links he may have been sent out will be broken, and Google is going never going to give your articles any reasonable page-rank because the content keeps changing and effectively diluting itself.
Sometimes there is a pretty good reason you've never seen your amazing new "idea/feature/invention/time-saver" anywhere else before.
INT is much more faster.
In the string case I think you should not select query with LIKE but just with = because you look for this-is-an-entry, not for this-is-an-entry-and-something.
There are a few things to consider:
The type of search performed on the DataBase will be an "index seek", search for single row using an index, most of the time.
This type of exact match operation on a single row is not significantly faster using ints than strings, they are basically the same cost, for any practical purpose.
What you can do is the following optimization, search the database using a exact match (no wildcards), this is as fast as using an int index. If there is no match do a fuzzy search (search using wildcards) this is more expensive, but on the other hand is more rare and can produce more than one result. A form of ranking results is needed if you want to go for best match.
Pseudocode:
Search for an exact match using a string: Article Like 'entry'
if (match is found) display page
if (match is not found) Search using wildcards
If (one apropriate match is found) display page
If (more relevant matches) display a "Did you tried to find ... page"
If (no matches) display error page
Note: keep in mind that fuzzy URLs are not recommended from a SEO perspective, because people can link your site using multiple URLs which will split your page rank instead of increase it.
If you put an index on the varchar field it should be ok (performance wise), really depends on how many pages you are going to have. Also you have to be more careful and sanitize the string to prevent sql injections, e.g. only allow a-z, 0-9, -, _, etc in your query.
I would still prefer an integer id as it is faster and safer, change the format to something nicer like:
http://mysite.com/article/21-this-is-an-entry.html
As said, comparing INT < VARCHAR, and if the table is indexed on the field you're searching then that'll help too, as the server won't have to create a manual index on the fly.
One thing which will help validate your queries for speed and sense is EXPLAIN. You can use this to show which indexes your query is using, as well as execution times.
To answer your question, if it's possible to build your system using exact matches on the article ID (ie an INT) then it'll be much "lighter" than if you're trying to match the whole url using a LIKE statement. LIKE will obviously work, but I wouldn't want to run a large, high traffic site on it.