Sunspot: How to implement a Search Result Hierarchy? - mysql

I`m currently working on implementing Solr through Sunspot in a Rails project.
Looking at documentation I don`t see how I would implement a hierarchy of search results, by that I mean:
All users that match the query & have profile pictures should be
displayed first.
All users that match the query & don`t have a profile picture should
be displayed underneath.
And so on... .
I would appreciate any guidance or references on how to implement such a system.

If you want to display users with profile pictures first and the ones who don't later -
you can use sorting with sortMissingLast, this will cause all the records which do not have any value to appear last.
else have a default value for the records not having a value so that they appear last when sorted.

I've heard this request many times over the years, and it doesn't work quite like people expect. The worst case behavior is pretty bad and pretty common.
You may not want to do exactly that. As soon as you include a common term, like "Jr", you will have to show thousands of results with pictures before the first profile without a picture, even if that one is the right result.
This will happen more often than you expect, because common names are, well, common, so they show up in queries a lot and match a lot of documents. This may happen for your most common queries. Oops.
Instead, boost results with a quality factor. If there are two "Joe Smith" profiles, the one with the picture is better and should be shown first. You can do this with the "boost" parameter of the edismax result handler. If a profile has a photo, use a boost of 2, otherwise a boost of 1. You may have to play with the exact values to get what you want.

Related

MySQL JOIN vs LIKE - faster selects?

Okay, so first of all let me tell a little about what I'm trying to do. Basically, during my studies I wrote a little webservice in PHP that calculates how similar movies are to each other based on some measurable sizes like length, actors, directors, writers, genres etc. The data I used for this was basically a collection of data accquired from omdbapi.com.
I still have that database, but it is technically just a SINGLE table that contains all the information to each movie. This means, that for each movie all the above mentioned parameters are divided by commas. Therefore I have so far used a query that encapsulates all these things by using LIKE statements. The query can become quite large as I will pretty much query for every parameter within the table, sometimes 5 different LIKE statements for different actors, the same for directors and writers. Back when I last used this, it took about 30 to 60 seconds to enter a single movie and receive a list of 15 similar ones.
Now I started my first job and to teach myself in my freetime, I want to work on my own website. Because I have no real concept for what I want to do with it, I thought I'd get out my old "movie finder" again and use it differently this time.
Now to challenge myself, I want the whole thing to be faster. Understand, that the data is NEVER changed, only read. It is also not "really" relational, as actor names and such are just strings and have no real entry anywhere else. Which essentially means having the same name will be treated as the same actor.
Now here comes my actual question:
Assuming I want my select queries to operate faster, would it make sense to run a script that splits the comma divided strings into extra tables (these are n to m relations, see attempt below) and then JOIN all these tables (they will be 8 or more) or will using LIKE as I currently do be about the same speed? The ONLY thing I am trying to achieve is faster select queries, as there is nothing else to really do with the data.
This is what I currently have. Keep in mind, I would still have to create tables for the relation between movies + each of these tables. After doing that, I could remove the columns in the movie table and would end up having to join a lot of tables with EACH query. The only real advantage I can see here, is that it would be easier to create an index on individuals tables, rather than one (or a few) covering the one, big movie table.
I hope all of this even makes sense to you. I appreciate any answer short or long, like I said this is mostly for self studies and as such, I don't have/need a real business model.
I don't understand what you currently have. It seems that you only showd the size of tables but not its internal structure. You need to separate data into separate tables using normalization rules and then put correct indexes. Indexes will make your queries very fast. What does the sizing above your query mean? Have you ever run EXPLAIN ANALYZE for you queries, and please post the query I cannot guess your query out of the result. There are a lot of optimization videos on YT.

What is the best way... considering performance

I have around 5,00,000 users in my table and each user is associated with some books(has_many)
I want to display all the users along with their books ....
I wont be displaying all the users in the same page, they would be paginated.
What is the best way to do this, keeping performance,database hits in the mind. What all things needs to be considered while dealing with large records.
It would be unreasonable to display all the users and their books on the same page. There are, I believe, two possible approaches to solving this:
You can have a index page for the users where you list all the users. Corresponding to each user you can have a "show" page where you display the user's books. This would greatly simplify the resulting database queries as you need to load only the users for the index page, and load only one user's books on his/her show page. That means no complex joins and not a lot of data each time.
If you really want to show multiple users and their books on the same page, then, like someone mentioned in the comments above, you need to use pagination, say load 5 users per page. However, to add to that, you would also need to use eager loading as that could easily turn into an N + 1 problem. You could read more in "Eager loading of associations".
Going back to the first approach, you could even use pagination in that as well; For example, in listing the users or even the books for a user.
Queries with large offsets are inefficient in MySQL; When evaluating a query with an offset of 100000, MySQL has to actually find those 100,000 rows and discard them before it can find the ten rows you end up displaying.
One way around this is to give your application hints: Rather than saying page 10000, say that it's the page where id > x, if you were sorting in primary key order.
It's also crucial that you have appropriate indexes.
There's a good article called "Efficient Pagination Using MySQL" on percona.com with a variety of approaches for paginating through large sets.
Few thoughts -
MySQL is smart enough to handle those many records. You may read all users and their books in a single query and then display them as you may wish. However, showing those many records on a single web page will impact the response time.
Hence comes pagination - read limited records per page. Though this will mean a SQL query per page but still optimized. And of course you can use some query caching.
A better option could be to show an alphabetical list of users, not necessarily A-Z but also like AB, AC, AD and so on. That way your visitors can directly jump to a particular list. Consider adding pagination to it if the number of users in a given list is too large.
I'm not sure how important is it for your website to show latest updates ASAP, but you may also think on generating XML files, as many as you seem necessary, for example alphabetical and generate your web pages from the XML files. You may update those XML files once every 24h. So, minimum DB load.
And please consider building a search because navigating through those many users could be discouraging.
Hope it helps!

Complex SQL String Comparison

I'm merging two databases for a client. In an ideal world, I'd simply use the unique id to join them, but in this case the newer table has different id's.
So I have to join the tables on another column. For this I need to use a complex LIKE statement to join on the Title field. But... they have changed the title's of some rows which breaks the join on those rows.
How can I write a complex LIKE statement to connect slightly different titles?
For instance:
Table 1 Title = Freezer/Pantry Storage Basket
Table 2 Title = Deep Freezer/Pantry Storage Basket
or
Table 1 Title = Buddeez Bread Buddy
Table 2 Title = Buddeez Bread Buddy Bread Dispenser
Again, there are hundreds of rows with titles only slightly different, but inconsistently different.
Thanks!
UPDATE:
How far can MySQL Full-Text Search get me? Looks similar to Shark's suggestion in SQL Server.
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Do it in stages. First get all the ones that match out of the way so that you are only working with the exceptions. Your mind is incredibly smarter than the computer in finding things that are 'like' each other so scan over the data and look for similarities and make sql statements that cover the specific cases you see until you get it narrowed down as much as possible.
You will have better results if you 'help' the computer in stages like this than if you try to develop a big routine to cover all cases at once.
Of course there are certainly apis out there that do this already (such as the one google uses to guess your search phrase before you finish it) but whether any are freely available I don't know. Certainly wouldn't hurt to search for one though.
It's fairly difficult to describe ' only slightly different ' in a way that computer would understand. I suggest choosing a group of certain criteria that can be considered either most common or most important and work around it. I am not sure what those criteria should be though since i have only a vague idea what the data set looks like.

MySQL is SELECT with LIKE expensive?

The following question is regarding the speed between selecting an exact match (example: INT) vs a "LIKE" match with a varchar.
Is there much difference? The main reason I'm asking this is because I'm trying to decide if it's a good idea to leave IDs out of my current project.
For example Instead of:
http://mysite.com/article/391239/this-is-an-entry
Change to:
http://mysite.com/article/this-is-an-entry
Do you think I'll experience any performance problems on the long run? Should I keep the ID's?
Note:
I would use LIKE to keep it easier for users to remember. For example, if they write "http://mysite.com/article/this-is-an" it would redirect to the correct.
Regarding the number of pages, lets say I'm around 79,230 and the app. is growing fast. Like lets say 1640 entries per day
An INT comparison will be faster than a string (varchar) comparison. A LIKE comparison is even slower as it involves at least one wildcard.
Whether this is significant in your application is hard to tell from what you've told us. Unless it's really intensive, ie. you're doing gazillions of these comparisons, I'd go with clarity for your users.
Another thing to think about: are users always going to type the URL? Or are they simply going to use a search engine? These days I simply search, rather than try and remember a URL. Which would make this a non-issue for me as a user. What are you users like? Can you tell from your application how they access your site?
Firstly I think it doesn't really matter either way, yes it will be slower as a LIKE clause involves more work than a direct comparison, however the speed is negligible on normal sites.
This can be easily tested if you were to measure the time it took to execute your query, there are plenty of examples to help you in this department.
To move away slighty from your question, you have to ask yourself whether you even need to use a LIKE for this query, because 'this-is-an-entry' should be unique, right?
SELECT id, friendly_url, name, content FROM articles WHERE friendly_url = 'this-is-an-article';
A "SELECT * FROM x WHERE = 391239" query is going to be faster than "SELECT * FROM x WHERE = 'some-key'" which in turn is going to be faster than "SELECT * FROM x WHERE LIKE '%some-key%'" (presence of wild-cards isn't going to make a heap of difference.
How much faster? Twice as fast? - quite likely. Ten times as fast? stretching it but possible. The real questions here are 1) does it matter and 2) should you even be using LIKE in the first place.
1) Does it matter
I'd probably say not. If you indeed have 391,239+ unique articles/pages - and assuming you get a comparable level of traffic, then this is probably just one of many scaling problems you are likely to encounter. However, I'd warrant this is not the case, and therefore you shouldn't worry about a million page views until you get to 1 million and one.
2) Should you even be using LIKE
No. If the page/article title/name is part of the URL "slug", it has to be unique. If it's not, then you are shooting yourself in the foot in term of SEO and writing yourself a maintanence nightmare. If the title/name is unique, then you can just use a "WHERE title = 'some-page'", and making sure the title column has a unique index on.
Edit
You plan of using LIKE for the URL's is utterly utterly crazy. What happens if someone visits
yoursite.com/articles/the
Do you return a list of all the pages starting "the" ? What then happens if:
Author A creates
yoursite.com/articles/stackoverflow-is-massive
2 days later Author B creates
yoursite.com/articles/stackoverflow-is-massively-flawed
Not only will A be pretty angry that his article has been hi-jacked, all the perma-links he may have been sent out will be broken, and Google is going never going to give your articles any reasonable page-rank because the content keeps changing and effectively diluting itself.
Sometimes there is a pretty good reason you've never seen your amazing new "idea/feature/invention/time-saver" anywhere else before.
INT is much more faster.
In the string case I think you should not select query with LIKE but just with = because you look for this-is-an-entry, not for this-is-an-entry-and-something.
There are a few things to consider:
The type of search performed on the DataBase will be an "index seek", search for single row using an index, most of the time.
This type of exact match operation on a single row is not significantly faster using ints than strings, they are basically the same cost, for any practical purpose.
What you can do is the following optimization, search the database using a exact match (no wildcards), this is as fast as using an int index. If there is no match do a fuzzy search (search using wildcards) this is more expensive, but on the other hand is more rare and can produce more than one result. A form of ranking results is needed if you want to go for best match.
Pseudocode:
Search for an exact match using a string: Article Like 'entry'
if (match is found) display page
if (match is not found) Search using wildcards
If (one apropriate match is found) display page
If (more relevant matches) display a "Did you tried to find ... page"
If (no matches) display error page
Note: keep in mind that fuzzy URLs are not recommended from a SEO perspective, because people can link your site using multiple URLs which will split your page rank instead of increase it.
If you put an index on the varchar field it should be ok (performance wise), really depends on how many pages you are going to have. Also you have to be more careful and sanitize the string to prevent sql injections, e.g. only allow a-z, 0-9, -, _, etc in your query.
I would still prefer an integer id as it is faster and safer, change the format to something nicer like:
http://mysite.com/article/21-this-is-an-entry.html
As said, comparing INT < VARCHAR, and if the table is indexed on the field you're searching then that'll help too, as the server won't have to create a manual index on the fly.
One thing which will help validate your queries for speed and sense is EXPLAIN. You can use this to show which indexes your query is using, as well as execution times.
To answer your question, if it's possible to build your system using exact matches on the article ID (ie an INT) then it'll be much "lighter" than if you're trying to match the whole url using a LIKE statement. LIKE will obviously work, but I wouldn't want to run a large, high traffic site on it.

Pathing in a non-geographic environment

For a school project, I need to create a way to create personnalized queries based on end-user choices.
Since the user can choose basically any fields from any combination of tables, I need to find a way to map the tables in order to make a join and not have extraneous data (This may lead to incoherent reports, but we're willing to live with that).
For up to two tables, I already managed to design an algorithm that works fine. However, when I add another table, I can't find a way to path through my database. All tables available for the personnalized reports can be linked together so it really all falls down to finding which path to use.
You might be able to try some form of an A* algorithm. Basically this looks at each of the possible next options to choose and applies a heuristic to it, a function that determines roughly how far it is between this node and your goal. It then chooses the one that is closer and repeats. The hardest part of implementing A* is designing a good heuristic.
Without more information on how the tables fit together, or what you mean by a 'path' through the tables, it's hard to recommend something though.
Looks like it didn't like my link, probably the * in it, try:
http://en.wikipedia.org/wiki/A*_search_algorithm
Edit:
If that is the whole database, I'd go with a depth-first exhaustive search.
I thought about using A* or a similar algorithm, but as you said, the hardest part is about designing the heuristic.
My tables are centered around somewhat of a backbone with quite a few branches each leading to at most a single leaf node. Here is the actual map (table names removed because I'm paranoid). Assuming I want to view data from tha A, B and C tables, I need an algorithm to find the blue path.