There is a page containing diverse items from different MySQL tables (news, articles, video, audio, ...), binded to a certain tag (e.g. "economics").
At the moment, from each table 100 rows binded to the tag are fetched and then grouped and sorted.
I need to introduce a pagination to the page, which is pain in such a situation, because one needs collect all items together in order to get the chunk from some offset with some limit length.
I think I need aggregate items from each table in one data source, and then perform querying (filter by tag) and sort (by date) on it.
What can I use for this purpose? I consider Sphinx search engine, but I'm not sure whether it's good in this case or not - I need only querying and sorting, not full-text search.
Sphinx is very good solution for your case. You can define one index for all types of your content (news, articles, video, audio), just add field "source_type" which show source table, for example 1 - news, 2 - audio, 3 - video, etc. And add all fields which you want use for filtration.
If you want to search all audio with tag "rock", you just need do filter by "tag" and "source_type" fields. Sphinx do it much more faster than MySQL, particularly if you have very big amount of data. Sphinx will return you only bunch of founded rows (it depends on max_results in sphinx config).
At the same time sphinx easily return you count off all matches very fast. Using LIMIT and OFFSET in your queries to Sphinx you can do pagination.
In that manner you can fetch ids of objects in MySQL db from Sphinx and after that fetch all required data from MySQL.
I used that scenario in the same situation. And it provide great efficiency.
Related
I'm accessing the database (Predominately MS SQL Server, Postgre) through ORM and defining attributes (like whether the field/column should have an index) via code.
I'm thinking that if a column will be ordered via ORDER BY, it should have an index, otherwise full table scan will be required every time (e.g. if you want to get top 5 records ordered by date).
As I'm defining these indexes in code (on Entity Framework POCO entities, as .NET attributes), I can access these metadata at runtime. When displaying the data in a grid, I'm planning to make only those columns sortable (by clicking on column header) that have an index attribute. Is my thinking correct, or maybe there exist some reasonable conditions where sorting can be desirable on non-indexed column, or vice-versa (indexed column sorting would not make much sense?..)
In short, is it good to assume that only those columns should be sortable in UI, that have corresponding index applied at the database level?
Or, to phrase more generic question: Should columns that will be ordered always have some sort of index?
Whether you need an index depends on how often you query the ordered sequence compared to how often you make changes that could influence the ordered sequence.
Every time you make changes that influence the ordered sequence your database has to reorder the ordered index. So if you will considerably make more changes than queries then the index will be ordered more often than the result of the ordering will be used.
Furthermore it depends on who is willing to wait for the result: the one who makes changes that requires a re-index, or the one who does the queries.
I wouldn't be surprised if the index is ordered by a separate process after the change has been made. If the query is done while the ordering is not finished, the database will need to first finish enough of the ordering before the query can return.
On the other hand, if a new change is made while the ordering that was needed because of an earlier change was not finished, the database probably will not finish the previous ordering, but start ordering the new situation.
So I guess it is not mandatory to have an ordered index for every query. To order every possible column-combination will be too much work, but if quite often a certain ordering is requested by a process that is waiting for the results, it might be wise to create the ordered index.
order by doesn't mandate index on a column but if isn't indexed then it will end up doing a file sort than index sort and thus it's always preferred to have those column indexed if you are intended to use them in WHERE / JOIN ON / HAVING / ORDER BY.
You can generate the query execution plan and see the differences between the versions (indexed over non-indexed)
Kudos to #Harald Coppoolse for a thorough answer - there's something else which you should know about sorting on the DB, and that it is preferred to be done at the app level. See item number 2 in the following list: https://www.brentozar.com/archive/2013/02/7-things-developers-should-know-about-sql-server/
I have three to five search fields in my application and planning to integrate this with Apache Solr. I tried to do the sams with a single table and is working fine. Here are my questions.
Can we create index multiple tables in same core ? Or should i create separate core for each indexes (i guess this concept is wrong).
Suppose i have 4 tables users, careers, education and location. I have two search boxes in a php page where one is to search for simple locations (just like an autocomplete box) and another one is to get search for a keyword which should check on tables careers and education. If multiple indexes are possible under single core;
2.1 How do we define the query here ?
2.2 Can we specify index name in query (like table name in mysql) ?
Links which can answer my concerns are enough.
If you're expecting to query the same data as part of the same request, such as auto-completing users, educations and locations at the same time, indexing them to the same core is probably what you want.
The term "core" is probably identical to the term "index" in your usage, and having multiple sets of data in the same index will usually be achieved through having a field that indicates the type of document (and then applying a filter query if you want to get documents of only one type, such as fq=type:location. You can use the grouping feature of Solr to get separate result sets of documents back for each query as well.
If you're only ever going to query the data separately, having them in separate indexes are probably the way to go, as you'll be able to scale and perform analysis and tuning independent from each index in that case (and avoid having to always have a filter query to get the type of content you're looking for).
Specifying the index name is the same as specifying the core, and is part of the URL to Solr: http://localhost:8983/solr/index1/ or http://localhost:8983/solr/index2/.
I have the following problem:
We have a lot of different, yet similar types of data items that we want to record in a (MariaDB) database. All data items have some common parameters such as id, username, status, file glob, type, comments, start & end time stamps. In addition there are many (let's say between 40 and 100) parameters that are specific to each type of data item.
We would prefer to have the different data item types in the same table because they will be displayed along with several other data, as they happen, in one single list in the web application. This will appear like an activity stream or "Facebook wall".
It seems that the normalised approach with a top-level generic table joined with specific tables underneath will lead to bad performance. We will have to do both a lot of joins and unions in order to display the activity stream, and the application will frequently poll with this query, so it's important that the query runs fast.
So, which is the better solution(s) in terms of performance and storage optimization?
to utilize MariaDB's dynamic columns
to just add in all the different kinds of columns we need in one table, and just accept that each data item type will only use a few of the columns, i.e. the rest will be null.
something else?
Does it matter if we use regular columns when a lot of the data in them will be null?
When should we use dynamic columns and when is it better to use regular columns?
I believe you should have separate columns for the values you are filtering by. However, you might have some unfiltered values. For those it might be a good idea to store them in a single column as a json object (simple to encode/decode).
A few columns -- the main ones for using in WHERE and ORDER BY clauses (but not necessarily all the columns you might filter on.
A JSON column or MariaDB Dynamic columns.
See my blog on why not to use EAV schema. I focus on how to do it in JSON, but MariaDB's Dynamic Columns is arguably better.
I'm building a fairly large database where I will have a lot of tables with various data.
But each table has similar fields, for example video title or track title.
Now the problem I'm facing is how to build a query which would look for a keyword match across five or more tables, keep in mind that each table can potentially have from 100k to 1million rows or in some cases even couple million rows.
I think using joins or separate queries for each table would be very slow, so what I thought of is to make one separate table where I would store search data.
For example I think it could have fields like these,
id ---- username ---- title ---- body ---- date ---- belongs_to ---- post_id
This way I think it would perform a lot faster searches, or am I totally wrong?
The only problem with this approach that I can think of it is that it would be hard to manage this table because if original record from some of the tables is deleted I would also need to delete record from 'search' table as well.
Don't use MySQL for joining lots of tables, I would suggest you to take a look at Apache Solr, with RDBMS
Take a look at some information retrieval systems. They also require their own indices, so you need to index the data after each update (or in regular intervals) to keep the search index up to date. But they offer the following advantages:
much faster, because they use special algorithms and data structures designed for specifically that purpose
ability to search for documents based on a set of terms (and maybe also a set of negative terms that must not appear in the result)
search for phrases (i.e. terms that appear after each other in a specific order)
automatic stemming (i.e. stripping the endings of words like "s", "ed", "ing" ...)
detection of spelling mistakes (i.e. "Did you mean ...?")
stopwords to avoid indexing really common meaningless words ("a", "the", etc.)
wildcard queries
advanced ranking strategies (i.e. rank by relevance, based on the number and the position of each occurrences of the search terms)
I have used xapian in the past for my projects and I was quite happy with it. Lucene, Solr and elastic search are some other really popular projects that might fit your needs.
On our new site (a shopping site), we will use Solr for our site's search engine. In the Solr index we keep a list of product id's, and a list of keywords for each product. The search query is done against the keywords.
Solr returns a list of product id's. These id's are then inserted into a MySQL query to select all product data from the database. MySQL also handles the sorting of results. E.g., the MySQL query might look like:
SELECT * FROM product WHERE id IN (1,4,42,32,46,...,39482) ORDER BY price ASC
We have around 100,000 products on the site. This method works fine when there are a couple of thousand of results, but becomes slow when there are - for example - 50,000 results.
My assumption is that the bottleneck is the "WHERE IN" clause. A long-term solution will be to move all product data to Solr so it can handle sorting the results and also applying refine filters to the search (e.g., perhaps the user only wants to view products in a certain price range). However, we are inexperienced with Solr and need a short-term fix before we can implement this.
One option is to abandon Solr in the short-term and store keywords in a table in MySQL and do the search against this using a FULL-TEXT search.
Am I missing any other options?
The main problem for you is that Solr is going to return the results sorted by number of matching keywords, but you want the results to be sorted by price. Like you correctly mention, moving all your data to Solr is the best option - you would be very happy with Solr for your searching, sorting, faceting and pagination needs.
For the short-term however, it will be well worth to just add the price field to Solr. When you get a search query like tooth paste you can issue a Solr query like
q=keywords:(tooth AND paste)&rows=10&fl=id&sort=price%20asc
to get only the first 10 results and then do pagination by specifying the start parameter, so like:
q=keywords:(tooth AND paste)&rows=10&start=10&fl=id&sort=price%20asc