Is there an alternative for LIKE. Note I cannot use FULL TEXT Search.
Here is my mysql code.
SELECT *
FROM question
WHERE content LIKE '%$search_each%'
OR title LIKE '%$search_each%'
OR summary LIKE '%$search_each%'
Well, MySQL has regular expressions but I would like to ask you what the problem is with multiple LIKEs.
I know it won't scale well when tables get really large but that's rarely a concern for the people using MySQL (not meaning to be disparaging to MySQL there, it's just that I've noticed a lot of people seem to use it for small databases, leaving large ones to the likes of Oracle, DB2 or SQLServer (or NoSQL where ACID properties aren't so important)).
If, as you say:
I plan to use it for really large sites.
then you should avoid LIKE altogether. And, if you cannot use full text search, you'll need to roll your own solution.
One approach we've used in the past is to use insert/update/delete triggers on the table to populate yet another table. The insert/update trigger should:
evaluate the string in question;
separate it into words;
throw away inconsequential words (all-numerics, noise words like 'at', 'the', 'to', and so on); then
add those words to a table which a marker to the row in the original table.
Then use that table for searching, almost certainly much faster than multiple LIKEs. It's basically a roll-your-own sort-of-full text search where you can fine-tune and control what actually should be indexed.
The advantage of this is speed during the select process with a minor cost during the update process. Keep in mind this is best for tables that are read more often than written (most of them) since it amortises the cost of indexing the individual words across all reads. There's no point in incurring that cost on every read, better to do it only when the data changes.
And, by the way, the delete trigger will simply delete all entries in the indexing table which refer to the real record.
The table structures would be something like:
Comments:
id int
comment varchar(200)
-- others.
primary key (id)
Words:
id int
word varchar(50)
primary key (id)
index (word)
WordsInComments:
wordid int
commentid int
primary key (wordid,commentid)
index (commentid)
Setting the many-to-many relationship to id-id (i.e., separate Words and WordsInComments tables) instead of id-text (combining them into one) is the correct thing to do for third normal form but you may want to look at trading off storage space for speed and combining them, provided you understand the implications.
Related
Let's say I have a table BOOK:
BOOK_ID INT(6) PK
--------------------
FILE_EXTENSION VARCHAR(5)
TITLE VARCHAR(60)
LANGUAGE VARCHAR(10)
EDITION INT(2)
PUBLISHMENT_OFFICE_ID INT(4)
PUBLISH_YEAR INT(4)
RATING INT(1)
FILE_UPDOAD_DATE DATE
LINK VARCHAR(150)
This table is meant to be used both for searching books (for ex. by extension, by publishment office, by authors (from other tables), etc) and for full visualization (print on page all books with all these fields).
So there is a question: For example, if I do
SELECT BOOK_ID FROM BOOK WHERE FILE_EXTENSION = 'PDF'
will this cause the load of all big fields (link, title, and maybe planned BLOB) as an intermediate result, or will it discard any unnecessary fields as soon as WHERE clause is translated with no performance issues?
The question leads for solution: separate big fields in other table with same PK in order to slow down visualization (cuz a JOIN is needed) but to speed up the search? Is it worth?
P.S. This particular DB is not meant to hold rly big amount of data, so my queries (I hope) won't be as slow. But this question is about general databases' design (let's say 10^8 entries).
P.P.S. Pls don't link me to database normalization (my full DB is normilized well)
Columns are stored as part of their row. Rows are stored as part of a Page. If you need one column from one row you need to read the whole row, in fact you read the whole page that row is in. That's likely to be thousands of rows, including all of their columns. Hopefully that page also has other rows you are interested in and the read isn't wasted.
That's why Columnar databases are becoming so popular for analytics. They store columns separately. They still store the values in Pages. So you read thousands of rows off the disk for that column, but in analytics you're likely to be interested in all or most of those rows. This way you can have hundreds of columns, but only ever read the columns you're querying.
MySQL doesn't have ColumnStore. So, you need an alternative.
First is to have your large fields in a separate table, which you've already alluded to.
Second, you can use a covering index.
If you index (file_extension, book_id) the query SELECT book_id FROM book WHERE file_extension = 'pdf' can be satisfied just be reading the index. It never needs to read the table itself. (Indexes are still stored as pages on the disk, but only the columns the index relates to, and potentially a row pointer. Much narrower than the table.)
That's a bit clunky though, because the covering index needs to cover the columns you know you'll be interested in.
In practice, your fields are small enough to not warrant this attention until it actually becomes a problem. It would be wise to store BLOBs in a separate table though.
"Columns are stored as part of their row." -- Yes and no. All the 'small' columns are stored together in the row. But TEXT and BLOB, when 'big', are stored elsewhere. (This assumes ENGINE=InnoDB.)
SELECT book_id FROM ... WHERE ext = 'PDF' would benefit from INDEX(ext, book_id). Without such, the query necessarily scans the entire table (100M rows?). With that index, it will be very efficient.
"print on page all books with all these fields" -- Presumably this excludes the bulky columns? In that case SELECT book_id versus SELECT all-these-fields will cost about the same. This is a reasonable thing to do on a web page -- if you are not trying to display thousands of books on a single page. That becomes a "bad UI" issue, more than an "inefficient query" issue.
title and link are likely to come under the heading of "small" in my discussion above. But any BLOBs are very likely to be "big".
Yes, it is possible to do "vertical partitioning" to split out the big items, but that is mostly repeating what InnoDB is already doing. Don't bother.
100M rows is well into the arena where we should discuss these things. My comments so far only touch the surface. To dig deeper, we need to see the real schema and some of the important queries. I expect some queries to be slow. With 100M rows, improving one query sometimes hurts another query.
I am creating a mysql table which contain several longtext rows. I am expecting a lot of users enter a lot of texts. Should I split them into different table individually or just put them together in one table? I concern about the speed, will that affect the speed when I query the result, how about if I want to transfer the data on the future? I am using InnoDB, or should I use Myisam?
CREATE TABLE MyGuests (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
diet longtext NOT NULL,
run longtext NOT NULL,
faith longtext,
apple longtext
);
The main concern over speed you'd have with this database layout is if your query is a SELECT *, while the page only uses one of the fields. (Which is a very common performance degrader.) Also, if you intend to display multiple texts per page in a listing of available texts etc., you'd probably want to have a separate description column (that has a truncated version of the complete text if nothing else), and only fetch those instead of fetching the full text only to then truncate it in PHP.
If you intend to provide search functionality, you should definitely use fulltext indexes to keep your performance in the clear. If your MySQL version is 5.6.4 or later, you can use both InnoDB and MyISAM for full text search. Otherwise, only MyISAM provides that in earlier versions.
You also have a third choice between an all-in-one table and separate-tables-for-each, which might be the way of choice, presuming you may end up adding more text types in the future. That is:
Have a second table with a reference to the ID of the first table, a column (ENUM would be most efficient, but really a marginal concern as long as you index it) indicating the type of text (diet, run, etc.), and a single longtext column that contains the text.
Then you can effortlessly add more text types in the future without the hassle of more dramatic edits to your table layouts (or code), and it will also be simple to fetch only texts of a particular type. An indexed join that combines the main entry-table (which might also hold some relevant metadata like author id, entry date, etc.) and the texts shouldn't be a performance concern.
I'm new to SQL, and thinking about my datasets relationally instead of hierarchically is a big shift for me. I'm hoping to get some insight on the performance (both in terms of storage space and processing speed) versus design complexity of using numeric row IDs as a primary key instead of string values which are more meaningful.
Specifically, this is my situation. I have one table ("parent") with a few hundred rows, for which one column is a string identifier (10-20 characters) which would seem to be a natural choice for the table's primary key. I have a second table ("child") with hundreds of thousands (or possibly millions or more) of rows, where each row refers to a row in the parent table (so I could create a foreign key constraint on the child table). (Actually, I have several tables of both types with a complex set of references among them, but I think this gets the point across.)
So I need a column in the child table that gives an identifier to rows in the parent table. Naively, it seems like creating the column as something like VARCHAR(20) to refer to the "natural" identifier in the first table would lead to a huge performance hit, both in terms of storage space and query time, and therefore I should include a numeric (probably auto_increment) id column in the parent table and use this as the reference in the child. But, as the data that I'm loading into MySQL don't already have such numeric ids, it means increasing the complexity of my code and more opportunities for bugs. To make matters worse, since I'm doing exploratory data analysis, I may want to muck around with the values in the parent table without doing anything to the child table, so I'd have to be careful not to accidentally break the relationship by deleting rows and losing my numeric id (I'd probably solve this by storing the ids in a third table or something silly like that.)
So my question is, are there optimizations I might not be aware of that mean a column with hundreds of thousands or millions of rows that repeats just a few hundred string values over and over is less wasteful than it first appears? I don't mind a modest compromise of efficiency in favor of simplicity, as this is for data analysis rather than production, but I'm worried I'll code myself into a corner where everything I want to do takes a huge amount of time to run.
Thanks in advance.
I wouldn't be concerned about space considerations primarily. An integer key would typically occupy four bytes. The varchar will occupy between 1 and 21 bytes, depending on the length of the string. So, if most are just a few characters, a varchar(20) key will occupy more space than an integer key. But not an extraordinary amount more.
Both, by the way, can take advantage of indexes. So speed of access is not particularly different (of course, longer/variable length keys will have marginal effects on index performance).
There are better reasons to use an auto-incremented primary key.
You know which values were most recently inserted.
If duplicates appear (which shouldn't happen for a primary key of course), it is easy to determine which to remove.
If you decide to change the "name" of one of the entries, you don't have to update all the tables that refer to it.
You don't have to worry about leading spaces, trailing spaces, and other character oddities.
You do pay for the additional functionality with four more bytes in a record devoted to something that may not seem useful. However, such efficiencies are premature and probably not worth the effort.
Gordon is right (which is no surprise).
Here are the considerations for you not to worry about, in my view.
When you're dealing with dozens of megarows or less, storage space is basically free. Don't worry about the difference between INT and VARCHAR(20), and don't worry about the disk space cost of adding an extra column or two. It just doesn't matter when you can buy decent terabyte drives for about US$100.
INTs and VARCHARS can both be indexed quite efficiently. You won't see much difference in time performance.
Here's what you should worry about.
There is one significant pitfall in index performance, that you might hit with character indexes. You want the columns upon which you create indexes to be declared NOT NULL, and you never want to do a query that says
WHERE colm IS NULL /* slow! */
or
WHERE colm IS NOT NULL /* slow! */
This kind of thing defeats indexing. In a similar vein, your performance will suffer bigtime if you apply functions to columns in search. For example, don't do this, because it too defeats indexing.
WHERE SUBSTR(colm,1,3) = 'abc' /* slow! */
One more question to ask yourself. Will you uniquely identify the rows in your subsidiary tables, and if so, how? Do they have some sort of natural compound primary key? For example, you could have these columns in a "child" table.
parent varchar(20) pk fk to parent table
birthorder int pk
name varchar(20)
Then, you could have rows like...
parent birthorder name
homer 1 bart
homer 2 lisa
homer 3 maggie
But, if you tried to insert a fourth row here like this
homer 1 badbart
you'd get a primary key collision because (homer,1) is occupied. It's probably a good idea to work how you'll manage primary keys for your subsidiary tables.
Character strings containing numbers sort funny. For example, '2' comes after '101'. You need to be on the lookout for this.
The main benefit you get from numeric values that that they are easier to 'index'. Indexing is a process that MySQL uses to make it easier to find a value.
Typically, if you want to find a value in a group, you have to loop through the group looking for your value. That is slow and has a worst case of O(n). If instead your data was in a nice, searchable format -- like a binary search tree, if could be found in O(lon n), much faster.
Indexing is the process MySQL uses to prepare data to be searched, it generates search trees and other clever do-bobs that will make finding data quick. It makes many searches much faster. However, to do it, it has to compare the value you are searching for to various 'key' values to determine if your value is greater than or less than the key.
This comparison can be done on non-numeric values. However, comparing non-numeric values is much slower. If you want to be able to quickly look up data, your best bet is you have a integer 'key' that you use.
Numeric row id's have many advantages over a string based id.
Most of them are mentioned in other answers.
1. One of them is indexing. Primary keys are by default indexed in a relational database. So, having a numeric key is always more efficient.
2. Numeric fields are stored much more efficiently
2. Joins are much faster with numeric keys.
3. A row id could be a foreign key. Numeric id's are compact to store, making them efficient
4. I think using a auto-increment on primary key has its advantages too
-Thanks
_san
My client has a huge database containing just three fields:
Primary key (a unsigned number)
Name (multi-word text)
Description (up to 1000 varchar)
This database has got over few billion entries. I have no previous experience in handling such large amounts of data.
He wants me to design an interface using AJAX (like Google) to search this database. My queries are as slow as turtle.
What is best way to search text fields in such a large database? If the user is typing wrong spelling on interface, how can I return what he wanted ?
If you are using FULLTEXT indexes, you're correctly writing your queries, and the speed in which the results are returned are not adequate, you are entering a territory where MySQL may simply not be sufficient for you..
You may be able to tweak settings, purchase enough RAM to make sure that your entire data-set fits 100% in memory. It's definitely true that performance gains could be huge there.
I'd definitely recommend looking into tweaks of your mysql configuration. We've had some silly settings in the past. Operating system defaults tend to really suck!
However, if you have trouble at that point, you can:
Create a separate table containing each word (indexed) along with a record id that it refers to. This will allow you to search on single words.
Use a different system that's optimized for solving this problem. Unless my information is now outdated, the 2 engines that are the most popular for solving this problem are:
Sphinx
Solr / Lucene
If your table is myISAM then you can set the Name and Description fields to FULLTEXT
CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
Name VARCHAR(200),
Description TEXT,
FULLTEXT (Name,Description)
);
Then you can use queries like:
SELECT * FROM articles
WHERE MATCH (Name,Description) AGAINST ('database');
Your can find more info at http://docs.oracle.com/cd/E17952_01/refman-5.0-en/fulltext-search.html
Before doing any of the above you might want to backup (or atleast make a copy) of your database.
You can't. The only fast search in your scenario would be on the Primary Key since that's most likely to be the index. Text search is slow as a turtle.
In all seriousness, you have a few solutions:
If you have to stick with NoSQL you'll have to redesign you scheme. It's hard to give you a good recommendation without knowing the requirements. One solution would be to index keywords in a separate table.
Another solution is to switch to a different search engine, you can find suggestions in other questions here such as: Fast SQL Server search on 40M text records
I am planning to implement database search through a website - I know there is full-text search offered by mysql, but turns out that it is not supported for innodb engine (which I need for transaction support).
Other options are using sphinx or similar indexing applications. However they require some re factoring of the database structure and may take more time to implement than I have.
So what I decided on was to take each table and concatenate all its relevant columns into a newly added QUERY column. This query column should also recruit from column of other relevant tables.
This accomplished, I will use the 'like' clause on query column of the table to be searched to search to return results of specific domains (group of related tables).
Since my database is not expected to be too huge (< 1mn rows in the biggest table), I am expecting reasonable query times.
Does any one agree with this method or have a better idea?
You will not be happy with the solution of using LIKE with wildcards. It performs hundreds or thousands of times slower than using a fulltext search technology.
See my presentation Practical Full-Text Search in MySQL.
Instead of copying the values into a QUERY column, I would recommend copying the values into a MyISAM table where you have a FULLTEXT index defined. You could use triggers to do this.
You don't need to concatenate the values together, you just need the primary key column and each of your searchable text columns.
CREATE TABLE OriginalTable (
original_id SERIAL PRIMARY KEY,
author_id INT,
author_date DATETIME,
summary TEXT,
body TEXT
) ENGINE=InnoDB;
CREATE TABLE SearchTable (
original_id BIGINT UNSIGNED PRIMARY KEY, -- not auto-increment
-- author_id INT,
-- author_date DATETIME,
summary TEXT,
body TEXT,
FULLTEXT KEY (summary, body)
) ENGINE=MyISAM;
You'll want to add an index to your query column. If there is a wildcard at the beginning of the search expression, MySQL cannot use the index.
If you do any search other than "equals" (LIKE 'test') or "begins with" (LIKE 'test%'), MySQL will have to scan every row. For example, a "contains" search (LIKE '%test%') is unable to use the index.
You could allow an "ends with" ('LIKE %test), but you'd have to build a reversed column to index on so you could actually do LIKE 'test%' in order to use the index.
Any full scan is going to be slow, and the more rows, the slower it will be. The larger the field, the slower it will be.
You can see the limitation of using LIKE. Therefore, you might create a table called Tags, where you link individual key words to each entry rather than using the entire text, but I would still stick to "equals" and "begins with", even with tags.
Using LIKE without the aid of an index should be limited to the rare ad-hoc query or very small data sets.
No, it is not optimal since it force to read all the row. But, if you table is small (i don't know what is the meaning of <1mn) then it could be acceptable in some extend.
Also, you can limit the search feature. For example, some sites limit to use the search feature no more that one request x minute while other force you to enter a captcha.