I'm currently developing a website which allows the users to upload presentations, documents and e-books (something like scribd and slideshare) so I need to be able to search in the file's content. I'm currently extracting the text from the files in a txt file.
I am considering 2 options as I am using MySQL:
Store the plain text in a separate table and use mysql's fulltext index to search through it.
Use an inverted index to store words and search through them. (2 new tables - words and many-to-many with the documents table). Now in this case what can I do to work with repeating words that give more relevance to the results.
The text will only be used for searching. The problem with (1) is that the text of an e-book may be huge so I consider limiting it to (for example) 50kb or less.
(2) also has a problem with lots of words in an e-book which, again, can be limited.
So can you guide me to the best way to index the text and be able to do fast fulltext searches. I need to get the best out of mysql in this case.
I decided to use Sphinx as suggested by Rob Di Marco. Turns out it is the fastest (and opensource) FullText search engine out there. I had some trouble with compiling and getting SphinxSE not to crash mysql so I now use MariaDB which includes the plugin.
I chose version 1.10 because of the RealTime index. It means that there is no need to wait for the indexer thing to rebuild the entire index if you just add a row. ( I know about the main+delta workarounds but this is way easier to configure and use with SphinxQL )
See also Some questions related to SphinxSE and RT indexes
Related
I use InnoDB on RDS, which unfortunately does not yet support MySQL full text search. I'm therefore looking into alternatives. My app is on Heroku and I have considered the various addons that provide search capabilities, but have a very large table of companies (~100M records) and I think that they are prohibitively expensive. I only need to be able to search one field on the table -- company name.
I am therefore considering creating my own 'keyword' table. Essentially this would list every word contained in every company name. There would then be another table that shows association between these keywords and the company_id.
Does this sound like a good idea? Are there any better alternatives?
What would be the most efficient way of creating the keyword table and the association table? I'd like to do it using T-SQL, if possible.
You can do it, and it's far better than using LIKE '%word%' queries.
But it's not nearly as good as using proper fulltext indexing.
See my presentation Full Text Search Throwdown, where I compare the fulltext solutions for MySQL, including trigrams, which is approximately like the keyword solution you're considering.
The fastest solution -- by far -- was Sphinx Search.
I am working on an information retrieval system using MySQL's with the natural language mode.
The data I have is annotated to considering different categories. Eg. Monkey, cat, dog will be annotated as 'animals' whereas duck, sparrow as 'birds'. The problem is that I am retrieving documents based on the occurrences of these tags.
Now MySQL has a limitation that if a particular term comes in more than 50% in the entire data that term is not considered. Considering my requirement I want it to score all the matching terms even if a particular term comes more than 50% in the entire data.
I have read few things about combination of Sphinx with MySQL for search efficiency but I am not sure whether this could be applied for my situation.
Please provide a solution for this problem
Sphinx is very good at very fast fulltext search. It doesn't have the 50% rule that mySQL has, but you will need to use it in place of mySQL's fulltext search. Basically what you do is install Sphinx and set up an import to copy all your mySQL data into Sphinx. Then you can build SphinxSE or query Sphinx directly through a library to get your results. You can then get the details of your results by querying mySQL.
I use SphinxSE because you can query Sphinx through mySQL and join your mySQL table to the results in a single query. It's quite nice.
I've looked for this question on stackoverflow, but didn't found a really good answer for it.
I have a MySQL database with a few tables with information about a specific product. When end users use the search function in my application, it should search for all the tables, in specific columns.
Because the joins and many where clauses where not performing really well, I created a stored procedure, which splits all the single words in these tables and columns up, and inserts them in the table. It's a combination of 'word' and 'productID'.
This table contains now over 3.3 million records.
At the moment, I can search pretty quick if I match on the whole word, or the beginning of the word (LIKE 'searchterm%'). This is obvious, because it uses an index right now.
However, my client want to search on partial words (LIKE '%searchterm%'). This isn't performing at all. Also FULLTEXT search isn't option, because it can only search for the beginning of a word, with a wildcard after it.
So what is the best practice for a search function like this?
While more work to set up, using a dedicated fulltext search package like Lucene or Solr may be what you are looking for.
MySQL is not well tailored for text search. Use other software to do that. For example use Sphinx to index data for text search. It will do a great job and is very simple to set up. If you user MySQL 5.1 you could use sphinx as an engine.
There are other servers for performing text search better than Spinx, but they are eather not free or require other software installed.
You can read more about: ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
In my mysql db I have a user table consisting of 37,000 (or thereabouts) users.
When a user search for another user on the site, I perform a simple like wildcard (i.e. LIKE '{name}%}) to return the users found.
Would it be more efficient and quicker to use a search engine such a solr to do my 'LIKE' searches? furthermore? I believe in solr I can use wildcard queries (http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/)
To be honest, it's not that slow at the moment using a LIKE query however as the number of users grows it'll become slower. Any tips or advice is greatly appreciated.
We had a similar situation about a month ago, our database is roughly around 33k~ and due to the fact our engine was InnoDB we could not utilize the MySQL full-text search feature (that and it being quite blunt).
We decided to implement sphinxsearch (http://www.sphinxsearch.com) and we're really impressed with the results (me becoming quite a 'fanboy' of it).
If we do a large index search with many columns (loads of left joins) of all our rows we actually halved the query response time against the MySQL 'LIKE' counterpart.
Although we havn't used it for long - If you're going to build for future scailablity i'd recommend sphinx.
you can speed up if the searchword must have minimum 3 chars to start the search and index your search column with a index size of 3 chars.
It's actually already built-in to MySQL: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
we're using solr for this purpose, since you can search in 1-2 ms even with milions of documents indexed. we're mirroring our mysql instance with Data Import Handler and then we search on Solr.
as neville pointed out, full text searches are built-in in mysql, but solr performances are way better, since it's born as a full text search engine
Which techniqes would you use to implement a search for contents in a column on a very big table in MySql? Say for instance that you have 10.000.000 emails stored in a table in the database and would like to implement a subject search, that would enable me to search for one or more words that was present in the email subject. If the user searched for "christmas santa" you should find a emails with subjects like "Santa visits us this christmas" and "christmas, will santa ever show".
My idea is to process all the words in the subjects (strip all numbers, special signs, commas etc) and save each word in an index table, where I have a unique index on the word column. Then I would link that to the email table by a many to many relationship table.
Is there a better way to perform wildcard searches on very big tables ?
Is there databases that natively supports this kind of searches ?
You could use FULLTEXT indexes if you are using MyISAM as the storage engine. However, MySQL in general is not very good with text search.
A much better option would be to go with a dedicated text indexing solution such as Lucene or Sphinx. Personally I'd recommend Sphinx - it has great integration with PHP and MySQL and is very, very fast (can be used to speed up even ordinary queries - performs very fast grouping and ordering).
Wikipedia has a nice list of different indexing engines - here.
MySQL's MyISAM tables support a FULLTEXT index, which helps in this kind of search.
But it's not the speediest technology available for this kind of search. And you can't use it on data stored in InnoDB tables.
I've heard some good things about Sphinx Search, but I haven't used it yet.
Here's another blog about Sphinx: http://capttofu.livejournal.com/13037.html
While a mysql fulltext index is possible, I suspect I would look at using something designed to be a search engine like Lucene.
This sounds like a a full text search, which SQL Server supports.
But your idea is generally sound. You're effectively computing an "index" on your table in advance to speed up searches.
You want to look at the MATCH...AGAINST function.
See, for example: Using MySQL Full-text Searching
check "full text search" in MySQL docs (AFAIK, all current DBMS support this)