Solr for indexing one table VS regular MySQL innoDB table buffer

Solr for indexing one table VS regular MySQL innoDB table buffer - mysql

I'm planning to use Solr for search over one mysql table, this table should have a millions of records.
i Did normalised this table so it can be indexed and searched by solr for more performance results.
Is this right or it should be searching with mysql table it self, i mean normal select.
please advice which is better since it's just one table to be indexed and searched.
Thanks

If your requirement is to "search" only one MySQL table using regular SQL SELECT statements, then Apache Solr seems a bit roundabout and overkill.
With solr, you'd need to periodically import data from the MySQL table into the solr schema, which essentially means you'll be maintaining two copies of the data. So there's also a delay between the time data is changed in the MySQL table and when the solr schema is refreshed from the database.
Aside from that, as far as performance, I'd expect that solr would be nearly as fast searching its schema as MySQL would be at searching it's own table. But you'd really need to test that, to make that determination for your particular requirements.
You'd likely want to make use of the solr "delta import", specifying appropriate queries to identify rows in the MySQL table that have been inserted and updated since the last import.
If this is the only usage of the MySQL table, then the only indexes you would really need on the MySQL table, as far as solr search/import performance is concerned, would be those indexes needed to optimize the performance of the two "delta import" queries.

Try using 3rd party search service like www.rockitsearch.com . It is an out of the box search solution. No need to worry about maintenance of solr cluster.

Related

How can I improve MySQL database performance?

So i have database in project Mysql .
I have a main table that have main staff for updating and inserting .
I have huge data traffic on the data . what i am doing mainly reading .csv file and inserting to table .
Everything works file for 3 days but when table record goes above 20 million the database start responding slow , and in 60 million more slow.
What i have done ?
I have applied index in the record where i think i need of it . (where clause field for fast searching) .
I think query optimisation can not be issue because database working fine for 3 days and when data filled in table it get slow . and as i reach 60 million it work more slow .
Can you provide me the approach how can i handle this ?
What should i do ? Should i shift data in every 3 days or what ? What you have done in such situation .

The purpose of database is to store a huge information. I think the problem is not in your database, it should be poor query, joins, Database buffer, index and cache. These are the following reason which makes your response to slow up. For more info check this link

I have applied index in the record where i think i need of it
Yes, index improve the performance of SELECT query, but at the same time it will degrade your DML operation and index has to be restructure whenever you perform any changes to indexed column.
Now, this is totally depending on your business need, whether you need index or not, whether you can compromise SELECT or DML.
Currently, many industries uses two different schemas OLAP for reporting and analytics and OLTP to store real-time data (including some real-time reporting).

First of all it could be helpful for us to now which kind of data you want to store.
Normally it makes no sense to store such a huge amount of data in 3 days because no one ever will be able to use this in an effective way. So it is better to reduce the data before storing in the database.
e.g.
If you get measuring values from a device which gives you one value a millisecond, you should think if any user is ever asking for a special value at a special millisecond or if it not makes more sense to calculate the average value of once a second, minute or hour or perhaps once a day?
If you really need the milliseconds but only if the user takes a deeper look, you can create a table from the main table with only the average values of an hour or day or whatever and work with that table. Only if the user goes in ths "milliseconds" view you use the main table and have to live with the more bad performance.
This all is of course only possible if the database data is read only. If the data in the database is changed from the application (and not only appended by the CSV import) then using more then one table will be error prone.

Whick operation do you want to speed up?
insert operation
A good way to speed it up is to insert records in batch. For example, insert 1000 records in each insert statement:
insert into test values (value_list),(value_list)...(value_list);
other operations
If your table got tens of millions of records, everything will be slowing down. This is quite common.
To speed it up in this situation, here is some advice:
Optimize your table definition. It depends on your particular case. Creating indexes is a common way.
Optimize your SQL statements. Apparently a good SQL statement will run much faster, and a bad SQL statement might be a performance killer.
Data migration. If only part of your data is used frequently, you can shift the infrequently-used data to another big table.
Sharding. This is a more complicated way, but usually used in big data system.

For the .csv file, use LOAD DATA INFILE ...
Are you using InnoDB? How much RAM do you have? What is the value of innodb_buffer_pool_size? That may not be set right -- based on queries slowing down as the data increases.
Let's see a slow query. And SHOW CREATE TABLE. Often a 'composite' index is needed. Or reformulation of the SELECT.

Does using union or joins in query causes slower performance in Sphinx?

I am willing to use sphinx with MySQL for my current project.
MYISAM as database engine as this db is gonna be only read-only with 10-25 millions of records.
so i would like to know whether ,
Does using union or joins in query causes performance issues in Sphinx ?
as i am about to design database and if union/joins gonna cause the slower performance then i can go for optimized design for sphinx.
Maybe like creating one big table with all fields and data and then creating separate INDEXES in sphinx depending on the data to be searched.
please guide me in correct direction.
thanks for your time.

Sphinx cant do joins anyway. Can do unions, just searching multiple indexes at once.
Or do you mean to build the sphinx index (ie in sql_query)? Indexer will only run the queries to build the indexes in the first place.
As you say read only - hence no updates, the indexes should never rebuilding, so doesnt really matter how slow they are.
In general a sphinx index will perform very similar regardless of how many feilds. So shouldnt need to split into different indexes. JUst have one multi purpose index (if its possible).
YOu can however shard the index into bits, so can distribute to multiple servers if performance becomes an issue.

Find least recently used mysql table index

I am cleaning up duplicate indices from (innodb)tables in a mysql database. The database has about 50 tables.
While I can check for duplicate keys using pt-duplicate-key-checker, I was wondering if there is a tool that could help me find out least recently used indices from all the tables.
For eg. , if table "A" has three indices defined "index_1", "index_2" and "index_3", and out of them "index_3" is used least frequently , assume its used every 1/10000 queries made on the table, then the output of the script or tool should be "index_3".
Is there a good way or a tool that could help me run this analysis on the database?
Thanks in advance.

Starting with MySQL 5.6, the performance_schema instruments table io, and computes aggregated statistics by table, and by index.
See table performance_schema.table_io_waits_summary_by_index_usage:
http://dev.mysql.com/doc/refman/5.6/en/table-waits-summary-tables.html#table-io-waits-summary-by-index-usage-table
Finding the least recently used index involves time and timestamps.
What the performance schema measure is counting io against an index, so it can be used to find the least often used index, which in practice should be pretty close.
Full example here:
http://sqlfiddle.com/#!9/c129c/4
Note: This question is also duplicated in
https://dba.stackexchange.com/questions/25406/find-least-recently-used-mysql-table-index

Document-oriented dbms as primary db and a RDBMS db as secondary db?

I'm having some performance issues with MySQL database due to it's normalization.
Most of my applications that uses a database needs to do some heavy nested queries, which in my case takes a lot of time. Queries can take up 2 seconds to run, with indexes. Without indexes about 45 seconds.
A solution I came a cross a few month back was to use a faster more linear document based database, in my case Solr, as a primary database. As soon as something was changed in the MySQL database, Solr was notified.
This worked really great. All queries using the Solr database only took about 3ms.
The numbers looks good, but I'm having some problems.
Huge database
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of data.
Each time I need to change a table/column the database need to be reindexed, which in this example took over 12 hours.
Difficult to render both a Solr object and a Active Record (MySQL) object without getting wet.
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
Like this.
# Controller
#song = Song.first
# View
#song.artist.urls.first.service.name
The problem in my case is that the data being returned from Solr is flat like this.
{
id: 123,
song: "Waterloo",
artist: "ABBA",
service_name: "Groveshark",
urls: ["url1", "url2", "url3"]
}
This forces me to build an active record object that can be passed to the view.
My question
Is there a better way to solve the problem?
Some kind of super duper fast primary read only database that can handle complex queries fast would be nice.

Solr individual fields update
About reindexing all on schema change: Solr does not support updating individual fields yet, but there is a JIRA issue about this that's still unresolved. However, how many times do you change schema?
MongoDB
If you can live without a RDBMS (without joins, schema, transactions, foreign key constrains), a document-based DB like MongoDB,
or CouchDB would be a perfect fit. (here is a good comparison between them )
Why use MongoBD:
data is in native format (you can use an ORM mapper like Mongoid directly in the views, so you don't need to adapt your records as you do with Solr)
dynamic queries
very good performance on non-full text search queries
schema-less (no need for migrations)
build-in, easy to setup replication
Why use SOLR:
advanced, very performant full-text search
Why use MySQL
joins, constrains, transactions
Solutions
So, the solutions (combinations) would be:
Use MongoDB + Solr
but you would still need to reindex all on schema change
Use only MongoDB
but drop support for advanced full-text search
Use MySQL in a master-slave configuration, and balance reads from slave(s) (using a plugin like octupus) + Solr
setup complexity
Keep current setup, denormalize data in MySQL
messy
Solr reindexing slowness
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of
data. Each time I need to change a table/column the database need to
be reindexed, which in this example took over 12 hours.
Reindexing 200MB DB in Solr SHOULD NOT take 12 hours! Most probably you have also other issues like:
MySQL:
n+1 issue
indexes
SOLR:
commit after each request - this is the default setup is you use a plugin like sunspot, but it's a perf killer for production
From http://outoftime.github.com/pivotal-sunspot-presentation.html:
By default, Sunspot::Rails commits at the end of every request
that updates the Solr index. Turn that off.
Use Solr's autoCommit
functionality. That's configured in solr/conf/solrconfig.xml
Be
glad for assumed inconsistency. Don't use search where results need to
be up-to-the-second.
other setup issues (http://wiki.apache.org/solr/SolrPerformanceFactors#Indexing_Performance)
Look at the logs for more details

Instead of pushing your data into Solr to flatten the records, why don't you just create a separate table in your MySQL database that is optimized for read only access.
Also you seem to contradict yourself
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
The problem in my case is that the data being returned from Solr is flat... This forces me to build a fake active record object that can be rendered by the view.

Any third party search engines (fulltext search and so on) work fine with InnoDB tables?

I know, that InnoDB tables do not support fulltext searches, yet. So I thought of using a third party search engine like solr, xapian or whoosh. Do those third party tools work equivalently fine with InnoDB tables as they work with MyIsam tables? I need to find e.g. spelling suggestions, and similar strings...

You could use Solr/Lucene to do the fulltext-search over your DB data. Since my MySQL DB is to big for an fast fulltext-search, i decided to combine mysql and Solr/lucene.
It's important to know, that Solr/Lucene is not an MySQL Plugin. So you will not be able to search the fulltext-index by using typical MySQL SQL-Statements. An fulltext-search, initiated by the application, should be first send the request to the 3rd party fulltext-index (Solr), which returns the primary keys of the related documents. Second step is to run an SQL statement against your MySQL innoDB with an where clause with the corresponding primary keys from the Solr result set.
That solution works in my case very well and much, much faster (and better) than an typical MySQL Myisam fulltext-search.
As an alternative you could not only index the data in solr. You also could store the data in solr additionally. In that case, solr is able to return the full text. So you don't need get the data form the database, as in the example above.
Do those third party tools work equivalently fine with InnoDB tables as they work with MyIsam tables?
Absolutely. Solr has an DataImportHandler. Ther you define an SQL statement in order to get the data you like to index in solr, like: select * from MyTable;
But keep in mind: right now (as far as I know) ther is no MySQL solr plugin available. The cooperation of Solr and MySQL should be handled by the application.

Third-party fulltext search engines typically copy data returned by a MySQL query, and use it to populate their search index. There's no difference between MyISAM and InnoDB data sources in this respect.
I gave a presentation Practical Full-Text Search in MySQL a few years ago. You might find it interesting.

Sphinx supports its own index and just takes data from MySQL on a timely basis by issuing a query.
It is not even aware of the underlying table structure and as long as the query runs and returns the results, it's OK for Sphinx.
Other third party engines work in a similar way.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008