What is happening as my Sphinx search server warms up? - mysql

I have Sphinx Search running on a Linux server with 38GB of RAM. The sphinx index contains 35M full text documents plus meta data indexed from a MySQL table. When I launch a fresh server, I run a script that "warms up the sphinx cache" by sending my 10,000 most common queries through it. It takes about an hour to run the warm up script the first time, but the same script completes in just a few minutes if I run it again.
My confusion arises from the fact that Sphinx doesn't have any documented caching, other than a file based cache that I am not using. The index is loaded into memory when Sphinx starts, but individual queries take the same length of time each time they are run after the system has been "warmed up".
There is a clear warm up period when I run my scripts. What is going on? Is Linux caching something that helps Sphinx run faster? Does the underlying MySQL system cache queries ( I believe Sphinx is basically a custom MySQL storage engine )? How are new queries that have never been run being made faster by what is going on?
I realize there is likely a very complex explanation for this, but even a little direction should help be dig deeper.

( I believe Sphinx is basically a custom MySQL storage engine )
SphinxSE is a 'fake' storage engine. fake because it doesnt store any data - but rather take requests for data from its 'table', but really it just proxies it back to a running searchd instance in the background.
searchd itself doesnt have any caching - but as mentioned as indexed are read from, the OS may well start caching the files - so dont have to go all the way back to disk.
If you are using SphinxSE - then queries may be cached by the normal mysql query cache - so whole result sets are cached. But in addiction, the usual way to use SphinxSE, is to join the search results back with the original dataset, so you get both returned to the app in one go. So your queries are also dependent on the real mysql data tables. And they will be subject to the same OS caching - as mysql reads data it will be cached.
When I launch a fresh server
that suggests you are using a VM? If so the virtual disk might actully be located on a remote SAN. (or EBS on Amazon ec2)
which means loading a large sphinx index via that route might well be slow.
Depending on where your VM is hosted might be able to get some special high performance disks - ideally local to the host - maybe even SSD - which may well help.
Anyway to trace the issue, more you should almost certainly enable the sphinx query log. Look into that to see if queries are slow executing there. There is also a startup upoption to searchd - where you can enable iostats. This will log more information to the quyery log about io stats as queries are run. This can give you additional insights.

Sphinx doesn't cache your queries, but file system does. So, yes, second time queries executing faster than first time.

Related

Reduced performance in mysql after upgrade to 5.6.27

Our application was using MySql version 4.0.24 for a long time. We are trying to migrate it to version 5.6.27.
But, on testing the performance on 5.6.27, even the simple selects and updates are 30-40% slower when we are doing load testing. The CPU and IO speeds are much better than the older server. The storage engine of the tables is MyIsam in both versions. There's only one connection to the database. We tried the following options:
Changing storage engine to InnoDb - this reduce the performance drastically (70% slower)
Changing the innodb log size and buffer size - didn't help much
Increasing key buffer size with MyIsam storage engine for tables. - It made no difference
We tried modifying other parameters like query cache, tmp_table_size, heap_table_size. But, none of them made any difference.
Can you please let me know if there's any other option that we can try?
Here's a copy of my.cnf:
lower-case-table-names=1
myisam-recover=FORCE
key_buffer_size=2000M
Some things you can look at are whether the two servers have the same amount of RAM or not as it may be that the old server has more RAM and so can cache more things in memory.
You can also look at how are you connecting to the MySQL server - is it over a network? Is that network speed / quality different? Is one server accessed locally and the other over a network.
You tried tuning some good parameters, but in case there are ones you're missing, you can run mysql -e 'show variables' on both servers and then use a tool like Winmerge to compare the values of the two and see what might be different that you might not have thought of.
After trying multiple options, here are the configurations that worked for us. There may be other solutions as well but this worked for us.
Option 1:
Turned off the performance schema
Added a couple of jdbc connection parameters: useConfigs=maxPerformance&maintainTimeStats=false
Option 2:
Migrate to MariaDB. From a performance perspective, this worked out really well. It was giving a 5% better performance compared to mysql for our system. But, we couldn't pursue this option due to non-technical reasons.
Thank you for your inputs.

Options for speeding up slow SQL queries

We're having issues with a few queries - relatively simple queries - that take too long processing. Everything from 3 000ms to 30 000ms. We are using PHP 5.5 and MySQL 5.5.28-29.1.
We have a few options, but I am posting here to see if anyone has any experience on each of them:
Currently we are accessing views to get our data, this was done to move the processing load from the PHP to the MySQL. Would accessing the tables directly improve the query processing speed? I'm thinking not, cause it would lead to a lot more queries, due to the fact that the views are just collations of data.
If we were to install a cache DB, such as SQLite3, to cache it locally, then sync it to a RDBMS, how would we do that? And would the speed improve?
Thinking about a NodeJS version as well, using Node WebKit. As far as I can understand there is npm packages out there that can act as cache or a db connection, which would rule out the need for PHP. But how about the speed?
Another option is to set up a dedicated server for this environment (we're using a virtual server environment for the moment). Which would most likely speed some parts of it up. But if the MySQL will still be slow on that server - it's kind of wasted.
These are the alternatives I can think of at the moment. Any suggestions are appreciated.
(I can post the slow SQL queries if need be, but would like to see if anyone has anything to say about our options first)

Using Redis to cache SQL result

I have a SQL-based application and I like to cache the result using Redis. You can think of the application as an address book with multiple SQL tables. The application performs the following tasks:
40% of the time:
Create a new record / Update an existing record
Bulk update multiple records
Review an existing record
60% of the time:
Search records based on user's criteria
This is my current approach:
The system cache a record when a record is created or updated.
When user performs a search, the system will cache the query result.
On top of that, I have a Redis look-up table (Redis Set) which stores the MySQL record ID and the Redis cache key. That way I can delete the Redis caches if the MySQL record has been changed (e.g., bulk update).
What if a new record is created after the system cache the search result? If the new record matches the search criteria, the system will always return the old cache (which does not include the new record), until the cache is deleted (which won't happen until an existing record in the cache is updated).
The search is driven by the users and the combination of the search condition is countless. It is not possible to evaluate which cache should be deleted when a new record is created.
So far, the only solution is to remove all caches of a MySQL table when a record is created. However this is not a good choice because lots of records are created daily.
In this situation, what's the best way to implement Redis on top of MySQL?
Here's a surprising thing when it comes to PHP and MySQL (I am not sure about other languages) - not caching stuff into memcached or Redis is actually faster. Much faster. Basically, if you just built your app and queried MySQL - you'd get more out of it.
Now for the "why" part.
InnoDB, the default engine, is a superb engine. Specifically, it's memory management (allocation and what not) is superior to any memory storage solutions. That's a fact, you can look it up or take my word for it - it will, at least, perform as good as Redis.
Now what happens in your app - you query MySQL and cache the result into redis. However, MySQL is also smart enough to keep cached results. What you just did is create an additional file descriptor that's required to connect to Redis. You also used some storage (RAM) to cache the result that MySQL already cached.
Here comes another interesting part - the preferred way of serving PHP scripts is by using php-fpm - it's much quicker than any mod_* crap out there. Down to the core, php-fpm is a supervisor process that spawns child processes. They don't shut down after the script is served, which means they cache connections to MySQL - connect once, use multiple times. Basically, if you serve scripts using php-fpm, they will reuse the already established connection to MySQL, meaning that you won't be opening and closing connections for each request - this is extremely resource friendly and it lets you have lightning fast connection to MySQL. MySQL, being memory efficient and having the cached result is much quicker than Redis.
Now what does all of this mean for you - having a proper setup lets you have small code that's simple, easy, doesn't involve Redis and eliminates all the problems that you might have with cache invalidation and what not and you won't waste your memory to contain the same data twice.
Ingredients you need for this to work:
php-fpm
MySQL and InnoDB based tables and most of all - sufficient RAM and tweaked innodb_buffer_pool_size variable. That one controls how much RAM InnoDB is allowed to allocate for its purposes - the larger the better.
You eliminated Redis from the game, you kept your code simple and easy to maintain, you didn't duplicate data, you didn't introduce additional system to the play and you let software that's meant to take care of data do its job. Pretty cheap trade-off for maximum usefulness, even if you compile all the software from scratch - it won't take more than an hour or so to get it up and running.
Or, you can just ignore what I wrote and look for a solution using Redis.
We met the same problem and we chose to do same thing you are thinking of: remove all query caches affected by the table. It is not ideal like your said but fortunately our "write" is not as high as 40% so it's ok so far.
That's the nature of query based caching. As an alternative you can add entity based caching. Instead of caching the search result only, cache the entire table and do the search inside memory. We use C# LINQ so we can do pretty common queries in memory but if the search is too complicated then you are out of luck.

Which would be more efficient, having each user create a database connection, or caching?

I'm not sure if caching would be the correct term for this but my objective is to build a website that will be displaying data from my database.
My problem: There is a high probability of a lot of traffic and all data is contained in the database.
My hypothesized solution: Would it be faster if I created a separate program (in java for example) to connect to the database every couple of seconds and update the html files (where the data is displayed) with the new data? (this would also increase security as users will never be connecting to the database) or should I just have each user create a connection to MySQL (using php) and get the data?
If you've had any experiences in a similar situation please share, and I'm sorry if I didn't word the title correctly, this is a pretty specific question and I'm not even sure if I explained myself clearly.
Here are some thoughts for you to think about.
First, I do not recommend you create files but trust MySQL. However, work on configuring your environment to support your traffic/application.
You should understand your data a little more (How much is the data in your tables change? What kind of queries are you running against the data. Are your queries optimized?)
Make sure your tables are optimized and indexed correctly. Make sure all your query run fast (nothing causing a long row locks.)
If your tables are not being updated very often, you should consider using MySQL cache as this will reduce your IO and increase the query speed. (BUT wait! If your table is being updated all the time this will kill your server performance big time)
Your query cache is set to "ON". Based on my experience this is always bad idea unless your data does not change on all your tables. When you have it set to "ON" MySQL will cache every query. Then as soon as they data in the table changes, MySQL will have to clear the cached query "it is going to work harder while clearing up cache which will give you bad performance." I like to keep it set to "ON DEMAND"
from there you can control which query should be cache and which should not using SQL_CACHE and SQL_NO_CACHE
Another thing you want to review is your server configuration and specs.
How much physical RAM does your server have?
What types of Hard Drives are you using? SSD is not at what speed do they rotate? perhaps 15k?
What OS are you running MySQL on?
How is the RAID setup on your hard drives? "RAID 10 or RAID 50" will help you out a lot here.
Your processor speed will make a big different.
If you are not using MySQL 5.6.20+ you should consider upgrading as MySQL have been improved to help you even more.
How much RAM does your server have? is your innodb_log_buffer_size set to 75% of your total physical RAM? Are you using innodb table?
You can also use MySQL replication to increase the read sources of the data. So you have multiple servers with the same data and you can point half of your traffic to read from server A and the other half from Server B. so the same work will be handled by multiple server.
Here is one argument for you to think about: Facebook uses MySQL and have millions of hits per seconds but they are up 100% of the time. True they have trillion dollar budget and their network is huge but the idea here is to trust MySQL to get the job done.

Flush InnoDB cache

I have some reporting queries that are rarely run, which I need to be performant without relying on them being cached anywhere in the system. In testing various schema and sproc changes I'll typically see the first run be very slow and subsequent runs fast, so I know there's some caching going on that's making it cumbersome to test changes. Restarting mysqld or running several other large queries are the only reliable ways to reproduce it. I'm wondering if there's a better way.
The MySQL Query Cache is turned OFF.
Monitoring the disk, I don't see any reads happening except on the first run. I'm not that familiar with disk cache but I would expect if that's where the caching is happening I'd still see disk reads, they'd just be very fast.
MONyog gives me what I think is the definitive proof, which is the InnoDB cache hit ratio. Monitoring it I see that when the query's fast it's hitting the InnoDB buffer, when it's slow it's hitting disk.
On a live system I'll gladly let InnoDB do this, but for development and test purposes I'm interested in worst case scenarios.
I'm using MySQL 5.5 on Windows Server 2008R2
I found a post on the Percona blog that says:
For MySQL Caches you can restart MySQL and this is the only way to clean all of the caches. You can do FLUSH TABLES to clean MySQL table cache (but not Innodb table meta data) or you can do “set global key_buffer_size=0; set global key_buffer_size=DEFAULT” to zero out key buffer but there is no way to clean Innodb Buffer Pool without restart.
In the comments he goes on to say:
Practically everything has caches. To do real profiling you need to profile real query mix which will have each query having appropriate cache/hit ratio not running one query in the loop and assuming results will be fine.
I guess that sums it up. It does make it hard to test individual queries. My case is that I want to try forcing different indices to make sure the query planner is picking the right one, and apparently I'll have to restart MySQL between tests to take the cache out of the equation!