Store common queries on disk with Mysql and windows - mysql

I have a Huge person database and do common search with name on it.
SELECT * FROM tbl_person WHERE full_name LIKE 'Sparow%Jack%';
SELECT * FROM tbl_person WHERE full_name LIKE 'Sparow%';
I rarely insert new data in this table.
I want to store common last_name queries on hark disk, queries already stored in ram but I loose it all each time the server reboot.
I have 1.7Billions row in my table and each row (with index) take 1k, yes it's a 1.7Tb database.
It's the main reason why I want to stored common select on disk.
Variable_name,Value
query_alloc_block_size,8192
query_cache_limit,1048576
query_cache_min_res_unit,1024
query_cache_size,4294966272
query_cache_type,ON
query_cache_wlock_invalidate,OFF
query_prealloc_size,8192
Edit :
SELECT * FROM tbl_person WHERE full_name LIKE 'Savard%';
take 1000 sec to execute first time and 2 sec after.
If I reboot the system and execute again, the query take 1000 sec again.
I simply want to avoid mysql take another 1000 sec runing the same query I already do before reboot.

Why not consider something like Redis for caching?
It's an in memory data store and it's very popular right now. Sites using Redis:
http://blog.togo.io/redisphere/redis-roundup-what-companies-use-redis
Redis also can persist data to disk: http://redis.io/topics/persistence
For caching though, saving to disk shouldn't be absolutely critical. The idea is that if some data is not cached, the worst case is not always loading from disk manually, but going straight through to your database.

If you are performing many such queries on your data, I suggest you index your table using Apache Lucene or Sphinx. Database are fast, but they are not so efficient (especially MySQL) when performing partial matches on millions of rows.
I already answered a similar question about Zend Framework and Lucene, and favor Zend's solution as I believe it is the easiest to setup and use with a PHP environment.
Luckily, Zend Framework can be used by module and you can easily only use the Zend Search Lucene module by itself without the entire class library.
** Edit **
The role of an indexer is not to replace your DB, but to improve it's search functionality by providing a way to perform partial searches. For example, given your table, you may only index a few of your fields (make them "queryable") and have other static (non-indexed) fields to reference your rows in your database.
The advantage in using an indexer is that you can also index pre-computations and directly search them, instead of querying the database.

Related

Should I use a MySQL view or a report cronjob

At my work my colleagues always build report cronjobs for heavy tables. With the cronjob we get all data from 1 day per user and insert the totals in a report table. The report overview page is not correct because it has a delay for at most 1 hour.
The cronjob runs 24 times a day (every hour).
Is it better to use a MySQL view? When a record has been added to the master table the MySQL view will updated, right? This is a very though action. Will that affect the users using the dashboard?
Kind regards,
Joost
Okay so some terminology first.
The cron jobs are most likely appending data to existing tables (perhaps using an upsert method like INSERT ... ON DUPLICATE KEY UPDATE). These data you are writing to the existing tables may be indexed, just like normal MySQL tables, and they are also persistent on disk
Views, on the other hand, are really nothing more than saved queries in MySQL. Every time you open a view, you run the query again. Views aren't really useful for performance optimization as much as they are useful for small, efficient queries that otherwise might be a pain to remember. Views cannot have indices (although they are effectively saved queries, so the query itself can make use of the indices on the tables it's referencing) and they are not persistent to disk. Every time you load the view, you will be running the query that makes up the view again
Now, in between views and tables populated by Cron jobs, you also could install a plugin for MySQL called Flexviews (https://github.com/greenlion/swanhart-tools). Flexviews allows MySQL to use what are called materialized views (eg http://en.wikipedia.org/wiki/Materialized_view). Materialized views are basically views that are persisted to disk as tables. And, since they are tables, they can also use indices.
Materialized views are not native to MySQL, but the developer who maintains that plugin is well known in the MySQL community, and he tends to write good, reliable SQL tools . Obviously it would be a mistake to test the plugin in a production environment, or without using backups. But there are plenty of folks who use Flexviews in production to accomplish exactly what it seems like you'd like to do... obtain near real time updates of dashboard/summary tables in a way that doesn't murder DB performance.
I'd definitely check Flexviews out... you can learn more about it
here: http://www.percona.com/blog/2011/03/23/using-flexviews-part-one-introduction-to-materialized-views/
and here: http://www.percona.com/blog/2011/03/25/using-flexviews-part-two-change-data-capture/

High frequency insert in MySQL

I have a problem with high frequency insert in MySQL. I've searched a lot on Internet but haven't found a good answer to my problem.
I need to log a lot of event at a very high frequency (~3000 inserts / s => 260 millions row per day), these event are stored in a InnoDB table like that :
log_events :
- id_user : BIGINT
- id_event : SMALLINT
- date : INT
- data : BIGINT (data associated to this event)
My problems are :
- How to speed inserts ? Event are send by thousands of visitors and we are not able to bulk insert
- How to limit IO write ? We are on a 6*600 GB SSD drives and have write IO problems
Do you have any ideas to these kind of problem ?
Thanks
François
Do you have any foreign keys on that table? If so, I would consider to remove them and add indexes only on cols which are used for reads. This should improve writes.
The second idea is use some in-memory db (eg. redis, memcache) as a queue and some worker could get data from it and inserts in a bulk (for example for every 2 seconds) to mysql storage.
The another option if you don't need frequent reads is use archive storage instead of innodb: http://dev.mysql.com/doc/refman/5.5/en/archive-storage-engine.html. But it looks like it's not an option for you as long as it hasn't indexes at all (which means full scan table reads).
Another option is reorganize your db structure, eg. use partitioning (http://dev.mysql.com/doc/refman/5.5/en/partitioning.html). But it depends on how SELECTS looks like.
My additional questions are:
could you show whole table definition?
which fields are used for reads? could you show them?
do you need all data for your reads or maybe only recently ones? If so, how recently data must be? (eg. only from last day/week/month/year)
id_event is an event type, right? Number of possible events is static or it could change in the future?
Event are send by thousands of visitors and we are not able to bulk insert
You need to either bulk insert or shard the data. I would be tempted to try the bulk insert route first.
That you think you can't suggests these events are being created by autonomous processes - you just need to funnel them through an intermediary rather than direct to the database. And it would be easiest to implement that funnel as an event based server (rather than a threaded or forking server).
You don't say what the events are nor where they originate - which has some impact on the details of implementing a solution.
Both rsyslog and syslogng will talk to a MySQL backend - hence you can eliminate the overhead of establishing a new connection per message - but I don't know if either implements buffering / bulk inserts. It would certainly be possible to tail the files they produce with a single process and create bulk inserts from there.
It would relatively simple to write a funnel using this event based server, this buffer tool along with a bit of code to implement asynch mysqli calls and a watchdog. Or you could use node.js with an async mysql lib. There's also tools like statsd (again using node.js) which can also perform some aggregation on the data on the data.
Or you could just write something from scratch.
A write-only database is a useless piece of hardware though. You've not provided any details of how this data will be used - which has some relevance to designing a solution. Also since ideally the data feed would be a single process / DB session, it might be a beter idea to use MyISAM rather than InnoDB (I see in your later comment you said you had problems with MyISAM - presumably this was with multiple clients).

Can I use multiple servers to increase mysql's data upload performance?

I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.

Document-oriented dbms as primary db and a RDBMS db as secondary db?

I'm having some performance issues with MySQL database due to it's normalization.
Most of my applications that uses a database needs to do some heavy nested queries, which in my case takes a lot of time. Queries can take up 2 seconds to run, with indexes. Without indexes about 45 seconds.
A solution I came a cross a few month back was to use a faster more linear document based database, in my case Solr, as a primary database. As soon as something was changed in the MySQL database, Solr was notified.
This worked really great. All queries using the Solr database only took about 3ms.
The numbers looks good, but I'm having some problems.
Huge database
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of data.
Each time I need to change a table/column the database need to be reindexed, which in this example took over 12 hours.
Difficult to render both a Solr object and a Active Record (MySQL) object without getting wet.
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
Like this.
# Controller
#song = Song.first
# View
#song.artist.urls.first.service.name
The problem in my case is that the data being returned from Solr is flat like this.
{
id: 123,
song: "Waterloo",
artist: "ABBA",
service_name: "Groveshark",
urls: ["url1", "url2", "url3"]
}
This forces me to build an active record object that can be passed to the view.
My question
Is there a better way to solve the problem?
Some kind of super duper fast primary read only database that can handle complex queries fast would be nice.
Solr individual fields update
About reindexing all on schema change: Solr does not support updating individual fields yet, but there is a JIRA issue about this that's still unresolved. However, how many times do you change schema?
MongoDB
If you can live without a RDBMS (without joins, schema, transactions, foreign key constrains), a document-based DB like MongoDB,
or CouchDB would be a perfect fit. (here is a good comparison between them )
Why use MongoBD:
data is in native format (you can use an ORM mapper like Mongoid directly in the views, so you don't need to adapt your records as you do with Solr)
dynamic queries
very good performance on non-full text search queries
schema-less (no need for migrations)
build-in, easy to setup replication
Why use SOLR:
advanced, very performant full-text search
Why use MySQL
joins, constrains, transactions
Solutions
So, the solutions (combinations) would be:
Use MongoDB + Solr
but you would still need to reindex all on schema change
Use only MongoDB
but drop support for advanced full-text search
Use MySQL in a master-slave configuration, and balance reads from slave(s) (using a plugin like octupus) + Solr
setup complexity
Keep current setup, denormalize data in MySQL
messy
Solr reindexing slowness
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of
data. Each time I need to change a table/column the database need to
be reindexed, which in this example took over 12 hours.
Reindexing 200MB DB in Solr SHOULD NOT take 12 hours! Most probably you have also other issues like:
MySQL:
n+1 issue
indexes
SOLR:
commit after each request - this is the default setup is you use a plugin like sunspot, but it's a perf killer for production
From http://outoftime.github.com/pivotal-sunspot-presentation.html:
By default, Sunspot::Rails commits at the end of every request
that updates the Solr index. Turn that off.
Use Solr's autoCommit
functionality. That's configured in solr/conf/solrconfig.xml
Be
glad for assumed inconsistency. Don't use search where results need to
be up-to-the-second.
other setup issues (http://wiki.apache.org/solr/SolrPerformanceFactors#Indexing_Performance)
Look at the logs for more details
Instead of pushing your data into Solr to flatten the records, why don't you just create a separate table in your MySQL database that is optimized for read only access.
Also you seem to contradict yourself
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
The problem in my case is that the data being returned from Solr is flat... This forces me to build a fake active record object that can be rendered by the view.

How to cache latest inserted data in MySQL?

Is it possible to cache recently inserted data in MySQL database internally?
I looked at query cache etc (http://dev.mysql.com/doc/refman/5.1/en/query-cache.html) but thats not what I am looking for. I know that 'SELECT' query will be cached.
Details:
I am inserting lots of data to MySQL DB every second.
I have two kind of users for this Data.
Users who query any random data
Users who query recently inserted data
For 2nd kind of users, my table has primary key as unix time-stamp which tells me how new the data is. Is there any way to cache the data at the time of insert?
One option is to write my own caching module which cache data and then 'INSERT'.
Users can query this module before going to MySQL DB.
I was just wondering if something similar is available.
PS: I am open to other database providing similar feature.
Usually you get the best performance from MySQL if you allow a big index cache (config setting key_buffer_size), at least for MyISAM tables.
If latency is really an issue (as it seems in your case) have a look at Sphinx which has recently introduced real-time indexes.