Query MySQL database vs reading static files - mysql

I have a MySQL database with fixed data, that never will change
or be edited, or be queried with complex queries.
IT just has 2 columns
Id|Data
And it contains about 50k rows, and has a size of around 70mb
I was thinking maybe I should created 50k static files which
will be named Id.xml and will be read that way. For example:
file_get_contents('2232.xml');
versus querying the mysql database
select from table where id = 2232
Is it better to do it this way, for a quicker performance
less ram usage? Or 50k inodes would not be ideal for the system?

Go for the static files. One benefit is not having another database in the system. The system can definitely handle 50k inodes (if it were in the millions, you may need to reconsider).

Related

MySQL best practice for archiving data

I have a 120Go database with 1 specific very heavy table of 80Go (storing data since +10 years).
I think to move old data in archive, but wonder if it is best :
to move them in a new table in the same database
to move them in a new table of a new archive database
What would be the result on the performence point of view ?
1/ If I reduce the table to only 8Go and move 72Go in another table from the same database, is the database going to run faster (we won't access the archive table with read/write operations and r/W will be done on a lighter table).
2/ Keeping 72Go of data into the archive table will anyway slow down the database engine ?
3/ Having the 72Go of data into another archive database will have any benefit versus keeping the 72Go into the archive table of the master database ?
Thanks for your answers,
Edouard
The size of a table may or may not impact the performance of queries against that table. It depends on the query, innodb_buffer_pool_size and RAM size. Let's see some typical queries.
The existence of a big table that is not being used has no impact on queries against other tables.
It may or may not be wise to PARTITION BY RANGE TO_DAYS(...) and have monthly or yearly partitions. The main advantage is where you get around to purging old data, but you don't seem to need that.
If you do split into 72 + 8, I recommend copying the 8 from the 80 into a new table, then use RENAME TABLEs to juggle the table names.
Two TABLEs in one DATABASE is essentially the same as having the TABLEs in different DATABASEs.
I'll update this Answer when you have provided more details.

Can I use multiple servers to increase mysql's data upload performance?

I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.

Only Mysql OR mysql+sqlite OR mysql+own solution

Currently I am building quite big web system and I need strong SQL database solution. I chose Mysql over Postgres because some of tasks needs to be read-only (MyISAM engine) and other are massive-writes (InnoDB).
I have a question about this read-only feature. It has to be extremely fast. User must get answer a lot less than one second.
Let say we have one well-indexed table named "object" with not more than 10 millions of rows and another one named "element" with around 150 millions of rows.
We also have table named "element_object" containing information connecting objects from table "element" with table "object" (hundreds of millions of rows)
So we're going to do partitioning on tables "element" and "element_object" and have 8192 tables "element_hash_n{0..8191}a" and 24576 of tables "element_object_hash_n{0..8191}_m{0..2}".
An Answer on user's question would be a 2-step searching:
Find id of element from tables "element_hash_n"
Do main sql select on table "object" and join with table "element_object..hash_n_m" to filter result with found (from first step) ID
I wonder about first step:
What would be better:
store (all) over 32k tables in mysql
create one sqlite database and store there 8192 tables for first step purpose
create 8192 different sqlite files (databases)
create 8192 files in file system and make own binary solution to find ID.
I'm sorry for my English. Its not my native language.
I think you make way to many partitions. If you have more than 32000 partitions you have a tremendous overhead of management. Given the name element_hash_* it seams as if you want to make a hash of your element and partition it this way. But a hash will give you a (most likely) even distribution of the data over all partitions. I can't see how this should improve performance. If your data is accessed over all those partitions you don't gain anything by having partitions in size of your memory - you will need to load for every query data from another partition.
We used partitions on a transaction systems where more than 90% of the queries used the current day as criteria. In such a case the partition based on days worked very well. But we only had 8 partitions and moved the data then off to another database for long time storage.
My advice: Try to find out what data will be needed that fast and try to group it together. And you will need to make your own performance tests. If it is so important to deliver data that fast there should be enough management support to build a decent test environment.
Maybe your test result will show that you simply can't deliver the data fast enough with a relational database system. If so you should look at NoSQL (as in Not only SQL) solutions.
In what technology do you build your web system? You should test this part as well. A super fast database will not help you much if you lose the time in a poorly performing web application.

Database solution for large infrequently-accessed data sets

We use MongoDB to store daily logs of statistics about 10s of thousands of items in our database--the collection is currently approaching 100 million records. This data is useful for data mining, but is accessed infrequently. We recently moved it from our main MySQL database to a Mongo database; this turned out not to be ideal--Mongo is optimized for fast reads, keeping all of its indexes in memory, and the index on this table is very large.
What is a good way to store large amounts of data for daily large writes, but infrequent reads? We are considering a separate MySQL installation on a separate system. Another possibility might be a NoSQL solution that did not require an index kept in memory.
You are correct, a nosql is good for fast reads of simple data. Since will need to query and possibly do relational operations on this data, I'd recommend a separate mysql installation for this.
You will want to minimize the sql indexes for fast writes.

Database design for heavy timed data logging

I have an application where I receive each data 40.000 rows. I have 5 million rows to handle (500 Mb MySQL 5.0 database).
Actually, those rows are stored in the same table => slow to update, hard to backup, etc.
Which kind of scheme is used in such application to allow long term accessibility to the data without problems with too big tables, easy backup, fast read/write ?
Is postgresql better than mysql for such purpose ?
1 - 40000 rows / day is not that big
2 - Partition your data against the insert date : you can easily delete old data this way.
3 - Don't hesitate to go through a datamart step. (compute often asked metrics in intermediary tables)
FYI, I have used PostgreSQL with tables containing several GB of data without any problem (and without partitioning). INSERT/UPDATE time was constant
We're having log tables of 100-200million rows now, and it is quite painful.
backup is impossible, requires several days of down time.
purging old data is becoming too painful - it usually ties down the database for several hours
So far we've only seen these solutions:
backup , set up a MySQL slave. Backing up the slave doesn't impact the main db. (We havn't done this yet - as the logs we load and transform are from flat files - we back up these files and can regenerate the db in case of failures)
Purging old data, only painless way we've found is to introduce a new integer column that identifies the current date, and partition the tables(requires mysql 5.1) on that key, per day. Dropping old data is a matter of dropping a partition, which is fast.
If in addition you need to do continuously transactions on these tables(as opposed to just load data every now and then and mostly query that data), you probably need to look into InnoDB and not the default MyISAM tables.
The general answer is: you probably don't need all that detail around all the time.
For example, instead of keeping every sale in a giant Sales table, you create records in a DailySales table (one record per day), or even a group of tables (DailySalesByLocation = one record per location per day, DailySalesByProduct = one record per product per day, etc.)
First, huge data volumes are not always handled well in a relational database.
What some folks do is to put huge datasets in files. Plain old files. Fast to update, easy to back up.
The files are formatted so that the database bulk loader will work quickly.
Second, no one analyzes huge data volumes. They rarely summarize 5,000,000 rows. Usually, they want a subset.
So, you write simple file filters to cut out their subset, load that into a "data mart" and let them query that. You can build all the indexes they need. Views, everything.
This is one way to handle "Data Warehousing", which is that your problem sounds like.
First, make sure that your logging table is not over-indexed. By that i mean that every time you insert/update/delete from a table any indexes that you have also need to be updated which slows down the process. If you have a lot of indexes specified on your log table you should take a critical look at them and decide if they are indeed necessary. If not, drop them.
You should also consider an archiving procedure such that "old" log information is moved to a separate database at some arbitrary interval, say once a month or once a year. It all depends on how your logs are used.
This is the sort of thing that NoSQL DBs might be useful for, if you're not doing the sort of reporting that requires complicated joins.
CouchDB, MongoDB, and Riak are document-oriented databases; they don't have the heavyweight reporting features of SQL, but if you're storing a large log they might be the ticket, as they're simpler and can scale more readily than SQL DBs.
They're a little easier to get started with than Cassandra or HBase (different type of NoSQL), which you might also look into.
From this SO post:
http://carsonified.com/blog/dev/should-you-go-beyond-relational-databases/