For a small application, need to use a flat file database with relational capabilities (2 or 3 tables).
Couple questions regarding this schema:
Does such databases exist?
Any performance hits with large datasets? (say, 10k-20k entries)
The reason I want to go with a flat-file database is so the whole thing(root directory) can be copied and pasted, and not have to worry about exporting the database, installing & configuring a database in another system, etc.
thanks.
Try SQLite.
It is easy to use, portable, no configuration and has great performance (10k / 20k is nothing)
Related
I have a database with data that is read-only as far as the application using it is concerned. However, different groups of tables within the database need to be refreshed weekly, or monthly (the organization generating the data provides entirely new flat files for this purpose). Most of the updates are small but some are large (more than 5 million rows with a large number of fields). I like to load the data to a test database and then just replace the entire database in production. So far I have been doing this by exporting the data using mysqldump and then importing it into production. The problem is that the import takes 6-8 hours and the system is unusable during that time.
I would like to get the downtime as short as possible. I’ve tried all the tips I could find to speed mysqldump, such as those listed here: http://vitobotta.com/smarter-faster-backups-restores-mysql-databases-with-mysqldump/#sthash.cQ521QnX.hpG7WdMH.dpbs. I know that many people recommend Percona’s XtraBackup, but unfortunately I’m on a Windows 8 Server and Perconia does not run on Windows. Other fast backups/restore options are too expensive (e.g., MySql Enterprise). Since my test server and production server are both 64 bit Windows machines and are both running the same version of MySql,(5.6) I thought I could just zip up the database files and copy them over to swap out the whole database at once (all are innodb). However, that didn’t work. I saw the tables in MySql Workbench, but couldn’t access them.
I’d like to know if copying the database files is a viable option (I may have done it wrong) and if it is not, then what low cost options are available to reduce my downtime?
I have a 2GB Digital Ocean VPS with 2 CPUs, to host a social network app written in Java. Right now my app stores data to Cassandra, but Cassandra is a new technology & not as reliable as MySQL that has been for years, also my experience in managing Cassandra as a DBA is not much. So I wanted to change my primary datasource back to MySQL but since some of the data is stored just schemaless, for e.g. there are lists specific to each user that are easily stored in Cassandra. For this type of data, I would use Cassandra as primary database.
So, to sum up, I would replicate my entire data in both the databases. Data will be written to both databases but read from where I can get it most performantly. This will help me in case when the entire Cassandra cluster goes down I can serve from mysql or vice versa. Is this usually done & recommended to do ?
(Right now I have a single 2 GB VPS that would host my app as well as the databases)
Normally we never see that people managing two separate database system just for purpose of data lost and recovery, always better to rely upon replication or mirroring of any of database system. Both are good and providing enough solution for replication so better you will choose any single one of them.
I need to know how data from databases is stored on a filesystem. I am sure, that different databases use different ways of storing data, but I want to know what the general rule is (if there is one), and what can be changed in settings of a particular DB.
How is the whole database stored? In one big file or one file per table?
What if a table is enormous? Would it be split into few files?
What is typical size of file in that case?
The answer to this question is both database dependent and implementation dependent. Here are some examples of how data can be stored:
As a single file per database. (This is the default for SQL Server.)
Using a separate file system manager, which could be the operating system. (MySQL has several options, with names like InnoDB.)
Using separate files for each table. (If we consider Access a database.)
As multiple physical files, spread across multiple file systems, but represented as a single "file". (HIVE, for instance, that uses a parallel file system to store the data.)
However, these are the default configurations. Real databases typically let you split the data among multiple physical devices. SQL Server and MySQL call this partitions. Oracle calls this table spaces. These are typically set up by knowledgeable DBAs who understand the performance requirements of the system.
The final questions are easy to answer, though. Most databases give you the option of either growing the databases as space is needed or giving the database a fixed (or fixed maximum) size. I have not encountered a database engine that will split the underlying data into multiple files automatically, although it is possible that newer column oriented databases (such as Vertica) do something similar.
Does it matter how many databases I use in my web system? I am planning to have:
User information and related tables
Admin tables and all system tables
Reporting system
Audit logs of tables
User object tables like Photos, Videos, Comments
User API applications to read/write data.
Questions:
I am using MySQL and MagnoDB with cakephp. So if i implement above then i will use 6 databases in the system. Add backups so 2 of each then 12 databases total. Any advantage / disadvantage this way vs dumping all tables into 1 database? I assume thesedays with sites like yahoo, amazon, facebook, etc having hundreds or thousands of databases is the norm OR are these all powered by 1 database but having multiple instances?
For lookup tables: Do i duplicate them in each database or 1 copy in the admin database is good enough?
Also if i have multiple instance of the same DB do i need to name them like DB1, DB2, DB3 or can i call them anything?
We are developing a local reviews website so expect lots of users eventually.
Everything that has relative information should be in the same DB.
So if all the things you have mentioned have relative info you would need probably 3 DBs: dev, prod, backup.
If some of that info is not related to anything else, that it should be in the separate DB.
As a developer, I always create a new DB for each new unrelated project. Otherwise, you create / add new features to the existing DB.
Use a single database. The problem with using multiple DBs include distributed queries (as pointed out already), plus overhead associated with each db server / instance, and general maintenance complexity.
What you want are tablespaces http://dev.mysql.com/doc/refman/5.1/en/create-tablespace.html
Consider with dbs like Oracle, the overhead per instance is 300-500mb+. Not to mention a new set of processes and separate buffer caches. You want a single, unified buffer cache to make the most of your RAM.
Partitioning using a database as the partition unit isn't saving you much, but will make a giant headache. MySQL can handle huge amounts of data (terabytes), as long as you design your schemas well and tune the storage. And use separate tablespaces.
Moving your app, and backup / restore should be simple too.
The only reason I create separate DBs is if there are multiple customers involved, or the project requires it. But it is usually not required from a technical point of view.
disadvantages: trying to do joins across multiple database
Any area that you might need to relate to another area should all be in the same db.
I'm just getting started with learning Hadoop, and I'm wondering the following: suppose I have a bunch of large MySQL production tables that I want to analyze.
It seems like I have to dump all the tables into text files, in order to bring them into the Hadoop filesystem -- is this correct, or is there some way that Hive or Pig or whatever can access the data from MySQL directly?
If I'm dumping all the production tables into text files, do I need to worry about affecting production performance during the dump? (Does it depend on what storage engine the tables are using? What do I do if so?)
Is it better to dump each table into a single file, or to split each table into 64mb (or whatever my block size is) files?
Importing data from mysql can be done very easily. I recommend you to use Cloudera's hadoop distribution, with it comes program called 'sqoop' which provides very simple interface for importing data straight from mysql (other databases are supported too).
Sqoop can be used with mysqldump or normal mysql query (select * ...).
With this tool there's no need to manually partition tables into files. But for hadoop it's much better to have one big file.
Useful links:
Sqoop User Guide
2)
Since I dont know your environment I will aire on the safe, side - YES, worry about affecting production performance.
Depending on the frequency and quantity of data being written, you may find that it processes in an acceptable amount of time, particularly if you are just writing new/changed data. [subject to complexity of your queries]
If you dont require real time or your servers have typically periods when they are under utilized (overnight?) then you could create the files at this time.
Depending on how you have your environment setup, you could replicate/log ship to specific db server(s) who's sole job is to create your data file(s).
3)
No need for you to split the file, HDFS will take care of partitioning the data file into bocks and replicating over the cluster. By default it will automatically split into 64mb data blocks.
see - Apache - HDFS Architecture
re: Wojtek answer - SQOOP clicky (doesn't work in comments)
If you have more questions or specific environment info, let us know
HTH
Ralph