I have two identical (in structure) databases residing on separate backend servers.
I need to come up with some logic to 'merge' their data into a single database on a third server.
My initial design is to load their data (by table) into memory using a combination of Perl hashes and arrays and merging them there, then doing a single massive write to a local DB (also identical in structure).
I would repeat for all tables (4-5).
I've seen posts about merging tables, but not sure if I can use some of those responses as my tables reside in separate databases (let alone separate machines).
My question is am I stuck with having to load the results into memory first or are there features of MySQL that I can use to my advantage?
What "mu" said needs addressing, but I'm not sure I'd go with this approach at all.
Get the two databases onto the target server using standard mysql dump/restore
Use standard queries to merge them into the third DB using standard queries
You should let MySQL do the heavy lifting.
Related
In MySQL (and PostgreSQL), what exactly constitutes a DB instance and a DB partition?
For example, do different DB partitions need to necessarily live on different database instances? Or can a single DB instance manage multiple partitions? If the latter, what's the point of calling it a "partition"? Would the DB have any knowledge of it in this case?
Here's a quote from a document describing a system design from an online course:
How can we plan for the future growth of our system?
We can have a large number of logical partitions to accommodate future data growth, such that in the beginning, multiple logical partitions reside on a single physical database server. Since each database server can have multiple database instances on it, we can have separate databases for each logical partition on any server. So whenever we feel that a particular database server has a lot of data, we can migrate some logical partitions from it to another server. We can maintain a config file (or a separate database) that can map our logical partitions to database servers; this will enable us to move partitions around easily. Whenever we want to move a partition, we only have to update the config file to announce the change.
These terms are confusing, misused, and inconsistently defined.
For MySQL:
A Database has multiple definitions:
A "schema" (as used by other vendors/standards). This is a collection of tables. There are one or more "databases in an instance.
The instance. You should use "server" or "database server" to be clearer.
The data. "Dataset" might be a better term.
An instance refers to a copy of mysqld running on some machine somewhere.
You can have multiple instances on a single piece of hardware. (Rare)
You can have multiple instances on a single piece of hardware, with the instances in different VMs or Dockers. (handy for testing)
Usually "instance" refers to one server with one copy of MySQL on it. (Typical for larger-scale situations)
A PARTITION is a specific way to lay out a table (in a database).
It is seen in CREATE TABLE (...) PARTITION BY ....
It is a "horizontal" split of the data, often by date, but could be by some other 'column'.
It have no direct impact on performance, making it rarely useful.
Sharding is not implemented in MySQL, but can be done on top of MySQL.
It is also a "horizontal" split of the data, but in this case across multiple "instances".
The use case is, for example, social media where there are millions of "users" that are mostly handled by themselves. That is, most of the queries focus on a single slice of the data, hence it is practical to a bunch of users on one server and do all those queries there.
It can be called "horizontal partitioning" but should not be confused with PARTITIONs of a table.
Vertical partitioning is where some columns are pulled out of a table in put into a parallel table.
Both tables would (normally) have the same PRIMARY KEY, thereby facilitating JOINs.
Vertical partitioning would (normally) be done only in a single "instance".
The purposes include splitting off big text/blog columns; splitting off optional columns (and use LEFT JOIN to get NULLs).
Vertical partitioning was somewhat useful in MyISAM, but rarely useful in InnoDB, since that engine automatically does such.
Replication and Clustering
Multiple instances contain the same data.
Used for "High Availability" (HA).
Used for scaling out reads.
Orthogonally to partitioning or sharding.
Does not make sense to have the instances on the same server (except for testing/experimenting/staging/etc).
Partitions, in terms of MySQL and PostgreSQL feature set, are physical segmentations of data. They exist within a single database instance, and are used to reduce the scope of data you're interacting with at a particular time, to cope with high data volume situations.
The document you're quoting from is speaking of a more abstract concept of a data partition at the system design level.
In my primary role, I handle laboratory testing data files that can contain upwards of 2000 parameters for each unique test condition. These files are generally stored and processed as CSV formatted files, but that becomes very unwieldy when working with 6000+ files with 100+ rows each.
I am working towards a future database storage and query solution to improve access and efficiency, but I am stymied by the row length limitation of MySQL (specifically MariaDB 5.5.60 on RHEL 7.5). I am using MYISAM instead of InnoDB, which has allowed me to get to around 1800 mostly-double formatted data fields. This version of MariaDB forces dynamic columns to be numbered, not named, and I cannot currently upgrade to MariaDB 10+ due to administrative policies.
Should I be looking at a NoSQL database for this application, or is there a better way to handle this data? How do others handle many-variable data sets, especially numeric data?
For an example of the CSV files I am trying to import, see below. The identifier I have been using is an amalgamation of TEST, RUN, TP forming a 12-digit unsigned bigint key.
Example File:
RUN ,TP ,TEST ,ANGLE ,SPEED ,...
1.000000E+00,1.000000E+00,5.480000E+03,1.234567E+01,6.345678E+04,...
Example key:
548000010001 <-- Test = 5480, Run = 1, TP = 1
I appreciate any input you have.
The complexity comes from the fact that you have to handle a huge number of data, not from the fact that they are split over many files with many rows.
Using a database storage & query system will superficially hide some of this complexity, but at the expense of complexity at several other levels, as you have already experienced, including obstacles that are out of your control like changing versions and conservative admins. Database storage & query system are made for other application scenarios where they have advantages that are not pertinent for your case.
You should seriously reconsider leaving your data in files, i.e. use your file system as your database storage system. Possibly, transcribe you CSV input into a modern self-documenting data format like YAML or HDF5. For queries, you may be better off writing scripts or programs that directly access those files, instead of writing SQL queries.
MySQL temporary table are stored in memory as long as computer has enough RAM (and MySQL was set up accordingly). One can created any indexes for any fields.
Redis stores data in memory indexed by one key at time and in my understanding MySQL can do this job too.
Are there any things that make Redis better for storing big amount(100-200k rows) of volatile data? I can only explain the appearance of Redis that not every project has mysql inside and probably some other databases don't support temporary tables.
If I already have MySql in my project, does it make sense to put up with Redis?
Redis is like working with indexes directly. There's no ACID, SQL parser and many other things between you and the data.
It provides some basic data structures and they're specifically optimized to be held in memory, and they also have specific operations to read and modify them.
In the other hand, Redis isn't designed to query data (but you can implement very powerful and high-performant filters with SORT, SCAN, intersections and other operations) but to store the data as you're going to be consumed later. If you want to get, for example, customers sorted by 3 different criterias, you'll need to work to fill 3 different sorted sets. There're a lot of use cases with other data structures, but I would end up writing a book in an answer...
Also, one of most powerful features found in Redis is how easy can be replicated, and since its 3.0 version, it supports data sharding out-of-the-box.
About why you would need to use Redis instead of temporary tables on MySQL (and other engines which have them too) is up to you. You need to study your case and check if caching or storing data in a NoSQL storage like Redis can both outperform your actual approach and it provides you a more elegant data architecture.
By using Redis alongside the other database, you're effectively reducing the load on it. Also, when Redis is running on a different server, scaling can be performed independently on each tier.
I need to start off by pointing out that by no means am I a database expert in any way. I do know how to get around to programming applications in several languages that require database backends, and am relatively familiar with MySQL, Microsoft SQL Server and now MEMSQL - but again, not an expert at databases so your input is very much appreciated.
I have been working on developing an application that has to cross reference several different tables. One very simple example of an issue I recently had, is I have to:
On a daily basis, pull down 600K to 1M records into a temporary table.
Compare what has changed between this new data pull and the old one. Record that information on a separate table.
Repopulate the table with the new records.
Running #2 is a query similar to:
SELECT * FROM (NEW TABLE) LEFT JOIN (OLD TABLE) ON (JOINED FIELD) WHERE (OLD TABLE.FIELD) IS NULL
In this case, I'm comparing the two tables on a given field and then pulling the information of what has changed.
In MySQL (v5.6.26, x64), my query times out. I'm running 4 vCPUs and 8 GB of RAM but note that the rest of my configuration is default configuration (did not tweak any parameters).
In MEMSQL (v5.5.8, x64), my query runs in about 3 seconds on the first try. I'm running the exact same virtual server configuration with 4 vCPUs and 8 GB of RAM, also note that the rest of my configuration is default configuration (did not tweak any parameters).
Also, in MEMSQL, I am running a single node configuration. Same thing for MySQL.
I love the fact that using MEMSQL allowed me to continue developing my project, and I'm coming across even bigger cross-table calculation queries and views that I can run that are running fantastically on MEMSQL... but, in an ideal world, i'd use MySQL. I've already come across the fact that I need to use a different set of tools to manage my instance (i.e.: MySQL Workbench works relatively well with a MEMSQL server but I actually need to build views and tables using the open source SQL Workbench and the mysql java adapter. Same thing for using the Visual Studio MySQL connector, works, but can be painful at times, for some reason I can add queries but can't add table adapters)... sorry, I'll submit a separate question for that :)
Considering both virtual machines are exactly the same configuration, and SSD backed, can anyone give me any recommendations on how to tweak my MySQL instance to run big queries like the one above on MySQL? I understand I can also create an in-memory database but I've read there might be some persistence issues with doing that, not sure.
Thank you!
The most likely reason this happens is because you don't have index on your joined field in one or both tables. According to this article:
https://www.percona.com/blog/2012/04/04/join-optimizations-in-mysql-5-6-and-mariadb-5-5/
Vanilla MySQL only supports nested loop joins, that require the index to perform well (otherwise they take quadratic time).
Both MemSQL and MariaDB support so-called hash join, which does not require you to have indexes on the tables, but consumes more memory. Since your dataset is negligibly small for modern RAM sizes, that extra memory overhead is not noticed in your case.
So all you need to do to address the issue is to add indexes on joined field in both tables.
Also, please describe the issues you are facing with the open source tools when connect to MemSQL in a separate question, or at chat.memsql.com, so that we can fix it in the next version (I work for MemSQL, and compatibility with MySQL tools is one of the priorities for us).
I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.