I am on Linux platform with PostgreSQL 5.5. I am trying to monitor all traffic related to PostgreSQL between Master and Slave. To that end, I used Wireshark to monitor the traffic. Then, I started PostgreSQL and ran three queries (Create table Hello, Create table Bye & inserted an image to PostgreSQL database). During queries, I ran Wireshark on Master just to capture the traffic between Master and Slave.
But there is one problems with PostgreSQL traffic captured using Wireshark. All the traffic is sent/received in TCP packets and that traffic is in coded form. I can't read that data. I want to find out all those three queries from Wirehsark that I inserted in PostgreSQL database.
What is the best way to go about finding queries of PostgreSQL?
On the other hand, I ran same queries on MySQL database and repeated above mentioned experiment. I can easily read all those three queries in wireshark dump because they are not in coded form.
Wireshark file of PostgreSQL experiment is available on Wireshark-File. I need to find out above three queries from Wirehsark file.
About File:
192.168.50.11 is the source machine from where I inserted queries to remote PostgreSQL's Master server. 192.168.50.12 is the IP of Master's server. 192.168.50.13 is the slave's IP address. Queries were executed from .11 and inserted into .12 and then replicated to .13 using Master-Slave approach.
Pointers will be very welcome.
You are probably using WAL-based replication (the default) which means you can't.
This involves shipping the transaction-logs between machines. This is actual on-disk representation of the data.
There are alternative trigger-based replication methods (slony etc) and the new logical replication.
Neither will let you recreate the complete original query as I understand it, but would let you get closer.
There are systems which duplicate the queries on nodes (like MySQL) but they aren't quite the same thing.
If you want to know exactly what queries are running on the master, turn on query logging and monitor the logs instead.
Solution to my own problem:
I got the solution of my question.
I used Python code to insert queries into remote PostgreSQL database. I used following line in PostgreSQL to connect with database.
con = psycopg2.connect(host="192.168.50.12", database="postgres", user="postgres", password="faban")
If you use above approach then all the data will be sent in encrypted form. If you use the approach given below in python code then all the data will be sent in decrypted form. You can easily read all queries in Wireshark.
con = psycopg2.connect("host=192.168.50.12 dbname=postgres user=postgres password=faban sslmode=disable")
Same is the case in C-Code as well.
Decrypted data
sprintf(conninfo, "dbname=postgres hostaddr=192.168.50.12 user=postgres password=faban sslmode=disable");
Encrypted Data
sprintf(conninfo, "dbname=postgres hostaddr=192.168.50.12 user=postgres password=faban");
Related
I have a system in which data is written constantly. It works on MySQL, I also have a second system that runs on SQL Server and uses some parameters from the first base.
Question: how is it possible (is this even possible) to constantly transfer values from one base (MySQL) to another (SQL Server)? The option to switch to one base is not an option. As I understand it, it will be necessary to write a program for example in Delphi which will transfer values from the other database to another.
You have a number of options.
SQL Server can access another database using ODBC, so you could setup SQL server to obtain the information it needs directly from tables that are held in MySQL.
MySQL supports replication using log files, so you could configure MySQL replication (which does not have to be on all tables) to write relevant transactions to a log file. You would then need to process that log file (which you could do in (almost) real time as the standard MySQL replication does) to identify what needs to be written to the MS SQL Server. Typically this would produce a set of statements to be run against the MS SQL server. You have any number of languages you could use to process the log file and issue the updates.
You could have a scheduled task that reads the required parameters from MySQL and posts it to MS SQL, but this would leave a period of time where the two may not be in sync. Given that you may have an issue with parsing log files and posting the updates you may still want to implement this as a fall back if you are processing log files.
If the SQL Server and the MySQL server are on the same network the external tables method is likely to be simplest and lowest maintenance, but depending on the amount of data involved you may find the overhead of the external connection and queries could affect the overall performace of the queries made against the MS SQL Server.
For a project we are working with an several external partner. For the project we need access to their MySQL database. The problem is, they cant do that. Their databse is hosted in a managed environment where they don't have much configuration possibilities. And they dont want do give us access to all of their data. So the solution they came up with, is the federated storage engine.
We now have one table for each table of their database. The problem is, the amount of data we get is huge and will even increase in the future. That means there are a lot of inserts performed on our database. The optimal solution for us would be to intercept all incoming MySQL traffic, process it and then store it in bulk. We also thought about using someting like redis to store the data.
Additionnaly, we plan to get more data from different partners. They will potentialy provide us the data in different ways. So using redis would allow us, to have all our data in one place.
Copying the data to redis after its stored in the mysql database is not an option. We just cant handle that many inserts and we need the data as fast as possible.
TL;DR
is there a way to pretend to be a MySQL server so we can directly process data received via the federated storage engine?
We also thought about using the blackhole engine in combination with binary logging on our side. So incoming data would only be written to the binary log and wouldn't be stored in the database. But then performance would still be limited by Disk I/O.
Just some contexts: In our old data pipeline system, we are running MySQL 5.6. or Aurora on Amazon rds. Bad thing about our old data pipeline is running a lot of heavy computations on the database servers because we are handcuffed by what was designed: treating transactional databases as data warehouse and our backend API directly “fishing” the databases heavily in our old system. We are currently patching this old data pipeline, while re-design the new data warehouse in SnowFlake.
In our old data pipeline system, the data pipeline calculation is a series of sequential MySQL queries. As our data grows bigger and bigger in the old data pipeline, what the problem now is the calculation might just hang forever at, for example, the step 3 MySQL query, while all metrics in Amazon CloudWatch/ grafana we are monitoring (CPU, database connections, freeable memory, network throughput, swap usages, read latency, available storage, write latency, etc. ) looks normal. The MySQL slow query log is not really helpful here because each of our query in the data pipeline is essentially slow anyway (can takes hours to run a query because the old data pipeline is running a lot of heavy computations on the database servers). The way we usually solve these problems is to “blindly” upgrade the MySQL/Aurora Amazon rds service and hoping it will solve the issue. I am wondering
(1) What are the recommended database metrics in MySQL 5.6. or Aurora on Amazon rds we should monitor real-time to help us identify why a query freezes forever? Like innodb_buffer_pool_size?
(2) Is there any existing tool and/or in-house approach where we can predict how many hardware resources we need before we can confidently execute a query and know it will succeed? Could someone share some 2 cents?
One thought: Since Amazon rds sometimes is a bit blackbox, one possible way is to host our own MySQL server on an Amazon EC2 instance in parallel to our Amazon MySQL 5.6/Aurora rds production server, so we can ssh into MySQL server and run a lot of command tools like mytop (https://www.tecmint.com/mysql-performance-monitoring/) to gather a lot more real time MySQL metrics which can help us triage the issue. Open to any 2 cents from gurus. Thank you!
None of the tools mentioned at that link should need to run on the database server itself, and to the extent that this is true, there should be no difference in their behavior if they aren't. Run them on any Linux server, giving the appropriate --host and --user and --password arguments (in whatever form they may expect). Even mysqladmin works remotely. Most of the MySQL command line tools do (such as the mysql cli, mysqldump, mysqlbinlog, and even mysqlcheck).
There is no magic coupling that most administrative utilities can gain by running on the same server as MySQL Server itself -- this is a common misconception but, in fact, even when running on the same machine, they still have to make a connection to the server, just like any other client. They may connect to the unix socket locally rather than using TCP, but it's still an ordinary client connection, and provides no extra capabilities.
It is also possible to run an external replica of an RDS/MySQL or Aurora/MySQL server on your own EC2 instance (or in your own data center, even). But this isn't likely to tell you a whole lot that you can't learn from the RDS metrics, particularly in light of the above. (Note also, that even replica servers acquire their replication streams using an ordinary client connection back to the master server.)
Avoid the temptation to tweak server parameters. On RDS, most of the defaults are quite sane, and unless you know specifically and precisely why you want to adjust a parameter... don't do it.
The most likely explanation for slow queries... is poorly written queries and/or poorly designed indexes.
If you are not familiar with EXPLAIN SELECT, then you need to learn it, live it, an love it. SQL is declarative, not procedural. That is, SQL tells the server what you want -- not specifically how to obtain it internall. For example: SELECT ... FROM x JOIN y tells the server to match up the rows from table x and y ON a certain criteria, but does not tell the server whether to read from x then find the matching rows in y... or read from y and find the matching rows in x. The net result is the same either way -- it doesn't matter which table the server examines first, internally -- but if the query or the indexes don't allow the server to correctly deduce the optimum path to the results you've requested, it can spend countless hours churning through unnecessary effort.
Take for an extreme and overly-simplified example, a table with millions of rows and table with 1 row. It would make sense to read the small table first, so you know what 1 value you're trying to join in the large table. It would make no sense to read throuh each row in the large table, then go over and check the small table for a match for each of the millions of rows. The order in which you join tables can be different than the order in which the actual joining is done.
And that's where EXPLAIN comes in. This allows you to inspect the query plan -- the strategy the internal query optimizer has concluded will get it to the answer you need with the least amount of effort. This is the core of the magic of relational database systems -- finding the correct solution in the optimal time, based on what it knows about the data. EXPLAIN shows you the order in which the tables are being accessed, how they're being joined, which indexes are being used, and an estimate of the number of rows from each table are involved -- and these numbers multiply together to give you an estimate of the number of permutations involved in resolving your query. Two small tables, each with 50,000 rows, joined without a proper index, means an entirely unreasonable 2,500,000,000 unique combinations between the two tables that must be evaluated; every row must be compared to every other row. In short, if this turns out to be the kind of thing that you are (unknowingly) asking the server to do, then you are definitely doing something wrong. Inspecting your query plan should be second nature any time you write a complex query, to ensure that the server is using a sensible strategy to resolve it.
The output is cryptic, but secret decoder rings are available.
https://dev.mysql.com/doc/refman/5.7/en/explain.html#explain-execution-plan
A current project I am working on has been exclusively using MySQL as our RDMS. We are currently looking to segment the database into two different databases. One will be moving to RedShift (which runs using a modified Postgresql) while the other will continue using MySQL.
My concern does not stem from splitting the data, but rather how applications will interact with the segmented data. Effectively our current application will be reading static data from RedShift and writing to the MySQL database and I am curious if it is a bad practice to intermingle these Query Languages.
Would it be better to migrate the MySQL DB to Postgres to limit complications arising from their differences?
We (Looker) work with many customers (100s) that have both MySQL and Redshift. The progression as their needs grow is usually:
MySQL
MySQL + MySQL slave
MySQL + MySQL Writable Slave
MySQL + MySQL Writable Slave + Redshift
So your best bet, if you haven't done so is to setup a MySQL Replica slave database. The replica slave follows your master write database and is essentially an exact copy of your master.
You can also make your Replica Writable. This becomes really useful for building summary tables. Here are some instructions on how to make a writable replica in RDS, but you can do it with in other systems too.
http://www.looker.com/docs/setup-and-management/database-config/mysql-rds
If have big event data that you want to integrate with your transactional data, the next step is to setup a process that migrates all your MySQL data into Redshift and pumps in data from other sources (like your event data, for example). Moving all the data, gives you the ability to ask any question from Redshift.
Redshift will lag hours or more behind the MySQL database. If you need to answer real time questions, query MySQL. If you want general insights, query the Redshift database.
Two machines, each running mysql, each synchronized to the other peer-to-peer. I do not want a master db replicated. Rather, I want two users to be able to work on the data offline (each running a mysql server on his machine) and then when reconnected synchronize to each other. Any way to do this with mysql? Any other database I should be looking at to accomplish this better than mysql?
Two-way replication is provided by various database systems (e.g. SQLServer, Sybase etc.) but there are always problems with such a set up.
For example, if the same row is updated at the same time on the two databases, which update wins?
If your aim is to provide a highly-available MySQL database, then there are better options than using replication. MySQL has a clustering solution (though I've not had much success with it) or you can use things like DRBD and heartbeat to provide automatic failover with no loss of data.
If you mean synchronous writing back and forth, this would cause serious data consistency issues. I think you may be referring to MySQL replication, wherein a master server sends its updates to one or more slave database servers, which can be queried.
As for "Other Database Options" SQLServer supports a fairly advanced "replication" process for synchronizing the data between two or more db's. Looks like MySql has something like this as well though.