Initially I thought this would be a stupid question, but now I am inspired by the following question.
Background: I have a lot of data in MySQL, but MySQL's spatial support is terrible. Ideally I would like to migrate everything to Postgres, but converting from MySQL to Postgres is a massive ball of hurt (I've already wasted close to a week struggling with it). Now I am thinking, if only I could maintain only the spatial portion in Pg, do the spatial queries in Pg, then use those row ids to query non-spatial data from MySQL.
I am a Perl DBI person. My question is thus -- can I create a single database handle that actually allows querying by JOINing a table from Pg with a table from MySQL, assuming they have a common id column?
No, you will need to query both separately and combine the data at the application layer. See a more informed answer here:
How to create linked server MySQL
No, I don't think you could do it that way. You would have to query the data separately and combine the results in your code. I believe there are no REAL RDB's that can do what you want.
Related
In my database, entities are stored as full JSON formatted objects. There is another table for properties of entities. Each property of an entity is stored its own row. In a sense, the second table holds properties of JSON entity.
I'm aware that this is a terrible design violating normal forms and also it is an anti-pattern called Entity-Attribute-Value.
For a simple query, I must make a number of joins and it is terrible for performance.
I wonder if there is a way for optimizing queries by reducing number of joins?
Solutions out of question since there are lots of constraints:
normalizing tables
migrating to a nosql engine
By the way, the database engine is MySQL 5.6.21. If I update the mysql db to another version of mysql, can I create queries on JSON objects directly?
Thank you in advance.
MySQL 5.7.8 supports JSON as a datatype. Without more details, it's hard to see if that will help you - but it will almost certainly be faster than the EAV queries you currently have to do.
I'm planning to use Solr for search over one mysql table, this table should have a millions of records.
i Did normalised this table so it can be indexed and searched by solr for more performance results.
Is this right or it should be searching with mysql table it self, i mean normal select.
please advice which is better since it's just one table to be indexed and searched.
Thanks
If your requirement is to "search" only one MySQL table using regular SQL SELECT statements, then Apache Solr seems a bit roundabout and overkill.
With solr, you'd need to periodically import data from the MySQL table into the solr schema, which essentially means you'll be maintaining two copies of the data. So there's also a delay between the time data is changed in the MySQL table and when the solr schema is refreshed from the database.
Aside from that, as far as performance, I'd expect that solr would be nearly as fast searching its schema as MySQL would be at searching it's own table. But you'd really need to test that, to make that determination for your particular requirements.
You'd likely want to make use of the solr "delta import", specifying appropriate queries to identify rows in the MySQL table that have been inserted and updated since the last import.
If this is the only usage of the MySQL table, then the only indexes you would really need on the MySQL table, as far as solr search/import performance is concerned, would be those indexes needed to optimize the performance of the two "delta import" queries.
Try using 3rd party search service like www.rockitsearch.com . It is an out of the box search solution. No need to worry about maintenance of solr cluster.
I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.
I want to create a query result page for a simple search, and i don't know , should i use views in my db, would it be better if i would write a query into my code with the same syntax like i would create my view.
What is the better solution for merging 7 tables, when i want to build a search module for my site witch has lots of users and pageloads?
(I'm searching in more tables at the same time)
you would be better off using a plain query with joins, instead of a view. VIEWS in MySQL are not optimized. be sure to have your tables properly indexed on the fields being used in the joins
If you always use all 7 tables, i think you should use views. Be aware that mysql changes your original query when creating the view so its always good practice to save your query elsewhere.
Also, remember you can tweak mysql's query cache env var so that it stores more data, therefore making your queries respond faster. However, I would suggest that you used some other method for caching like memcached. The paying version of mysql supports memcached natively, but Im sure you can implement it in the application layer no problem.
Good luck!
I've seen questions for doing the reverse, but I have an 800MB PostgreSQL database that needs to be converted to MySQL. I'm assuming this is possible (all things are possible!), and I'd like to know the most efficient way of going about this and any common mistakes there are to look out for. I have next to no experience with Postgre. Any links to guides on this would be helpful also! Thanks.
One advise is to start with a current version of MySQL, otherwise you will not have sub-queries, stored procedures or views. The other obvious difference is auto-increment fields. Check out:
pg2mysql
/Allan
You should not convert to new database engine based solely on the fact that you do not know the old one. These databases are very different - MySQL is speed and simplicity, Postgres is robustness and concurrency. It will be easier for you to learn Postgres, it is not that hard.
pg_dump can do the dump as insert statements and create table statements. That should get you close. The bigger question, though, is why do you want to switch. You may do a lot of work and not get any real gain from it.