I already have a product in production where i used Entity Framework with SQL Server as db. I've used full text search wherein i store binary data (filestream) along with the file extension which is used by the Full Text Search Engine to index. Not to mention that was a cake walk.
Now, am planning to move to mysql (cuz of obvious reasons -- cost, open-source etc). The product is in SAAS model (although this is the time i can actually take definite call as traffic isn't high) so volume will be high, so search engine should be scalable.
Migration to mysql is easy (will be using INNODB, again for obvious reasons), am only stuck with Full Text Search as right now only the binary data is stored in Sql Server. Although INNODB supports full text search in version 5.6 onwards I did not find exact way to Full Text Index (that is using binary).
Am not sure of using third party full text search engine (Lucene, Sphinx etc) as my searches will be combination of structured and unstructured, for ex: All the Customers from Ohio (structured data as i capture and store the information in RDB) and who have "insurance" in their set of documents (unstrucutred) uploaded. In SQL Server i use "ContainsTable" which gives me ranked results.
I have following questions :-
Will this movement really be fruitful in the long run? Lets say i do
migrate the existing data (in 4 figures).
Although INNODB provides all the necessary features, does that match in SQL Server ease of use, administration and scale?
Some more questions...
Can i move the full text table to mysql as is? I guess i cant index
blob columns.
Should i use mysql full text or any third party
engine?
Thats really hard to say, ive used both system and i do prefeer MySql. Allthough i find that index rebuilding is alot faster in MSSQL. A tip, in MySQL, if you have alot of updates to the fulltext tables, that you simply drop the whole index, do the update and readd the FT index. This will save you insane amount of time and give you less headace.
I make use of MySQL workbench. It has proven to me to be very very useful. But here, MSSQL has a massive advantage with Studio Manager with alot more feature and scalability.
The question itself is very broad there are so many aspects. If you biggest concern is cost than MySQL is the best option. Performance wise if the MySQL server setup is good i have seen very little difference in performace between the RDBS.
Im not sure if my question was helpful but i hope it gave you a little bit insight.
EDIT:
No you cannot FT in blob.
I suggest a document based search engine then like SOLR, Lucene...
Related
The query below takes about a minute to run on my MySQL instance (running on a fairly beefy machine with 64G memory, 2T disc, 2.30Ghz CPU with 8 cores and 16 logical, and the query is running on localhost). This same query runs in less than a second on a SQL Server database I have access to. Unfortunately, I do not have access to the SQL Server host or the DBA, etc.
select min(visit_start_date)
from visit_occurrence;
The table has been set to ENGINE=MyISAM and default-storage-engine=INNODB and innodb_buffer_pool_size=16G are set in my.ini.
Is there some configuration I could be missing that would cause this query to run so slowly on MySQL? How can I fix it?
I have a large number of tables and queries I will need to support so I would really like to be able to fix this issue globally rather than having to create indexes everywhere I have slow queries.
The SQL Server database does not seem to have an index on the column being queried as shown below.
EDIT:
Untagged MS Sql Server, I had tagged it hoping for the help of our MS Sql Server colleagues with information that Sql Server had some way of structuring data and/or queries that would make this type of query run faster on that platform v other such as MySql
Removed image of code to more closely conform with community standards
You never know if there is a magic go-faster button if you don't ask (ENGINE=MyISAM is sometimes kind of like a magic go-faster button for some queries in MySql). I'm kind of fishing for a potential hardware or clustering solution here. Is Apache Ignite a potential solution here?
Thanks again to the community for all of your support and help. I hope this fixes most of the issues that have been raised for this post.
SECOND EDIT:
Is the partitioning/sharding described in the links below a potential solution here?
https://user3141592.medium.com/how-to-scale-mysql-42ebd2841fa6
https://dev.mysql.com/doc/refman/8.0/en/partitioning-overview.html
THIRD EDIT: A note on community standards.
Part of our community standards is explicitly to be welcoming, inclusive, and to be nice.
https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/?fbclid=IwAR1gr6r2qmXs506SAV3H_h6H8LoFy3mlXucfa-fqiiEXMHUR3aF_tdoZGsw
https://meta.stackexchange.com/questions/240839/the-new-new-be-nice-policy-code-of-conduct-updated-with-your-feedback).
The MS Sql Server tag was used here as one of the systems I'm comparing is MS Sql Server. We're really working with very limited information here. I have two systems: My MySql system, which is knowable as I'm running it, and the MS Sql Server running the same database in someone else's system that I have very little information about (all I have is a read only sql prompt). I am comparing apples and oranges: The same query runs well on the orange (MS Sql Server) and does not run well on the apple (My MySql instance). I'd like to know why so I can make an informed decision about how to get my queries to run in a reasonable amount of time. How do I get my apple to look like an orange? Do I switch to MS Sql Server? Do I need to deploy on different hardware? Is the other system running some kind of in memory caching system on top of their database instance? Most of these possibilities would require a non trivial amount of time to explore and validate. So yes, I would like help from MS Sql Server experts that might know if there are caching options, transactional v warehouse options, etc. that could be set that would make a world of difference, that would be magic go-fast buttons.
The magic go-fast button comment was perhaps a little bit condescending.
The picture showing the indexes was shown as I was just trying to make the point that the other system does not seem to have an index on the column being queried. I this case a picture was worth a thousand words.
If the table says ENGINE=MyISAM, then that is what counts. In almost all cases, this is a bad choice. innodb_buffer_pool_size=16G is not relevant except that it robs memory from MyISAM.
default-storage-engine=INNODB is relevant only when creating a table explicitly specifying the ENGINE=.
Are some of your tables MyISAM and some are InnoDB? How much RAM do you have?
Most performance solutions necessarily involve an INDEX. Please explain why you can't afford an index. It could turn that query into less than 10ms, regardless of the number of rows in the table.
Sorry, but I don't accept "rather than having to create indexes everywhere I have slow queries".
Changing tables from MyISAM to InnoDB will, in some cases help with performance. Suggest you change the engine as you add the indexes.
Show us some more queries, we can help you decide what indexes are needed. select min(visit_start_date) from visit_occurrence; needs INDEX(date); other queries may not be so trivial. Do not fall into the trap of "indexing every column".
More
In MySQL...
A single connection only uses one core, so more cores only helps when you have more connections. (Some tiny exceptions exist in MySQL 8.0.)
Partitioning rarely helps with performance; do use that without getting advice. (PS: BY RANGE is perhaps the only useful variant.)
Replication is for read-scaling (and backup and ...)
Sharding is for write-scaling. It requires a bunch of extra architectural things -- such as routing queries to the appropriate servers. (MariaDB has Spider and FederatedX as possible tools.) In any case, sharding is a non-trivial undertaking.
Clustering is for HA (High Availability, auto-failover, etc), while helping some with read and write scaling. Cf: Galera, InnoDB Cluster.
Hardware is rarely more than a temporary solution to performance issues.
Caching leads to potentially inconsistent results, so beware. Also, consider my mantra "don't bother putting a cache in front of a cache".
(I can advise further on any of these topics.)
Whether in MyISAM or InnoDB. or even SQL Server, your query
select min(visit_start_date) from visit_occurrence;
can be satisfied almost instantaneously by this index, because it uses a so-called loose index scan.
CREATE INDEX visit_start_date ON visit_occurrence (visit_start_date);
A query with an aggregate function like MIN() is always a GROUP BY query. But if the GROUP BY clause isn't present in the SQL statement, the server groups by the entire table.
You mentioned a query that can be satisfied immediately when using MyISAM. That's SELECT COUNT(*) FROM whatever_table. Behind the scenes MyISAM keeps table metadata showing the total number of rows in the table, so that query comes back right away. The transactional storage engine InnoDB doesn't do that. It supports so much concurrency that its designers didn't include the total row count in their metadata, because it would be wrong in so many circumstances that it wasn't worth the risk.
Index design isn't a black art. But it is an art informed by the kind of measurements we get from EXPLAIN (or ANALYZE or EXPLAIN ANALYZE). A basic truth of database-driven apps (in any make of database server) is that indexing needs to be revisited as the app grows. The good news: changing, adding, or dropping indexes doesn't change your data.
Im currently using a MySQL DB to pull/run queries injunction with Tableau. Based on the amount of data the queries are taking hours to run. Im thinking of switching to PostgreSQL but new to it. Would this be a good idea or can I optimize MySQL for my needs? I will be adding various data sources as we grow as well.
It's hard to answer definitively without knowing: your schema, your indexes and your queries sent by Tableau.
MySQL (and MariaDB) are excellent databases for certain use cases. Postgres is excellent for most of those use cases, and also others. [risk of generalizing alert]: Postgres can utilize complex indexes better, and also can be finer tuned.
Your statement "Based on the amount of data" suggests indexes are not aligned with the info you want to pull. I know from experience, an index that supports my data pulls makes queries run like a hot knife through butter, no matter what db is used.
8 times out of 10, MySQL or Postgres would suffice. This tableau page suggests a conversation with your DBA would help you.
If you are your own DBA as is often the case, I'd go with Postgres.
I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.
I need to store large amount of data every hour in the database. What kind of data? Text data.
What is the best way? Store on multiple table or 1 large table?
Edit: I just said, large text data. 10000 times the word "data"
Every hour a new line is added like:
hour - data
Edit 2: Just because you can't understood the question, and also i said, "EVERY HOUR", so you imagine every hour for the next 10 years a new line will be created, does not mean its not a readable question.
Use a column of datatype 'text', 'mediumtext', or 'largetext' according to your needs.
See: http://dev.mysql.com/doc/refman/5.0/en/blob.html
Alternatively, you could just output the data to a file. They are more appropriate for logging large amounts data that may not need to be accessed often - which it seems like this might be.
MySql have added many feature in MySql 5.7. Now you can do it in many way.
Oracle like Big Data is now Integrating in MySQL.
MySql have Unlocked New Big Data Insights with MySQL & Hadoop.
Soluation 1: You can use MySQL as a Document Store. There are possible to store many many object as JSON. It highly recommended and Extendable.
MySQL Document Store = (MySql + NoSql).
X Dev API will help to produce JSON with SQL and CRUD operation over X
Protocol. Also there is possible to maintain X Session.
It will be best for transparent data sanding and sharing for chat application or group Application.
Soluation 2: MySql Sysbench: Read Only is another best solution. It will be very very fast and scalable to make group chat Application.
Soluation 3: Use MySql 5.7 : InnoDB, NoSql with Memcached API which will interact directly with storage engine InnoDB. It is 6X faster than MySql 5.6.
Still Now FaceBook is using this technology. Because it is very fast.
For more details:
https://www.mysql.com/news-and-events/web-seminars/introducing-mysql-document-store/
https://dev.mysql.com/doc/refman/5.7/en/document-store-setting-up.html
About Big Data: https://www.oracle.com/big-data/index.html
https://www.youtube.com/watch?v=1Dk517M-_7o
I think it is better to use a database that is not used by anything else but whatever uses the data (as it is a lot of text data and may slow down SQL queries) and create seperate tables for each category of data.
Ad#m
For the same data set, with mostly text data, how do the data (table + index) size of Postgresql compared to that of MySQL?
Postgresql uses MVCC, that would suggest its data size would be bigger
In this presentation, the largest blog site in Japan talked about their migration from Postgresql to MySQL. One of their reasons for moving away from Postgresql was that data size in Postgresql was too large (p. 41):
Migrating from PostgreSQL to MySQL at Cocolog, Japan's Largest Blog Community
Postgresql has data compression, so that should make the data size smaller. But MySQL Plugin also has compression.
Does anyone have any actual experience about how the data sizes of Postgresql & MySQL compare to each other?
MySQL uses MVCC as well, just check
innoDB. But, in PostgreSQL you can
change the FILLFACTOR to make space
for future updates. With this, you
can create a database that has space
for current data but also for some
future updates and deletes. When
autovacuum and HOT do their things
right, the size of your database can
be stable.
The blog is about old versions, a lot
of things have changed and PostgreSQL
does a much better job in compression
as it did in the old days.
Compression depends on the datatype,
configuration and speed as well. You
have to test to see how it's working
for you situation.
I did a couple of conversions from MySQL to PostgreSQL and in all these cases, PostgreSQL was about 10% smaller (MySQL 5.0 => PostgreSQL 8.3 and 8.4). This 10% was used to change the fillfactor on the most updated tables, these were set to a fillfactor 60 to 70. Speed was much better (no more problems with over 20 concurrent users) and data size was stable as well, no MVCC going out of control or vacuum to far behind.
MySQL and PostgreSQL are two different beasts, PostgreSQL is all about reliability where MySQL is populair.
Both have their storage requirements in their respective documentation:
MySQL: http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
Postgres: http://www.postgresql.org/docs/current/interactive/datatype.html
A quick comparison of the two don't show any flagrant "zomg PostGres requires 2 megabytes to store a bit field" type differences. I suppose Postgres could have higher metadata overhead than MySQL, or has to extend its data files in larger chunks, but I can't find anything obvious that Postgres "wastes" space for which migrating to MySQL is the cure.
I'd like to add that for large columns stores, postgresql also takes advantage of compressing them using a "fairly simple and very fast member of the LZ family of compression techniques"
To read more about this, check out http://www.postgresql.org/docs/9.0/static/storage-toast.html
It's rather low-level and probably not necessary to know, but since you're using a blog, you may benefit from it.
About indexes,
MySQL stores the data within the index which makes them huge. Postgres doesn't. This means that the storage size of a b-tree index in Postgres doesn't depend on the number of the column it spans or which data type the column has.
Postgres also supports partial indexes (e.g. WHERE status=0) which is a very powerful feature to prevent building indexes over millions of rows when only a few hundred is needed.
Since you're going to put a lot of data in Postgres you will probably find it practical to be able to create indexes without locking the table.
Sent from my iPhone. Sorry for bad spelling and lack of references