Storing one column in a NoSQL DB? - mysql

In an app I am working on, we use a MySQL database and want to store articles in a table. Rather than store them in a SQL DB, we were looking at just storing the key to the article in a NoSQL db. Is this a good problem to solve using NoSQL, or should we just create another table in MySQL and store the large amounts of text there?
We are looking at using MongoDB to store the text.

First thing I'd do is check how MySQL runs with the 'large amount of data'. If you're getting acceptable performance, then there's no point trying to make the system more complicated.
Putting the text content into a separate table in MySQL wouldn't accomplish anything. Putting it into a separate DB might help, but I wouldn't do it unless you're sure that MySQL is a significant bottleneck, and that you can't do anything else, like optimize your queries.

MongoDB is good at storing large blob-ish fields, whether text or binary. So I think that is a good thing to do.
There is a limit on BSON objects (MongoDB "records") of 4MB. If you text fields are larger than 4MB, you can use GridFS and then there is no limit.

Related

Giant unpartitioned MySQL table issues

I have a MySQL table which is about 8TB in size. As you can imagine, querying is horrendous.
I am thinking about:
Create a new table with partitions
Loop through a series of queries to dump data into those partitions
But the loop will require lots of queries to be submitted & each will be REALLY slow.
Is there a better way to do this? Repartitioning a production database in-situ isn't going to work - this seemed like an OK option, but slow
And is there a tool that will make life easier? Rather than a Python job looping & submitting jobs?
Thanks a lot in advance
You could use pt-online-schema-change. This free tool allows you to partition the table with an ALTER TABLE statement, but it does not block clients from using the table while it's restructuring it.
Another useful tool could be pt-archiver. You would create a new table with your partitioning idea, then pt-archiver to gradually copy or move data from the old table to the new table.
Of course try out using these tools in a test environment on a much smaller table first, so you get some practice using them. Do not try to use them for the first time on your 8TB table.
Regardless of what solution you use, you are going to need enough storage space to store the entire dataset twice, plus binary logs. The old table will not shrink, even as you remove data from it. So I hope your filesystem is at least 24TB. Or else the new table should be stored on a different server (or ideally several other servers).
It will also take a long time no matter which solution you use. I expect at least 4 weeks, and perhaps longer if you don't have a very powerful server with direct-attached NVMe storage.
If you use remote storage (like Amazon EBS) it may not finish before you retire from your career!
In my opinion, 8TB for a single table is a problem even if you try partitioning. Partitioning doesn't magically fix performance, and could make some queries worse. Do you have experience with querying partitioned tables? And you understand how partition pruning works, and when it doesn't work?
Before you choose partitioning as your solution, I suggest you read the whole chapter on partitioning in the MySQL manual: https://dev.mysql.com/doc/refman/8.0/en/partitioning.html, especially the page on limitations: https://dev.mysql.com/doc/refman/8.0/en/partitioning-limitations.html Then try it out with a smaller table.
A better strategy than partitioning for data at this scale is to split the data into shards, and store each shard on one of multiple database servers. You need a strategy for adding more shards as I assume the data will continue to grow.

storing telemetry data from 10000s of nodes

I need to store telemetry data that is being generated every few minutes from over 10000 nodes (which may increase), each supplying the data over the internet to a server for logging. I'll also need to query this data from a web application.
I'm having a bit of trouble deciding what the best storage solution would be..
Each node has a unique ID, and there will be a timestamp for each packet of variables. (probably will need to be generated by the server).
The telemetry data has all of the variables in the same packet, so conceptually it could easily be stored in a single database table with a column per variable. The serial number + timestamp would suffice as a key.
The size of each telemetry packet is 64 bytes, including the device ID and timestamp. So around 100Gb+ per year.
I'd want to be able to query the data to get variables across time ranges and also store aggregate reports of this data so that I can draw graphs.
Now, how best to handle this? I'm pretty familiar with using MySQL, so I'm leaning towards this. If I were to go for MySQL would it make sense to have a separate table for each device ID? - Would this make queries much faster or would having 10000s of tables be a problem?
I don't think querying the variables from all devices in one go is going to be needed but it might be. Or should I just stick it all in a single table and use MySQL cluster if it gets really big?
Or is there a better solution? I've been looking around at some non relational databases but can't see anything that perfectly fits the bill or looks very mature. MongoDB for example would have quite a lot of size overhead per row and I don't know how efficient it would be at querying the value of a single variable across a large time range compared to MySQL. Also MySQL has been around for a while and is robust.
I'd also like it to be easy to replicate the data and back it up.
Any ideas or if anyone has done anything similar you input would be greatly appreciated!
Have you looked at time-series databases? They're designed for the use case you're describing and may actually end up being more efficient in terms of space requirements due to built-in data folding and compression.
I would recommend looking into implementations using HBase or Cassandra for raw storage as it gives you proven asynchronous replication capabilities and throughput.
HBase time-series databases:
OpenTSDB
KairosDB
Axibase Time-Series Database - my affiliation
If you want to go with MySQL, keep in mind that although it will keep on going when you throw something like a 100GB per year at it easily on modern hardware, do be advised that you cannot execute schema changes afterwards (on a live system). This means you'll have to have a good, complete database schema to begin with.
I don't know if this telemetry data might grow more features, but if they do, you don't want to have to lock your database for hours if you need to add a column or index.
However, some tools such as http://www.percona.com/doc/percona-toolkit/pt-online-schema-change.html are available nowadays which make such changes somewhat easier. No performance problems to be expected here, as long as you stay with InnoDB.
Another option might be to go with PostgreSQL, which allows you to change schemas online, and sometimes is somewhat smarter about the use of indexes. (For example, http://kb.askmonty.org/en/index-condition-pushdown is a new trick for MySQL/MariaDB, and allows you to combine two indices at query time. PostgreSQL has been doing this for a long time.)
Regarding overhead: you will be storing your 64 bytes of telemetry data in an unpacked form, probably, so your records will take more than 64 bytes on disk. Any kind of structured storage will suffer from this.
If you go with an SQL solution, backups are easy: just dump the data and you can restore it afterwards.

mysql optimization script file

I'm looking at having someone do some optimization on a database. If I gave them a similar version of the db with different data, could they create a script file to run all the optimizations on my database (ie create indexes, etc) without them ever seeing or touching the actual database? I'm looking at MySQL but would be open to other db's if necessary. Thanks for any suggestions.
EDIT:
What if it were an identical copy with transformed data? Along with a couple sample queries that approximated what the db was used for (ie OLAP vs OLTP)? Would a script be able to contain everything or would they need hands on access to the actual db?
EDIT 2:
Could I create a copy of the db, transform the data to make it unrecognizable, create a backup file of the db, give it to vendor and them give me a script file to run on my db?
Why are you concerned that they should not access the database? You will get better optimization if they have the actual data as they can consider table sizes, which queries run the slowest, whether to denormalise if necessary, putting small tables completely in memory, ...?
If it is a issue of confidentiality you can always make the data anomous by replacement of names.
If it's just adding indices, then yes. However, there are a number of things to consider when "optimizing". Which are the slowest queries in your database? How large are certain tables? How can certain things be changed/migrated to make those certain queries run faster? It could be harder to see this with sparse sample data. You might also include a query log so that this person could see how you're using the tables/what you're trying to get out of them, and how long those operations take.

Best way to store large data in mysql

I need to store large amount of data every hour in the database. What kind of data? Text data.
What is the best way? Store on multiple table or 1 large table?
Edit: I just said, large text data. 10000 times the word "data"
Every hour a new line is added like:
hour - data
Edit 2: Just because you can't understood the question, and also i said, "EVERY HOUR", so you imagine every hour for the next 10 years a new line will be created, does not mean its not a readable question.
Use a column of datatype 'text', 'mediumtext', or 'largetext' according to your needs.
See: http://dev.mysql.com/doc/refman/5.0/en/blob.html
Alternatively, you could just output the data to a file. They are more appropriate for logging large amounts data that may not need to be accessed often - which it seems like this might be.
MySql have added many feature in MySql 5.7. Now you can do it in many way.
Oracle like Big Data is now Integrating in MySQL.
MySql have Unlocked New Big Data Insights with MySQL & Hadoop.
Soluation 1: You can use MySQL as a Document Store. There are possible to store many many object as JSON. It highly recommended and Extendable.
MySQL Document Store = (MySql + NoSql).
X Dev API will help to produce JSON with SQL and CRUD operation over X
Protocol. Also there is possible to maintain X Session.
It will be best for transparent data sanding and sharing for chat application or group Application.
Soluation 2: MySql Sysbench: Read Only is another best solution. It will be very very fast and scalable to make group chat Application.
Soluation 3: Use MySql 5.7 : InnoDB, NoSql with Memcached API which will interact directly with storage engine InnoDB. It is 6X faster than MySql 5.6.
Still Now FaceBook is using this technology. Because it is very fast.
For more details:
https://www.mysql.com/news-and-events/web-seminars/introducing-mysql-document-store/
https://dev.mysql.com/doc/refman/5.7/en/document-store-setting-up.html
About Big Data: https://www.oracle.com/big-data/index.html
https://www.youtube.com/watch?v=1Dk517M-_7o
I think it is better to use a database that is not used by anything else but whatever uses the data (as it is a lot of text data and may slow down SQL queries) and create seperate tables for each category of data.
Ad#m

How do I set up replication from MySQL to MongoDB?

I have a bunch of data from a scientific experiment stored in a MySQL database, but I want to use MongoDB to take advantage of its map/reduce functionality to power some web charts. What is the best way to have new writes to MySQL replicate into Mongo? Some solution where I inspect the binary MySQL log and update accordingly, just like standard MySQL replication?
Thanks!
Alex
MySQL and MongoDB uses very different data and query models, so you can't transfer data directly.
Alas, moving data between the two must be done manually, and doing that efficiently depends very much on your data. Eg. you could transfer each table to a separate collection (roughly a table in MongoDB-lingo), and making the unique attributes in the tables to the _id-attribute. Alternately, you can make the _id to tablename+unique_id.
Basically, as Document databases are essentially free-form, you are free to invent your on schemes ad-infinitum (as long as the _id-attributes are unique within the collection).
Tungsten Replicator is data replication engine for MySQL.
Using heterogeneous replication, you may be able to set up MySQL to MongoDB replication.
I am not familiar with MongoDB but my quick look shows it is incompatible with mySQL so unless someone has written something to import form mySQL you are out of luck.
You could write your own import function.
assuming your mySQL tables use an incrementing unique 'id' field you could track the last row in mysql and then send it to mongodb when it changes.
Alterations and deletions would be much more difficult to deal with. if this is important then inserting the data at the source is probably the best bet.
Do you need to insert the data into mysql at all? could you do it all in mongodb and save all the trouble?
DC