Best way to store large data in mysql - mysql

I need to store large amount of data every hour in the database. What kind of data? Text data.
What is the best way? Store on multiple table or 1 large table?
Edit: I just said, large text data. 10000 times the word "data"
Every hour a new line is added like:
hour - data
Edit 2: Just because you can't understood the question, and also i said, "EVERY HOUR", so you imagine every hour for the next 10 years a new line will be created, does not mean its not a readable question.

Use a column of datatype 'text', 'mediumtext', or 'largetext' according to your needs.
See: http://dev.mysql.com/doc/refman/5.0/en/blob.html
Alternatively, you could just output the data to a file. They are more appropriate for logging large amounts data that may not need to be accessed often - which it seems like this might be.

MySql have added many feature in MySql 5.7. Now you can do it in many way.
Oracle like Big Data is now Integrating in MySQL.
MySql have Unlocked New Big Data Insights with MySQL & Hadoop.
Soluation 1: You can use MySQL as a Document Store. There are possible to store many many object as JSON. It highly recommended and Extendable.
MySQL Document Store = (MySql + NoSql).
X Dev API will help to produce JSON with SQL and CRUD operation over X
Protocol. Also there is possible to maintain X Session.
It will be best for transparent data sanding and sharing for chat application or group Application.
Soluation 2: MySql Sysbench: Read Only is another best solution. It will be very very fast and scalable to make group chat Application.
Soluation 3: Use MySql 5.7 : InnoDB, NoSql with Memcached API which will interact directly with storage engine InnoDB. It is 6X faster than MySql 5.6.
Still Now FaceBook is using this technology. Because it is very fast.
For more details:
https://www.mysql.com/news-and-events/web-seminars/introducing-mysql-document-store/
https://dev.mysql.com/doc/refman/5.7/en/document-store-setting-up.html
About Big Data: https://www.oracle.com/big-data/index.html
https://www.youtube.com/watch?v=1Dk517M-_7o

I think it is better to use a database that is not used by anything else but whatever uses the data (as it is a lot of text data and may slow down SQL queries) and create seperate tables for each category of data.
Ad#m

Related

Using Sql Server for data mining

I am working on a project where I am storing data in Sql Server database for data mining. I 'm at the first step of datamining, collecting data.
All the data is being stored currently stored in SQL Server 2008 db. The data is being stored in couple different tables at the moment. The table adds about 100,000 rows per day.
At this rate the table will have more than million records in about a month's time.
I am also running certain select statements against these tables to get upto the minute realtime statistics.
My question is how to handle such large data without impacting query performance. I have already added some indexes to help with the select statements.
One idea is to archive the database once it hits a certain number of rows. Is this the best solution going forward?
Can anyone recommend what is the best way to handle such data, keeping in mind that down the road I want to do some data mining if possible.
Thanks
UPDATE: I have not researched enough to decide what tool I would use for datamining. My first order of task is to collect relevant information. And then do datamining.
My question is how to manage the growing table so that running selects against it does not cause performance issues.
What tool you will you be using to data mine? If you use a tool that uses a relational source then you check the worlkload that it is submitting to the database and optimise based on that. So you don't know what indexes you'll need until you actually start doing data mining.
If you are using SQL Server data mining tools then they pretty much run off SQL Server cubes (which pre aggregate the data). So in this case you want to consider which data structure will allow you to build cubes quickly and easily.
That data structure would be a star schema. But there is additional work required to get it into a star schema, and in most cases you can build a cube off a normalised/OLAP structure OK.
So assuming you are using SQL Server data mining tools, your next step is to build a cube of the tables you have right now and see what challenges you have.

storing telemetry data from 10000s of nodes

I need to store telemetry data that is being generated every few minutes from over 10000 nodes (which may increase), each supplying the data over the internet to a server for logging. I'll also need to query this data from a web application.
I'm having a bit of trouble deciding what the best storage solution would be..
Each node has a unique ID, and there will be a timestamp for each packet of variables. (probably will need to be generated by the server).
The telemetry data has all of the variables in the same packet, so conceptually it could easily be stored in a single database table with a column per variable. The serial number + timestamp would suffice as a key.
The size of each telemetry packet is 64 bytes, including the device ID and timestamp. So around 100Gb+ per year.
I'd want to be able to query the data to get variables across time ranges and also store aggregate reports of this data so that I can draw graphs.
Now, how best to handle this? I'm pretty familiar with using MySQL, so I'm leaning towards this. If I were to go for MySQL would it make sense to have a separate table for each device ID? - Would this make queries much faster or would having 10000s of tables be a problem?
I don't think querying the variables from all devices in one go is going to be needed but it might be. Or should I just stick it all in a single table and use MySQL cluster if it gets really big?
Or is there a better solution? I've been looking around at some non relational databases but can't see anything that perfectly fits the bill or looks very mature. MongoDB for example would have quite a lot of size overhead per row and I don't know how efficient it would be at querying the value of a single variable across a large time range compared to MySQL. Also MySQL has been around for a while and is robust.
I'd also like it to be easy to replicate the data and back it up.
Any ideas or if anyone has done anything similar you input would be greatly appreciated!
Have you looked at time-series databases? They're designed for the use case you're describing and may actually end up being more efficient in terms of space requirements due to built-in data folding and compression.
I would recommend looking into implementations using HBase or Cassandra for raw storage as it gives you proven asynchronous replication capabilities and throughput.
HBase time-series databases:
OpenTSDB
KairosDB
Axibase Time-Series Database - my affiliation
If you want to go with MySQL, keep in mind that although it will keep on going when you throw something like a 100GB per year at it easily on modern hardware, do be advised that you cannot execute schema changes afterwards (on a live system). This means you'll have to have a good, complete database schema to begin with.
I don't know if this telemetry data might grow more features, but if they do, you don't want to have to lock your database for hours if you need to add a column or index.
However, some tools such as http://www.percona.com/doc/percona-toolkit/pt-online-schema-change.html are available nowadays which make such changes somewhat easier. No performance problems to be expected here, as long as you stay with InnoDB.
Another option might be to go with PostgreSQL, which allows you to change schemas online, and sometimes is somewhat smarter about the use of indexes. (For example, http://kb.askmonty.org/en/index-condition-pushdown is a new trick for MySQL/MariaDB, and allows you to combine two indices at query time. PostgreSQL has been doing this for a long time.)
Regarding overhead: you will be storing your 64 bytes of telemetry data in an unpacked form, probably, so your records will take more than 64 bytes on disk. Any kind of structured storage will suffer from this.
If you go with an SQL solution, backups are easy: just dump the data and you can restore it afterwards.

Document-oriented dbms as primary db and a RDBMS db as secondary db?

I'm having some performance issues with MySQL database due to it's normalization.
Most of my applications that uses a database needs to do some heavy nested queries, which in my case takes a lot of time. Queries can take up 2 seconds to run, with indexes. Without indexes about 45 seconds.
A solution I came a cross a few month back was to use a faster more linear document based database, in my case Solr, as a primary database. As soon as something was changed in the MySQL database, Solr was notified.
This worked really great. All queries using the Solr database only took about 3ms.
The numbers looks good, but I'm having some problems.
Huge database
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of data.
Each time I need to change a table/column the database need to be reindexed, which in this example took over 12 hours.
Difficult to render both a Solr object and a Active Record (MySQL) object without getting wet.
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
Like this.
# Controller
#song = Song.first
# View
#song.artist.urls.first.service.name
The problem in my case is that the data being returned from Solr is flat like this.
{
id: 123,
song: "Waterloo",
artist: "ABBA",
service_name: "Groveshark",
urls: ["url1", "url2", "url3"]
}
This forces me to build an active record object that can be passed to the view.
My question
Is there a better way to solve the problem?
Some kind of super duper fast primary read only database that can handle complex queries fast would be nice.
Solr individual fields update
About reindexing all on schema change: Solr does not support updating individual fields yet, but there is a JIRA issue about this that's still unresolved. However, how many times do you change schema?
MongoDB
If you can live without a RDBMS (without joins, schema, transactions, foreign key constrains), a document-based DB like MongoDB,
or CouchDB would be a perfect fit. (here is a good comparison between them )
Why use MongoBD:
data is in native format (you can use an ORM mapper like Mongoid directly in the views, so you don't need to adapt your records as you do with Solr)
dynamic queries
very good performance on non-full text search queries
schema-less (no need for migrations)
build-in, easy to setup replication
Why use SOLR:
advanced, very performant full-text search
Why use MySQL
joins, constrains, transactions
Solutions
So, the solutions (combinations) would be:
Use MongoDB + Solr
but you would still need to reindex all on schema change
Use only MongoDB
but drop support for advanced full-text search
Use MySQL in a master-slave configuration, and balance reads from slave(s) (using a plugin like octupus) + Solr
setup complexity
Keep current setup, denormalize data in MySQL
messy
Solr reindexing slowness
The MySQL database is about 200mb, the Solr db contains about 1.4Gb of
data. Each time I need to change a table/column the database need to
be reindexed, which in this example took over 12 hours.
Reindexing 200MB DB in Solr SHOULD NOT take 12 hours! Most probably you have also other issues like:
MySQL:
n+1 issue
indexes
SOLR:
commit after each request - this is the default setup is you use a plugin like sunspot, but it's a perf killer for production
From http://outoftime.github.com/pivotal-sunspot-presentation.html:
By default, Sunspot::Rails commits at the end of every request
that updates the Solr index. Turn that off.
Use Solr's autoCommit
functionality. That's configured in solr/conf/solrconfig.xml
Be
glad for assumed inconsistency. Don't use search where results need to
be up-to-the-second.
other setup issues (http://wiki.apache.org/solr/SolrPerformanceFactors#Indexing_Performance)
Look at the logs for more details
Instead of pushing your data into Solr to flatten the records, why don't you just create a separate table in your MySQL database that is optimized for read only access.
Also you seem to contradict yourself
The view is relying on a certain object. It doesn't care if the object it self is an Active Record object or an Solr object, as long as it can call a set of attributes on the it.
The problem in my case is that the data being returned from Solr is flat... This forces me to build a fake active record object that can be rendered by the view.

How do I set up replication from MySQL to MongoDB?

I have a bunch of data from a scientific experiment stored in a MySQL database, but I want to use MongoDB to take advantage of its map/reduce functionality to power some web charts. What is the best way to have new writes to MySQL replicate into Mongo? Some solution where I inspect the binary MySQL log and update accordingly, just like standard MySQL replication?
Thanks!
Alex
MySQL and MongoDB uses very different data and query models, so you can't transfer data directly.
Alas, moving data between the two must be done manually, and doing that efficiently depends very much on your data. Eg. you could transfer each table to a separate collection (roughly a table in MongoDB-lingo), and making the unique attributes in the tables to the _id-attribute. Alternately, you can make the _id to tablename+unique_id.
Basically, as Document databases are essentially free-form, you are free to invent your on schemes ad-infinitum (as long as the _id-attributes are unique within the collection).
Tungsten Replicator is data replication engine for MySQL.
Using heterogeneous replication, you may be able to set up MySQL to MongoDB replication.
I am not familiar with MongoDB but my quick look shows it is incompatible with mySQL so unless someone has written something to import form mySQL you are out of luck.
You could write your own import function.
assuming your mySQL tables use an incrementing unique 'id' field you could track the last row in mysql and then send it to mongodb when it changes.
Alterations and deletions would be much more difficult to deal with. if this is important then inserting the data at the source is probably the best bet.
Do you need to insert the data into mysql at all? could you do it all in mongodb and save all the trouble?
DC

Storing one column in a NoSQL DB?

In an app I am working on, we use a MySQL database and want to store articles in a table. Rather than store them in a SQL DB, we were looking at just storing the key to the article in a NoSQL db. Is this a good problem to solve using NoSQL, or should we just create another table in MySQL and store the large amounts of text there?
We are looking at using MongoDB to store the text.
First thing I'd do is check how MySQL runs with the 'large amount of data'. If you're getting acceptable performance, then there's no point trying to make the system more complicated.
Putting the text content into a separate table in MySQL wouldn't accomplish anything. Putting it into a separate DB might help, but I wouldn't do it unless you're sure that MySQL is a significant bottleneck, and that you can't do anything else, like optimize your queries.
MongoDB is good at storing large blob-ish fields, whether text or binary. So I think that is a good thing to do.
There is a limit on BSON objects (MongoDB "records") of 4MB. If you text fields are larger than 4MB, you can use GridFS and then there is no limit.