Right database for machine learning on 100 TB of data - mysql

I need to perform classification and clustering on about 100tb of web data and I was planning on using Hadoop and Mahout and AWS. What database do you recommend I use to store the data? Will MySQL work or would something like MongoDB be significantly faster? Are there other advantages of one database or the other? Thanks.

The simplest and most direct answer would be to just put the files directly in HDFS or S3 (since you mentioned AWS) and point Hadoop/Mahout directly at them. Other databases have different purposes, but Hadoop/HDFS is designed for exactly this kind of high-volume, batch-style analytics. If you want a more database-style access layer, then you can add Hive without too much trouble. The underlying storage layer would still be HDFS or S3, but Hive can give you SQL-like access to the data stored there, if that's what you're after.
Just to address the two other options you brought up: MongoDB is good for low-latency reads and writes, but you probably don't need that. And I'm not up on all the advanced features of MySQL, but I'm guessing 100TB is going to be pretty tough for it to deal with, especially when you start getting into large queries that access all of the data. It's more designed for traditional, transactional access.

Related

Use XML file or DB

For my simple App, i have a ftp server where i can store file (json or xml) or DB. Multiple clients could access that file or DB to read or write (DB or file would have only up to 100 entries).
From one point of view, DB is more suited for having big amounts of entries, due to indexing. But from other point of view, i am not sure if there would be issue with xml or json file if multiple clients try to read or write at the same time from the same file? So i am thinking to use DB just to avoid that issue.
I'd suggest using a database for a few reasons:
Databases are designed for exactly this scenario.
If you ever need to work on a larger scale you won't need to change your code.
You'll get to practice writing db code that will be useful in future, larger scale projects.
There are some really good database technologies that will work well with what you need, for example MongoDB, MySQL, SQL Server. They all have great support, code examples and you'll be able to use Stack Overflow to ask questions about them.
After googling, it seems that SQLite is the best choice. It is good approach for small DB, it is self-contained, allowing safe access from multiple processes or threads. Exactly what i needed.

In-memory database for mahout recommendatiion

I have been working on mahout lately. The current version of supports inputs from Files, MySQL etc... via its DataModels. In my case, the raw-data resides within a Postgres DB at a client location. The raw-data requires a good amount of pre-processing before being fed into the mahout DataModel. Currently I'm storing the refined data as a simple *.csv file and loading it to Mahout using inbuilt FileDataModel.
Is it possible to use an inmemory DB to actually store the refined data and t load it to Mahout using its existing MySQLJDBCDataModel/JDBCDataModel? . If so, what kind of inmemory DB would serve this purpose
sqllite3 is quite often the goto in memory database and for good reason it's one of the most battle hardened databases out there and can be found literally everywhere. The browser you're using is likely using it. It has an in memory option that's fairly straight forward. Even disk based it's also fast.
Most databases given enough RAM will efficiently load most of your data into RAM anyway. I used PostgreSQL as the backend for a search engine for a long time and most access was to RAM with almost nothing going to disk when reading. If you already have the database in PostgreSQL it might be simpler to keep it in that.
Keep in mind that you can only access an SQLite in-memory database from a single process.
If you need the ultimate performance, even a fully cached persistent database won't be as fast as a true in-memory database system. To me, though, it doesn't sound like you need that level of extreme performance.

SQLite3 database per customer

Scenario:
Building a commercial app consisting in an RESTful backend with symfony2 and a frontend in AngularJS
This app will never be used by many customers (if I get to sell 100 that would be fantastic. Hopefully much more, but in any case will be massive)
I want to have a multi tenant structure for the database with one schema per customer (they store sensitive information for their customers)
I'm aware of problem when updating schemas but I will have to live with it.
Today I have a MySQL demo database that I will clone each time a new customer purchase the app.
There is no relationship between my customers, so I don't need to communicate with multiple shards for any query
For one customer, they can be using the app from several devices at the time, but there won't be massive write operations in the db
My question
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and if it's a common practice to have one dedicated SQLite3 database PER CLIENT. I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
Is this a correct scenario for SQLite?
Any suggestion (aka tutorial) in how to achieve this?
[I wonder] if it's a common practice to have one dedicated SQLite3 database PER CLIENT
Only if the database is deployed along with the application, like on a phone. Otherwise I've never heard of such a thing.
I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
SQLite is a SQL database and responds to ALTER TABLE and the like. As for updating all the schemas, you'll have to re-run the update for all schemas.
Schema synching is usually handled by an outside utility, usually your ORM will have something. Some are server agnostic, some only support specific servers. There are also dedicated database change management tools such as Sqitch.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and
SQLite's main advantage is not requiring you to install and run a server. That makes sense for quick projects or where you have to deploy the database, like a phone app. For server based application there's no problem having a database server. SQLite's very restricted set of SQL features becomes a disadvantage. It will also likely run slower than a server database for anything but the simplest queries.
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
Under no circumstances should you test with a different database than the production database. Databases do not all implement SQL the same, MySQL is particularly bad about this, and your tests will not reflect reality. Running a MySQL instance for testing is not much work.
This separate schema thing claims three advantages...
Extensibility (you can add fields whenever you like)
Security (a query cannot accidentally show data for the wrong tenant)
Parallel Scaling (you can potentially split each schema onto a different server)
What they're proposing is equivalent to having a separate, customized copy of the code for every tenant. You wouldn't do that, it's obviously a maintenance nightmare. Code at least has the advantage of version control systems with branching and merging. I know only of one database management tool that supports branching, Sqitch.
Let's imagine you've made a custom change to tenant 5's schema. Now you have a general schema change you'd like to apply to all of them. What if the change to 5 conflicts with this? What if the change to 5 requires special data migration different from everybody else? Now let's imagine you've made custom changes to ten schemas. A hundred. A thousand? Nightmare.
Different schemas will require different queries. The application will have to know which schema each tenant is using, there will have to be some sort of schema version map you'll need to maintain. And every different possible query for every different possible schema will have to be maintained in the application code. Nightmare.
Yes, putting each tenant in a separate schema is more secure, but that only protects against writing bad queries or including a query builder (which is a bad idea anyway). There are better ways mitigate the problem such as the view filter suggested in the docs. There are many other ways an attacker can access tenant data that this doesn't address: gain a database connection, gain access to the filesystem, sniff network traffic. I don't see the small security gain being worth the maintenance nightmare.
As for scaling, the article is ten years out of date. There are far, far better ways to achieve parallel scaling then to coarsely put schemas on different servers. There are entire databases dedicated to this idea. Fortunately, you don't need any of this! Scaling won't be a problem for you until you have tens of thousands to millions of tenants. The idea of front loading your design with a schema maintenance nightmare for a hypothetical big parallel scaling problem is putting the cart so far before the horse, it's already at the pub having a pint.
If you want to use a relational database I would recommend PostgreSQL. It has a very rich SQL implementation, its fast and scales well, and it has something that renders this whole idea of separate schemas moot: a built in JSON type. This can be used to implement the "extensibility" mentioned in the article. Each table can have a meta column using the JSON type that you can throw any extra data into you like. The application does not need special queries, the meta column is always there. PostgreSQL's JSON operators make working with the meta data very easy and efficient.
You could also look into a NoSQL database. There are plenty to choose from and many support custom schemas and parallel scaling. However, it's likely you will have to change your choice of framework to use one that supports NoSQL.

Options for transferring data between MySQL and SQLite via a web service

I've only recently started to deal with database systems.
I'm developing an ios app that will have a local database (sqlite) and that will have to periodically update the internal database with the contents of a database stored in a webserver (mySQL). My questions is, whats the best way to fetch the data from the webserver and store it in the local database? There are some options that came to me, don't know if all of them are possible
Webserver->XML/JSON->Send it->Locally convert and store in local database
Webserver->backupFile->Send it->Feed it to the SQLite db
Are there any other options? Which one is better in terms of amount of data taken?
Thank you
The XML/JSON route is by far the simplest while providing sufficient flexibility to handle updates to the database schema/older versions of the app accessing your web service.
In terms of the second option you mention, there are two approaches - either use an SQL statement dump, or a CSV dump. However:
The "default" (i.e.: mysqldump generated) backup files won't import into SQLite without substantial massaging.
Using a CSV extract/import will mean you have considerably less flexibility in terms of schema changes, etc. so it's probably not a sensible approach if the data format is ever likely to change.
As such, I'd recommend sticking with the tried and tested XML/JSON approach.
In terms of the amount of data transmitted, JSON may be smaller than the equivalent XML, but it really depends on the variable/element names used, etc. (See the existing How does JSON compare to XML in terms of file size and serialisation/deserialisation time? question for more information on this.)

Converting Mysql to No sql databases

I have a production database server running on MYSQL 5.1, now we need to build a app for reporting which will fetch the data from the production database server, since reporting queries through entire database may slow down, hence planning to switch to nosql. The whole system is running aws stack planning to use DynamoDb. Kindly suggest me the ways to sync data from the production nosql server to nosql database server.
Just remember the simple fact that any NoSQL database is essentially a document database; it's really difficult to automatically convert a typical relational database in MySQL to a good document design.
In NoSQL you have a single collection of documents, and each document will probably contain data that would be in related rows in multiple tables. The advantage of a NoSQL redesign is that most data access is simpler and faster without requiring you to write complex join statements.
If you automatically convert each MySQL table to a corresponding NoSQL collection, you really won't be taking advantage of a NoSQL DB. This is because you'll end up loading many more documents, and thus make many more calls to the database than needed and thus loosing simplicity and speediness of NoSQL DB.
Perhaps a better approach is to look at how your applications use the MySQL database and go from there. You might then consider writing a simple utility script knowing fully well your MySQL database design.
As the data from a NoSQL database like MongoDB, RIAK or CouchDB has a very different structure than a relational database like MySQL the only way to migrate/synchronise the data would be to actually write a job which would write the data from MySQL to the NoSQL database using SELECT queries as stated on the MongoDB website:
Migrate the data from the database to MongoDB, probably simply by writing a bunch of SELECT * FROM statements against the database and then loading the data into your MongoDB model using the language of your choice.
Depending of the quantity of your data this could take awhile to process.
If you have any other questions don't hesitateo to ask.