Design of the recommendation engine database? - mysql

i am currently working on recommendation systems especially for audio files.but i am a beginner at this subject.i am trying to design database first with mysql but i cant decide how to do it.İt is basicly a system which users create profile then search for the music and system recommend them music similar to they liked.
which database should i use ?(Mysql
comes my mind as a first guess)
it is a web project and also then
with mobile side.Which technologies
should i use?(php,android
platform...)
what are the pitfalls of this
project.
how to design database for system
like that?

Any relational database should be good for storing the raw data like lists of songs, list of users, users' song preferences..
I think that you'll find that a relational databases (and SQL) are not that great for storing the various data structures that your recommender will be constructing. Your recommendation engine will probably creating data that doesn't really need to be in tables and manipulating it for storage in a relational database may just be wasted work.
Just be aware of what you are doing and don't spend time putting stuff into a SQL database if it feels wrong. Maybe look into using a document oriented database like MongoDB.
The recommender that I recently wrote is actually a Java server process that reads in the raw data from MySQL, does all of its work in-memory, and provides recommendation data to my application via an HTTP API. I didn't even bother storing the recommendation data permanently since it can be regenerated.

Go read "Programming Collective Intelligence". They have a number of fine algorithms for recommendations in Chapter 2, "Making Recommendations".

Well, this is a vague question and a half, but I'll do my best to answer:
MySQL is a solid database, and so is PostgreSQL. Both are free and open sourced. MySQL is more widely supported and a little easier to use, but Postgres has some very cool features and functionality that's worth taking a gander at. WikiVS has a good comparison of the two.
Smartphones are having better and better browsers. Use PHP or ASP.NET (whatever you're comfortable with), and then build out a mobile site which looks better on the smaller resolutions.
There are a lot. First and foremost, how good is your recommendation algorithm? Secondly, storing audio files can eat up storage space quickly. What's your plan for scaling? Thirdly, how well do you know database design? Can you design a large, hefty database and index it properly? If not, you need to start reading everything you can on indices and database design. Fourthly, it's a software project, and those always have pitfalls. The best you can do is post here when problems arise and we can always see what the fine people of StackOverflow can do to help.

Related

Should I go with MySql or mongodb

I am building a social network (connections and their connections, messages and locations) and I am a little confused in deciding whether to go with a relational database (MySQL) or a no-sql system (MongoDB) when designing our backend APIs. Does anyone have any views on what to use when?
PS: I am building developer APIs for developers to tap into our system with oAuth. So scalability and performance is also key factor. Rails 3 + Devise (most likely).
This depends largely on which technology you are comfortable with, what exactly do you want to get out of this etc. etc.
Coming back to your question, not all data is relational. So For those situations, NoSQL can be helpful. With that said, NoSQL stands for "Not Only SQL". It's not intended to knock MySQL or supplant it.
SQL or MySQL has several very big advantages:
MySQL is Strong mathematical basis.
Declarative syntax.
A well-known language in Structured Query Language (SQL).
Highly proven and extremely reliable technology. MySQL has been around far more than the oldest noSQL. It's a mature piece of technology. Google Adsense runs on MySQL, Facebook persistent store is MySQL. The examples suggest its reliability.
As a result of being mature technology, people have optimised the shit out of it.
Enormous online and open source community both for support and providing features as opposed to noSQL technologies (look what happened to Cassandra)
In my opinion, all the above questions matter to me when I choose a piece of technology. Hey well, if it's a Sunday evening project that you want to whip up with little real world consequences then do what whims you but if it's slightly more serious then please consider these questions.
SQL hasn't gone away (even in noSQL). It's a mistake to think about this as an either/or argument. NoSQL is an alternative that people need to consider when it fits, that's all.
Documents can be stored in non-relational databases, like CouchDB or even in MySQL (it borders on abuse but still). A Relational database in principle could make a very good NOSQL solution
Check out this hilarious video. This gives a different perspective on this topic :)
I chose MongoDB for my "Social" application because of the flexibility of the schema and scalability/performance. MongoDB has allowed me to adjust my schema without having to make drastic code changes and makes reading/finding data very easy.
I also chose MongoDB as a learning experience. I wanted to know what all the fuss was about with these "noSQL" databases...and now I know why. MongoDB is awesome in my opinion and definitely worth looking at for a Social network that requires scalability and performance. Node.js would also be an excellent choice for the API ;)
neither.
Go with a network/graph database, you will not regret. My current favorite is Neo4j.
http://neo4j.org/
note: Not related to Neo4J
I think the latest version of Neo4J has a sql interface, just in case you would need SQL compatibility. Otherwise, do your crud using their native library. It is very fast.
If you would need to visualize the Graph data, which would be very impressive to show to your boss, use yEd package. To export neo4J to a graphml format, use this:
Convert Neo4j DB to XML?
You could front end your relationships in Neo4J and backend it with a relational db or mongodb. I have seen those hybrid architectures as well.
If you project requires actual relationships between certain objects then MySql will be fine. If you are storing things that typically just have inherent data to them, such as a user with messages to other users, then a document style database, such as MongoDB, makes more sense.
You can do relationships in mongo, but they make a lot more sense in a relational database. But, if most of your data is more inherent of a user then mongo makes more sense.
In your case a document type of scheme makes more sense where each user has a list of connected users and their own personal atributes, ect...

Are there any advantages to using mongodb over mysql if said mongo db were used without embedded documents?

I'm using a php framework with a mongodb adapter that doesn't currently comprehend embedded documents as a Model/association relationship. After reading about mongodb for a few days it seems that you should use embedded documents for objects that are most often displayed together. This makes a lot of sense to me. It was said during one mongo schema talk that a collection of many small documents can negate some of the advantages of mongo over an RDBMS.
In searching stackoverflow and beyond, I can't seem to see what advantages exist, if any, when deploying mongodb into an environment where it is implemented with a reasonably normalized schema like you'd find in a traditional RDBMS.
Are there still advantages to using MongoDB when used in this way? Scaling? Performance?
If by "reasonably normalized" you mean that you need information from one table to filter the information from another table (i.e. a join), then mongo is going to work against you. In a SQL database you can easily get the info from multiple tables with a single query. In mongo you'll need multiple queries to get data from multiple collections. Any speed advantage mongo gives you in pulling from a single collection will quickly be negated by making multiple round trips to the database.
Here are some advantages that MongoDb might give you (depending on your usecase):
Schemaless: More flexible if document structure is modified later.
Performance: MongoDB utilizes the RAM available very well making it very performant
Easy replication: Replication is easy to setup
Sharding/Clustering: MongoDB is designed with sharding in mind. It is easy to setup and doesn't require experts.
Map/Reduce: If you happen to need this, there is built-in support.
Javascript: Intuitive to use if you already know Javascript (and who doesn't nowadays :) )
MongoDB website has a good list of casestudies of production deployments.
MongoDB has replication and sharding built in.
These are things that can be done with MySQL.
The downside is the learning curve and lack of programmers that know it.
If it's just for you, it would be fun as a learning project.
If this is for a larger project, you'll need to weigh the lack of MongoDB programmers and learning curve against popularity of MySQL.
I have been developing my University dissertation project with MySQL first then thought to give a shot to MongoDB to improve performance. Rewriting code was really easy and straightforward with Jongo. Production has been really smooth.
Unfortunately performance were terrible. I am not particularly skilled with MongoDB queries, but I believe I did quite a lot of research: I have used map reduce, I have used the aggregation framework, $limit and all that stuff... when at same stage I got the message: "request heap use exceeded 10% of physical RAM" I really gave up and delivered the MySQL version.
For me it's really a shame because I was working so hard to make it work the best way possible with MongoDB (as a University project stands out if you do something different). However I think I will continue study MongoDB in future, but for the moment I stick to performance (or better what I can make perform).
I hope my comment will not offend MongoDB fans, but this is my experience.

SQLite or MySQL for a read-mostly website?

Is it practical to use SQLite as the database backend for a website with, say, 300,000 unique visitors a month ?
Writes to the database will be pretty limited - user signing up or logging in, adding comments etc. The vast majority of the use will just be queries getting content based on a primary key in the URL. I'd like to know if SQLite can cope as a website backend and won't end up dramatically slower then MySQL.
I've seen this SO question and others, but they're not really the same and seem like they could be out-of-date now too. http://www.sqlite.org/whentouse.html suggests it'd be fine, but they might be a bit biased!
SQLite is a very cool product - and with HTML5 on the horizon, it's a good idea for any web developer to get acquainted with it. However you should bear in mind that sqlite does not scale well. If you ever need to share data across multiple webservers, it's going to be very difficult using sqlite as the data-substrate.
However to ease the development, you could look at PDO / dbx_ in PHP which provides an abstraction layer (i.e. same code talks to all sorts of databases) however there are some subtle variations between the way different systems implement stuff like transactions - and variations in SQL - if you do go down this route, I'd recommend maintaining your own abstraction layer between the PHP PDO/DBX calls and your application - think stored procedures implemented in PHP.
300,000 unique visitors a month ?
aaaarrrgghhhh! pet hate. While you need to think about how much money your site will be making in order to plan a budget, this is not a useful metric for capacity planning. Really you want to look at expected hit/page rates.
I think you would be fine. Sqlite is able to support multi-threading just fine, and as you are mostly reading from it, there shouldn't be a problem. Also, if you are writing to it, it fully supports transactions as well. You have to remember though that it's still just one file and no service - so if you are going to cluster it you will be out of luck. Maybe you should check what problems exactly you have with mysql and solve them.
sqlite is very fast, but it becomes difficult to use once you need to cluster. However pretty much all databases are difficult once you need to cluster. If you are read oriented, it shouldn't matter too much which you use. Just make sure you are using memcached.

What database systems should a startup company consider?

Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?
The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.
So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL
What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.
Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.
I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.

Which is the Best database for Rails application?

I am developing a Rails application that will access a lot of RSS feeds or crawl sites for data (mostly news). It will be something like Google News but with a different approach, so I'll store a lot of news (or news summaries), classify them in different categories and use ranking and recommendation techniques.
Should I go with MySQL?
Is it worthwhile using IBM DB2
purexml to store the doucuments?
Also Ruby search implementations
(Ferret, Ultrasphinx and others) are
not needed If I choose DB2. Is that correct?
What are the advantages of
PostreSQL in this?
Does it makes sense to use Couch DB in
this scenario?
I'd like to choose the best option but without over-complicating the solution. So I discarded the idea to use two different storage solutions (one for the news documents and other for the rest of the data). I'm also considering only "free" options, so I didn't look at Oracle or MS SQL Server.
purexml is heavier than SQL, so you pay more for your roundtrip between webserver and DB. If you plan to have lots of users, I'd avoid it, your better off letting your webserver cache the requests, thus avoiding creating xml(rss) everytime, if that is what you are thinking about.
I'd go with MySQL because its really good at serving and its totally free, well PostgreSQL is too, but haven't used it so I can't say.
CouchDB could make sense, but not if you plan on doing OLAP (Offline Analysis) of your data, a normal RDBMS will be better at it.
Admitting firstly that I generally don't like mysql, I will say that there has been writing on this topic regarding postgres:
http://oldmoe.blogspot.com/2008/08/101-reasons-why-postgresql-is-better.html
This is always my choice when I need a pure relational database. I don't know whether a document database would be more appropriate for your application without knowing more about it. It does sound like it's something you should at least investigate.
MySQL is probably one of the best options out there; light, easy to install and maintain, multiplatform and free. On top of that there are some good free client tools.
Something to think about; because of the nature of your system you will probably have some tables that will grow quite a lot very quickly so you might want to think about performance.
Thus, MySQL supports vertical partitioning but only from V 5.1.
It sounds to me the application you will build can easily become a large-scale web app. I would suggest PostgreSQL, for it has been known for its reliability.
You can check out the following link -- Bob Ippolito from MochiMedia tells us why they ditched MySQL for PostgreSQL. Although the posts are more than 3 years old, the issues MySQL 5.1 has recently tend to prove that they are still relevant.
http://bob.pythonmac.org/archives/category/sql/mysql/
MySQL is good in production. I haven't used PostgreSQL for rails, but it's a good solution as well.
In the dev and test environments I'd start out with SQLite (default), and perhaps migrate to your target DB in the test environment as you move closer to completion.