As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What I've got right now with Mysql
Here is my database:
I search users by location a lot using bounding box.
There are two more tables: user_tag and tags. The overall database size is about 1 Gb.
I've implemented an arbitrary tag system with these tables, so that when user wants to use tag that hasn't been created yet, this tag is inserted in tag table.
I also search users by tags.
Benchmarks
I have no indexes in this database except those on primary keys.
As you can see there are a lot of inserts and they take much time.
Main problem here is time consuming inserts and updates.
Create new Event with tags(~150ms):
http://pastebin.com/vyw6qhrN
Update Event(~200ms):
http://pastebin.com/f28yvn9z
What I don't like in this solution:
When I create new user I do insert in 3 tables to link the user with his tags.
When updating user information I also need to do 3 updates and 1 deletion or insert when user changes tags.
Searching users by tags gets really messy (complex query) (How to implement tag system)
What can I get with NoSql
I want to use a document oriented database. Then I will need only one collection:
{
"name": "Dan",
"lat": 60
"lon": 30
"tags":["football", "fishing"]
}
I will be able to set index on tags and lat and lon for faster search.
My questions
Should I switch to NoSql or I can somehow improve my current implementation. Or maybe switch to a different RDBMS?
In case I should switch: What NoSql database is the best in this case?
In case I should switch to MongoDb: Is it reliable and mature enough? Because I've read a lot of posts about people going away from MongoDb. For example: http://www.reddit.com/search?q=mongodb
Both technologies can probably solve your problem. Some scenarios are easier to handle with a RDBMS, others with a more specialized database. It depends on the details of your requirements, your experience and your personal preferences.
#mvp commented on the "convenience of SQL". Personally, I find SQL a major pain because object-oriented and SQL aren't easy to map. People often use their ORM behemoths, which I find an antipattern -- chances are the ORM code size is more than 50 times the entire application code you have so something is fishy. But that is just my opinion, SQL is still probably the most common data store.
Personally, I have the feeling your problems maps to MongoDB quite nicely, because
It has geo indexes and supports various geo queries
It is very easy to create simple tagging, if that is what you need
It's easy and handling a few GB of data is easy, too.
It's easy to administer. I don't need to meddle with innodb_buffer_pool_size or whatnot at that scale.
Joins are overrated. Joins are needed, because you split up data that belongs together to squeeze it into tables. If you want to find answers to questions like "users who like football and live in foo also like?", the aggregation framework and caching are easier and more scalable than huge joins.
If I were you, I'd sit down for a day or two and give it a spin: You have a reasonably sized data set so you can do testing with real-world data, and changing just a few queries should be very easy. It will be fun and you get a feeling for the upsides and downsides first hand.
By the way, three of the articles on reddit refer to each other: "Don't use MongoDB" on pastebin, Eliot Horowitz' answer at news.ycombinator.com and "The MongoDB story was a hoax", so no, MongoDB doesn't just crash randomly and have a gazillion bugs. But of course, it's not a silver bullet that just magically makes scaling issues disappear.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I guess this has been brought up many times, but I'm bringing it up again!!!
Anyway... In Ruby on Rails Sqlite3 is already setup and no extra picking and slicing is needed, but...
after numerous readings here and other places, some say it's not scalable while others say it can actually be quite good at that. Some say MySQL is much better for bigger projects, while others think, just go with PostgreSQL.
I'm interested in hearing your opinion on this. In two scenarios. One where you are starting a little website for news publishing website like CNN news, and the other scenario where you're creating a website similar to Twitter?
Highly depends on your application.
Generally spoken, any write operation into a SQLite database is slow. Even a plain :update_attribute or :create may take up to 0.5 seconds. But if your App doesn't write much (killer against SQLite: write to DB on every request!), SQlite is a solid choice for most web apps out there. It is proven to handle small to medium amounts of traffic. Also, it is a very good choice during development, since it needs zero configuration. It performs also very well in your test suite with the in-memory mode (except you have thousands of migrations, since it rebuilds from scratch every time). Also, it is mostly seamless to switch from SQLite to, eg MySQL if its performance isn't enough any longer.
MySQL is currently a rock-solid choice. Future will tell what happens to MySQL under Oracle.
PostgreSQL is the fastest one as far as I know, but I haven't used it in production yet. Maybe others can tell more.
I'd vote for Postgres, it's consistently getting better, especially performance wise if that's a concern. Taking you up on the CNN and Twitter examples, start out with as solid footing as you can. You'll be glad later on down the road.
For websites, SQLite3 will suffice and scale fine for anything up to higher middle class traffic scenarios. So, unless you start getting hit by millions of requests per hour, there's no need to worry about SQLite3's performance or scalability.
That said, SQLite3 doesn't support all those typical features that a dedicated SQL server would. Access control is limited to whatever file permissions you can set for UNIX accounts on the machine with your database file, there's no daemon to speak of and the set of built-in functions is rather small. Also, there's no stored procedures of any kind, although you could emulate those with views and triggers.
If you're worried about any of those points, you should go with PostgreSQL. MySQL has (indirectly) been bought by Oracle, and considering they also had their own database before acquiring MySQL, I wouldn't put it past them to just drop it somewhere along the line. I've also had a far smoother experience maintaining PostgreSQL in the past and - anecdotally - it always felt a bit snappier and more reliable.
DISCLAIMER:
My opinion is completely bias as I have used mysql since it first came out.
Your question brings in another argument about how your development environment should be setup. A number of individuals will argue that you should be using the same dbms in development as you do in testing/production. This is totally dependent upon what you're doing in the first place. Sqlite will work fine, on development, in most cases.
I've personally been involved with more sites using MySql and MsSql than Postgres.
I was involved in a project that scrubbed the National Do-Not-Call list against client numbers. We stored that data locally. Some area codes easily have over 5 million records. The app was initially written in .Net using MsSql. It was "not-so-fast". I changed it to use PHP and MySql (Sad says before I found out about Ruby). It would insert/digest 5 million rows in(about) 3 seconds. Which was infinitely faster than processing it through MsSql. We also stored call log data in tables that would grow to 20 million records in less than a day. MySql handled everything we threw at it like a champ. The processing naturally took a hit when we setup replication but it was such a small one that we ignored it.
It really comes down to your project and what solution fits the need of the project.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I understand that this is very broad so let me give you the setting and be specific about my focus points.
Setting:
I am working with an existing PHP application using MYSQL. Tables almost all use the MYISAM engine and contain millions of rows for the most part. One of the largest tables uses an EAV design which is necessary but impacts on performance. The application was written to best leverage MYSQL cache. It requests a fair amount of requests per page load (partialy because of this) and is complex enough to have to go through most tables of the whole DB on each page load.
Pros:
it's free
MYISAM tables support full text indexes which are important to the application
Cons:
With the way things are set up, MYSQL is limited to one CPU for the whole of the application. If one very demanding query is run (or server is under a lot of load) it will queue all others making the site unresponsive
MYSQL caching and lack of "WITH" or "INTERSECT" means we have to break our queries down to better use cache. Thus multiplying the number of queries made. For instance, using subqueries over multiple tables with millions of rows (even with decent indexing) turns out to be a big issue with the current/upcomming load and the constraint layed out in the point above (CPU usage)
Feeling the need to scale up in the upcomming year, but not necessarily ready to pay for licensing right away, I've been thinking about rewriting the application and switching DBs.
The three options being considered are to either continue using mysql but with the INNODB engine, this way we can leverage more CPU power. Adapt to Oracle XE and get a license when we need to scale upwards of 4Gb database, 1Gb RAM or the 1 CPU limit (all of which we haven't hit yet). Or adapt to PostgreSQL
So the questions are :
How would losing full text indexing impact performance in the three cases (does oracle or postgreSQL have an equivalent?)
How do oracle and postgreSQL leverage cache on subqueries, WITH, and UNION/INTERSECT statements
How do Oracle and PostgreSQL leverage multicore/cpu power (if/when we get an oracle license)
I think that's already a lot to answer so I'll stop here. I don't mind simple/incomplete answers if there are links to compliment.
If you need any more information just let me know
Thanks in advance guys, the help is appreciated.
PostgreSQL supports full text search and indexes. Details here.
And it can use any number of CPU cores. It creates separate process for every session + some additional support processes. Details here.
PostgreSQL doesn't have built in query caching, but there are lots of open source utilities for this purpose.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to understand how to build a large site database architecture for chat messages.(example facebook.com or gmail.com)
I think that messages is redistributed in different tables because having all the messages in one table is impossible, the reason is they have huge quantity right? (and here partitioning can't I think)
So, what logic is used to redistribute messages in different tables? I have several variants but I think none of them is an optimal variant.
So generally, I'm interested in what you may think about this? and also, If you know some good articles about this, please post the link.
The answer currently is hadoop
They have a distributed file system and a database that can use that http://hbase.apache.org
http://en.wikipedia.org/wiki/HBase
OK, well the problem is how to partition the dataset. The easiest (and often the best) way to think about this is to consider the access pattern. what messages are needed quickly, which ones can be slow, and how to manage each of them.
Generally older messages can be held on low network speed/low memory/very large storage nodes (multi-terabyte).
New messages should be on high bandwidth network/high memory/low storage nodes (gigabytes are enough).
As traffic grows, so you'll need to add storage to the slow nodes, and add nodes to the fast nodes (scale horizontally).
Each night (or more often) you can copy old messages to the historical database, and remove the messages from the current database. Queries may need to address two databases, but this is not too much trouble.
As you scale out, the data will probably need to be sharded i.e. split by some data value. User-id splits makes sense. To make life easy, all sides of a conversation can be stored with each user. I would recommend using time bucketed text for this (disk access is usually on 4k boundaries) though this may be too complicated for you initially.
Queries now need to be user-aware so they query against the correct database. A simple lookup table will help there.
The other thing to do is to compress the messages on the way in, and decompress on the way out. Text is easily compressed and may double your throughput for a small cpu increase.
Many NoSQL databases do a lot of this hard work for you, but until you've run out of capacity on your current system, you may wish to stick to the technologies you know.
Good luck!
A while ago there was a article how reddit did from small to large.
They dont have a user message-system, but I guess this will work out for alooooot of scenarios with huge amounts of data
http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html
Edit: the "interesting" part about the database is #3 - dont worry about the schema.. they use 2 tables for everything. Thing and Data.
Facebook uses Apache Cassandra for some of their storage (document database), and heavy use of memcached to make it scale well.
Here's more about FB's nuts and bolts.
You might also find some gems in the FB developer news.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
There is alot of talk at the moment about NoSQL from my understanding Mongodb is one of them, to me the NoSQL seems that it is SQL just not in the same sense that we know mySQL.
Think of it this way, they both store data, one does it by a fixed database with so called limits while another stores data when it thinks it the best time to store data and supposable has not limits or very few.
However this is confusing to web developers who are making the switch or thinking about making the switch. In my case I work for a big teleco company and making a switch like this is something that needs to be really looked at, and we cant relay on something that has no physical being so to say.
Maybe I am not understanding the meaning of NoSQL, maybe I have the meaning correct.
I am in the process at the moment re-writing the whole CMS that we use, and would kinda be nice to know should I spend the time looking at noSQL or keep MySQL (which does not seem to have any issues that the moment)
We only have 5000 rows in the customer details and in the backup with have 14000, it gets backed up just incase the master table decides to screw up.
Are you being forced to select one or another? If not, why limit the potential solutions of solving your business requirements by having to do 'this' or having to do 'that'. I equate the workflow steps of software engineering to those of a doctor.
A doctor has to make a number of decisions to ensure the operation goes successfully. This includes diagnosis, determining the incision points, and selecting the required tools of their trade; scalpel, bone-saw, etc. to complete the operation. If you told the doctor that they could only do the operation with a crossbow, the end results won't work out well for the patient or the doctor (malpractice).
So stepping away from the clumsy analogy, here are a few reasons why I opt to use both, (using an online bookstore as an example):
Book data such as ISBN, author name(s), dates published, etc. are stored in a RDBMS (let's say MySQL). By storing this type of data in MySQL I can run any number of queries to present to a user. For example, I can run a query returning you all books published by authors whose last name being with the letter Z, and a publish date of 2005, ordered by their ISBN descending. This type of data manipulation is critical when creating useful features for your company (or clients).
Book assets, such as cover art are stored on the filesystem using a NoSQL solution. This solves two problems. First, I don't want voluminous data ballooning up my MySQL database (blobs) so I'll store this data on the filesystem. And secondly, a book's cover art has nothing to do with any of the actual book data (people really going to want all books with the color blue in their cover art?). And we simply cannot forgo a book's cover art, as it could make or break a sale when a user is browsing our online inventory.
In closing, my recommendation to you is to select any and all the tools you need to finish the operation successfully, and in a way which makes it easy to add new features in the future.
With such data, MySQL wouldn't be a problem. NoSQL db are designed for large data sets, and are quite different designed(everything you can do in NoSQL you can also do in sql db).
Besides, NoSQL are far more harder to administrate. Cassandra needs right config to be faster than normal MySQL db, if not it is much slower(and even then you can have few problems with it). And for most NoSQL you need VPS/dedicated hostage.
NoSQL databases are worthy of some evaluation, but they have a niche that they suit and it's not CMS' with 5,000 rows.
I think you should stick with a proper relational, SQL-based database. You may find PosgreSQL to be a better free choice than MySQL, but you'll have to evaluate this yourself1.
1 There's a variety of resources for this, for instance: http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
We are currently using MySQL for a product we are building, and are keen to move to PostgreSQL as soon as possible, primarily for licensing reasons.
Has anyone else done such a move? Our database is the lifeblood of the application and will eventually be storing TBs of data, so I'm keen to hear about experiences of performance improvements/losses, major hurdles in converting SQL and stored procedures, etc.
Edit: Just to clarify to those who have asked why we don't like MySQL's licensing. We are developing a commercial product which (currently) depends on MySQL as a database back-end. Their license states we need to pay them a percentage of our list price per installation, and not a flat fee. As a startup, this is less than appealing.
Steve, I had to migrate my old application the way around, that is PgSQL->MySQL. I must say, you should consider yourself lucky ;-)
Common gotchas are:
SQL is actually pretty close to language standard, so you may suffer from MySQL's dialect you already know
MySQL quietly truncates varchars that exceed max length, whereas Pg complains - quick workaround is to have these columns as 'text' instead of 'varchar' and use triggers to truncate long lines
double quotes are used instead of reverse apostrophes
boolean fields are compared using IS and IS NOT operators, however MySQL-compatible INT(1) with = and <> is still possible
there is no REPLACE, use DELETE/INSERT combo
Pg is pretty strict on enforcing foreign keys integrity, so don't forget to use ON DELETE CASCADE on references
if you use PHP with PDO, remember to pass a parameter to lastInsertId() method - it should be sequence name, which is created usually this way: [tablename]_[primarykeyname]_seq
I hope that helps at least a bit. Have lots of fun playing with Postgres!
I have done a similar conversion, but for different reasons. It was because we needed better ACID support, and the ability to have web users see the same data they could via other DB tools (one ID for both).
Here are the things that bit us:
MySQL does not enforce constraints
as strictly as PostgreSQL.
There are different date handling routines. These will need to be manually converted.
Any code that does not expect ACID
compliance may be an issue.
That said, once it was in place and tested, it was much nicer. With correct locking for safety reasons and heavy concurrent use, PostgreSQL performed better than MySQL. On the things where locking was not needed (read only) the performance was not quite as good, but it was still faster than the network card, so it was not an issue.
Tips:
The automated scripts in the contrib
directory are a good starting point
for your conversion, but will need
to be touched a little usually.
I would highly recommend that you
use the serializable isolation
level as a default.
The pg_autodoc tool is good to
really see your data structures and
help find any relationships you
forgot to define and enforce.
We did a move from a MySQL3 to PostgreSQL 8.2 then 8.3. PostgreSQL has the basic of SQL and a lot more so if your MYSQL do not use fancy MySQL stuff you will be OK.
From my experience, our MySQL database (version 3) doesn't have Foreign Key... PostgreSQL lets you have them, so we had to change that... and it was a good thing and we found some mistake.
The other thing that we had to change was the coding (C#) connector that wasn't the same in MySQL. The MySQL one was more stable than the PostgreSQL one. We still have few problems with the PostgreSQL one.