So I decided I want my image names stored in mysql. So I would save the "1" in
in my mysql database.
Now so far I intend to name my iamges in just numbers, in autoincrement actually. But should I name my image? Because I heard mysql can be rather slow, and I don't want a bunch of images names in my database if it's going to make my database run that much slower. Especially since image names really aren't that important.
Also, would autoincrement make my database run faster, as opposed to manual numbering?
EDIT: Totally unrelated question. I notice some people/sites separate their images into smaller chunks. Like something that could easily be one image, is separated into multiple image files? I don't get that, especially since amazon image hositng service charge per "get". As in amazon charges per picture too. So why do people split up images? I have my assumptions but I'm sure there's more to it.
Uhm....not sure about the context of your question, but broadly speaking, MySQL is one of the fastest database engines you can get for moderate use (the paid for products like Oracle and MS SQL Server are usually better for extreme requirements).
Storing the name as a string will not be noticably slower than storing it as an integer - but it would help to get more of an idea of the actual application.
Auto increment is a sensible choice for automatically assigning a primary key; it guarantees uniqueness of the number even if two processes access the table at the exact same time. In terms of performance, again - I don't think you could measure the difference.
Broadly speaking, with database stuff like this, I'd recommend to build a logically clean implementation, create the right indexing structure, and optimize it as far as you can; then measure performance. If that performance is not satisfactory, work out alternatives. Worrying about the kind of thing you're asking risks optimizing things that are already lightning quick, at the expense of maintainability and may often introduce other, far more severe performance issues...
I'm trying to answer based on your clarification response. As there would technically be more data being stored, it would, technically, make it slower. If you are indexing the column that could have the filename, then that could also potentially cause a performance hit due to using up more memory. If or when that becomes a problem depends on how much memory your server has, etc.
In either case I suspect it's not anything you need to worry about right now and falls into the category of premature optimization. It's going to make so little of a difference that you will not notice a difference at this point unless I'm greatly underestimating your table sizes and/or traffic.
Here's an update for the second part.
It depends on exactly what you're seeing, but I would lean towards it being one of a few things.
1) Some parts of the image may need to change dynamically and it allows just loading/changing the bit that needs to change rather than the whole image.
2) By sending a large image in smaller chunks of several images, it can cause the image to start displaying sooner on the client which gives the impression of better responsiveness and faster loading.
3) The website is from a template that was provided as a psd file. These are horrible things where someone makes an image of the website in photoshop or the like and then chops it up into a bunch of small images and then they get loaded separately into a table or divs.
It's not that clear from your question what you intend to do. But I feel I should share my experience.
I stored image filenames(profile picture of users) as strings in User table, in _profile_pic.jpeg convention. The problem was, since browsers cached the images, when a user changed his/her profile picture in Update Profile form, the same image showed up. Since the filename of the image is same, the browser thought it's the same resource and served from cache instead of getting it over the HTTP. So we ended up putting images to another table, with an auto increment field a primary key, and image filename renamed to the corresponding PK.
Auto increment will not strictly make your database faster, but you should still avoid man
Related
I need professional programmers/DBAs to bounce my idea off of and to know if it would/could even work. Please read below and give me any information that may break this theory. Thanks.
Overview of Website Idea:
The website will be used by sports card collectors to chat, answer questions on forums, showcase their cards/box breaks, trade/sell to/with other users, and keep a collection of their cards.
Design Issue:
A user can have an unlimited number of cards. This could make for some very large tables.
Design Question:
I do not want to limit the users on how many cards they can have in their collection on the site. If they have 5 copies of one card, and would rather have 5 records, one for each card, then that is their prerogative. This may also be necessary as each of the cards may be in a different condition. However, by allowing this to happen, this means that having only one table to store all records for all users is not even close to an option. I know sports card collectors with over 1,000,000 cards.
I was thinking that by either creating a table or a database for each user, it would allow for faster queries. All databases would be on the same server (I don't know who my host will be yet, only in design phase currently). There would be a main database with data that everyone would need (the base item while the user table/database would have a reference to the base item). I do see that it is possible for a field to be a foreign key from another database, so I know my idea in that aspect is possible, but overall I'm not sure what the best idea is.
I see most hosts say "unlimited number of databases" which is what got me to thinking about a database for each user. I could use this for that users posts on threads, their collection items, their preferences, and other information. Also, by having each user have a different table/database, if someone's table needed to be reindexed for whatever reason, it wouldn't affect the other users.
However, my biggest concern in either fashion would be additions/deletions to the structure of the tables/databases. I'm pretty sure a script could be written to make the necessary changes, but it seems like a pretty high risk. For instance, I'm pretty sure that I could write a script to add a field to a specific table in each database, or all of the like tables, but then to verify them it could prove difficult.
Any ideas you can throw out there for me would be greatly appreciated. I've been trying to work on this site for over a year now and keep getting stuck on the database design because of my worry of too large of tables, slow response time, and if the number of users grow, breaking some constraints set by phpmyadmin/MySQL. I also don't want to get half way through the database building and then think that there's a better way to do it. I know there may be multiple ways to do it, but what is the most common practice for it? Thank you all very much.
I was thinking that by either creating a table or a database for each user, it would allow for faster queries.
That's false. A single data base will be faster.
1,000,000 cards per user isn't really a very large number unless you have 1,000,000 users.
Multiple databases is an administration nightmare. A single database is always preferred.
my worry of too large of tables, slow response time, and if the number of users grow, breaking some constraints set by phpmyadmin/MySQL
You'll be hard-pressed to exceed MySQL limits.
Slow response is part of your application and details of your SQL queries more than anything else.
Finally. And Most Important.
All technology goes out of date. Eventually, you must replace something. In order to get to the point where you're forced to upgrade, you must first get something running.
Don't worry about "large database" until you have numbers of rows in the billions.
Don't worry about "long-term" solutions because all software technology expires. Quickly.
Regarding number of users.
Much of web interaction is time spent interacting with the browser through JavaScript. Or reading a page. Clicks are actually sort of rare. MySQL on a reasonably large server should handle 30 or more nearly concurrent queries with sub-second response. Your application will probably take very little time to format and start sending an HTML page. Things can rip along at a very, very good clip on a typical server.
If your database design avoids the dreaded full-table scan.
You must have proper indexes for the most common queries.
Now. What are the odds of 30 nearly concurrent requests? If a user only clicks once every 10 seconds (they have to read the page, fill in the form, re-read the page, think, drink their beer) then the odds of 30 clicks in a single second means you have to have 300 concurrent users. Considering that people have other things to do in their lives, that means you must have 50,000 or so users (figuring they're spending 1 hour each week on your site.)
I wouldn't go down the path of creating a database for every user... that will create countless headaches for you: data integrity issues, referential integrity issues, administrative issues...
As long as your table is well normalized and indexed, I don't think a table with hundreds of millions of rows is prohibitively large.
Instead, I would just start with a simple table design. If your site is wildly successful, it wouldn't be any extra effort to implement partitioning or sharding in MySql down the road as opposed to scaling out right off the bat.
If I where in your shoes I would start with one database and one table and not worry too much about the possible size of the table. If you ever get so successful and reach the size you imagine you would probably have a lot more resources and knowledge of your domain to make a better informed decision. Once that happens, you can also consider noSql solution such as HBase, Mondgodb and others that allow for horizontal scaling(unlimited size) with some limitations that businesses that deal with big data are bound to face. You can also use mysql partitions or other sharding solutions. So, go build your product with one table and don't sweat this problem until you absolutely need to. Good luck!
I know this theme has been widely discussed in the past, and I thoroughly analysed the many insightful answers on the matter - confirming my idea that, generally, storing blobs in the db is bad practice.
Now let's take a look at the following scenarios:
There's users, which has a one-to-many relationship with images;
The images rows would contain, apart from users' FK and some metadata (date, title...), the following binaries (or file paths pointing to the following binaries):
Thumbnail (a ridiculously small binary);
Fullsize (will actually be preprocessed, to be about 400x600 and around 35-45kb;
I will never need any data from the images table without an image as well (and anyway, I know where not to use SELECT *);
I want to use fs and memory cache;
In the most common scenario, I'll just need the thumbs (only getting the fullsize images dynamically on some events and, in those cases, getting them by ID). Clarifying: either many very very small pictures or one still pretty small one per call;
Users will want to change their pictures' data, delete them, change them a lot.
Everything seems to make me think that the DB solution is optimal.
Are there drawbacks I fail to see (apart from the obvious open db connection in the event of no cache)?
One problem in this scenario is making backups, or syncing a "development" or "test" environment with the production one (moving around a db dump would be painful). It's also much more simpler to edit en masse all the images, if they are on fs (say: mass create a new size for the images).
Anyway, I don't fully get into the real advantages of this approach versus a "pointers in db, files on fs" one. The memory cache, perhaps? These days when I think about blobs, I think of S3 anyway :)
The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).
Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.
I have come to the following dilemma here:
i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )
An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.
The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).
Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.
I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.
If your application is successful, then it's a good bet that you'll have more reads than writes - I only upload a photo once (write), but each of my friends reads it whenever they refresh their feed. Therefore you should optimize for fast reads, not fast writes, which points in the direction of a denormalized schema.
The problem here is that the amount of data you create could quickly get out of hand if you have a large number of users. Very large tables are hard on the db to query, so again there's a potential performance issue. (There's also the question of having enough storage, but that's much more easily solved).
If, as you suggest, you can delete rows after a certain amount of time, then this could be a good solution. You can reduce that amount of time (up to a point) as you grow and run into performance issues.
Regarding storing serialized objects, it's a good option if these objects are immutable (you won't change them after writing) and you don't need to index them or query on them. Note that if you denormalize your data, it probably means that you have a single table for the activity feed. In that case I see little gain in storing blobs.
If you're going the serialized objects way, consider using some NoSQL solution, such as CouchDB - they're better optimized for handling that kind of data, so in principle you should get better performance for the same hardware setup.
Note that I'm not suggesting that you move all your data to NoSQL - only for that part where it's a better solution.
Finally, a word of caution, spoken from experience: building an application that can scale is hard and takes time better spent elsewhere. You should spend your times worrying about how to get millions of users to your app before you worry about how you're going to serve those millions - the first is the more difficult problem. When you get to the point that you're hugely successful, you can re-architect and rebuild your application.
There are many options you can take
Add more hardware, Memory, CPU -- Enter cloud hosting
Hows 24GB of memory sound? Most of your importantly accessed DB information can fit just in memory.
Choose a host with expandable SSDs.
Use an events based system in your application to write the "history" of all users. So it will be like so: id, user_id, event_name, date, event_parameters' -- an example would be: 1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, <id of picture> and most important of all, this table will be in memory. No longer need to worry about write performance. After the records go past i.e. 3 days they can be purged into another table (in non-memory) and included into the query results, if the user chooses to go back that far. By having all this in one table you remove having to do multiple queries and SELECTs to build up this information.
Consider using INNODB for the history/feeds table.
Good Resources to read
Exploring the software behind Facebook, the world’s largest site
Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL
Caching & Performance: Lessons from Facebook
I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.
These kind of problems are why currently NOSql solutions used these days. What I did in my previos projecs is really simple. I don't keep user->wall user->history which contains purely feed'ids in memory stores(my favorite is redis). so in every insert I do 1 insert operation on database and (n*read optimization) insert operation in memory store. I design memory store to optimize my reads. if I want to filter user history (or wall) for videos I put a push feedid to a list like user::{userid}::wall::videos.
Well ofcourse you can purely build the system in memstores aswell but its nice to have 2 systems doing what they are doing the best.
edit :
checkout these applications to get an idea:
http://retwis.antirez.com/
http://twissandra.com/
I'm reading more and more about NoSQL solutions and people suggesting them, however no one ever mentions drawbacks of such choice.
Most obvious for me is lack of transactions - imagine if you lost a few records every now and then (there are cases reporting this happens often).
But, what I'm surprised with is that no one mentions MySQL being used as NoSQL - here's a link for some reading.
In the end, no matter what solution you choose (relational database or NoSQL storage), they scale in similar manner - by sharding data across network (naturally, there are more choices but this is the most obvious one). Since NoSQL does less work (no SQL layer so CPU cycles aren't wasted on interpreting SQL), it's faster, but it can hit the roof too.
As Elad already pointed out - building an app that's scalable from the get go is a painful process. It's better that you spend time focusing on making it popular and then scale it out.
I am building a very simple classified site.
There is a form that puts data in mysql table.
Now how should this data be displayed ? Is it better to build html pages from the data in a table , and then display it to the users OR is it better to, fetch the data from the mysql table each time a user wants to see the data ?
I hope I was clear!
Performance-wise, it's generally better to keep the static versions of the HTML pages.
However, you may have too many dynamic content which can bloat your disk space, and you should apply some extra effort to track cache expiration (which can be even more expensive than generating the content dynamically).
It's a matter of tradeoff, and to make any other advices we would need to know the nature of your data.
If it's a blog with content updated rarely but read often, it's better to cache.
If it's a product search engine with mostly unique queries and changing stock, it's better to always query the database.
Note that MySQL implements query cache: it can cache the resultsets of the queries and if the query is repeated verbatim and no underlying tables were changed since the last query, then it's served out of the cache.
It tracks the cache expiration automatically, saves you of the need to keep the files on the disk and generally combines the benefits of both methods.
You can use Php caching techniques if the data would not change frequently. Keep loading the cached contents for frequent visits.
http://www.developertutorials.com/tutorials/php/php-caching-1370/
Use both, via a caching mechanism. Based on parameters, the page would be re-rendered (has not been viewed in X time or at all) or displayed from cache otherwise.
As stated though, it depends heavily on the amount of and frequency with which the data is accessed. More information would warrant a more detailed response.
It depends on a few things. Ask yourself two easy questions:
1) How often does the content change? Are your classified ads static or are they changing a lot on the page. How much control do you want on that page to have rotating ads, comments from users, reviews etc.
2) Are you going to be VERY high traffic? So much so that you are looking at a bottleneck at the database?
If you can say "Yes, no doubts, tomorrow" to question #2, go static. even it means adding other things in via ajax or non database calls (ie includes) in order to make the page pseudo-dynamic.
Otherwise if you say "Yes" to question #1, go with a dynamic page so you have the freshest content at all times. These days users have gotten very used to instant gratification on posts and such. Gone are the days we would wait for hours for a comment to appear in a thread (I am looking at you Slashdot).
Hope this helps!
Start with the simplest possible solution that satisfies the requirements, and go from there.
If you implement a cache but didn't need one, you have wasted time (and/or money). You could have implemented features instead. Also, now you (might) have to deal with the cache everytime you add features.
If you don't implement a cache and realize you need one, you are now in a very good position to implement a smart one, because now you know exactly what needs to be cached.
I'm building an English web dictionary where users can type in words and get definitions. I thought about this for a while and since the data is 100% static and I was only to retrieve one word at a time I was better off using the filesystem (ext3) as the database system instead of opting to use MySQL to store definitions. I figured there would be less overhead considering that you have to connect to MySQL and that in itself is a very slow operation.
My fear is that if my system were to get bombarded by let's say 500 word retrievals/sec, would I still be better off using the filesystem as the database? or will the increased filesystem reads hinder performance as opposed to something that MySQL might be doing under the hood?
Currently the hierarchy is segmented by first letter, second letter and third letter of the word. So if you were to search for the definition of "water", the script (PHP) will try to read from "../dict/w/a/t/water.word" (after cleaning up the word of problematic characters and lowercasing it)
Am I heading in the right direction with this or is there a faster solution (not counting storing definitions in memory using something like memcached)? Will the amount of files stored in any directory factor in performance? What's the rough benchmark for the number of files that I should store in a directory?
What are your grounds for your belief that this decision will matter to the overall performance of the solution? WHat does it do other than provide definitions?
Do you have MySQL as part of the solution anyway, or would you need to add it should you select it as the solution here?
Where is the definitive source of definitions? The (maybe replicated) filesystem, or some off line DB?
It seems like something that should be in a DB architecturally - filesystems are a strange place to map a large number of names to values (as is evidenced by your file system structure breaking things down by initial letters)
If it's in the DB, answering questions like "how many definitions are there?" is a lot easier, but if you don't care about such things for your application, this may not matter.
So to some extent this feels like looking to hyper optimise the performance of something whose performance won't actually make much difference to the overall solution.
I'm a fan of "make it correct, then make it fast", and "correct" would be more straightforward to achieve with a DB.
And of course, the ultimate answer would to be try both and see which one works best in your situation.
Paul
The type of lookups that a dictionary requires is exactly what a database is good at. I think the filesystem method you describe will be unworkable. Don't make it hard! Use a Database.
You can keep a connection pool around to speed up connecting to the DB.
Also, if this application needs to scale to multiple servers, the file system may be tricky to share between servers.
So, I third the suggestion. Use a DB.
But unless it's a fabulously large dictionary, caching would mean you're nearly alwys getting stuff from local memory, so I don't think this is going to be the biggest issue for your application :)
A DB sounds perfect for your needs.
I also don't see why memcached is relevant (how big is your data? Can't be more than a few GB... right?)
The data is approximately a couple of GBs. And my goal is speed, speed, speed (definitions will be loaded using XHR). The data as I said is static and is never going to change, and in no where would I using anything other than a single read operation for each request. So I'm having a pretty hard time getting convinced of using MySQL and all its bloat.
Which would be first to fail under high load using this strategy, the filesystem or MySQL? As for scaling replication is the answer since the data will never change and is only a couple of GBs.
Make it work first. Premature optimisation is bad.
Using a database enables easier refactoring of your schema, and you don't have to write an implementation of an index-based lookup, which in actual fact is nontrivial.
Saying that connecting to a database "is a very slow operation" overstates the problem. Actually connecting should not take very long, plus you can reuse connections anyway.
If you are worried about read-scaling, a 1G database is very small, so you can push readonly replicas of it to each web server and they can each read from their local copy. Provided the writes stay at a level which doesn't impact read performance, that gives you almost perfect read-scalability.
Moreover, 1G of data will fit into ram easily, so you can make it fast by loading the entire database into memory at startup time (before that node advertises itself to the load balancer).
500 lookups per second is trivially small. I would start worrying about 5000 per second per server, maybe. If you can't achieve 5000 key lookups per second on modern hardware (from a database which fits in RAM?!!), there is something seriously wrong with your implementation.
Agreeing that this is premature optimization, and that MySQL surely will be performant enough for this use case. I must add you can also use a file based database, like the very fast Tokyo Cabinet as a compromise. Sadly it doesn't have a PHP binding so you could use its grandfather, DBM.
That said, do not use a filesystem, there's no good reason to, as far as I can see.
Use a virtual Drive in your ram (google it for a how to for your distro) or if your data is provided by PHP use APC, memcache might work well with mysql. Personally I don't think the optimization you are doing here is really where you should be spending your time. 500 requests a second is massive, I think using mysql would give you better forward features for later. I think you need to concentrate on features and not speed if you want to differentiate yourself from your competitors. Also there are a few good talks about UI for the web, the server speed is only a small factor in the whole picture.
Good luck
You might also think about a no-sql database (like riak, mongo, or even redis) for something like this. They are all super-fast and help out with your replication. Mysql might be over-kill and hard-to-scale in an instance like this, but the other ones have some robust tools