I am building a very simple classified site.
There is a form that puts data in mysql table.
Now how should this data be displayed ? Is it better to build html pages from the data in a table , and then display it to the users OR is it better to, fetch the data from the mysql table each time a user wants to see the data ?
I hope I was clear!
Performance-wise, it's generally better to keep the static versions of the HTML pages.
However, you may have too many dynamic content which can bloat your disk space, and you should apply some extra effort to track cache expiration (which can be even more expensive than generating the content dynamically).
It's a matter of tradeoff, and to make any other advices we would need to know the nature of your data.
If it's a blog with content updated rarely but read often, it's better to cache.
If it's a product search engine with mostly unique queries and changing stock, it's better to always query the database.
Note that MySQL implements query cache: it can cache the resultsets of the queries and if the query is repeated verbatim and no underlying tables were changed since the last query, then it's served out of the cache.
It tracks the cache expiration automatically, saves you of the need to keep the files on the disk and generally combines the benefits of both methods.
You can use Php caching techniques if the data would not change frequently. Keep loading the cached contents for frequent visits.
http://www.developertutorials.com/tutorials/php/php-caching-1370/
Use both, via a caching mechanism. Based on parameters, the page would be re-rendered (has not been viewed in X time or at all) or displayed from cache otherwise.
As stated though, it depends heavily on the amount of and frequency with which the data is accessed. More information would warrant a more detailed response.
It depends on a few things. Ask yourself two easy questions:
1) How often does the content change? Are your classified ads static or are they changing a lot on the page. How much control do you want on that page to have rotating ads, comments from users, reviews etc.
2) Are you going to be VERY high traffic? So much so that you are looking at a bottleneck at the database?
If you can say "Yes, no doubts, tomorrow" to question #2, go static. even it means adding other things in via ajax or non database calls (ie includes) in order to make the page pseudo-dynamic.
Otherwise if you say "Yes" to question #1, go with a dynamic page so you have the freshest content at all times. These days users have gotten very used to instant gratification on posts and such. Gone are the days we would wait for hours for a comment to appear in a thread (I am looking at you Slashdot).
Hope this helps!
Start with the simplest possible solution that satisfies the requirements, and go from there.
If you implement a cache but didn't need one, you have wasted time (and/or money). You could have implemented features instead. Also, now you (might) have to deal with the cache everytime you add features.
If you don't implement a cache and realize you need one, you are now in a very good position to implement a smart one, because now you know exactly what needs to be cached.
Related
Meaning every page in my website queries something from the database.
The query in itself is very small and the time it takes to load is unnoticeable, but I'm wondering if it's okay to do this for every page since I don't really know much about how querying from the database works and whether doing it multiple times, and in my case for every page load, affects anything significantly.
As with all things, the answer is it depends. :-)
Most web sites you visit queries something from a database on every page load. If the queries are crafted well, they look up just the data they need, and avoid scanning through a big database. You might like my presentation How to Design Indexes, Really (video) to help with this.
Another strategy is to use a fast cache in RAM for data that is needed frequently. RAM is thousands of times faster than disk drives. You might like to familiarize yourself with the Numbers Everyone Should Know. Those numbers are just examples, but the intention is to get programmers to think about the fact that moving data around has different cost as you use RAM vs. disk vs.network vs. CPU.
P.S.: Please don't buy into the myth that you're not good at computers because you're a woman. Everyone starts out as a novice, no matter what their gender or background. Only through practice and study do any of us learn this stuff. I recommend seeing Hidden Figures, the story of the women who did pioneering math and programming for NASA.
Another notable woman is Margaret Hamilton, who practically invented the profession of "software engineering."
Yes you are OK to query the database on every page load.
Think about websites like Facebook. When you visit the site it needs to know who you are - it gets that from a database. It needs to know all of the status updates that it's going to show you - it gets that from a database. When you hit the bottom of the news feed and it gets more for you to read - it gets that from a database.
That's normal. Most web applications have to query the database for each page load (usually several times), since most of the page content comes from the database.
If you're concerned about performance, think about this: is the query different for each page? Or is it loading the same data over and over? If it keeps querying the same thing (like the current user's name), you can improve performance by storing the data in the application's session state. But if it's different (like how many unread messages the user has), you'll need to run the query each time.
Imagine visiting a website which has features like 'whois online' or a messaging system, when ever you click on another page the site needs to update the database so that it can keep track of where you are on the site. If you receive a private message it would be accessible on the next page click since the database would have been updated when the message was sent. The trick is to run queries only to perform tasks which is required at that time. For instance if you were looking for a username in the database, if you searched the whole database it will run a lot slower as it needs to search the whole database. If you searched by a particular column it will be faster, it will be even faster if you used things like limits such as LIMIT in the query.
I have a query that takes about a minute to complete since it deals with a lot of data, but I also want to put the results on a website. The obvious conclusion is to cache it (right?) but the data changes as time goes by and I need a way to automatically remake the cached page maybe every 24 hours.
can someone point me to how to do this?
edit: I want to make a "top 10" type of thing so it's not displaying the page that is the problem but the amount of time it takes for the query to run.
Caching the results of the query with a 24hr TTL (expiry) would probably work fine. Use a fragment cache assuming this is a chunk of the page.
You can setup memcached or redis as stated to store the cache. Another thing you can do is setup a job that warms the cache every 24 hrs (or as desired) so that unlucky user doesn't have to generate the cache for you.
If you know when the cache is expired base on a state or change in your database you can expire the cache based on that. A lot of times I use the created at or updated at fields as part of the cache key to assist in this process.
There is some good stuff in the scaling rails screencasts by envy labs and new relic. http://railslab.newrelic.com/scaling-rails, a little out of date but the principles are still the same.
Also, checkout the caching rails guides. http://guides.rubyonrails.org/caching_with_rails.html
Finally, make sure indexes are setup properly, use thoughtbots post here: http://robots.thoughtbot.com/post/163627511/a-grand-piano-for-your-violin
Typed on my phone so apologies for typos.
Think a little beyond the query. If your goal is to allow the user to view a lot of data, then grab that data as they want it rather than fighting with a monsterous query that's going to overwhelm your UI. The result not only looks better, but is much, much quicker.
My personal trick for this pattern is DataTables. It's a grid that allows you to use Ajaxed queries (which is built in) to get data from your query a "chunk" at a time that the user wants to see. It can sort, page, filter, limit, and even search with some simple additions to the code. It even has a plug-in to export results to excel, pdf, etc.
The biggest thing that Datatables has that others don't is a concept called "pipelining" which allows you to get an amount to show (say 20) plus an additional amount forward and/or backwards. This allows you to still do manageable queries, but not to have to hit the database each time the user hits "next page"
I've got an app dealing with millions of records. One query of all data would be impossible....it would just take too long. Grabbing 25 at a time, however, is lightning fast, no tricks required. Once the datatable was up, I just performance tuned my query, did some indexing where needed, and voila.....great, responsive app.
Here's a simple example:
<table id="example"></table>
$('#example').dataTable( {
"bProcessing": true,
"bServerSide": true,
"sAjaxSource": "/processing/file.php"
} );
Use a cache store that allows auto-expiration after a certain length of time.
Memcached does it, Redis too I guess !
The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).
Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.
I have come to the following dilemma here:
i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )
An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.
The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).
Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.
I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.
If your application is successful, then it's a good bet that you'll have more reads than writes - I only upload a photo once (write), but each of my friends reads it whenever they refresh their feed. Therefore you should optimize for fast reads, not fast writes, which points in the direction of a denormalized schema.
The problem here is that the amount of data you create could quickly get out of hand if you have a large number of users. Very large tables are hard on the db to query, so again there's a potential performance issue. (There's also the question of having enough storage, but that's much more easily solved).
If, as you suggest, you can delete rows after a certain amount of time, then this could be a good solution. You can reduce that amount of time (up to a point) as you grow and run into performance issues.
Regarding storing serialized objects, it's a good option if these objects are immutable (you won't change them after writing) and you don't need to index them or query on them. Note that if you denormalize your data, it probably means that you have a single table for the activity feed. In that case I see little gain in storing blobs.
If you're going the serialized objects way, consider using some NoSQL solution, such as CouchDB - they're better optimized for handling that kind of data, so in principle you should get better performance for the same hardware setup.
Note that I'm not suggesting that you move all your data to NoSQL - only for that part where it's a better solution.
Finally, a word of caution, spoken from experience: building an application that can scale is hard and takes time better spent elsewhere. You should spend your times worrying about how to get millions of users to your app before you worry about how you're going to serve those millions - the first is the more difficult problem. When you get to the point that you're hugely successful, you can re-architect and rebuild your application.
There are many options you can take
Add more hardware, Memory, CPU -- Enter cloud hosting
Hows 24GB of memory sound? Most of your importantly accessed DB information can fit just in memory.
Choose a host with expandable SSDs.
Use an events based system in your application to write the "history" of all users. So it will be like so: id, user_id, event_name, date, event_parameters' -- an example would be: 1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, <id of picture> and most important of all, this table will be in memory. No longer need to worry about write performance. After the records go past i.e. 3 days they can be purged into another table (in non-memory) and included into the query results, if the user chooses to go back that far. By having all this in one table you remove having to do multiple queries and SELECTs to build up this information.
Consider using INNODB for the history/feeds table.
Good Resources to read
Exploring the software behind Facebook, the world’s largest site
Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL
Caching & Performance: Lessons from Facebook
I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.
These kind of problems are why currently NOSql solutions used these days. What I did in my previos projecs is really simple. I don't keep user->wall user->history which contains purely feed'ids in memory stores(my favorite is redis). so in every insert I do 1 insert operation on database and (n*read optimization) insert operation in memory store. I design memory store to optimize my reads. if I want to filter user history (or wall) for videos I put a push feedid to a list like user::{userid}::wall::videos.
Well ofcourse you can purely build the system in memstores aswell but its nice to have 2 systems doing what they are doing the best.
edit :
checkout these applications to get an idea:
http://retwis.antirez.com/
http://twissandra.com/
I'm reading more and more about NoSQL solutions and people suggesting them, however no one ever mentions drawbacks of such choice.
Most obvious for me is lack of transactions - imagine if you lost a few records every now and then (there are cases reporting this happens often).
But, what I'm surprised with is that no one mentions MySQL being used as NoSQL - here's a link for some reading.
In the end, no matter what solution you choose (relational database or NoSQL storage), they scale in similar manner - by sharding data across network (naturally, there are more choices but this is the most obvious one). Since NoSQL does less work (no SQL layer so CPU cycles aren't wasted on interpreting SQL), it's faster, but it can hit the roof too.
As Elad already pointed out - building an app that's scalable from the get go is a painful process. It's better that you spend time focusing on making it popular and then scale it out.
So I decided I want my image names stored in mysql. So I would save the "1" in
in my mysql database.
Now so far I intend to name my iamges in just numbers, in autoincrement actually. But should I name my image? Because I heard mysql can be rather slow, and I don't want a bunch of images names in my database if it's going to make my database run that much slower. Especially since image names really aren't that important.
Also, would autoincrement make my database run faster, as opposed to manual numbering?
EDIT: Totally unrelated question. I notice some people/sites separate their images into smaller chunks. Like something that could easily be one image, is separated into multiple image files? I don't get that, especially since amazon image hositng service charge per "get". As in amazon charges per picture too. So why do people split up images? I have my assumptions but I'm sure there's more to it.
Uhm....not sure about the context of your question, but broadly speaking, MySQL is one of the fastest database engines you can get for moderate use (the paid for products like Oracle and MS SQL Server are usually better for extreme requirements).
Storing the name as a string will not be noticably slower than storing it as an integer - but it would help to get more of an idea of the actual application.
Auto increment is a sensible choice for automatically assigning a primary key; it guarantees uniqueness of the number even if two processes access the table at the exact same time. In terms of performance, again - I don't think you could measure the difference.
Broadly speaking, with database stuff like this, I'd recommend to build a logically clean implementation, create the right indexing structure, and optimize it as far as you can; then measure performance. If that performance is not satisfactory, work out alternatives. Worrying about the kind of thing you're asking risks optimizing things that are already lightning quick, at the expense of maintainability and may often introduce other, far more severe performance issues...
I'm trying to answer based on your clarification response. As there would technically be more data being stored, it would, technically, make it slower. If you are indexing the column that could have the filename, then that could also potentially cause a performance hit due to using up more memory. If or when that becomes a problem depends on how much memory your server has, etc.
In either case I suspect it's not anything you need to worry about right now and falls into the category of premature optimization. It's going to make so little of a difference that you will not notice a difference at this point unless I'm greatly underestimating your table sizes and/or traffic.
Here's an update for the second part.
It depends on exactly what you're seeing, but I would lean towards it being one of a few things.
1) Some parts of the image may need to change dynamically and it allows just loading/changing the bit that needs to change rather than the whole image.
2) By sending a large image in smaller chunks of several images, it can cause the image to start displaying sooner on the client which gives the impression of better responsiveness and faster loading.
3) The website is from a template that was provided as a psd file. These are horrible things where someone makes an image of the website in photoshop or the like and then chops it up into a bunch of small images and then they get loaded separately into a table or divs.
It's not that clear from your question what you intend to do. But I feel I should share my experience.
I stored image filenames(profile picture of users) as strings in User table, in _profile_pic.jpeg convention. The problem was, since browsers cached the images, when a user changed his/her profile picture in Update Profile form, the same image showed up. Since the filename of the image is same, the browser thought it's the same resource and served from cache instead of getting it over the HTTP. So we ended up putting images to another table, with an auto increment field a primary key, and image filename renamed to the corresponding PK.
Auto increment will not strictly make your database faster, but you should still avoid man
What are some usecases that will benefit from using memcached with a mysql DB. I would guess it would be good for data that does not change much over time.
More specifically if my data changes often then its not worth using memcached right?
Even more specifically I am trying to use the DB as a data structure for a multi player game. So the records are going to change with every move the players make. And all players views should be updated with the latest moves. So my app is getting read and write intensive. Trying to see what I can do about it. If I use memcached, for every write we read 3 times max since 4 players max can play the game at a time.
Thanks.
Pav
Usecase: webshop with a lot of products. These products are assigned to various pages, and per product a user gets to see certain specs. The specs are called with a "getSpec" function. This is expensive and a query per time.
If we put these in memcached, its much quicker. Everytime someone changes something about the product, you jsut update the memcached.
so if your data changes it still can be worth it! Not everything might change at once.
edit: In your case, you could make your write also update memcached: no stale cache. But that's just a random thought, I don't know if making your write heavier like that has any disadvantaged. This would essentially mean you're running everything from memcached, and are just using your DB as a sort of backup :)
Caching is a tradeoff between speed and (potentially) stale data. You have to determine if the speed gain is appropriate given your own use cases.
We cache everything that doesn't require real-time data. Some things that are typically cached: Reports, user content, entire pages (though you may consider caching these to disk via some other system), etc..
Our API allows clients to query for huge amounts of data. We use memcached to store that for quick paging on the clients end.
If you plan ahead, you can setup your application to cache most everything and just invalidate parts of the cache as needed (for instance, when some data in your db is updated).
It's going to depend on how often "often" is and how busy your app is. For example, if you have a piece of data that changes hourly, but that data is queried 500 times per hour, it would probably make sense to cache it even though it changes relatively frequently.