Meaning every page in my website queries something from the database.
The query in itself is very small and the time it takes to load is unnoticeable, but I'm wondering if it's okay to do this for every page since I don't really know much about how querying from the database works and whether doing it multiple times, and in my case for every page load, affects anything significantly.
As with all things, the answer is it depends. :-)
Most web sites you visit queries something from a database on every page load. If the queries are crafted well, they look up just the data they need, and avoid scanning through a big database. You might like my presentation How to Design Indexes, Really (video) to help with this.
Another strategy is to use a fast cache in RAM for data that is needed frequently. RAM is thousands of times faster than disk drives. You might like to familiarize yourself with the Numbers Everyone Should Know. Those numbers are just examples, but the intention is to get programmers to think about the fact that moving data around has different cost as you use RAM vs. disk vs.network vs. CPU.
P.S.: Please don't buy into the myth that you're not good at computers because you're a woman. Everyone starts out as a novice, no matter what their gender or background. Only through practice and study do any of us learn this stuff. I recommend seeing Hidden Figures, the story of the women who did pioneering math and programming for NASA.
Another notable woman is Margaret Hamilton, who practically invented the profession of "software engineering."
Yes you are OK to query the database on every page load.
Think about websites like Facebook. When you visit the site it needs to know who you are - it gets that from a database. It needs to know all of the status updates that it's going to show you - it gets that from a database. When you hit the bottom of the news feed and it gets more for you to read - it gets that from a database.
That's normal. Most web applications have to query the database for each page load (usually several times), since most of the page content comes from the database.
If you're concerned about performance, think about this: is the query different for each page? Or is it loading the same data over and over? If it keeps querying the same thing (like the current user's name), you can improve performance by storing the data in the application's session state. But if it's different (like how many unread messages the user has), you'll need to run the query each time.
Imagine visiting a website which has features like 'whois online' or a messaging system, when ever you click on another page the site needs to update the database so that it can keep track of where you are on the site. If you receive a private message it would be accessible on the next page click since the database would have been updated when the message was sent. The trick is to run queries only to perform tasks which is required at that time. For instance if you were looking for a username in the database, if you searched the whole database it will run a lot slower as it needs to search the whole database. If you searched by a particular column it will be faster, it will be even faster if you used things like limits such as LIMIT in the query.
Related
I am sure there are lots of tutorial for this kind of topic, but I can't find what I want because I don't know the jargon for it. So I ask StackOverflow.
Here the example:
People can Like or Dislike videos on Youtube, and the database should update the counts for Like or Dislike. However, it's impractical, especially for sites like Youtube, to update the database every time a user clicked on Like / Dislike button.
How can we cache the query / count numbers at a time interval, and when the time expired we send all the queries / update the database at one time? Or any similar technique for this kind of situation?
So what you're observing is the time delay between something happening and being able to view the results of what happened.
And you're on the right path to only update periodically.
But you're on the wrong path as far as where to do the periodic updates.
Thing is you WANT to update the "database" every time ASAP (namely the database(s) responsible for writing - choose your missing corner of the CAP triangle) to capture everything pretty quickly, but for your visitors/viewers, you give them a slightly-behind (a few seconds to maybe a day, depending the situation) view of the write database(s).
You do NOT want to store this on the browser and potentially lose what the user did should the request fail, the internet go down, etc.
Slightly off topic - you typically do not try to "prematurely optimize" without data on knowing how much you're going to save by caching, buffering, etc. Optimizations like that add complexity - and you will stay sane, longer, if you keep things simple for as long as possible. Keep your design simple and optimize your bottlenecks once you know what they are.
Slightly more off topic - I'd recommend reading on distributed computing, specifically as it pertains to databases and then some design. You'll realize these highly focused abstract problems all have "solutions" with various advantages and disadvantages.
I have a query that takes about a minute to complete since it deals with a lot of data, but I also want to put the results on a website. The obvious conclusion is to cache it (right?) but the data changes as time goes by and I need a way to automatically remake the cached page maybe every 24 hours.
can someone point me to how to do this?
edit: I want to make a "top 10" type of thing so it's not displaying the page that is the problem but the amount of time it takes for the query to run.
Caching the results of the query with a 24hr TTL (expiry) would probably work fine. Use a fragment cache assuming this is a chunk of the page.
You can setup memcached or redis as stated to store the cache. Another thing you can do is setup a job that warms the cache every 24 hrs (or as desired) so that unlucky user doesn't have to generate the cache for you.
If you know when the cache is expired base on a state or change in your database you can expire the cache based on that. A lot of times I use the created at or updated at fields as part of the cache key to assist in this process.
There is some good stuff in the scaling rails screencasts by envy labs and new relic. http://railslab.newrelic.com/scaling-rails, a little out of date but the principles are still the same.
Also, checkout the caching rails guides. http://guides.rubyonrails.org/caching_with_rails.html
Finally, make sure indexes are setup properly, use thoughtbots post here: http://robots.thoughtbot.com/post/163627511/a-grand-piano-for-your-violin
Typed on my phone so apologies for typos.
Think a little beyond the query. If your goal is to allow the user to view a lot of data, then grab that data as they want it rather than fighting with a monsterous query that's going to overwhelm your UI. The result not only looks better, but is much, much quicker.
My personal trick for this pattern is DataTables. It's a grid that allows you to use Ajaxed queries (which is built in) to get data from your query a "chunk" at a time that the user wants to see. It can sort, page, filter, limit, and even search with some simple additions to the code. It even has a plug-in to export results to excel, pdf, etc.
The biggest thing that Datatables has that others don't is a concept called "pipelining" which allows you to get an amount to show (say 20) plus an additional amount forward and/or backwards. This allows you to still do manageable queries, but not to have to hit the database each time the user hits "next page"
I've got an app dealing with millions of records. One query of all data would be impossible....it would just take too long. Grabbing 25 at a time, however, is lightning fast, no tricks required. Once the datatable was up, I just performance tuned my query, did some indexing where needed, and voila.....great, responsive app.
Here's a simple example:
<table id="example"></table>
$('#example').dataTable( {
"bProcessing": true,
"bServerSide": true,
"sAjaxSource": "/processing/file.php"
} );
Use a cache store that allows auto-expiration after a certain length of time.
Memcached does it, Redis too I guess !
I need professional programmers/DBAs to bounce my idea off of and to know if it would/could even work. Please read below and give me any information that may break this theory. Thanks.
Overview of Website Idea:
The website will be used by sports card collectors to chat, answer questions on forums, showcase their cards/box breaks, trade/sell to/with other users, and keep a collection of their cards.
Design Issue:
A user can have an unlimited number of cards. This could make for some very large tables.
Design Question:
I do not want to limit the users on how many cards they can have in their collection on the site. If they have 5 copies of one card, and would rather have 5 records, one for each card, then that is their prerogative. This may also be necessary as each of the cards may be in a different condition. However, by allowing this to happen, this means that having only one table to store all records for all users is not even close to an option. I know sports card collectors with over 1,000,000 cards.
I was thinking that by either creating a table or a database for each user, it would allow for faster queries. All databases would be on the same server (I don't know who my host will be yet, only in design phase currently). There would be a main database with data that everyone would need (the base item while the user table/database would have a reference to the base item). I do see that it is possible for a field to be a foreign key from another database, so I know my idea in that aspect is possible, but overall I'm not sure what the best idea is.
I see most hosts say "unlimited number of databases" which is what got me to thinking about a database for each user. I could use this for that users posts on threads, their collection items, their preferences, and other information. Also, by having each user have a different table/database, if someone's table needed to be reindexed for whatever reason, it wouldn't affect the other users.
However, my biggest concern in either fashion would be additions/deletions to the structure of the tables/databases. I'm pretty sure a script could be written to make the necessary changes, but it seems like a pretty high risk. For instance, I'm pretty sure that I could write a script to add a field to a specific table in each database, or all of the like tables, but then to verify them it could prove difficult.
Any ideas you can throw out there for me would be greatly appreciated. I've been trying to work on this site for over a year now and keep getting stuck on the database design because of my worry of too large of tables, slow response time, and if the number of users grow, breaking some constraints set by phpmyadmin/MySQL. I also don't want to get half way through the database building and then think that there's a better way to do it. I know there may be multiple ways to do it, but what is the most common practice for it? Thank you all very much.
I was thinking that by either creating a table or a database for each user, it would allow for faster queries.
That's false. A single data base will be faster.
1,000,000 cards per user isn't really a very large number unless you have 1,000,000 users.
Multiple databases is an administration nightmare. A single database is always preferred.
my worry of too large of tables, slow response time, and if the number of users grow, breaking some constraints set by phpmyadmin/MySQL
You'll be hard-pressed to exceed MySQL limits.
Slow response is part of your application and details of your SQL queries more than anything else.
Finally. And Most Important.
All technology goes out of date. Eventually, you must replace something. In order to get to the point where you're forced to upgrade, you must first get something running.
Don't worry about "large database" until you have numbers of rows in the billions.
Don't worry about "long-term" solutions because all software technology expires. Quickly.
Regarding number of users.
Much of web interaction is time spent interacting with the browser through JavaScript. Or reading a page. Clicks are actually sort of rare. MySQL on a reasonably large server should handle 30 or more nearly concurrent queries with sub-second response. Your application will probably take very little time to format and start sending an HTML page. Things can rip along at a very, very good clip on a typical server.
If your database design avoids the dreaded full-table scan.
You must have proper indexes for the most common queries.
Now. What are the odds of 30 nearly concurrent requests? If a user only clicks once every 10 seconds (they have to read the page, fill in the form, re-read the page, think, drink their beer) then the odds of 30 clicks in a single second means you have to have 300 concurrent users. Considering that people have other things to do in their lives, that means you must have 50,000 or so users (figuring they're spending 1 hour each week on your site.)
I wouldn't go down the path of creating a database for every user... that will create countless headaches for you: data integrity issues, referential integrity issues, administrative issues...
As long as your table is well normalized and indexed, I don't think a table with hundreds of millions of rows is prohibitively large.
Instead, I would just start with a simple table design. If your site is wildly successful, it wouldn't be any extra effort to implement partitioning or sharding in MySql down the road as opposed to scaling out right off the bat.
If I where in your shoes I would start with one database and one table and not worry too much about the possible size of the table. If you ever get so successful and reach the size you imagine you would probably have a lot more resources and knowledge of your domain to make a better informed decision. Once that happens, you can also consider noSql solution such as HBase, Mondgodb and others that allow for horizontal scaling(unlimited size) with some limitations that businesses that deal with big data are bound to face. You can also use mysql partitions or other sharding solutions. So, go build your product with one table and don't sweat this problem until you absolutely need to. Good luck!
The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).
Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.
I have come to the following dilemma here:
i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )
An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.
The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).
Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.
I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.
If your application is successful, then it's a good bet that you'll have more reads than writes - I only upload a photo once (write), but each of my friends reads it whenever they refresh their feed. Therefore you should optimize for fast reads, not fast writes, which points in the direction of a denormalized schema.
The problem here is that the amount of data you create could quickly get out of hand if you have a large number of users. Very large tables are hard on the db to query, so again there's a potential performance issue. (There's also the question of having enough storage, but that's much more easily solved).
If, as you suggest, you can delete rows after a certain amount of time, then this could be a good solution. You can reduce that amount of time (up to a point) as you grow and run into performance issues.
Regarding storing serialized objects, it's a good option if these objects are immutable (you won't change them after writing) and you don't need to index them or query on them. Note that if you denormalize your data, it probably means that you have a single table for the activity feed. In that case I see little gain in storing blobs.
If you're going the serialized objects way, consider using some NoSQL solution, such as CouchDB - they're better optimized for handling that kind of data, so in principle you should get better performance for the same hardware setup.
Note that I'm not suggesting that you move all your data to NoSQL - only for that part where it's a better solution.
Finally, a word of caution, spoken from experience: building an application that can scale is hard and takes time better spent elsewhere. You should spend your times worrying about how to get millions of users to your app before you worry about how you're going to serve those millions - the first is the more difficult problem. When you get to the point that you're hugely successful, you can re-architect and rebuild your application.
There are many options you can take
Add more hardware, Memory, CPU -- Enter cloud hosting
Hows 24GB of memory sound? Most of your importantly accessed DB information can fit just in memory.
Choose a host with expandable SSDs.
Use an events based system in your application to write the "history" of all users. So it will be like so: id, user_id, event_name, date, event_parameters' -- an example would be: 1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, <id of picture> and most important of all, this table will be in memory. No longer need to worry about write performance. After the records go past i.e. 3 days they can be purged into another table (in non-memory) and included into the query results, if the user chooses to go back that far. By having all this in one table you remove having to do multiple queries and SELECTs to build up this information.
Consider using INNODB for the history/feeds table.
Good Resources to read
Exploring the software behind Facebook, the world’s largest site
Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL
Caching & Performance: Lessons from Facebook
I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.
These kind of problems are why currently NOSql solutions used these days. What I did in my previos projecs is really simple. I don't keep user->wall user->history which contains purely feed'ids in memory stores(my favorite is redis). so in every insert I do 1 insert operation on database and (n*read optimization) insert operation in memory store. I design memory store to optimize my reads. if I want to filter user history (or wall) for videos I put a push feedid to a list like user::{userid}::wall::videos.
Well ofcourse you can purely build the system in memstores aswell but its nice to have 2 systems doing what they are doing the best.
edit :
checkout these applications to get an idea:
http://retwis.antirez.com/
http://twissandra.com/
I'm reading more and more about NoSQL solutions and people suggesting them, however no one ever mentions drawbacks of such choice.
Most obvious for me is lack of transactions - imagine if you lost a few records every now and then (there are cases reporting this happens often).
But, what I'm surprised with is that no one mentions MySQL being used as NoSQL - here's a link for some reading.
In the end, no matter what solution you choose (relational database or NoSQL storage), they scale in similar manner - by sharding data across network (naturally, there are more choices but this is the most obvious one). Since NoSQL does less work (no SQL layer so CPU cycles aren't wasted on interpreting SQL), it's faster, but it can hit the roof too.
As Elad already pointed out - building an app that's scalable from the get go is a painful process. It's better that you spend time focusing on making it popular and then scale it out.
I am building a very simple classified site.
There is a form that puts data in mysql table.
Now how should this data be displayed ? Is it better to build html pages from the data in a table , and then display it to the users OR is it better to, fetch the data from the mysql table each time a user wants to see the data ?
I hope I was clear!
Performance-wise, it's generally better to keep the static versions of the HTML pages.
However, you may have too many dynamic content which can bloat your disk space, and you should apply some extra effort to track cache expiration (which can be even more expensive than generating the content dynamically).
It's a matter of tradeoff, and to make any other advices we would need to know the nature of your data.
If it's a blog with content updated rarely but read often, it's better to cache.
If it's a product search engine with mostly unique queries and changing stock, it's better to always query the database.
Note that MySQL implements query cache: it can cache the resultsets of the queries and if the query is repeated verbatim and no underlying tables were changed since the last query, then it's served out of the cache.
It tracks the cache expiration automatically, saves you of the need to keep the files on the disk and generally combines the benefits of both methods.
You can use Php caching techniques if the data would not change frequently. Keep loading the cached contents for frequent visits.
http://www.developertutorials.com/tutorials/php/php-caching-1370/
Use both, via a caching mechanism. Based on parameters, the page would be re-rendered (has not been viewed in X time or at all) or displayed from cache otherwise.
As stated though, it depends heavily on the amount of and frequency with which the data is accessed. More information would warrant a more detailed response.
It depends on a few things. Ask yourself two easy questions:
1) How often does the content change? Are your classified ads static or are they changing a lot on the page. How much control do you want on that page to have rotating ads, comments from users, reviews etc.
2) Are you going to be VERY high traffic? So much so that you are looking at a bottleneck at the database?
If you can say "Yes, no doubts, tomorrow" to question #2, go static. even it means adding other things in via ajax or non database calls (ie includes) in order to make the page pseudo-dynamic.
Otherwise if you say "Yes" to question #1, go with a dynamic page so you have the freshest content at all times. These days users have gotten very used to instant gratification on posts and such. Gone are the days we would wait for hours for a comment to appear in a thread (I am looking at you Slashdot).
Hope this helps!
Start with the simplest possible solution that satisfies the requirements, and go from there.
If you implement a cache but didn't need one, you have wasted time (and/or money). You could have implemented features instead. Also, now you (might) have to deal with the cache everytime you add features.
If you don't implement a cache and realize you need one, you are now in a very good position to implement a smart one, because now you know exactly what needs to be cached.