Caching Query Results per user - mysql

I have a system (develop by someone else) where all registered user can query data (similar to data.stackexchange.com). The system is getting big and more user query the system and during the high traffic time the database is slow and I am afraid of security now.
What can I do to make the system more secure?
What can I do to make the queries faster to execute?
I have a very basic knowledge of mysql and databases and I want to learn. Can you point where I need to look and what can I do? (I would like to build my self, so please no code)

Well, you have two large jobs to do :)
How to make the system more secure? Well, use SSL where you need to. If the data is not important you can get away without it. That said, if you want to ultra-secure your logins, then insist on HTTPS. Above that, ensure that you never compare passwords directly, rather you compare the hashes of the passwords (with the inclusion of a salt). Additionally, if your website allows people to be remembered, use a token-based approach. This allows you to assign a unique cookie ID with the client for a period of time that it is valid. It's not fool-proof, but better than nothing. Paired with your SSL login requirements, it will be pretty good.
Have a look at cache managers. But before you do, have a gander at what is taking the most time. What particular pages are hitting your website the hardest? Once you ascertain that you can come up with a caching strategy which is, unfortunately, completely website-dependant. What works for one site, would be inadmissable for you. You can use some kind of memcache to store the common stuff so that the basic "Front page" and "Portal" queries are cached efficiently. The rest will have to be dealt with in the regular way.

Related

Persistent mysql connection, regardless of users connected?

I want to have ONE single mysql-connection used by EVERY user that selects the data all the time and updates it if specific conditions are met (like a placed bid). Most preferably even then if no user is visiting the website, if that's even possible?
So, in the last days I'm google'ing all the time, trying so hard to figure out to solve my issue, but it seems there are no people with enough knowledge to help me with my problem. So I try to ask my question as simple as possible without confusing you with my code. (But if you're interested seeing the code: http://pastebin.com/dRFzWtEH)
However, this is all about an auction website with live-countdown-timer and I just want to run a node.js server that SELECTs data every second and sends it to a WebSocket to show all users visiting that website the countdown and price-updates (on bids) in realtime.
I accomplished this whole task by using single-mysql-queries but then I ran into errors. Then the author of the GitHub node-mysql-module suggested me to use a MySQL Pool. But there is like no content at all to find about my specific aim stated in my first sentence of this question.
Now I want to ask in general, how could I accomplish this and is this even possible or does at least one user has to be on my website?
What would the code/code-structure/logical process look like?
And I guess I don't need to close the connection at all, so I won't need functions like connection.end()?
No, don't worry about connection pooling. It is not a big deal in MySQL.
Furthermore a "pool" has a problem -- it must clear out all settings, #variables, transaction state, etc, etc, before allowing the next 'client' to use the pooled connection. This can take time, especially if the client is far from the server.
MySQL's connection/disconnection time is very low, unlike competing products.
If you are developing a Web product, then keep in mind that HTTP is "stateless". That is, you cannot hang onto a connection from one 'page' to the next 'page. Hence, no 'state' can be saved.
Edit
If you have "Across the pond" latency problems (100-200ms between US and Europe), client-side connection pool could be very useful. However, if the pool software is injecting commands to reset things, that could totally defeat the pooling.
If you can turn on the 'general log' (in a hosted service, you may have to use log_output=TABLE), do so to see what extra commands are injected.
Also, consider combining multiple client SQL statements into Stored Procedures to cut down on back-and-forth.
Also consider either moving the MySQL server closer to the client, or moving the client closer to the MySQL server, depending on how the end-user to client back-and-forth compares to the client to MySQL traffic.

What's the most efficient architecture for this system? (push or pull)

All s/w is Windows based, coded in Delphi.
Some guys submit some data, which I send by TCP to a database server running MySql.
Some other guys add a pass/fail to their data and update the database.
And a third group are just looking at reports.
Now, the first group can see a history of what they submitted. When the second group adds pass/fail, I would like to update their history. My options seem to be
blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient.
ask the database server regularly if anything changed in the last X minutes.
never poll the database server, instead letting it inform the user's app when something changes.
1 seems inefficient. 2 seems better. 3 reduces TCP traffic, but that isn't much. Anyway, just a few bytes for each 2. However, it has the disadvantage that both sides are now both TCP client and server.
Similarly, if a member of the third group is viewing a report and a member of either of the first two groups updates data, I wish to reflect this in the report. What it the best way to do this?
I guess there are two things to consider. Most importantly, reduce network traffic and, less important, make my code simpler.
I am sure this is a very common pattern, but I am new to this kind of thing, so would welcome advice. Thanks in advance.
[Update] Close voters, I have googled & can't find an answer. I am hoping for the beneft of your experience. Can you help me reword this to be acceptable? or maybe give a UTL which will help me? Thanks
Short answer: use notifications (option 3).
Long answer: this is a use case for some middle layer which propagates changes using a message-oriented middleware. This decouples the messaging logic from database metadata (triggers / stored procedures), can use peer-to-peer and publish/subscribe communication patterns, and more.
I have blogged a two-part article about this at
Firebird Database Events and Message-oriented Middleware (part 1)
Firebird Database Events and Message-oriented Middleware (part 2)
The article is about Firebird but the suggested solutions can be applied to any application / database.
In your scenarios, clients can also use the middleware message broker send messages to the system even if the database or the Delphi part is down. The messages will be queued in the broker until the other parts of the system are back online. This is an advantage if there are many clients and update installations or maintenance windows are required.
Similarly, if a member of the third group is viewing a report and a
member of either of the first two groups updates data, I wish to
reflect this in the report. What it the best way to do this?
If this is a real requirement (reports are usually a immutable 'snapshot' of data, but maybe you mean a view which needs to be updated while beeing watched, similar to a stock ticker) but it is easy to implement - a client just needs to 'subscribe' to an information channel which announces relevant data changes. This can be solved very flexible and resource-saving with existing message broker features like message selectors and destination wildcards. (Note that I am the author of some Delphi and Free Pascal client libraries for open source message brokers.)
Related questions:
Client-Server database application: how to notify clients that data was changed?
How to communicate within this system?
Each of your proposed solutions are all viable in certain situations.
I've been writing software for a long time and comments below relate to personal experience which dates way back to 1981. I have no doubt others will have alternative opinions which will also answer your questions.
Please allow me to justify the positives and negatives of each approach, and the parameters around each comment.
"blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient."
Yes, this is inefficient
Is often the quickest and simplest thing to do.
Seems like the best short-term temporary solution which gives maximum value for minimal effort.
Good for "exploratory coding" helping derive a better software design.
Should be a good basis to refine / explore alternatives.
It's very important for programmers to strive to document and/or share with team members who could be affected by your changes their team when a tech debt-inducing fix has been checked-in.
If not intended as production quality code, this is acceptable.
If usability is poor, then consider more efficient solutions, like what you've described below.
"ask the database server regularly if anything changed in the last X minutes."
You are talking about a "pull" or "polling" model. Consider the following API options for this model:
What's changed since the last time I called you? (client to provide time to avoid service having to store and retrieve seesion state)
If nothing has changed, server can provide a time when the client should poll again. A system under excessive load is then able to back-off clients, i.e if a server application has an awareness of such conditions, then it is therefore better able to control the polling rate of compliant clients, by instructing them to wait for a longer period before retrying.
After considering that, ask "Is the API as simple as it can possibly be?"
"never poll the database server, instead letting it inform the user's app when something changes."
This is the "push" model you're talking about- publishing changes, ready for subscribers to act upon.
Consider what impact this has on clients waiting for a push - timeout scenarios, number of clients, etc, System resource consumption, etc.
Consider that the "pusher" has to become aware of all consuming applications. If using industry standard messaging queueing systems (RabbitMQ, MS MQ, MQ Series, etc, all naturally supporting Publish/Subscribe JMS topics or equivalent then this problem is abstracted away, but also added some complexity to your application)
consider the scenarios where clients suddenly become unavailable, hypothesize failure modes and test the robustness of you system so you have confidence that it is able to recover properly from failure and consistently remain stable.
So, what do you think the right approach is now?

Scaling up a ruby, activerecord, mysql app

I have an app...
The app does a market comparison for a financial product - for a given quote request, it contacts several other sites for their quotes. It then gives the user the results - several quotes for their details.
To manage these requests they get saved to MySQL and then my app kicks in, picking up the pending quotes and farms these out to threads (all same Linux box) to process each site lookup.
I am using JRuby as I had thread/db related issues. Using Java threadpools to control the number of threads. With the current hardware/VPS - it can handle around 200 threads. A lot of the limitations seem to relate to each thread grabbing their own MySQL connection - grabbing the quote details and saving back the results. We want to handle more concurrent threads and so looking for ways to scale up.
Wondering which way to go ...
Bigger hardware...
More machines and use some kind of queueing
mechanism (with priorities) to share the load across the machines -
so the threads dont touch the db, all the details/responses go via
the queue - so the DB hit is less, but then maybe I am just pushing
the problem into the queue. Thinking of using something like
MongoDB for the queue, but open to suggestions - something easy to
use with Ruby :)
Some kind of remote/RPC mechanism, eg dRb -
theoretically this seems like a good option, but not done anything
with this yet to know how complex it will make things.
Something
else...?
From this link Reasons for NOT scaling-up vs. -out? - it would seem this problem is suited to running more machines to solve it.
So, any thoughts on which way to go...
Cheers,
Chris
My usual approach to problems like this is to pay very close attention to the database queries you're making and tune them aggressively. Retrieve only what you need, skipping columns that aren't explicitly used, and be very careful about eager loading things you don't need in their entirety.
You'll often find you can get significant speed gains by adding indexes, or strategically de-normalizing certain attributes in your database to avoid ugly, time-consuming JOIN operations.
Further, think about caching: The fastest database call is the one that's never made. It's not hard to leverage in something like Memcached to save the results of a moderately time-consuming record retrieval and if done carefully it's even easy to invalidate and expire this provided you channel your updates through a few methods.
For scheduling workers, a simple first-in, first-out queue can be implemented in Redis to off-load a lot of the processing overhead from MySQL itself. This is usually very simple to add if you follow an example.
A cache like Memcached can handle an extremely high amount of traffic, so whenever possible, cache against this to avoid hitting your database for every last thing.
If you've exhausted these options, it's time for more front-end servers and even more database capacity, but only then.
Queing is easiest thing for you to implement. Use something like this: http://beanstalkd.github.com/beaneater/
Basically you can prepend your methods with async. which will put them into queue and execute them. They queue and workers can be same server or a different one.

Filtering at server or at client?

I am thinking about how to build advertise site which works like twitter.
That means, most user don't not visit the site by browser, they should run a dedicated client application on their PC or smart phone. Then they set some filters about what kind of advertise they like. And when new post that fulfill their needs appear, the client will make a notification.
To make that client as real time as possible, it has to poll the server within a short time interval.
The problem is, should I do the filtering at the server side when client polls, or should I simply transfer all new posts to client and let client do the filtering?
Making server side filtering might cause too much CPU cycles of server, but transferring every post blindly to client might waste a lot of bandwidth.
Just a brain game. :)
Filtering data on server side my applying a simple filter query on it (SELECT * FROM tweets WHERE category IN (1,2,3,4,5,)) won't cost you much in performance - much less than distributing all available data to all clients anyway.
If by filtering you mean an SQL query, then making it on the server will be better of course. Inquiring from any SQL database is very light even if you make thousands of SELECTs.
As others have pointed out, there is no point in sending data which isn't going to be used. People only want to download what they can use. If someone pays for their mobile data allowance and your app shows them 2 ads and uses up 1000 ads of data they will stop using your service.
You can filter by certain types at the database side, or you can filter by some more indepth business logic in the service before the final data is sent back to the client.
The main point is; low data transfer, quicker responses, happier user :-)
Oh, especially if you're also considering deployment onto mobile devices, /always/ filter on the server side. Maybe the main issue is finding appropriate data structures of linking new postings to the filters so that isn't expensive. And you can also keep the most-asked-for entries and filters in memcached so you don't hit the database always.
There is absolutely no sense at all to transfer all the stuff to the client,and then not show it.
i think filtering on the server would be the much better way it will reduce the amount of data transfered (expecially for smart phone users this will be a huge gain)

post/redirect/get

In producing a web-based data entry system, is the fact that you are adding an extra server request per page a significant concern when deciding whether or not to use a post/redirect/get design?
The request alone isn't a problem, especially that the alternative gives a pretty bad user experience.
However, when using a site with load balancing and/or database replication, you need to take care to ensure that the GET after POST will see the data that has been posted.
When using load balancing and caching, this is sometimes solved with "sticky sessions" that direct the same user to the same machine, so data stored in a write-through cache on that machine will be current.
When using database replication, GET requests after POST may need to read directly from the "primary" database, instead of a local "secondary" as usual.
If I understand your question (and I'm not entirely sure I do), it is definitely good design to do a redirect after a post, even if you are showing them the same page with the updated info.
By doing the redirect you are breaking the connection between the page being viewed and the POST which caused the change. The user can bookmark and/or refresh the page without any popup asking "Do you want to resend the data?"
Most of the time, posts only happen when data is changed. The most traffic and CPU time on sites is generated by queries (GETS) rather than changes, so I think these extra requests aren't very significant.
I think the usability that this offers outweighs the small performance hit.
Test it out by performing some performance benchmarks and you will be able to see if it is going to be a concern in your particular case. See this article for more information.