I have a web crawler that saves information to a database as it crawls the web. While it does this, it also saves a log file of its actions, and any errors it encounters to a log field in a mysql database (field becomes anywhere from 64kb to 100kb. It accomplishes this by concatenating (using the mysql CONCAT function).
This seems to work fine, but I am concerned about the cpu useage / impact it has on the mysql database. I've noticed that the web crawling is performing slower than before I implemented saving the log to the database.
I view this log file from a management webpage, and the current implementation seems to work fine other than the slow loading. Any recommendations for speeding this up, or implementation recommendations?
Reading 100kb strings into memory numerous time then write them to disk via a db. Of course your going to experience slowdown! Every part of what you are doing is going to task memory, disk, and cpu (especially if memory usage hits the system max and you start swapping to disk). Let me count some of the ways your going to possibly decrease overall site performance:
Sql connections max out and back up as the time to store 100kb records increases time a single process holds a connection
Webserver processes eat up free process pool and max out and take longer to free up because they have to wait on db connections to free.
Web server processes begin to bloat and take more memory each, possibly more than the system can handle without swapping. This is compounded by using the max. Umber of processes due to #2
... A book could be written on your situation.
Related
I am trying to decide between holding static data (gets updated on a nightly basis, not real-time) in a database or in flat JSON files to supply a Node.js application. In preliminary tests the flat file method is twice as fast. My question is about the issue of memory when using the file method.
If my Node.js app reads the data from the file and then does JSON.parse and passes the object to the template to render... does the in-memory size of that data get duplicated with each user connection?
i.e. if the data file is 1MB and there are 1000 concurrent users, does it consume 1000MB of server memory during that period?
Each connection runs separately so if you have 1000 concurrent users, they aren't really running their request all at the same time because node.js is single threaded. It runs one single connection until it either finishes or until it hits a non-blocking operation such as async I/O. Assuming you are using async file I/O, you could have a few connections in process at the same time, but as soon as one finishes, its memory use will be returned to the system by the garbage collector.
Your operation sounds ideal for an in-memory cache. You can decide what lifetime works best for the cache, but you could load the JSON, store it in memory, set an expiration time 10 minutes from now and as long as the current time is not greater than the expiration time, you just return the result from the cache with no disk access. Thus, you'd only ever retrieve the data from disk once every 10 minutes max and the data would be returned even faster and the average memory used per request would be significantly lower.
The only downside to this cache approach is that when the data is updated real-time, it could take up to 10 minutes (on average 1/2 the cache time or 5 minutes) for the cached data to expire and the new data to be returned. Since this only happens once nightly, it may not be a big deal to you, but there are ways to deal with that issue if you want to. For example, you can check the file date/time of the data file on every request and if it hasn't changed since the last time, then you just keep using your cached version of the data. When it does change, you read it from the disk and replace the cached version. This adds an extra disk I/O operation on each request, but guarantees that the user always gets the latest version while still allowing for the benefits of a cached version that only has to be newly read into memory when the data has actually changed.
One other thing to consider. If the data is 1MB and you're generating a giant HTML file from that, your page rendering may be where the largest memory consumption is since expanding a large data structure into HTML can often make it 10-20x larger and how well your rendering engine does with memory consumption depends entirely on the rendering engine.
If there is no per-user customization in the HTML or anything else in the rendered HTML that varies from one rendering to the next (as long as the JSON hasn't changed), you might want to actually cache the rendered HTML so all you have to do is stream it to each request.
This seems to be a clear issue; but I was unable to find an explicit answer. Consider a simple mysql database with indexed ID; without any complicated process. Just reading a row with WHERE clause. Does it really need to be cached? Reducing mysql queries apparently satisfies every one. But I tested reading a text from a flat cache file and by mysql query in a for loop of 1 - 100,000 cycles. Reading from flat file was only 1-2 times faster (but needed double memory). The CPU usage (by rough estimate from top in SSH) was almost the same.
Now I do not see any reason for using flat file cache. Am I right? or the case is different in long term? What may make slow query in such a simple system? Is it still useful to reduce mysql queries?
P.S. I do not discuss internal QC or systems like memcached.
It is depending of how you see the problem.
There is a limit on number of mysql connection can be established at any one time.
Holding the mysql connection resources in a busy site could lead to max connection error.
Establish a connection to mysql via TCP is a resource eater (if your database is sitting in different server). In this case, access a local disk file will be much faster.
If your server is located outside the network, the cost of physical distance will be heavier.
If records are updated once daily, stored into cache is truly request once and reused for the day.
Overview of the application:
I have a Delphi application that allows a user to define a number of queries, and run them concurrently over multiple MySQL databases. There is a limit on the number of threads that can be run at once (which the user can set). The user selects the queries to run, and the systems to run the queries on. Each thread runs the specified query on the specified system using a TADOQuery component.
Description of the problem:
When the queries retrieve a low number of records, the application works fine, even when lots of threads (up to about 100) are submitted. The application can also handle larger numbers of records(150,000+) as long as only a few threads (up to about 8) are running at once. However, when the user is running more than around 10 queries at once (i.e. 10+ threads), and each thread is retrieving around 150,000+ records, we start getting errors. Here are the specific error messages that we have encountered so far:
a: Not enough storage is available to complete this operation
b: OLE error 80040E05
c: Unspecified error
d: Thread creation error: Not enough storage is available to process this command
e: Object was open
f: ODBC Driver does not support the requested properties
Evidently, the errors are due to a combination of factors: number of threads, amount of data retrieved per thread, and possibly the MySQL server configuration.
The main question really is why are the errors occurring? I appreciate that it appears to be in some way related to resources, but given the different errors that are being returned, I'd like to get my head around exactly why the errors are cropping up. Is it down to resources on the PC, or something to do with the configuration of the server, for example.
The follow up question is what can we do to avoid getting the problems? We're currently throttling down the application by lowering the number of threads that can be run concurrently. We can't force the user to retrieve less records as the queries are totally user defined and if they want to retrieve 200,000 records, then that's up to them, so there's not much that we can do about that side of things. Realistically, we don't want to throttle down the speed of the application because most users will be retrieving small amounts of data, and we don't want to make the application to slow for them to use, and although the number of threads can be changed by the user, we'd rather get to the root of the problem and try to fix it without having to rely on tweaking the configuration all the time.
It looks you're loading a lot of data client-side. They may require to be cached in the client memory (especially if you use bidirectional cursors), and in a 32 bit application that could not be enough, depending on the average row size and how efficient is the library to store rows.
Usually the best way to accomplish database work is to perform that on the server directly, without retrieving data to the client. Usually databases have an efficient cache system and can write data out to disk when they don't fit in memory.
Why do you retrieve 150000 rows at once? You could use a mechanism to transfer data only when the user actually access them (sort of paging through data), to avoid large chunks of "wasted" memory.
This makes perfect sense (the fact you're having problems, not the specific errors). Think it through - you have the equivalent of 10 database connections (1 per thread) each receiving 150,000 rows of data (1,500,000 rows total) across a single network connection. Even if you're not using client-side cursors and the rows are small (just a few small columns), this is a HUGE flow of data across a single network interface, and a big hit on memory on the client computer.
I'd suspect the error messages are incorrect, in the same way that sometimes you have an access violation caused by a memory overwrite in another code location.
Depending on your DBMS, to help with the problem you could use the LIMIT/TOP sql clauses to limit the amountof data returned.
Things I would do:
write a very simple test application which only uses the necessary parts of the connection / query creation (with threads), this would eliminate all side effects caused by other parts of your software
use a different database access layer instead of ODBC, to find out if the ODBC driver is the root cause of the problem
it looks like the memory usage is no problem when the number of threads is low - to verify this, I would also measure / calculate the memory requirement of the records, compare it with the memory usage of the application in the operating system. For example if tests show that four threads can safely query 1.5 GB of total data without problems, but ten threads fail with under 0.5 GB of total data, I would say it is a threading problem
I have read every possible answer to this question and searched via Google in order to find the correct answer to the following question, but I am rather a novice and don't seem to get a clear understanding.
A lot I've read has to do with web servers, but I don't have a web server, but an intranet database.
I have a MySQL dsatabase in a Windows server at work.
I will have many users accessing this database constantly to perform simple queries and writting back to it new records.
The read/write will not be that heavy (chances are 50-100 users will do so exactly at the same time, even if 1000's could be connected).
The GUI will be either via Excel forms and/or Access.
What I need to know is the maximum number of active connections I can have at any given time to the database.
I know I can change the number on Mysql Admin however I really need to know what will really work...
I don't want to put 1000 users if the system will really handle 100 correctly (after that, although connected, the performance will be too slow, for example)
Any ideas or own experiences will be appreciated
This depends mainly on your server hardware (RAM, cpu, networking) and server load for other processes if not dedicated to the database. I think you won't have an absolute answer and the best way is testing.
I think something like 1000 should work ok, as long as you use 64 bit MySQL server. With 32 bit, too many connections may create virtual memory pressure - a connection has an own thread, and every thread needs a stack, so the stack memory will reduce possible size of the buffer pool and other buffers.
MySQL generally does not slow down if you have many idle connections, however special commands e.g "show processlist" or "kill", that enumerate every connection will be somewhat slower.
If idle connection stays idle for too long (idle time exceeds wait_timeout parameter), it is dropped by the server. If this is the case in your possible scenario, you might want to increase wait_timeout (its default value is 8 hours)
I have written a program which uses a MySQL database, and transaction between the database server (a very powerful one) and the client is happening over a ADSL connection (1 Mbit/s).
But I have a very very slow connection between each client and the server. Only approximately 3-4 KB/s data is send through the server. Neither the server nor the clients use the Internet for other purposes, just my program uses the Internet.
I can't figure out why? Is the reason MySQL server packet size?
Any suggestions?
Try using mytop to identify the server low performance cause.
Another one: you may be using SELECT COUNT(*) FROM .. for large InnoDB tables which causes a table scan.
And can you test for some other services whether the exchange data rate between the machines is OK? Even if the even if the output bandwidth is lower for ADSL users 3-4 kB might not be the reason of low performance.
The effective transfer rate is often heavily limited by the number of roundtrips between client and server. Without seeing your code it is sort of difficult to tell, but you should check the number of requests happening.
If you have a single request that results in many records being returned, you should see a better usage of bandwidth than with a higher number of requests which only deliver a few rows each.
In the latter case the actual result transfer is probably quite fast, but the latencies involved in the "control communications" (i. e. the statements themselves, login requests etc.) will add up, effectively lowering overall throughput.
As for the packet size: When it is very small, there is more overhead in the communications, increasing the aforementioned effect. The server's default max_allowed_packet size if 1MB if memory serves, but that should be fine with your connection.
You first have to debug both connections.
What is your upload speed if you upload a file with WinSCP ot equivalent to the MySQL server? It should be near 90 KB/s with ADSL 1 Mbit/s.