Really streaming large MySQL result sets - mysql

There's a similar question about streaming large results but the answer just points at docs and no clear answer emerges.
I believe that merely treating a full result set as a stream still takes a lot of memory on the jdbc driver side..
I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.
And in particular I am not sure why setFetchSize(Integer.MIN_VALUE) is a very good idea, as it seems far from optimal if that means each row is sent on its own on the wire.
I believe libraries like jooq and slick already take care of that... and am curious as to how to accomplish it with and without them.
Thanks!

I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.
The best practice is not to do synchronous streaming but rather fetch in moderate size chunks. However avoid using OFFSET (also see). If your doing a batch process this can be facilitated by first selecting and pushing the data into a temporary table (ie turn your original results you want into a table first and then select chunks from the table... databases are really fast at copying data internally).
Synchronous streaming in general does not scale (aka iterator). It does not scale well for batch processing and it certainly does not scale for handling loads of clients. This is why the drivers vary and do so many different things because its fairly difficult to figure out how much resources to load because it is a pull model. Async streaming (push model) would probably help but unfortunately the JDBC standard does not support async streaming.
You might notice but this is one of the reasons why many of the wrappers around JDBC such as Spring JDBC do not return Iterators (along with fact that the resource also needs to be manually cleaned up). Some of the wrappers provide iterators but really they just turn the results into a list .
Your link to the Scala version is rather disturbing that its upvoted given the stateful nature of managing a ResultSet... its very un-Scala like... I'm not sure those folks know they have to consume the iterator or close the connection/ResultSet properly which requires a fair amount of imperative programming.
While it may seem inefficient to let the database decide how much to buffer just remember that most database connections are extremely heavy memory wise (at least on postgres they are). So if you take a long time streaming and have many clients your going to have to create more connections and put serious burden on the database. Not to mention the default buffers have probably been highly optimized (ie the resultset size that client ends up with).
Finally for batch processing chunks can be done in parallel which is obviously more efficient than a synchronous pipeline as well as being restarted (with out having to rework already processed data) if a problem occurs.

Related

Can we do Parallel Streaming, with IJSON for bulky JSON data Parsing/processing

I am aware how IJSON is solving, the bulky JSON reading and processing challenges. However i am not able to find any article which specifies how to speed up this:
I have seen few things to achieve that
1. Use YAZL backend
2. Play with buff_size parameter(not seeing any significant improvement)
Question which I had in my mind or I guess many are already working on that:
Now if we want to utilize parallel processing power of the machine, will IJSON supports that.
I know at no point IJSON knows entire stream size, so splitting is out of question.
My knowledge is quite limited in this area.
Any thread, document or link would be a nice start for me to understand this more clearly.

JSON or HTML: Which output can perform better?

I am thinking of improving website performance by moving rendering to the client side. The current stack is: (router, sphinx, db) + HTML. I am thinking of changing this to: (router, sphinx, db) + JSON.
All of clients are running i7 processors and they don't care much about client side rendering performance. We also have client side app which is ready to connect to resful JSON API (this is not to go into discussion about client vs server side rendering).
1) Rendering on server takes about 20% of time (and 80% goes to routing, sphinx, db). I heard that outputting JSON takes about half of the time that it takes to output HTML, so I think it would be 10% improvement, and those 10% could go into data processing. Am I right about that?
2) I believe that 10% improvement for one server means that, to get the same amount of performance with a large scale app with 100 physical servers, we need 10% less servers: in this case 90 instead of 100. Is this correct?
3) How is it possible to get the best performance in Ruby to output JSON instead of any other format?
4) Taking daily scenarios, what difference could be made performance wise if we output JSON instead of HTML?
1, 2) Probably yes, but there could be uncounted for factors, which may make the performance increase to be less than you expect. Like, if bottleneck is IO, and as HTML creation probably is CPU-limited, then reducing CPU load will only let CPUs idle more. Only way to find out really is to have reliable benchmarks while running parallel request handling, and get hard numbers.
Further, spending the hours to develop client-side rendering might be more expensive, than just paying for more server capacity... Moore's law is still holding, doing that kind of optimization for such a small improvement is probably not worth the development cost... Probably better to concentrate those dev resources on something which would increase revenue, instead of trying to make small savings.
3) JSON generation probably uses a native library, while HTML generation happens in Ruby script code. And native code is typically 1-2 orders of magnitude faster than interpreted (and not JIT-compiled) code at low level operations. The higher level operation it is, the narrower the gap, so if "generate JSON" is the high level operation, then it's equally fast if you call it from Ruby or from compiled language code.
4) Well, not sure I understand the question, but see answer 1,2...
see http://openmymind.net/2012/5/30/Client-Side-vs-Server-Side-Rendering/ maybe that will help you
best way to find out for you particular case is to implement it and test.
you can use new relic and google analytics (maybe others as well) to see client performance and rendering times and experience

Any good threads related job-interview question?

When interviewing graduates I usually ask them questions about data structures, algorithms and complexity theory. I would really like to ask a question that will enable them to show their familiarity with multi-threaded concepts, without dwelling into language specific issues.
Any good questions? The only question I could think of is how to write a Singleton that supports multi-threaded access.
I find the classic "write me a consumer-producer queue" question to be quite good. You can talk about synchronization in a handwavy way beforehand for five minutes or so (e.g. start with "What does Object.wait() do? What other methods on Object is it related to? Can you give me an example of when you might use these? What other concurrency techniques might you use in practice [because really, it's quite rare that actually using the wait/notify primitives is the best approach]?"). Make sure the candidate addresses (or at least makes clear he is aware of) both atomicity ("missed updates") and volatility (visibility of the new value on other threads)
Then after you've had a chat about the theory of these, get them to spend a few minutes actually writing the code for a primitive producer-consumer queue. This should be straightforward to anyone who actually understands what they were talking about above, yet it will weed out those who can "talk the talk" but don't actually understand it in practice (arguably the most dangerous group).
What I like about these mini-coding exercises, is that they're often easy to extend. For instance, if the candidate completes the task easily, you can ask how they would extend it for situation XXX - invent requirements that you know will push the limits of the noddy solution you asked for. This not only lets you tailor the depth of questions you're asking but gives some insight into how well the candidate handles clarification of requirements, and modifications of existing design (which is pretty important in this industry).
Here you can find some topics to discuss:
threads implementation ( kernel vs user space)
thread local storage
synchronization primitives
deadlocks, livelocks
Differences between mutex and
semaphore.
Use of condition variables.
When not to use threads. (eg. IO multiplexing)
Talk with them about a popular, but not well-known topic, where thread handling is essential.
I recommend you, build a web server with them, of course, only on paper or just in words. The result should look something like this: there is a main thread, it's listening on a socket. When something arrives, it passes the socket into the pool, then this thread returns back to socket listening. The pool has fixed number of slots. The request processing threads are dedicated to get job from the pool. Find out, what's better, if the threads are checking the pool concurrently, or the listner main thread selects a free slot/thread for the new incoming request. Try to write a small pseudocode, or a graph for both side of the pool handling.
Let's introduce a small application: page counter, which tells that how many page request has been made since server startup. Don't tell them that the counter must be protected against concurrent modification, let them to find it out how to do this with mutexes or synchronization or whatsoever. Maybe you could skip the web server part, the page counter app is easier to specify.
Another example is a chat, with 2+ clients and a server, find out, how to solve the problem, that all the messages should arrive in the same order for all clients. Or reflex game: the server waits for 1..5 secs random, then says "peek-a-boo", and the player wins who presses space key first. Specify it with 2 player, then try to expand it to N players.
Also, be aware of NPPs. NPP stands: "non-programming programmer". There are dudes, who can talk about programming issues, they know all the 3/4-letter abbrevations (there're lot in the Java world, EJB, JSP, XSLT, and my favourite: POJO, which means Pure Old Java Objects, lol), they understand and modify codes, or make similar programs from a base, but they fail even with small problems, it it has to do it theirselves, e.g. finding the nearest element to a base in an array. Sometimes it takes months, until it turns out. They performs well at interviews, because they prepare for it. Maybe they don't even known, that they're NPPs, this is a known effect: http://en.wikipedia.org/wiki/Dunning-Kruger_effect
It's hard to recognize the opposite dudes, who have not heard about trendy libraries or patterns, but they can learn it even at the job interview. (Personal remark: my last interview was in 1999, and it seems that I will not do interview anymore. I have never heard of dynamic web pages before, but I've figured out the term "session" during the interview, the question was that how to build a simple hanging man web app. I was hired.)

Is it possible to do asynchronous / parallel database query in a Django application?

I have web pages that take 10 - 20 database queries in order to get all the required data.
Normally after a query is sent out, the Django thread/process is blocked waiting for the results to come back, then it'd resume execution until it reaches the next query.
Is there's any way to issue all queries asynchronously so that they can be processed by the database server(s) in parallel?
I'm using MySQL but would like to hear about solutions for other databases too. For example I heard that Postgresql has an async client library - how would I use that in this case?
This very recent blog entry seems to imply that it's not built in to either the django or rails frameworks. I think it covers the issue well and is quite worth a read along with the comments.
http://www.eflorenzano.com/blog/post/how-do-we-kick-our-synchronous-addiction/ (broken link)
I think I remember Cal Henderson mentioning this deficiency somewhere in his excellent speech http://www.youtube.com/watch?v=i6Fr65PFqfk
My naive guess is you might be able to hack something with separate python libraries but you would lose a lot of the ORM/template lazy evaluation stuff django gives to the point you might as well be using another stack. Then again if you are only optimizing a few views in a large django project it might be fine.
I had a similar problem and I solved it with javascript/ajax
Just load the template with basic markup and then do severl ajax requsts to execute the queries and load the data. You can even show loading animation. User will have a web 2.0 feel instead of just gloomy page loading. Ofcourse, this means several more HTTP requests per page, but it's up to you to decide.
Here is how my example looks: http://artiox.lv/en/search?query=test&where_to_search=all (broken link)
Try Celery, there's a bit of a overhead of having to run a ampq server, but it might do what you want. Not sure about concurrency of the DB tho. Also, if you want speed for your DB, I'd recommend MongoDB (but you'll need django-nonrel for that).

How do web spiders differ from Wget's spider?

The next sentence caught my eye in Wget's manual
wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the functionality of real web spiders.
I find the following lines of code relevant for the spider option in wget.
src/ftp.c
780: /* If we're in spider mode, don't really retrieve anything. The
784: if (opt.spider)
889: if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227: if (!opt.spider)
1239: if (!opt.spider)
1268: else if (!opt.spider)
1827: if (opt.htmlify && !opt.spider)
src/http.c
64:#include "spider.h"
2405: /* Skip preliminary HEAD request if we're not in spider mode AND
2407: if (!opt.spider
2428: if (opt.spider && !got_head)
2456: /* Default document type is empty. However, if spider mode is
2570: * spider mode. */
2571: else if (opt.spider)
2661: if (opt.spider)
src/res.c
543: int saved_sp_val = opt.spider;
548: opt.spider = false;
551: opt.spider = saved_sp_val;
src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)
src/spider.h
1:/* Declarations for spider.c
src/recur.c
52:#include "spider.h"
279: if (opt.spider)
366: || opt.spider /* opt.recursive is implicitely true */
370: (otherwise unneeded because of --spider or rejected by -R)
375: (opt.spider ? "--spider" :
378: (opt.delete_after || opt.spider
440: if (opt.spider)
src/options.h
62: bool spider; /* Is Wget in spider mode? */
src/init.c
238: { "spider", &opt.spider, cmd_boolean },
src/main.c
56:#include "spider.h"
238: { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435: --spider don't download anything.\n"),
1045: if (opt.recursive && opt.spider)
I would like to see the differences in code, not abstractly. I love code examples.
How do web spiders differ from Wget's spider in code?
A real spider is a lot of work
Writing a spider for the whole WWW is quite a task --- you have to take care about many "little details" such as:
Each spider computer should receive data from a few thousand servers in parallel in order to make efficient use of the connection bandwidth. (asynchronous socket i/o).
You need several computers that spider in parallel in order to cover the vast amount of information on the WWW (clustering; partitioning the work)
You need to be polite to the spidered web sites:
Respect the robots.txt files.
Don't fetch a lot of information too quickly: this overloads the servers.
Don't fetch files that you really don't need (e.g. iso disk images; tgz packages for software download...).
You have to deal with cookies/session ids: Many sites attach unique session ids to URLs to identify client sessions. Each time you arrive at the site, you get a new session id and a new virtual world of pages (with the same content). Because of such problems, early search engines ignored dynamic content. Modern search engines have learned what the problems are and how to deal with them.
You have to detect and ignore troublesome data: connections that provide a seemingly infinite amount of data or connections that are too slow to finish.
Besides following links, you may want to parse sitemaps to get URLs of pages.
You may want to evaluate which information is important for you and changes frequently to be refreshed more frequently than other pages. Note: A spider for the whole WWW receives a lot of data --- you pay for that bandwidth. You may want to use HTTP HEAD requests to guess whether a page has changed or not.
Besides receiving, you want to process the information and store it. Google builds indices that list for each word the pages that contain it. You may need separate storage computers and an infrastructure to connect them. Traditional relational data bases don't keep up with the data volume and performance requirements of storing/indexing the whole WWW.
This is a lot of work. But if your target is more modest than reading the whole WWW, you may skip some of the parts. If you just want to download a copy of a wiki etc. you get down to the specs of wget.
Note: If you don't believe that it's so much work, you may want to read up on how Google re-invented most of the computing wheels (on top of the basic Linux kernel) to build good spiders. Even if you cut a lot of corners, it's a lot of work.
Let me add a few more technical remarks on three points
Parallel connections / asynchronous socket communication
You could run several spider programs in parallel processes or threads. But you need about 5000-10000 parallel connections in order to make good use of your network connection. And this amount of parallel processes/threads produces too much overhead.
A better solution is asynchronous input/output: process about 1000 parallel connections in one single thread by opening the sockets in non-blocking mode and use epoll or select to process just those connections that have received data. Since Linux kernel 2.4, Linux has excellent support for scalability (I also recommend that you study memory-mapped files) continuously improved in later versions.
Note: Using asynchronous i/o helps much more than using a "fast language": It's better to write an epoll-driven process for 1000 connections written in Perl than to run 1000 processes written in C. If you do it right, you can saturate a 100Mb connection with processes written in perl.
From the original answer:
The down side of this approach is that you will have to implement the HTTP specification yourself in an asynchronous form (I am not aware of a re-usable library that does this). It's much easier to do this with the simpler HTTP/1.0 protocol than the modern HTTP/1.1 protocol. You probably would not benefit from the advantages of HTTP/1.1 for normal browsers anyhow, so this may be a good place to save some development costs.
Edit five years later:
Today, there is a lot of free/open source technology available to help you with this work. I personally like the asynchronous http implementation of node.js --- it saves you all the work mentioned in the above original paragraph. Of course, today there are also a lot of modules readily available for the other components that you need in your spider. Note, however, that the quality of third-party modules may vary considerably. You have to check out whatever you use. [Aging info:] Recently, I wrote a spider using node.js and I found the reliability of npm modules for HTML processing for link and data extraction insufficient. For this job, I "outsourced" this processing to a process written in another programming language. But things are changing quickly and by the time you read this comment, this problem may already a thing of the past...
Partitioning the work over several servers
One computer can't keep up with spidering the whole WWW. You need to distribute your work over several servers and exchange information between them. I suggest to assign certain "ranges of domain names" to each server: keep a central data base of domain names with a reference to a spider computer.
Extract URLs from received web pages in batches: sort them according to their domain names; remove duplicates and send them to the responsible spider computer. On that computer, keep an index of URLs that already are fetched and fetch the remaining URLs.
If you keep a queue of URLs waiting to be fetched on each spider computer, you will have no performance bottlenecks. But it's quite a lot of programming to implement this.
Read the standards
I mentioned several standards (HTTP/1.x, Robots.txt, Cookies). Take your time to read them and implement them. If you just follow examples of sites that you know, you will make mistakes (forget parts of the standard that are not relevant to your samples) and cause trouble for those sites that use these additional features.
It's a pain to read the HTTP/1.1 standard document. But all the little details got added to it because somebody really needs that little detail and now uses it.
I am not sure exactly what the original author of the comment was referring to, but I can guess that wget is slow as a spider, since it appears to only use a single thread of execution (at least by what you have shown).
"Real" spiders such as heritrix use a lot of parallelism and tricks to optimize their crawling speed, while simultaneously being nice to the website they are crawling. This typically means limiting hits to one site at a rate of 1 per second (or so), and crawling multiple websites at the same time.
Again this is all just a guess based on what I know of spiders in general, and what you posted here.
Unfortunately, many of the more well-known 'real' web spiders are closed-source, and indeed closed-binary. However there are a number of basic techniques wget is missing:
Parallelism; you're never going to be able to keep up with the entire web without retrieving multiple pages at a time
Prioritization; some pages are more important to spider than others
Rate limiting; you'll be banned quickly if you keep pulling down pages as quickly as you can
Saving to something other than a local filesystem; the Web is big enough that it's not going to fit in a single directory tree
Rechecking pages periodically without restarting the entire process; in practice, with a real spider you'll want to recheck 'important' pages for updates frequently, while less interesting pages can go for months.
There are also various other inputs that can be used such as sitemaps and the like. Point is, wget isn't designed to spider the entire web, and it's not really a thing that can be captured in a small code sample, as it's a problem of the whole overall technique being used, rather than any single small subroutine being wrong for the task.
I'm not going to go into details of how to spider the internet, I think that wget comment is regarding to spidering one website which is still a serious challenge.
As a spider you need to figure out when to stop, not go into recursive crawls just because the URL changed like date=1/1/1900 to 1/2/1900 and so
Even bigger challenge to sort out URL Rewrite (I have no clue what so ever how google or any other handles this). It's pretty big challenge to crawl enough but not too much. And how one can automatically recognise URL Rewrite with some random parameters and random changes in the content?
You need to parse Flash / Javascript at least up to some level
You need to consider some crazy HTTP issues like base tag. Even parsing the HTML is not easy, considering most of the websites are not XHTML and browsers are so flexible in the syntax.
I don't know how much of these implemented or considered in wget but you might want to take a look at httrack to understand the challenges of this task.
I'd love to give you some code examples but this is big tasks and a decent spider will be about 5000 loc without 3rd party libraries.
+ Some of them already explained by #yaakov-belch so I'm not going to type them again