CPU bound applications vs. IO bound - language-agnostic

For 'number-crunching' style applications that use alot of data (reads: "hundreds of MB, but not into GB" ie, it will fit nicely into memory beside the OS), does it make sense to read all your data into memory first before starting processing to avoid potentially making your program IO bound while reading large related datasets, instead loading them from RAM?
Does this answer change between using different data backings? ie, would the answer be the same irrespective of if you were using XML files, flat files, a full DBMS, etc?

Your program is as fast as whatever its bottleneck is. It makes sense to do things like storing your data in memory if that improves the overall performance. There is no hard and fast rule that says it will improve performance however. When you fix one bottleneck, something new becomes the bottleneck. So resolving one issue may get a 1% increase in performance or 1000% depending on what the next bottleneck is. The thing you're improving may still be the bottleneck.
I think about these things as generally fitting into one of three levels:
Eager. When you need something from disk or from a network or the result of a calculation you go and get or do it. This is the simplest to program, the easiest to test and debug but the worst for performance. This is fine so long as this aspect isn't the bottleneck;
Lazy. Once you've done a particular read or calculation don't do it again for some period of time that may be anything from a few milliseconds to forever. This can add a lot of complexity to your program but if the read or calculation is expensive, can reap enormous benefits; and
Over-eager. This is much like a combination of the previous two. Results are cached but instead of doing the read or calculation or requested there is a certain amount of preemptive activity to anticipate what you might want. Like if you read 10K from a file, there is a reasonably high likelihood that you might later want the next 10K block. Rather than delay execution you get it just in case it's requested.
The lesson to take from this is the (somewhat over-used and often mis-quoted) quote from Donald Knuth that "premature optimization is the root of all evil." Eager and over-eager solutions add a huge amount of complexity so there is no point doing them for something that won't yield a useful benefit.
Programmers often make the mistake of creating some highly (alleged) optimized version of something before determining if they need to and whether or not it will be useful.
My own take on this is: don't solve a problem until you have a problem.

I would guess that choosing the right data storage method will have more effect than whether you read from disk all at once or as needed.
Most database tables have regular offsets for fields in each row. For example, a customer record may be 50 bytes long and have a pants_size column start at the 12th byte. Selecting all pants sizes is as easy as getting values at offsets 12, 62, 112, 162, ad nauseum.
XML, however, is a lousy format for fast data access. You'll need to slog through a bunch of variable-length tags and attributes in order to get your data, and you won't be able to jump instantly from one record to the next. Unless you parse the file into a data structure like the one mentioned above. In which case you'd have something very much like an RDMS, so there you go.

Related

Which one is faster single big query or few small queries?

So I want to grab data from database, which method is faster, create several queries or one multi-query?
Each "round trip" to the database will have some overhead. So the fewer round-trips, the less overhead. Consider also that fewer requests means fewer packets from client to server. If the result of the consolidated query gives you just what you want, then single query is the way to go. If your single query is returning extra or redundant data (perhaps because of de-normalization) then the overhead savings of a single round trip may be lost in the extra data transferred.
Another consideration is latency. If the queries have to be completed in sequence because some part of the output of one is needed in the input of the next, consolidating into one query will cut out all the network latencies in between all the individual smaller queries, so a final result can be delivered faster. However, if the smaller queries are independent of each other, launching them in parallel can get all the results delivered faster, albeit less efficiently.
Bottom line: the answer depends on the specifics of your situation. The best way to get an answer will probably be to implement both ways, test, and compare the resource usage of each implementation.
In general it would be a multi query. But it depends on a lot of stuff such as the hardware, datastructure etc.
But connections do take a little time each one of them.

How should I configure my Solr filterCache, firstSearcher and newSearcher?

Question 1: I'm trying to optimize my searchers in my solrconfig.xml, and there are two different searchers that can get warmed. My understanding is that firstSearcher only fires on server startup. A newSearcher is created whenever you need a new searcher. It seems to me that we would want the same fqs and facets to be specified in each. When is a case when you would want them to differ?
Question 2: Is there any way I can determine the effect on searcher startup time of adding a fq or facet? I know that I can brute force measure the startup times of a searcher with fqs/facets vs. one without, but that's not very granular. Assuming there's cost/benefit to way for an individual fq/facet, I'd like to be able to measure that so I can decide which things are worth warming and which are not.
Question 3: How can I effectively size my filterCache? I have a specific set of fqs that I know are likely to be hit, about 500 of them, so it seems like I would set it to 500. However, Solr seems to use the filterCache for query results that have to be faceted. Since 90% of my queries are faceted, it seems like we'd have to use the number of queries expected as the basis of the cache size. Does that sound right?
Your understanding is correct. However, a newSearcher can be autowarmed from the last one, so that's one difference. Another is that since newSearcher happens per commit, if you're doing frequent commits, you may want to do considerably less work than if you're starting cold.
I'm not aware of a great way. Queries are run serially, and at least with firstSearcher, show up in the access log, so you can literally see how long they take. Whether a given query set results in something "warm enough" is pretty much trial and error though.
The biggest thing to remember about FilterCache size is that a single entry is around (Number of Documents in your index)/8 bytes. So if you set the size to 500, and you have 100M docs in your index, you'll need 6.25G of heap just to hold it. Generally the recommendation is that you size your heap as small as possible to leave more memory for the disk cache, but this is an exception. As far as facet queries putting eviction pressure on your cache goes, I have the same problem, and I'm not aware of any solution. See https://issues.apache.org/jira/browse/SOLR-8171.

Most suitable data store for billions of indexes

So we're looking to store two kinds of indexes.
First kind will be in the order of billions, each with between 1 and 1000 values, each value being one or two 64 bit integers.
Second kind will be in the order of millions, each with about 200 values, each value between 1KB and 1MB in size.
And our usage pattern will be something like this:
Both kinds of index will have values added to the top up to thousands of times per second.
Indexes will be infrequently read, but when they are read it'll be the entirety of the index that is read
Indexes should be pruned, either on writing values to the index or in some kind of batch type job
Now we've considered quite a few databases, our favourites at the moment are Cassandra and PostreSQL. However, our application is in Erlang, which has no production-ready bindings for Cassandra. And a major requirement is that it can't require too much manpower to maintain. I get the feeling that Cassandra's going to throw up unexpected scaling issues, whereas PostgreSQL's just going to be a pain to shard, but at least for us it's a know quantity. We're already familiar with PostgreSQL, but not hugely well acquainted with Cassandra.
So. Any suggestions or recommendations as to which data store would be most appropriate to our use case? I'm open to any and all suggestions!
Thanks,
-Alec
You haven't given enough information to support much of an answer re: your index design. However, Cassandra scales up quite easily by growing the cluster.
You might want to read this article: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
A more significant issue for Cassandra is whether it supports the kind of queries you need - scalability won't be the problem. From the numbers you give, it sounds like we are talking about terabytes or tens of terabytes, which is very safe territory for Cassandra.
Billions is not a big number by todays standards, why not writing a benchmark instead of guesswork? That will give you a better decision tool and it's really easy to do. Just install your target OS, and each database engine, then run querys with let's say Perl (because i like it)
It won't take you more than one day to do all this, i've done something like this before.
A nice way to benchmark is writing a script that randomly , or with something like a gauss bell curve, executes querys, "simulating" real usage. Then plot the data or do it like a boss and just read the logs.

dynamic or pre calculate data

a bit new to programming and had a general question that I just thought of.
Say, I have a database with a bunch of stock information and one column with price and another with earnings. To get the price/earning ratio, would it be better to calculate it everyday or to calculate it on demand? I think performance wise, it'd be quicker to read only but I'm wondering if for math type functions its worth the batch job to pre-calculate it(is it even noticeable?).
So how do the professionals do it? have the application process the data for them or have it already available in the database?
The professionals use a variety of methods. It all depends on what you're going for. Do the new real ratios need to be displayed immediately? How often is the core data changing? Ideally you would only calculate the ratio any time the price or earning changes, but this takes extra development, and it's probably not worth it if you don't have a substantial amount of activity on the site.
On the other hand, if you're receiving hundreds of visits every minute, you're definitely going to want to cache whatever you're calculating, as the time required to re-display a cached result is much less that recreating the result (in most scenarios).
However, as a general rule of thumb, don't get stuck trying to optimize something you haven't anticipated any performance issues with.
It would be good to keep statistical data as seperate table as those read only mode. you could calculate avarage, max, min values directly with SQL functions and save them. In mean time, for current period(day), you could dynamically calculate and show it. These statistical information can be use for reports or forcasting.
Pre-calculated value is (of course) faster.
However, it all depends on the requirement itself.
Does this value will be invoked frequently? If it's invoked frequently, then using a precalculated value will bring a huge advantage.
Does the calculation really need long time and/or huge resource? If so, using a precalculated will be helpful.
Please bear in mind, sometimes a slow process or a large resource consumption is caused by the programming implementation itself, not by a wrongly designed system.

how to increase performance of mysql query if we have more than 1 million records?

In User table i have more than 1 million records so how can i manage using MySQL, Symfony 1.4. Make performance better.
So that it can give quick output.
To significantly improve performance of well designed system all you can do is increase the resources. Typically, these days, the cheapest way to do this is to distribute the task.
For example a slow thing in RDBM system is reading and writing to an from the storage (typically RDBMs systems start as I/O bound, that is, they mostly wait for data to get read or written to storage).
So, to offset, very commonly the RDBMS will allow you to split the table across multiple HDDs, effectively multiplying the I/O performance (approach similar to RAID0).
Adding more hard disks increases the performance. This goes on up to maximum I/O that your system could support (either simply because the system can not push more data through circuits or because it does need to crunch the numbers a bit when it fetches them so it becomes CPU bound; optimally you would be utilising both)
After that you have to start multiplying the systems distributing the data across database nodes. For this to work either RDBMS must support it or there should be application layer that will coordinate distributing the tasks and merging the results, but normally things would still scale.
I would say that with 512 systems you could have all trillion records effectively cached (10^12) and achieve relatively nice performance. But really you should specify what kind of performance you are looking for - there is a difference between full text searches on terra-records and running mostly simple fetches and updates. Also, for certain work 500ms (or even more) is considered good performance and then for other work it would be horrible.
at first: theres a big difference between 1 trillion and 1 million.
to your performance problems: show us the query thats running slow, without seeing it, it's hard to tell whats wrong with it. what you could try:
use EXPLAIN to get more information about your slow querys, see if they're using your indexes or if not (and if not, why not?)
use correct and reasonable indexes