Should I enable shufflehashjoin when left data is large (~1B records) with power law and righ data is small (but > 2GB) - apache-spark-2.3

I have a dataset that is very large 350 million to 1 billion records depending on the batch.
On the right side I have a much smaller data set usually in size of 10 million or so, not more.
I cannot simply broadcast the right side (sometimes it grows beyond 8BG which is a hard limit).
And to top it my left side has power law distribution on the join key.
I have tried doing the trick of randomly exploding the right side key by adding a random salt in order to battle power law distribution on the left side.
This works as intended but for occasional batch I get container failure with memory exceeding limit (19.5 GB out of 19GB). I can only go as far as 17GB + 2GB overhead per executor. I tried reducing cores in order to have more memory per thread but still same problem happens. The issue happens 2 or 3 times per 50 or so batches. Same batch runs correctly when job is restarted from the point of failure.
Right side of the join is produced by joining small date to medium size data via broadcast join and the larger side of this join is is checkpoint-ed in order to save time if errors occur.
val x = larger.join(broadcast(smaller), Seq("col1", "col2", ...), "left")
The result is obtained by joining very large data to x.
val res = very_large.join(x, Seq("col2", "col3", ....), "left_outter").where(condition)
My question is whether re-enabling (dissabled by default) shuffle hash join would be a better option in this case. My understanding is that given my right side is so much smaller than left side of the join that shuffle boradcast join could be a better option than sort merge join (which is enabled by default).
I use spark 2.3 (cant upgrade due to platform constraints).
I do have some custom catalyst expressions but they have been tested and they dont crash out in other jobs. // I am listing this only for sake of complicity.
Note: I cannot paste code samples due to IP.

Related

Microsoft Access Lost Back-end File Size

I have the problem with MS. Access. The problem is, My current MS. Access Back-end file size is 320 MB but after I compact database it still has file size only 222 MB so it mean I lost file size 98 MB. My question is, what that problem? after it lost file size 98 MB why it keep more slower then before when user use it? What about record in that file lost or not? Thank you in advance.
This is normal behaviour, and you did not lose any data. A compact + repair (C + R) is a normal maintains that you should do on your database. How often you do this kind off maintains will much depend on how many users, how much data “churning” etc.
So some can go for weeks, or even perhaps longer without having to C + R the back end. Some, much less time.
So why does the file grow like this?
There are several reasons, but one is simply that to allow multiple users, then when you delete a record, access cannot reclaim the disk space, because you (may) have multiple users working with the data. You cannot “move” all that other data down to fill the “hole” because that would CHANGE the position of existing data.
So if I am editing record 400, and another user deletes record 200, then a “hole” exists at 200. However, if I want to reclaim the space, I would have to “move down” every single record to fill that hole. So if the database has 100,000 records, and I delete record 50, then I now have to move a MASSIVE 99950 records back down to fill that one hole! That is way too slow.
So in place of the HUGE (and slow) process of processing 99950 records (a lot of data) then access just simple leaves the “hole” in that spot.
The reason is multi-user. With say 5 users working on the system, then you can’t start moving around data WHILE users are working. The place or spot of an existing record would thus be moving all the time.
So moving records around is NOT practical if you are to allow multiple users.
The other issue that causes file growth is that if you open up and edit record (again say position and record 50 out of 100,000). What happens if you type in extra information and now the record is TOO LARGE for that spot at position 50?
So now your record is too large. Now we have the opposite problem of a delete – we need to expand and make the “hole” or spot 50 larger. And to do that, we might have to move 100,000 or more records to increase the size of the hole for that one record.
The “hole” or “spot” for the record is NOT large enough anymore.
So what access does is simply mark and set the old record (the old spot) as deleted, and then access puts the too large record we just editing at the end of the file (thus expands at the end of the file). So the file grows, even with just editing, and not necessary only due to delets.
So deleting a record does not really “remove” the hole, and is to slow from a performance point of view.
And as noted, if we move records (which is way too slow), then other users working on the data would find the position of the current record they are working on NOT in the same place anymore.
So we can’t start “moving around” that data during edting.
So access is NOT able to re-claim space during operation. It is too slow, causes way too much disk i/o for a simple delete, and also as noted would not work in multi-user when position of records are always changing due to some delete.
To reclaim all those “holes” and “spots”, then you do a C + R. So this is a scheduled type of maintains that you do when no one is working on data. (say late at night, or after all workers go home). This also explains why only ONE user can be working to do a C + R.
So you not losing any data – the C + R simply is re-claiming all those “holes” and “spots” of un-used space, but the process is time consuming.
So it is too slow “during” operation of your application to re-claim those spots. Such re-claim of wasted and un-used space thus only occurs during a C + R, and not during high speed and interactive operations when your users are working.
I should point out that “most” database systems have this issue, and while “some” attempt re-claim the un-used space, it is simple better to have a separate process and separate action to reclaim that space during system maintains, not during use of the application.
What you are seeing is normal.
And after a C + R you should see improved performance. Often not much, but if the file is really large, full of lots of gaps and holes, then a C + R will reduce the file size a lot, and can help perfoamnce. Access also re-builds the indexes, and also orders data by PK order – this can also increase performance as you “more often” get to read data in PK order.

Is there a limit to size of R dataframe / and how to check RAM or memory limit for this

Using the packages RMySQL and dbConnect, I have a program that pulls tables from my Company's MySQL database. I'm an intern here leaving in a few weeks, and need to make my code robust for the guy that will be running it weekly once I'm gone (I don't want it to break on him / he will have issues debugging without me) - one of the things I'm worried about is the size of the queries, and whether or not R will always be able to handle / fit the size of the data pulled in the query. Currently, the code involves running a 10 - 30 minute query that results in a dataframe with X rows and Y columns.
Is there anyway to check (1) how much space in memory this dataframe is taking up, (2) how big this dataframe could become before R has an issue with it (other than by brute force testing the query with a very large pull until it's too big)?
Thanks,

Number of disk block reads in nested-loop join

I'm trying to understand how to calculate the number of disk blocks that is being read when a nested-loop join is being performed.
In my book it says that the number of I/Os made in a nested-loop join is:
O + ⌈O/(b−2)⌉ * I
where O is the number of blocks in the outer loop and I is the number of blocks in the inner loop.
Is this the same as calculating the number of blocks that needs to be read from disk when performing a nested-loop join?
Your book is probably wrong. The number of attempted reads will be given by a formula like that. I would expect the formula to be O*I -- and it is not clear what b is in the formula.
However, databases have something called a page cache. So, the pages are generally stored in memory and the number of actual reads would be more like O + I. Once the pages are in memory, they do not need another I/O.
Of course, not all tables fit into memory. In that case the number of reads is much higher and depends on how the cache works.
If the conclusion of the discussion is to avoid nested loop joins, then the book is correct.

Reason for 6x perfomance increase in data fetch when caching disabled?

I'm running MySQL 5.5 on windows and I'm trying to optimize a query. However, I can't seem to make any progress because when I baseline it I get 131s, but if I re-execute it I get 23s. If I wait awhile (like 10 minutes or so) it will go back to 131s, but I never know if it's gone back until I execute it. So I can't really figure out if my optimizations are helping. Of course, I assumed it was caused by the query cache, so turned that off but I'm still getting the same results.
It's a select with several inner joins and a couple outer joins. Two of the tables in the inner join are every large, but it's generally joining on indexes. There are a couple of "in" statements in the joins.
So, my question is, what would cause this change in response time? Execution plan caching? OS file caching? Index caching? Something else?
Thanks you!
Edit:
Here is the the query and table sizes:
select SQL_NO_CACHE count(1)
from reall_big_table_one ml
inner join pretty_big_table_one ltl on ml.sid=ltl.sid
inner join pretty_big_table_two md on ml.lid=md.lid
inner join reference_table ltp on ltl.ltlp_id=ltp.ltlp_id
left join pretty_big_table_three o on ml.sid=o.sid and o.cid not in (223041,226855,277890,123953,218150,264789,386817,122435,277902,278466,278430,277911,363986,373233,419863) and o.status_id in (100,400,500,700,800,900,1000)
left join medium_table ar on o.oid=ar.oid and ar.status_id in (1,2)
where ml.date_orig >= '2011-03-01' and ml.date_orig < '2011-04-01' and ml.lid=910741
ml has 50M rows
tlt has 1M rows
md has 1M rows
tlp has 800 rows
o has 7M rows
ar has 25K rows
The operating system is caching the drive the second time? If you run an exe twice it will come up far more quickly the second time too.
The db server isn't the only thing that caches data. The operating system has it's own caching system as does the hard drive you're accessing. That's the reason you're seeing performance increases on subsequent calls.
To establish a baseline you could go from cold booting to running the query. This is a tedious process (I've done it myself on large, complex queries) but a cold boot will make certain that no data is cached. You can also simply throw out the first query, run ten or more of your test queries back to back and take the average as your performance benchmark. You can also run the query many more times in sequence, even after cold boot and take the average. This is a good idea anyway to see what changes under different circumstances such as page faults, spikes from other applications (which you would want manually replicate) etc.

Considerations for very large SQL tables?

I'm building, basically, an ad server. This is a personal project that I'm trying to impress my boss with, and I'd love any form of feedback about my design. I've already implemented most of what I describe below, but it's never too late to refactor :)
This is a service that delivers banner ads (http://myserver.com/banner.jpg links to http://myserver.com/clicked) and provides reporting on subsets of the data.
For every ad impression served and every click, I need to record a row that has (id, value) [where value is the cash value of this transaction; e.g. -$.001 per served banner ad at $1 CPM, or +$.25 for a click); my output is all based on earnings per impression [abbreviated EPC]: (SUM(value)/COUNT(impressions)), but on subsets of the data, like "Earnings per impression where browser == 'Firefox'". The goal is to output something like "Your overall EPC is $.50, but where browser == 'Firefox', your EPC is $1.00", so that the end user can quickly see significant factors in their data.
Because there's a very large number of these subsets (tens of thousands), and reporting output only needs to include the summary data, I'm precomputing the EPC-per-subset with a background cron task, and storing these summary values in the database. Once in every 2-3 hits, a Hit needs to query the Hits table for other recent Hits by a Visitor (e.g. "find the REFERER of the last Hit"), but usually, each Hit only performs an INSERT, so to keep response times down, I've split the app across 3 servers [bgprocess, mysql, hitserver].
Right now, I've structured all of this as 3 normalized tables: Hits, Events and Visitors. Visitors are unique per visitor session, a Hit is recorded every time a Visitor loads a banner or makes a click, and Events map the distinct many-to-many relationship from Visitors to Hits (e.g. an example Event is "Visitor X at Banner Y", which is unique, but may have multiple Hits). The reason I'm keeping all the hit data in the same table is because, while my above example only describes "Banner impressions -> clickthroughs", we're also tracking "clickthroughs -> pixel fires", "pixel fires -> second clickthrough" and "second clickthrough -> sale page pixel".
My problem is that the Hits table is filling up quickly, and slowing down ~linearly with size. My test data only has a few thousand clicks, but already my background processing is slowing down. I can throw more servers at it, but before launching the alpha of this, I want to make sure my logic is sound.
So I'm asking you SO-gurus, how would you structure this data? Am I crazy to try to precompute all these tables? Since we rarely need to access Hit records older than one hour, would I benefit to split the Hits table into ProcessedHits (with all historical data) and UnprocessedHits (with ~last hour's data), or does having the Hit.at Date column indexed make this superfluous?
This probably needs some elaboration, sorry if I'm not clear, I've been working for past ~3 weeks straight on it so far :) TIA for all input!
You should be able to build an app like this in a way that it won't slow down linearly with the number of hits.
From what you said, it sounds like you have two main potential performance bottlenecks. The first is inserts. If you can have your inserts happen at the end of the table, that will minimize fragmentation and maximize throughput. If they're in the middle of the table, performance will suffer as fragmentation increases.
The second area is the aggregations. Whenever you do a significant aggregation, be careful that you don't cause all in-memory buffers to get purged to make room for the incoming data. Try to minimize how often the aggregations have to be done, and be smart about how you group and count things, to minimize disk head movement (or maybe consider using SSDs).
You might also be able to do some of the accumulations at the web tier based entirely on the incoming data rather than on new queries, perhaps with a fallback of some kind if the server goes down before the collected data is written to the DB.
Are you using INNODB or MyISAM?
Here are a few performance principles:
Minimize round-trips from the web tier to the DB
Minimize aggregation queries
Minimize on-disk fragmentation and maximize write speeds by inserting at the end of the table when possible
Optimize hardware configuration
Generally you have detailed "accumulator" tables where records are written in realtime. As you've discovered, they get large quickly. Your backend usually summarizes these raw records into cubes or other "buckets" from which you then write reports. Your cubes will probably define themselves once you map out what you're trying to report and/or bill for.
Don't forget fraud detection if this is a real project.