Related
I think I understand how each of repartition, hive partitioning, and bucketing affect the number of output files, but I am not quite clear on the interaction of the various features. Can someone help fill in the number of output files for each of the below situations where I've left a blank? The intent is to understand what the right code is for a situation where I have a mix of high and low cardinality columns that I need to partition / bucket by, where I have frequent operations that filter on the low cardinality columns, and join on the high cardinality columns.
Assume that we have a data frame df that starts with 200 input partitions, colA has 10 unique values, and colB has 1000 unique values.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files?
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
For hive partitioning + bucketing, the # of output files is not constant and will depend on the actual data of the input partition.To clarify, let's say df is 200 partitions, not 200 files. Output files scale with # of input partitions, not # of files. 200 files could be misleading as that could be 1 partition to 1000's of partitions.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files
df.repartition(50, 'colB') = 50 output files
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = upper bound of 2,000 output files (200 input partitions * max 10 values per partition)
output.write_dataframe(df, partition_cols=['colB']) = max 200,000 output files (200 * 1000 values per partition)
output.write_dataframe(df, partition_cols=['colA', 'colB']) = max 2,000,000 output files (200 partitions * 10 values * 1000)
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = max 20,000 files (200 partitions * max 100 buckets per partition)
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = max 2,000 files (200 partitions * max 10 buckets per partition)
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = exactly 10 files (repartitioned dataset makes 10 input partitions, each partition outputs to only 1 bucket)
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I could be wrong on this, but I believe it's max of 400,000 output files (200 input partitions * 10 colA partitions * 200 colB buckets)
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I believe this is exactly 10,000 output files (repartition colA,colB = 10,000 partitions, each partition contains exactly 1 colA and 1 bucket of colB)
Background
The key to being able to reason about output file counts is understanding at which level each concept applies.
Repartition (df.repartition(N, 'colA', 'colB')) creates a new spark stage with the data shuffled as requested, into the specified number of shuffle partitions. This will change the number of tasks in the following stage, as well as the data layout in those tasks.
Hive partitioning (partition_cols=['colA', 'colB']) and bucketing (bucket_cols/bucket_count) only have any effect within the scope of the final stage's tasks, and effect how the task writes its data into files on disk.
In particular, each final stage task will write one file per hive-partition/bucket combination present in its data. Combinations not present in that task will not write an empty file if you're using hive-partitioning or bucketing.
Note: if not using hive-partitioning or bucketing, each task will write out exactly one file, even if that file is empty.
So in general you always want to make sure you repartition your data before writing to make sure the data layout matches your hive-partitioning/bucketing settings (i.e. each hive-partition/bucket combination is not split between multiple tasks), otherwise you could end up writing huge numbers of files.
Your examples
I think there is some misunderstanding floating around, so let's go through these one by one.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
Yes - the data will be randomly, evenly shuffled into 100 partitions, causing 100 tasks, each of which will write exactly one file.
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
No - the number of partitions to shuffle into is unspecified, so it will default to 200. So you'll have 200 tasks, at most 10 of which will contain any data (could be fewer due to hash collisions), so you will end up with 190 empty files, and 10 with data.
*Note: with AQE in spark 3, spark may decide to coalesce the 200 partitions into fewer when it realizes most of them are very small. I don't know the exact logic there, so technically the answer is actually "200 or fewer, only 10 will contain data".
df.repartition('colB') = 1000 output files
No - Similar to above, the data will be shuffled into 200 partitions. However in this case they will (likely) all contain data, so you will get 200 roughly-equally sized files.
Note: due to hash collisions, files may be larger or smaller depending on how many values of colB happened to land in each partition.
df.repartition(50, 'colA') = 50 output files?
Yes - Similar to before, except now we've overridden the partition count from 200 to 50. So 10 files with data, 40 empty. (or fewer because of AQE)
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Yes - Same as before, we'll get 50 files of slightly varying sizes depending on how the hashes of the colB values work out.
Hive partitions:
(I think the below examples are written assuming df is in 100 partitions to start rather than 200 as specified, so I'm going to go with that)
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
Yes - You'll have 100 tasks, each of which will write one file for each colA value they see. So up to 1,000 files in the case the data is randomly distributed.
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
No - Missing a 0 here. 100 tasks, each of which could write as many as 1,000 files (one for each colB value), for a total of up to 100,000 files.
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
No - 100 tasks, each of which will write one file for each combination of partition cols it sees. There are 10,000 such combinations, so this could write as many as 100 * 10,000 = 1,000,000 files!
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Yes - The repartition will shuffle our data into 200 tasks, but only 10 will contain data. Each will contain exactly one value of colA, so will write exactly one file. The other 190 tasks will write no files. So 10 files exactly.
Bucketing:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
No - Since we haven't laid out the data carefully, we have 100 tasks with (maybe) randomly distributed data. Each task will write one file per bucket it sees. So this could write up to 100 * 100 = 10,000 files!
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
No - Similar to above, 100 tasks, each could write up to 10 files. So worst-case is 1,000 files here.
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
Now we're adjusting the data layout before writing, we'll have 200 tasks, at most 10 of which will contain any data. Each value of colA will exist in only one task.
Each task will write one file per bucket it sees. So we should get at most 10 files here.
Note: Due to hash collisions, one or more buckets might be empty, so we might not get exactly 10.
All together now:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
100 tasks. 10 hive-partitions. 200 buckets.
Worst case is each task writes one file per hive-partition/bucket combination. i.e. 100 * 10 * 200 = 200,000 files.
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
This one is sneaky. We have 200 tasks and the data is shuffled carefully so each colA/colB combination is in just one task. So everything seems good.
BUT each bucket contains multiple values of colB, and we have done nothing to make sure that an entire bucket is localized to one spark task.
So at worst, we could get one file per value of colB, per hive partition (colA value). i.e. 10 * 1,000 = 10,000 files.
Given our particular parameters, we can do slightly better by just focusing on getting the buckets laid out optimally:
output.write_dataframe(df.repartition(200, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200)
Now we're making sure that colB is shuffled exactly how it will be bucketed, so each task will contain exactly one bucket.
Then we'll get one file for each colA value in the task (likely 10 since colA is randomly shuffled), so at most 200 * 10 = 2,000 files.
This is the best we can do, assuming colA and colB are not correlated.
Conclusion
There's no one-size fits all approach to controlling file sizes.
Generally you want to make sure you shuffle your data so it's laid out in accordance with the hive-partition/bucketing strategy you're applying before writing.
However the specifics of what to do may vary in each case depending on your exact parameters.
The most important thing is to understand how these 3 concepts interact (as described in "Background" above), so you can reason about what will happen from first principals.
I have 4-dimensional data points stored in my MySQL database in server. One time dimension data with three spatial GPS data (lat, lon, alt). GPS data are sampled by 1 minute timeframe for thousands of users and are being added to my server 24x7.
Sample REST/post json looks like,
{
"id": "1005",
"location": {
"lat":-87.8788,
"lon":37.909090,
"alt":0.0,
},
"datetime": 11882784
}
Now, I need to filter out all the candidates (userID) whose positions were within k meters distance from a given userID for a given time period.
Sample REST/get query params for filtering looks like,
{
"id": "1001", // user for whose we need to filter out candidates IDs
"maxDistance":3, // max distance in meter to consider (euclidian distance from users location to candidates location)
"maxDuration":14 // duration offset (in days) from current datetime to consider
}
As we see, thousands of entries are inserted in my database per minute which results huge number of total entries. Thus to iterate over for all the entries for filtering, I am afraid trivial naive approach won't be feasible for my current requirement. So, what algorithm should I implement in the server? I have tried to implement naive algorithm like,
params ($uid, $mDis, $mDay)
1. Init $candidates = []
2. For all the locations $Li of user with $uid
3. For all locations $Di in database within $mDay
4. $dif = EuclidianDis($Li, $Di)
5. If $dif < $mDis
6. $candidates += userId for $Di
7. Return $candidates
However, this approach is very slow in practice. And pre calculation might not be feasible as it costs huge space for all userIDs. What other algorithm can improve efficiency?
You could implement a spatial hashing algorithm to efficiently query your database for candidates within a given area/time.
Divide the 3D space into a 3D grid of cubes with width k, and when inserting a data-point into your database, compute which cube the point lies in and compute a hash value based on the cube co-ordinates.
When querying for all data-points within k of another datapoint d, compute the cube that d sits in, and find the 8 adjacent cubes (+/- 1 in each dimension). Compute the hash values of the 9 cubes and query your database for all entries with these hash values within the given time period. You will have small candidate set from which you can then iterate over to find all datapoints within k of d.
If your value of k can range between 2-5 meters, give your cubes a width of 5.
Timestamps can be stored as a separate field, or alternatively you can make your cubes 4-dimensional and include the timestamp in the hash, and search 27 cubes instead of 9.
I'm working with some rather large time-series data sets related to futures prices and am in the process of converting some calculations which I previously did in Excel to R. This conversion has been relatively straightforward thus far but I m having a bit of trouble replicating my histograms with their cumulative frequency distributions in R as I had them in Excel. If you're familiar with Excel, the Histogram function in the Data Analysis Toolpack automatically creates a Cumulative Frequency Distribution table with the cumulative percentages of each, in this case, Price Level, next to the histogram.
I've had some success creating some basic histograms using ggplot, here is a snippet of that code:
ggplot(data=CrudeRaw, aes(x=CrudeRaw$X7_1_F))+
geom_histogram(breaks=seq(X7_F_M_L, X7_F_M_H, by=0.01),
col="blue",
fill="white",
alpha= 0.2)+
labs(title="X7 1 Month Price Distribution", x="Price Levels",
y="Frequency") +
xlim(c(X7_F_M_L, X7_F_M_H)) +
ylim(c(0,100))
Several questions regarding formatting and usage.
a) CrudeRaw is a dataframe which contains roughly 276 rows, and no less then 50 columns. For the purposes of this project I've chopped the data into 20 period, 60 period, 120 period, 180 period, and 240 period subsets. The data is in chronological order by date.
Question(s): ggplot cannot take numeric data types, only data frames, so I can only feed it the entire df even though I am interested in creating distributions for the aforementioned subsets. Is there a way that I can still do this?
b) How do I get every bin (price) to show up on the x-axis rather than a number marking every 5 bins (-15, -10, -5, 0, 5 ..., 15)?
c) I've successfully created a cumulative frequency table using the follow code,
round(cbind(cumsum(table(X7_F)))/NROW(X7_F),2)
But I'd like a way to a) output each of these tables (of which there are many) to a CSV file OR, ideally create a "report" of sorts with R which can be saved to a pdf, or perhaps even within the histogram which the table/data is associated with.
d) I've done some searching on how to output data to a CSV file, but it wasnt clear from the examples I went over how I could output multiple arrays to the same sheet or workbook, en masse. That is, I would like to output my 20, 60, 120, 180, and 240 period arrays of prices to the same workbook. I'm thinking that by creating another dataframe that I could then pass these subsets of the data to the ggplot function like I mentioned I was having trouble doing in part a)
e) Lastly (for now) how do I overlay the CFD onto my histograms?
Please advise if you require any additional information or colour in order to help me and many thanks in advance for your responses!
I have a CSV file which is generated by a process that outputs the data in pre-defined bins (say from -100 to +100 in steps of 10). So, each line looks somewhat like this:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
i.e. 20 comma separated values, the first representing the frequency in the range -100 to -90, while the last represents the frequency between 90 to 100.
The problem is, Gnuplot seems to require the raw data for it to be able to generate a histogram, whereas I have only the frequency distribution. How do I proceed in this case? I'm looking for the simplest possible histogram, that perhaps displays the data using vertical bars.
You already have histogram data, so you mustn't use "set histogram".
Generate the x-values from the linenumbers, and do a simple boxplot
plot dataf using (($0-10)*10):$1 with boxes
Does the value returned by MySQL's MD5 hash function continue to change indefinitely as the string given to it grows indefinitely?
E.g., will these continue to return different values:
MD5("A"+"B"+"C")
MD5("A"+"B"+"C"+"D")
MD5("A"+"B"+"C"+"D"+"E")
MD5("A"+"B"+"C"+"D"+"E"+"D")
... and so on until a very long list of values ....
At some point, when we are giving the function very long input strings, will the results stop changing, as if the input were being truncated?
I'm asking because I want to use the MD5 function to compare two records with a large set of fields by storing the MD5 hash of these fields.
======== MADE-UP EXAMPLE (YOU DON'T NEED THIS TO ANSWER THE QUESTION BUT IT MIGHT INTEREST YOU: ========
I have a database application that periodically grabs data from an external source and uses it to update a MySQL table.
Let's imagine that in month #1, I do my first download:
downloaded data, where the first field is an ID, a key:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
I store this
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
Month #2, I get
1,"A","B","C"
2,"A","D","X"
3,"B","D","E"
4,"B","F","E"
Notice that the record with ID 2 has changed. Record with ID 4 is new. So I store two new records:
1,"A","B","C"
2,"A","D","E"
3,"B","D","E"
2,"A","D","X"
4,"B","F","E"
This way I have a history of *changes* to the data.
I don't want have to compare each field of the incoming data with each field of each of the stored records.
E.g., if I'm comparing incoming record x with exiting record a, I don't want to have to say:
Add record x to the stored data if there is no record a such that x.ID == a.ID AND x.F1 == a.F1 AND x.F2 == a.F2 AND x.F3 == a.F3 [4 comparisons]
What I want to do is to compute an MD5 hash and store it:
1,"A","B","C",MD5("A"+"B"+"C")
Let's suppose that it is month #3, and I get a record:
1,"A","G","C"
What I want to do is compute the MD5 hash of the new fields: MD5("A"+"G"+"C") and compare the resulting hash with the hashes in the stored data.
If it doesn't match, then I add it as a new record.
I.e., Add record x to the stored data if there is no record a such that x.ID == a.ID AND MD5(x.F1 + x.F2 + x.F3) == a.stored_MD5_value [2 comparisons]
My question is "Can I compare the MD5 hash of, say, 50 fields without increasing the likelihood of clashes?"
Yes, practically, it should keep changing. Due to the pigeonhole principle, if you continue doing that enough, you should eventually get a collision, but it's impractical that you'll reach that point.
The security of the MD5 hash function is severely compromised. A collision attack exists that can find collisions within seconds on a computer with a 2.6Ghz Pentium4 processor (complexity of 224).
Further, there is also a chosen-prefix collision attack that can produce a collision for two chosen arbitrarily different inputs within hours, using off-the-shelf computing hardware (complexity 239).
The ability to find collisions has been greatly aided by the use of off-the-shelf GPUs. On an NVIDIA GeForce 8400GS graphics processor, 16-18 million hashes per second can be computed. An NVIDIA GeForce 8800 Ultra can calculate more than 200 million hashes per second.
These hash and collision attacks have been demonstrated in the public in various situations, including colliding document files and digital certificates.
See http://www.win.tue.nl/hashclash/On%20Collisions%20for%20MD5%20-%20M.M.J.%20Stevens.pdf
A number of projects have published MD5 rainbow tables online, that can be used to reverse many MD5 hashes into strings that collide with the original input, usually for the purposes of password cracking.