I think I understand how each of repartition, hive partitioning, and bucketing affect the number of output files, but I am not quite clear on the interaction of the various features. Can someone help fill in the number of output files for each of the below situations where I've left a blank? The intent is to understand what the right code is for a situation where I have a mix of high and low cardinality columns that I need to partition / bucket by, where I have frequent operations that filter on the low cardinality columns, and join on the high cardinality columns.
Assume that we have a data frame df that starts with 200 input partitions, colA has 10 unique values, and colB has 1000 unique values.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files?
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
For hive partitioning + bucketing, the # of output files is not constant and will depend on the actual data of the input partition.To clarify, let's say df is 200 partitions, not 200 files. Output files scale with # of input partitions, not # of files. 200 files could be misleading as that could be 1 partition to 1000's of partitions.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files
df.repartition(50, 'colB') = 50 output files
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = upper bound of 2,000 output files (200 input partitions * max 10 values per partition)
output.write_dataframe(df, partition_cols=['colB']) = max 200,000 output files (200 * 1000 values per partition)
output.write_dataframe(df, partition_cols=['colA', 'colB']) = max 2,000,000 output files (200 partitions * 10 values * 1000)
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = max 20,000 files (200 partitions * max 100 buckets per partition)
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = max 2,000 files (200 partitions * max 10 buckets per partition)
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = exactly 10 files (repartitioned dataset makes 10 input partitions, each partition outputs to only 1 bucket)
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I could be wrong on this, but I believe it's max of 400,000 output files (200 input partitions * 10 colA partitions * 200 colB buckets)
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I believe this is exactly 10,000 output files (repartition colA,colB = 10,000 partitions, each partition contains exactly 1 colA and 1 bucket of colB)
Background
The key to being able to reason about output file counts is understanding at which level each concept applies.
Repartition (df.repartition(N, 'colA', 'colB')) creates a new spark stage with the data shuffled as requested, into the specified number of shuffle partitions. This will change the number of tasks in the following stage, as well as the data layout in those tasks.
Hive partitioning (partition_cols=['colA', 'colB']) and bucketing (bucket_cols/bucket_count) only have any effect within the scope of the final stage's tasks, and effect how the task writes its data into files on disk.
In particular, each final stage task will write one file per hive-partition/bucket combination present in its data. Combinations not present in that task will not write an empty file if you're using hive-partitioning or bucketing.
Note: if not using hive-partitioning or bucketing, each task will write out exactly one file, even if that file is empty.
So in general you always want to make sure you repartition your data before writing to make sure the data layout matches your hive-partitioning/bucketing settings (i.e. each hive-partition/bucket combination is not split between multiple tasks), otherwise you could end up writing huge numbers of files.
Your examples
I think there is some misunderstanding floating around, so let's go through these one by one.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
Yes - the data will be randomly, evenly shuffled into 100 partitions, causing 100 tasks, each of which will write exactly one file.
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
No - the number of partitions to shuffle into is unspecified, so it will default to 200. So you'll have 200 tasks, at most 10 of which will contain any data (could be fewer due to hash collisions), so you will end up with 190 empty files, and 10 with data.
*Note: with AQE in spark 3, spark may decide to coalesce the 200 partitions into fewer when it realizes most of them are very small. I don't know the exact logic there, so technically the answer is actually "200 or fewer, only 10 will contain data".
df.repartition('colB') = 1000 output files
No - Similar to above, the data will be shuffled into 200 partitions. However in this case they will (likely) all contain data, so you will get 200 roughly-equally sized files.
Note: due to hash collisions, files may be larger or smaller depending on how many values of colB happened to land in each partition.
df.repartition(50, 'colA') = 50 output files?
Yes - Similar to before, except now we've overridden the partition count from 200 to 50. So 10 files with data, 40 empty. (or fewer because of AQE)
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Yes - Same as before, we'll get 50 files of slightly varying sizes depending on how the hashes of the colB values work out.
Hive partitions:
(I think the below examples are written assuming df is in 100 partitions to start rather than 200 as specified, so I'm going to go with that)
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
Yes - You'll have 100 tasks, each of which will write one file for each colA value they see. So up to 1,000 files in the case the data is randomly distributed.
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
No - Missing a 0 here. 100 tasks, each of which could write as many as 1,000 files (one for each colB value), for a total of up to 100,000 files.
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
No - 100 tasks, each of which will write one file for each combination of partition cols it sees. There are 10,000 such combinations, so this could write as many as 100 * 10,000 = 1,000,000 files!
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Yes - The repartition will shuffle our data into 200 tasks, but only 10 will contain data. Each will contain exactly one value of colA, so will write exactly one file. The other 190 tasks will write no files. So 10 files exactly.
Bucketing:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
No - Since we haven't laid out the data carefully, we have 100 tasks with (maybe) randomly distributed data. Each task will write one file per bucket it sees. So this could write up to 100 * 100 = 10,000 files!
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
No - Similar to above, 100 tasks, each could write up to 10 files. So worst-case is 1,000 files here.
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
Now we're adjusting the data layout before writing, we'll have 200 tasks, at most 10 of which will contain any data. Each value of colA will exist in only one task.
Each task will write one file per bucket it sees. So we should get at most 10 files here.
Note: Due to hash collisions, one or more buckets might be empty, so we might not get exactly 10.
All together now:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
100 tasks. 10 hive-partitions. 200 buckets.
Worst case is each task writes one file per hive-partition/bucket combination. i.e. 100 * 10 * 200 = 200,000 files.
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
This one is sneaky. We have 200 tasks and the data is shuffled carefully so each colA/colB combination is in just one task. So everything seems good.
BUT each bucket contains multiple values of colB, and we have done nothing to make sure that an entire bucket is localized to one spark task.
So at worst, we could get one file per value of colB, per hive partition (colA value). i.e. 10 * 1,000 = 10,000 files.
Given our particular parameters, we can do slightly better by just focusing on getting the buckets laid out optimally:
output.write_dataframe(df.repartition(200, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200)
Now we're making sure that colB is shuffled exactly how it will be bucketed, so each task will contain exactly one bucket.
Then we'll get one file for each colA value in the task (likely 10 since colA is randomly shuffled), so at most 200 * 10 = 2,000 files.
This is the best we can do, assuming colA and colB are not correlated.
Conclusion
There's no one-size fits all approach to controlling file sizes.
Generally you want to make sure you shuffle your data so it's laid out in accordance with the hive-partition/bucketing strategy you're applying before writing.
However the specifics of what to do may vary in each case depending on your exact parameters.
The most important thing is to understand how these 3 concepts interact (as described in "Background" above), so you can reason about what will happen from first principals.
I'm importing a large CSV file into GNU Octave, doing some simple data manipulation and creating some plots. The file has about 6.5 million rows. I expected the process of file reading to take about two to three hours, because that's how long it usually takes to create a file this size in my experience. Added a status counter when it wasn't finishing and found that it was slowing down as it read; after 12 hours, only at line 1.5 million and moving at a crawl. According to Resource Monitor, though, no memory issues. Is there a more efficient way to read the code than what I have below? Do I need to do something special to allocate memory to the process so it doesn't slow down? This is the loop that's reading in the CSV. It's a while loop that scans the csv one line at a time, extracts the columns I need and ends when it reaches the first blank line:
% Process File
F=1;
while 1
% Status Counter
printf ("Status: %d \r", F);
fflush (stdout);
F=F+1;
% Read first unread line
line = fgetl(fileID);
% Exit while loop if line is empty
if ~ischar(line)
break;
endif
% Translate Line
Bank = textscan (line, '%f', 'Delimiter', ',');
Bank = cell2mat (Bank);
Bank = transpose (Bank);
% Append Bank to Output
Output = [Output; Bank(1, 1:9), Bank(1, 13:14), Bank(1, 20:21)];
endwhile
This is the slow part:
Output = [Output; Bank(1, 1:9), Bank(1, 13:14), Bank(1, 20:21)];
What you do here is create a new matrix, copy Output and the new row into it, and assign it to Output. As Output becomes larger, the copy becomes increasingly expensive.
What you need to do is preallocate the output array. Always preallocate!
I'm using Apache Hive.
I created a table in Hive (similar to external table) and loaded data into the same using the LOAD DATA LOCAL INPATH './Desktop/loc1/kv1.csv' OVERWRITE INTO TABLE adih; command.
While I am able to retrieve simple data from the hive table adih (e.g. select * from adih, select c_code from adih limit 1000, etc), Hive gives me errors when I ask for data involving slight computations (e.g. select count(*) from adih, select distinct(c_code) from adih).
The Hive cli output is as shown in the following link -
hive> select distinct add_user from adih;
Query ID = latize_20161031155801_8922630f-0455-426b-aa3a-6507aa0014c6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1477889812097_0006, Tracking URL = http://latize-data1:20005/proxy/application_1477889812097_0006/
Kill Command = /opt/hadoop-2.7.1/bin/hadoop job -kill job_1477889812097_0006
[6]+ Stopped $HIVE_HOME/bin/hive
Hive stops displaying any further logs / actions beyond the last line of "Kill Command"
Not sure where I have gone wrong (many answers on StackOverflow tend to point back to YARN configs (environment config detailed below).
I have the log as well but it contains more than 30000 characters (Stack Overflow limit)
My hadoop environment is configured as follows -
1 Name Node & 1 Data Node. Each has 20 GB of RAM with sufficient ROM. Have allocated 13 GB of RAM for the yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb each with the mapreduce.map.memory.mb being set as 4 GB and the mapreduce.reduce.memory.mb being set as 12 GB. Number of reducers is currently set to default (-1). Also, Hive is configured to run with a MySQL DB (rather than Derby).
You should set the appropriate values to the properties show in your trace,
eg: Edit the properties in hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value></property>
Looks like you have set mapred.reduce.tasks = -1, which makes Hive refer to its config to decide the number of reduce tasks.
You are getting an error as the number of reducers is missing in Hive config.
Try setting it using below command:
Hive> SET mapreduce.job.reduces=XX
As per official documentation: The right number of reduces seems to be 0.95 or 1.75 multiplied by (< no. of nodes > * < no. of maximum containers per node >).
I managed to get Hive and MR to work - increased the memory configurations for all the processes involved:
Increased the RAM allocated to YARN Scheduler and maximum RAM allocated to the YARN Nodemanager (in yarn-site.xml), alongside increasing the RAM allocated to the Mapper and Reducer (in mapred-site.xml).
Also incorporated parts of the answers by #Sathiyan S and #vmorusu - set the hive.exec.reducers.bytes.per.reducer property to 1 GB of data, which directly affects the number of reducers that Hive uses (through application of its heuristic techniques).
In Jmeter v2.13, is there a way to capture Throughput via non-GUI/Command Line mode?
I have the jmeter.properties file configured to output via the Summariser and I'm also outputting another [more detailed] .csv results file.
call ..\..\binaries\apache-jmeter-2.13\bin\jmeter -n -t "API Performance.jmx" -l "performanceDetailedResults.csv"
The performanceDetailedResults.csv file provides:
timeStamp
elapsed time
responseCode
responseMessage
threadName
success
failureMessage
bytes sent
grpThreads
allThreads
Latency
However, no amount of tweaking the .properties file or the test itself seems to provide Throuput results like I get via the GUI's Summary Report's Save Table Data button.
All articles, postings, and blogs seem to indicate it wasn't possible without manual manipulation in a spreadsheet. But I'm hoping someone out there has figured out a way to do this with no, or minimal, manual manipulation as the client doesn't want to have to manually calculate the Throughput value each time.
It is calculated by JMeter Listeners so it isn't something you can enable via properties files. Same applies to other metrics which are calculated like:
Average response time
50, 90, 95, and 99 percentiles
Standard Deviation
Basically throughput is calculated as simple as dividing total number of requests by elapsed time.
Throughput is calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.
The formula is: Throughput = (number of requests) / (total time)
Hopefully it won't be too hard for you.
References:
Glossary #1
Glossary #2
Did you take a look at JMeter-Plugins?
This tool can generate aggregate report through commandline.
I am looking for the best way to search through a very large rainbow table file (13GB file). It is a CSV-style file, looking something like this:
1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring
... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string.
So basically, the program should look if a specific hash is lited in this file.
Whats the fastest way to do this?
Obviously, I can't read the entire file into memory and then put a strstr() on it.
So whats the most efficent way to do this?
read file line by line, always to a strstr();
read larger chunk of the file (e.g. 10.000 lines), do a strstr()
Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys?
Any help is appreciated
The best way to do it would be to sort it and then use a binary search-like algorithm on it. After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. Your algorithm might look like this:
Keep a start offset and end offset. Initialize the start offset to zero and end offset to the file size.
If start = end, there is no match.
Read some data from the offset (start + end) / 2.
Skip forward until you see a newline. (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.)
If the hash you're on is the hash you're looking for, go on to step 6.
Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2.
If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2.
Skip to the semicolon and trailing space. The unhashed data will be from the current position to the next newline.
This can be easily converted into a while loop with breaks.
Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm.
Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that).
strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. Thing about strncmp, and strchr.
The overhead for reading a single line would be really high (as it is often the case for file IO). So I'd recommend reading a larger chunk and perform your search on that chunk. I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell.
You can also think about using memory mapped IO instead of the standard C file API. Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself.
Of course restructuring the data for faster access would help you too. For example insert padding bytes so all datasets are equally long. This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry.
I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt, if the hash begins with 0001 it's in the file 00/01data.txt, etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB.
Next, separate the hash from the string; such that you've got 65536 "hash files" and 65536 "string files". Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each.
Next, the hash files should be in a binary format. 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). This alone would halve the size of the hash files. If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order.
After all these changes; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string.
Finally, you'd implement a "hash file cache" in memory. If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (e.g. probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (e.g. 60 KiB); so on average for each lookup you'd be loading about 30 KiB from disk. Of course more memory is better (and less is worse); and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings.
A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. For an example; if all your strings use lower case ASCII letters and are a max. of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). This could lead to a different approach - e.g. use a reversible encoding (with bit 64 of the integer/hash clear) where possible; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. With a little knowledge (e.g. carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster.