I think I understand how each of repartition, hive partitioning, and bucketing affect the number of output files, but I am not quite clear on the interaction of the various features. Can someone help fill in the number of output files for each of the below situations where I've left a blank? The intent is to understand what the right code is for a situation where I have a mix of high and low cardinality columns that I need to partition / bucket by, where I have frequent operations that filter on the low cardinality columns, and join on the high cardinality columns.
Assume that we have a data frame df that starts with 200 input partitions, colA has 10 unique values, and colB has 1000 unique values.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files?
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
For hive partitioning + bucketing, the # of output files is not constant and will depend on the actual data of the input partition.To clarify, let's say df is 200 partitions, not 200 files. Output files scale with # of input partitions, not # of files. 200 files could be misleading as that could be 1 partition to 1000's of partitions.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files
df.repartition(50, 'colB') = 50 output files
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = upper bound of 2,000 output files (200 input partitions * max 10 values per partition)
output.write_dataframe(df, partition_cols=['colB']) = max 200,000 output files (200 * 1000 values per partition)
output.write_dataframe(df, partition_cols=['colA', 'colB']) = max 2,000,000 output files (200 partitions * 10 values * 1000)
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = max 20,000 files (200 partitions * max 100 buckets per partition)
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = max 2,000 files (200 partitions * max 10 buckets per partition)
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = exactly 10 files (repartitioned dataset makes 10 input partitions, each partition outputs to only 1 bucket)
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I could be wrong on this, but I believe it's max of 400,000 output files (200 input partitions * 10 colA partitions * 200 colB buckets)
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I believe this is exactly 10,000 output files (repartition colA,colB = 10,000 partitions, each partition contains exactly 1 colA and 1 bucket of colB)
Background
The key to being able to reason about output file counts is understanding at which level each concept applies.
Repartition (df.repartition(N, 'colA', 'colB')) creates a new spark stage with the data shuffled as requested, into the specified number of shuffle partitions. This will change the number of tasks in the following stage, as well as the data layout in those tasks.
Hive partitioning (partition_cols=['colA', 'colB']) and bucketing (bucket_cols/bucket_count) only have any effect within the scope of the final stage's tasks, and effect how the task writes its data into files on disk.
In particular, each final stage task will write one file per hive-partition/bucket combination present in its data. Combinations not present in that task will not write an empty file if you're using hive-partitioning or bucketing.
Note: if not using hive-partitioning or bucketing, each task will write out exactly one file, even if that file is empty.
So in general you always want to make sure you repartition your data before writing to make sure the data layout matches your hive-partitioning/bucketing settings (i.e. each hive-partition/bucket combination is not split between multiple tasks), otherwise you could end up writing huge numbers of files.
Your examples
I think there is some misunderstanding floating around, so let's go through these one by one.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
Yes - the data will be randomly, evenly shuffled into 100 partitions, causing 100 tasks, each of which will write exactly one file.
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
No - the number of partitions to shuffle into is unspecified, so it will default to 200. So you'll have 200 tasks, at most 10 of which will contain any data (could be fewer due to hash collisions), so you will end up with 190 empty files, and 10 with data.
*Note: with AQE in spark 3, spark may decide to coalesce the 200 partitions into fewer when it realizes most of them are very small. I don't know the exact logic there, so technically the answer is actually "200 or fewer, only 10 will contain data".
df.repartition('colB') = 1000 output files
No - Similar to above, the data will be shuffled into 200 partitions. However in this case they will (likely) all contain data, so you will get 200 roughly-equally sized files.
Note: due to hash collisions, files may be larger or smaller depending on how many values of colB happened to land in each partition.
df.repartition(50, 'colA') = 50 output files?
Yes - Similar to before, except now we've overridden the partition count from 200 to 50. So 10 files with data, 40 empty. (or fewer because of AQE)
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Yes - Same as before, we'll get 50 files of slightly varying sizes depending on how the hashes of the colB values work out.
Hive partitions:
(I think the below examples are written assuming df is in 100 partitions to start rather than 200 as specified, so I'm going to go with that)
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
Yes - You'll have 100 tasks, each of which will write one file for each colA value they see. So up to 1,000 files in the case the data is randomly distributed.
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
No - Missing a 0 here. 100 tasks, each of which could write as many as 1,000 files (one for each colB value), for a total of up to 100,000 files.
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
No - 100 tasks, each of which will write one file for each combination of partition cols it sees. There are 10,000 such combinations, so this could write as many as 100 * 10,000 = 1,000,000 files!
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Yes - The repartition will shuffle our data into 200 tasks, but only 10 will contain data. Each will contain exactly one value of colA, so will write exactly one file. The other 190 tasks will write no files. So 10 files exactly.
Bucketing:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
No - Since we haven't laid out the data carefully, we have 100 tasks with (maybe) randomly distributed data. Each task will write one file per bucket it sees. So this could write up to 100 * 100 = 10,000 files!
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
No - Similar to above, 100 tasks, each could write up to 10 files. So worst-case is 1,000 files here.
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
Now we're adjusting the data layout before writing, we'll have 200 tasks, at most 10 of which will contain any data. Each value of colA will exist in only one task.
Each task will write one file per bucket it sees. So we should get at most 10 files here.
Note: Due to hash collisions, one or more buckets might be empty, so we might not get exactly 10.
All together now:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
100 tasks. 10 hive-partitions. 200 buckets.
Worst case is each task writes one file per hive-partition/bucket combination. i.e. 100 * 10 * 200 = 200,000 files.
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
This one is sneaky. We have 200 tasks and the data is shuffled carefully so each colA/colB combination is in just one task. So everything seems good.
BUT each bucket contains multiple values of colB, and we have done nothing to make sure that an entire bucket is localized to one spark task.
So at worst, we could get one file per value of colB, per hive partition (colA value). i.e. 10 * 1,000 = 10,000 files.
Given our particular parameters, we can do slightly better by just focusing on getting the buckets laid out optimally:
output.write_dataframe(df.repartition(200, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200)
Now we're making sure that colB is shuffled exactly how it will be bucketed, so each task will contain exactly one bucket.
Then we'll get one file for each colA value in the task (likely 10 since colA is randomly shuffled), so at most 200 * 10 = 2,000 files.
This is the best we can do, assuming colA and colB are not correlated.
Conclusion
There's no one-size fits all approach to controlling file sizes.
Generally you want to make sure you shuffle your data so it's laid out in accordance with the hive-partition/bucketing strategy you're applying before writing.
However the specifics of what to do may vary in each case depending on your exact parameters.
The most important thing is to understand how these 3 concepts interact (as described in "Background" above), so you can reason about what will happen from first principals.
Hi I am new to netlogo with no programming background,
I am trying to create a network of "neighbours" , using GIS extension ,
so far I'm using in-radius function but i am not sure if it's the one that is suitable.
since i don't understand the unit of radius in Netlogo
here's the code :
to setup
clear-drawing
clear-all
reset-ticks
; zoom to study area
resize-world 00 45 0 20
set-patch-size 20
; upload city boundries
set mosul-data gis:load-dataset"data/MosulBoundries.shp"
gis:set-world-envelope gis:envelope-of mosul-data
gis:apply-coverage mosul-data "Q_NAME_E" neighbor
to Neighbour-network
;; set 7 neighbour agents inside the city
ask turtles [
let target other turtles in-radius 1
if any? target
[ask one-of target [create-link-with myself]]
]
print count links
I want for each neighberhood neighbor each agent is linked to the 7 nearst neighbors.
my guess is that in the line if any? target something has to change , but all my attempts are useless so far.
Thank in advance
I am unclear how GIS relates to this question and you haven't provided the code for creating the agents so I can't give a complete answer. NetLogo has a coordinate system, automatically built in. Each agent has a position on that coordinate system and each patch occupies the space 1 unit by 1 unit square (centred on integer coordinates). The in-radius and distance primitives are in the distance units.
However, if all you want to do is connect to the 7 nearest turtles, you don't need any of that because NetLogo can simply find those turtles directly by finding those with the minimum distance to the asking turtle. This uses min-n-of to find the given number of turtles with the relevant minimum, and distance [myself] for the thing to minimise. The whole thing, including creating the links with the generated turtleset, can be done in a single line of code.
Here is a complete model to show you what it looks like:
to testme
clear-all
create-turtles 100 [setxy random-xcor random-ycor]
ask n-of 5 turtles
[ create-links-with min-n-of 7 other turtles [distance myself]
]
end
Sarah:
1) This helped me understand the use of 'in-radius' in NetLogo (or the unit of radius): When you use 'in-radius 1' in a patch-context, 5 patches will be selected (patch where the asking turtle is located and four neighbors, not all 8 neighboring patches).
2) Consider using 'min-one-of target [ distance myself ]' instead of 'one-of target'.
min-one-of: http://ccl.northwestern.edu/netlogo/docs/dict/min-one-of.html
distance myself: http://ccl.northwestern.edu/netlogo/docs/dict/distance.html
to Neighbour-network
; set 7 neighbour agents inside the city
ask turtles [
let target other turtles in-radius 1
let counter 0
while [ count target > 0 and counter < 8 ]
[ ask min-one-of target [ distance myself ] [
create-link-with myself
set counter counter + 1
]
]
show my-links
]
3) Consider exploring Nw extension: https://ccl.northwestern.edu/netlogo/docs/nw.html
I have a shapefile containing the location of several thousands of people. I want to import this and for each patch, I'd like to count the amount of people on this exact patch (each person is an entry in the shape file, but multiple people can be located on the exact patch depending on my world size).
I have managed to do so using the following code:
set population-here 0
let population-dataset gis:load-dataset "population_file.shp"
foreach gis:feature-list-of population-dataset [a ->
ask patches gis:intersecting a [
set population-here population-here + 1]
However, it takes several hours to load the dataset in a world of -300 to 300 pixels. Is there a faster way of counting the number of individual entries for each patch?
The population should be placed on an underlying shapefile of an area. This area is imported as following:
let area-dataset gis:load-dataset "area.shp"
ask patches [set world? false]
gis:set-world-envelope gis:envelope-of area-dataset
ask patches gis:intersecting area-dataset
[ set pcolor grey
set world? true
]
Okay, I can't test this and I'm not entirely confident in this answer as I use GIS very rarely. But I suggest you adapt the code in the GIS general examples in the NetLogo model library (see File menu). What you appear to want to do is create a turtle breed for your people, and create a person at each location where there is a person in your population dataset.
breed [people person]
patches-own [population]
to setup
< all the other stuff you have >
let population-dataset gis:load-dataset "population_file.shp"
foreach gis:feature-list-of population-dataset
[ thisFeature ->
[ let location gis:centroid-of (first (first (gis:vertex-lists-of thisFeature )))
create-people 1
[ set xcor item 0 location
set ycor item 1 location
]
]
]
ask patches [ set population count people-here ]
end
You can also import other variables from the population set (eg gender or age group) and have those variables transfer to appropriate attributes of your NetLogo people.
If you haven't found it yet, I recommend this tutorial https://simulatingcomplexity.wordpress.com/2014/08/20/turtles-in-space-integrating-gis-and-netlogo/.
Note that this assumes there is a reason why you want the people in the correct (as defined by the GIS dataset) position for your model rather than simply having some sort of population count (or density) in your GIS file and then create the people in NetLogo on the correct patch.
I have a csv file includes two column
no. of packet size
1 60
2 70
3 400
4 700
.
.
.
1000000 60
where the first column is
the number of packet
, and the second column is
the size of packet in bytes.
the total number of packets in the csv file is one million. I need to plot histogram for this data file by:
xrange = [0, 5 , 10 , 15 ]
which denotes the packet size in bytes. The range [0] denotes the packet size less than 100 bytes, and [5] denotes the packet bytes less than 500 bytes and so on.
yrange = [ 10, 100, 10000, 100000000],
which denots the number of packets
Any help will be highly appreciated.
Don't quite remember exactly how this works, but the commands given in my Gnuplot in Action book for creating a histogram are
bin(x,s) = s*int(x/s)
plot "data-file" using (bin(1,0.1)):(1./(0.1*300)) smooth frequency with boxes
I believe smooth frequency is the command that's important to you, and you need to figure out what the using argument should be (possibly with a different function used).
This should do the job:
# binning function for arbitrary ranges, change as needed
bin(x) = x<100 ? 0 : x<500 ? 5 : x<2500 ? 10 : 15
# every occurence is counted as (1)
plot datafile using (bin($2)):(1) smooth freq with boxes
Im not really sure what you mean by "yrange [10 100 1000 ...]", do you want a logscaled ordinate?
Then just
set xrange [1:1e6]
set logscale y
before plotting.
i want to import csv-Files with about 40 million lines into neo4j. For this i try to use the "batchimporter" from https://github.com/jexp/batch-import.
Maybe it's a problem that i provide own IDs. This is the example
nodes.csv
i:id
l:label
315041100 Person
201215100 Person
315041200 Person
rels.csv :
start
end
type
relart
315041100 201215100 HAS_RELATION 30006
315041200 315041100 HAS_RELATION 30006
the content of batch.properties:
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=1000M
neostore.relationshipstore.db.mapped_memory=5000M
neostore.propertystore.db.mapped_memory=4G
neostore.propertystore.db.strings.mapped_memory=2000M
neostore.propertystore.db.arrays.mapped_memory=1000M
neostore.propertystore.db.index.keys.mapped_memory=1500M
neostore.propertystore.db.index.mapped_memory=1500M
batch_import.node_index.node_auto_index=exact
./import.sh graph.db nodes.csv rels.csv
will be processed without errors, but it takes about 60 seconds!
Importing 3 Nodes took 0 seconds
Importing 2 Relationships took 0 seconds
Total import time: 54 seconds
When i use smaller IDs - for example 3150411 instead of 315041100 - it takes just 1 second!
Importing 3 Nodes took 0 seconds
Importing 2 Relationships took 0 seconds
Total import time: 1 seconds
Actually i would take even bigger IDs with 10 digits. I don't know what i'm doing wrong. Can anyone see an error?
JDK 1.7
batchimporter 2.1.3 (with neo4j 2.1.3)
OS: ubuntu 14.04
Hardware: 8-Core-Intel-CPU, 16GB RAM
I think the problem is that the batch importer is interpreting those IDs as actual physical ids on disk. And so the time is spent in the file system, inflating the store files up to the size where they can fit those high ids.
The ids that you're giving are intended to be "internal" to the batch import, or? Although I'm not sure how to tell the batch importer that is the case.
#michael-hunger any input there?
the problem is that those ID's are internal to Neo4j where they represent disk record-ids. if you provide high values there, Neo4j will create a lot of empty records until it reaches your ids.
So either you create your node-id's starting from 0 and you store your id as normal node property.
Or you don't provide node-id's at all and only lookup nodes via their "business-id-value"
i:id id:long l:label
0 315041100 Person
1 201215100 Person
2 315041200 Person
start:id end:id type relart
0 1 HAS_RELATION 30006
2 0 HAS_RELATION 30006
or you have to configure and use an index:
id:long:people l:label
315041100 Person
201215100 Person
315041200 Person
id:long:people id:long:people type relart
0 1 HAS_RELATION 30006
2 0 HAS_RELATION 30006
HTH Michael
Alternatively you can also just write a small java or groovy program to import your data if handling those ids with the batch-importer is too tricky.
See: http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/