Efficient way to read all the data from a large couchbase bucket - couchbase

We have about 80M product in a bucket in couchbase and we need to daily read all data from it and do some calculations on them.
I am using select queries with 512 limit on each select and a for while to read data from each collecion but sometimes I get timout error.
Is there any better and efficient way to read data from couchbase ?
for x_range in range(math.ceil(total_couchbase / buffer_size)):
mycollection = couchbase_cluster.query(get_select_query(cluster_name, scope, collection_name, buffer_size, x_range * buffer_size))
for current_document in mycollection:
# do some calculations

Related

How many files are output by a Foundry Transform in various combinations of repartition, hive partitioning, and bucketing?

I think I understand how each of repartition, hive partitioning, and bucketing affect the number of output files, but I am not quite clear on the interaction of the various features. Can someone help fill in the number of output files for each of the below situations where I've left a blank? The intent is to understand what the right code is for a situation where I have a mix of high and low cardinality columns that I need to partition / bucket by, where I have frequent operations that filter on the low cardinality columns, and join on the high cardinality columns.
Assume that we have a data frame df that starts with 200 input partitions, colA has 10 unique values, and colB has 1000 unique values.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files?
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
For hive partitioning + bucketing, the # of output files is not constant and will depend on the actual data of the input partition.To clarify, let's say df is 200 partitions, not 200 files. Output files scale with # of input partitions, not # of files. 200 files could be misleading as that could be 1 partition to 1000's of partitions.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
df.repartition('colB') = 1000 output files
df.repartition(50, 'colA') = 50 output files
df.repartition(50, 'colB') = 50 output files
Hive partitions:
output.write_dataframe(df, partition_cols=['colA']) = upper bound of 2,000 output files (200 input partitions * max 10 values per partition)
output.write_dataframe(df, partition_cols=['colB']) = max 200,000 output files (200 * 1000 values per partition)
output.write_dataframe(df, partition_cols=['colA', 'colB']) = max 2,000,000 output files (200 partitions * 10 values * 1000)
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Bucketing:
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = max 20,000 files (200 partitions * max 100 buckets per partition)
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = max 2,000 files (200 partitions * max 10 buckets per partition)
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = exactly 10 files (repartitioned dataset makes 10 input partitions, each partition outputs to only 1 bucket)
All together now:
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I could be wrong on this, but I believe it's max of 400,000 output files (200 input partitions * 10 colA partitions * 200 colB buckets)
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = I believe this is exactly 10,000 output files (repartition colA,colB = 10,000 partitions, each partition contains exactly 1 colA and 1 bucket of colB)
Background
The key to being able to reason about output file counts is understanding at which level each concept applies.
Repartition (df.repartition(N, 'colA', 'colB')) creates a new spark stage with the data shuffled as requested, into the specified number of shuffle partitions. This will change the number of tasks in the following stage, as well as the data layout in those tasks.
Hive partitioning (partition_cols=['colA', 'colB']) and bucketing (bucket_cols/bucket_count) only have any effect within the scope of the final stage's tasks, and effect how the task writes its data into files on disk.
In particular, each final stage task will write one file per hive-partition/bucket combination present in its data. Combinations not present in that task will not write an empty file if you're using hive-partitioning or bucketing.
Note: if not using hive-partitioning or bucketing, each task will write out exactly one file, even if that file is empty.
So in general you always want to make sure you repartition your data before writing to make sure the data layout matches your hive-partitioning/bucketing settings (i.e. each hive-partition/bucket combination is not split between multiple tasks), otherwise you could end up writing huge numbers of files.
Your examples
I think there is some misunderstanding floating around, so let's go through these one by one.
First a few ones to check my understanding:
df.repartition(100) = 100 output files of the same size
Yes - the data will be randomly, evenly shuffled into 100 partitions, causing 100 tasks, each of which will write exactly one file.
df.repartition('colA') = 10 output files of different sizes, since each file will contain all rows for 1 value of colA
No - the number of partitions to shuffle into is unspecified, so it will default to 200. So you'll have 200 tasks, at most 10 of which will contain any data (could be fewer due to hash collisions), so you will end up with 190 empty files, and 10 with data.
*Note: with AQE in spark 3, spark may decide to coalesce the 200 partitions into fewer when it realizes most of them are very small. I don't know the exact logic there, so technically the answer is actually "200 or fewer, only 10 will contain data".
df.repartition('colB') = 1000 output files
No - Similar to above, the data will be shuffled into 200 partitions. However in this case they will (likely) all contain data, so you will get 200 roughly-equally sized files.
Note: due to hash collisions, files may be larger or smaller depending on how many values of colB happened to land in each partition.
df.repartition(50, 'colA') = 50 output files?
Yes - Similar to before, except now we've overridden the partition count from 200 to 50. So 10 files with data, 40 empty. (or fewer because of AQE)
df.repartition(50, 'colB') = 50 output files, so some files will contain more than one value of colB?
Yes - Same as before, we'll get 50 files of slightly varying sizes depending on how the hashes of the colB values work out.
Hive partitions:
(I think the below examples are written assuming df is in 100 partitions to start rather than 200 as specified, so I'm going to go with that)
output.write_dataframe(df, partition_cols=['colA']) = 1,000 output files (because I get potentially 100 files in each of the 10 hive partitions 10)
Yes - You'll have 100 tasks, each of which will write one file for each colA value they see. So up to 1,000 files in the case the data is randomly distributed.
output.write_dataframe(df, partition_cols=['colB']) = 10,000 output files
No - Missing a 0 here. 100 tasks, each of which could write as many as 1,000 files (one for each colB value), for a total of up to 100,000 files.
output.write_dataframe(df, partition_cols=['colA', 'colB']) = 100,000 output files
No - 100 tasks, each of which will write one file for each combination of partition cols it sees. There are 10,000 such combinations, so this could write as many as 100 * 10,000 = 1,000,000 files!
output.write_dataframe(df.repartition('colA'), partition_cols=['colA']) = 10 output files of different sizes (1 file in each hive partition)
Yes - The repartition will shuffle our data into 200 tasks, but only 10 will contain data. Each will contain exactly one value of colA, so will write exactly one file. The other 190 tasks will write no files. So 10 files exactly.
Bucketing:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, bucket_cols=[‘colB’], bucket_count=100) = 100 output files? In an experiment, this did not seem to be the case
No - Since we haven't laid out the data carefully, we have 100 tasks with (maybe) randomly distributed data. Each task will write one file per bucket it sees. So this could write up to 100 * 100 = 10,000 files!
output.write_dataframe(df, bucket_cols=[‘colA’], bucket_count=10) = 10 output files?
No - Similar to above, 100 tasks, each could write up to 10 files. So worst-case is 1,000 files here.
output.write_dataframe(df.repartition(‘colA’), bucket_cols=[‘colA’], bucket_count=10) = ???
Now we're adjusting the data layout before writing, we'll have 200 tasks, at most 10 of which will contain any data. Each value of colA will exist in only one task.
Each task will write one file per bucket it sees. So we should get at most 10 files here.
Note: Due to hash collisions, one or more buckets might be empty, so we might not get exactly 10.
All together now:
Again, assuming 100 partitions for df, not 200
output.write_dataframe(df, partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ???
100 tasks. 10 hive-partitions. 200 buckets.
Worst case is each task writes one file per hive-partition/bucket combination. i.e. 100 * 10 * 200 = 200,000 files.
output.write_dataframe(df.repartition(‘colA’, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200) = ??? -- Is this the command that I want to use in the end? And anything downstream would first filter on colA to take advantage of the hive partitioning, then join on colB to take advantage of the bucketing?
This one is sneaky. We have 200 tasks and the data is shuffled carefully so each colA/colB combination is in just one task. So everything seems good.
BUT each bucket contains multiple values of colB, and we have done nothing to make sure that an entire bucket is localized to one spark task.
So at worst, we could get one file per value of colB, per hive partition (colA value). i.e. 10 * 1,000 = 10,000 files.
Given our particular parameters, we can do slightly better by just focusing on getting the buckets laid out optimally:
output.write_dataframe(df.repartition(200, ‘colB’), partition_cols=[‘colA’], bucket_cols=[‘colB’], bucket_count=200)
Now we're making sure that colB is shuffled exactly how it will be bucketed, so each task will contain exactly one bucket.
Then we'll get one file for each colA value in the task (likely 10 since colA is randomly shuffled), so at most 200 * 10 = 2,000 files.
This is the best we can do, assuming colA and colB are not correlated.
Conclusion
There's no one-size fits all approach to controlling file sizes.
Generally you want to make sure you shuffle your data so it's laid out in accordance with the hive-partition/bucketing strategy you're applying before writing.
However the specifics of what to do may vary in each case depending on your exact parameters.
The most important thing is to understand how these 3 concepts interact (as described in "Background" above), so you can reason about what will happen from first principals.

Big binary data share between processes

I have a big binary data iof ip data about Xmb. Processes use binary do some search algorithm to lookup ip address. I have three method.
1. put in ets. but I suppose every read access will copy big binary to process. :(
2. put in gen_server state. processes use gen_server:call to get address.The short coming concurrency.
3. compile binary into beam. but when I compile get
eheap_alloc: Cannot allocate 1318267840 bytes of memory (of type "heap")
which the best practice of big data share in erlang?
Binaries over 64 bytes in size are stored as reference counted binaries and their data is stored outside the heap of any process. If such a binary is sent to any process, the underlying data is not duplicated. So, if you store such a binary in an ETS table and then access it from various processes, the underlying data will not be copied, only its reference count will be incremented/decremented. I'd suggest going with the ETS table solution.
Here's a demonstration of the memory usage at boot, after inserting a 100MB binary into an ETS table, and after fetching a copy of the binary into the shell process. The memory usage does not change after we have a copy binary stored in the shell process. The same would not be true if it was million character string (list of integers) that we were copying in from ETS or another process.
1> erlang:memory().
[{total,21912472},
{processes,5515456},
{processes_used,5510816},
{system,16397016},
{atom,223561},
{atom_used,219143},
{binary,844872},
{code,4808780},
{ets,301232}]
2> ets:new(foo, [named_table, set]).
foo
3> ets:insert(foo, {foo, binary:copy(<<".">>, 104857600)}).
true
4> erlang:memory().
[{total,127038632},
{processes,5600320},
{processes_used,5599952},
{system,121438312},
{atom,223561},
{atom_used,220445},
{binary,105770576},
{code,4908097},
{ets,308416}]
5> X = ets:lookup(foo, foo).
[{foo,<<"........................................................................................................"...>>}]
6> erlang:memory().
[{total,127511632},
{processes,6082360},
{processes_used,6081992},
{system,121429272},
{atom,223561},
{atom_used,220445},
{binary,105761504},
{code,4908097},
{ets,308416}]
You can find a lot more info about how to efficiently work with binaries in Erlang in the link above.

Hive query does not begin MapReduce process after starting job and generating Tracking URL

I'm using Apache Hive.
I created a table in Hive (similar to external table) and loaded data into the same using the LOAD DATA LOCAL INPATH './Desktop/loc1/kv1.csv' OVERWRITE INTO TABLE adih; command.
While I am able to retrieve simple data from the hive table adih (e.g. select * from adih, select c_code from adih limit 1000, etc), Hive gives me errors when I ask for data involving slight computations (e.g. select count(*) from adih, select distinct(c_code) from adih).
The Hive cli output is as shown in the following link -
hive> select distinct add_user from adih;
Query ID = latize_20161031155801_8922630f-0455-426b-aa3a-6507aa0014c6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1477889812097_0006, Tracking URL = http://latize-data1:20005/proxy/application_1477889812097_0006/
Kill Command = /opt/hadoop-2.7.1/bin/hadoop job -kill job_1477889812097_0006
[6]+ Stopped $HIVE_HOME/bin/hive
Hive stops displaying any further logs / actions beyond the last line of "Kill Command"
Not sure where I have gone wrong (many answers on StackOverflow tend to point back to YARN configs (environment config detailed below).
I have the log as well but it contains more than 30000 characters (Stack Overflow limit)
My hadoop environment is configured as follows -
1 Name Node & 1 Data Node. Each has 20 GB of RAM with sufficient ROM. Have allocated 13 GB of RAM for the yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb each with the mapreduce.map.memory.mb being set as 4 GB and the mapreduce.reduce.memory.mb being set as 12 GB. Number of reducers is currently set to default (-1). Also, Hive is configured to run with a MySQL DB (rather than Derby).
You should set the appropriate values to the properties show in your trace,
eg: Edit the properties in hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value></property>
Looks like you have set mapred.reduce.tasks = -1, which makes Hive refer to its config to decide the number of reduce tasks.
You are getting an error as the number of reducers is missing in Hive config.
Try setting it using below command:
Hive> SET mapreduce.job.reduces=XX
As per official documentation: The right number of reduces seems to be 0.95 or 1.75 multiplied by (< no. of nodes > * < no. of maximum containers per node >).
I managed to get Hive and MR to work - increased the memory configurations for all the processes involved:
Increased the RAM allocated to YARN Scheduler and maximum RAM allocated to the YARN Nodemanager (in yarn-site.xml), alongside increasing the RAM allocated to the Mapper and Reducer (in mapred-site.xml).
Also incorporated parts of the answers by #Sathiyan S and #vmorusu - set the hive.exec.reducers.bytes.per.reducer property to 1 GB of data, which directly affects the number of reducers that Hive uses (through application of its heuristic techniques).

JMeter: Capturing Throughput in Command Line Interface Mode

In Jmeter v2.13, is there a way to capture Throughput via non-GUI/Command Line mode?
I have the jmeter.properties file configured to output via the Summariser and I'm also outputting another [more detailed] .csv results file.
call ..\..\binaries\apache-jmeter-2.13\bin\jmeter -n -t "API Performance.jmx" -l "performanceDetailedResults.csv"
The performanceDetailedResults.csv file provides:
timeStamp
elapsed time
responseCode
responseMessage
threadName
success
failureMessage
bytes sent
grpThreads
allThreads
Latency
However, no amount of tweaking the .properties file or the test itself seems to provide Throuput results like I get via the GUI's Summary Report's Save Table Data button.
All articles, postings, and blogs seem to indicate it wasn't possible without manual manipulation in a spreadsheet. But I'm hoping someone out there has figured out a way to do this with no, or minimal, manual manipulation as the client doesn't want to have to manually calculate the Throughput value each time.
It is calculated by JMeter Listeners so it isn't something you can enable via properties files. Same applies to other metrics which are calculated like:
Average response time
50, 90, 95, and 99 percentiles
Standard Deviation
Basically throughput is calculated as simple as dividing total number of requests by elapsed time.
Throughput is calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.
The formula is: Throughput = (number of requests) / (total time)
Hopefully it won't be too hard for you.
References:
Glossary #1
Glossary #2
Did you take a look at JMeter-Plugins?
This tool can generate aggregate report through commandline.

R: NaiveBayes incrementally on a large data set

I have a large data set in a MySQL database (at least 11 GB of data). I would like to train a NaiveBayes model on the entire set and then test is against a smaller but also quite large data set (~3 GB).
The second part seems feasible - I assume that I would run the following in a loop:
data_test <- sqlQuery(con, paste("select * from test_data LIMIT 10000", "OFFSET", (i*10000) ))
model_pred <- predict(model, data_test, type="raw")
...and then dump the predictions back to MySQL or a CSV.
How can I, however, train my model incrementally on such a large data set? I noticed in the R documentation of the function (http://www.inside-r.org/packages/cran/e1071/docs/naiveBayes) that there is an addtional argument in the predict function "newdata" which suggests that incremental learning is possible. The predict function however will return the predictions and not a new model.
Please provide me with an example of how to incrementally train my model.