cassandra sstableloader load data from csv with various partition keys - csv

I want to load a large CSV file to my cassandra cluster (1 node at this moment).
Basing on: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
My data is transformed by CQLSSTableWriter to SSTables files, then I use SSTableLoader to load that SSTables to a cassandra table already containing some data.
That CSV file contains various partition keys.
Now lets assume that multi-node cassandra cluser is used.
My questions:
1) Is the loading procedure that I use correct in case of multinode cluster?
2) Will that SSTable files be splitted by SSTableLoader and send to nodes responsible for the specific partition keys?
Thank you

1) Loading into a single-node cluster or 100-node cluster is the same. The only difference is that the data will be distributed around the ring if you have a multi-node cluster. The node where you run sstableloader becomes the coordinator (as #rtumaykin already stated) and will send the writes to the appropriate nodes.
2) No. As in my response above, the "splitting" is done by the coordinator. Think of the sstableloader utility as just another instance of a client sending writes to the cluster.
3) In response to your follow-up question, the sstableloader utility is not sending files to nodes but sending writes of the rows contained in those SSTables. The sstableloader reads the data and sends write requests to the cluster.

Yes
It will be actually done by the coordinator node, not by the SSTableLoader.

Related

Converting JSON .gz files into Delta Tables

I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost effective solution to get this data into delta lake?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Has anyone done this successfully without breaking the bank?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Yes, that's the big downside of gzip format - it is not splitable and therefore cannot be distributed across all your workers and cores - the Driver has to load a file in its entirety and decompress it in a single batch. Topic related to this question.
The only sensible workaround I've used myself is to make Driver have only few cores but as powerful ones as possible - I assume, since you are using Azure Blob, then you are using Databricks on Azure as well and here you can find all Azure VM types - just have to pick the one with fastest cores.

How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.
Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.

Automatic distributed reading of csv file in Jmeter distributed load testing?

My scenario is while doing distributed load testing through jmeter i want csv file should read in auto distributed manner . example
if i have 100 users entry in csv data set config file and number of slave server is 10. so in normal scenario i have to keep csv file entry in arranged manner like
user1- to 10 at slave-1
user-11to20 at slave-2
.
.
.
user-91 to 100 at slave 3
so i want same csv file have entry of all 100 users should be placed at all slave and jmeter automatically read entry from these files and distribute it .
You cannot do that at the moment but if you want a configuration in which each thread uses unique data even across slave machines, then you should use different Test Data files on different slave machine.
You have to place this Test Data file on each slave machine (with same location as on master). JMeter will use Test Data from the slave machine and not from the Master, so placing different data set on different machine will ensure uniqueness.
JMeter doesn't provide such functionality out of the box so the only option I can think of is reading required X lines with the given offset depending on the slave hostname or IP address somewhere in setUp Thread Group using JSR223 Sampler and Groovy language and writing this range of lines into a new file which will be used in the CSV Data Set Config.
Another possible solution would be going for HTTP Simple Table Server, it's READ endpoint allows removing the value after reading it so you will have unique data for all slaves.

Loading NOSQL data into Spark nodes

I am trying to understand what happens when I load data into Spark from a NoSQL source. ie. Will it try to load the records into the driver and then distribute it to the worker nodes OR will it load records into all the worker nodes simultaneously? . Basically is there any way to load data in parallel if yes, how to ensure the same record is not processed by more than one node?
If it is not a parallel process would writing the same json into a ".json" file help?(provided each line is a record)
It will always load directly to the workers. Depending on the source of the data and how it is stored, it can be possible to be loaded in parallel. When the data is being loaded, the data will be sharded with non-overlapping rows, so you won't have to worry about processing the same data twice. The file format will be irrelevant. Which data source are you loading from (mongo, cassandra, hbase)? I can give a better answer if you tell me the source system.

Where can I find the clear definitions for a Couchbase Cluster, Couchbase Node and a Couchbase Bucket?

I am new to Couchbase and NoSQL terminologies. From my understanding a Couchbase node is a single system running a Couchbase Server application and a collection of such nodes having the same data by replication form a Couchbase Cluster.
Also, a Couchbase Bucket is somewhat like a table in RDBMS wherein you put your documents. But how can I relate the Node with the Bucket? Can someone please explain me about it in simple terms?
a Node is a single machine (1 IP/ hostname) that executes Couchbase Server
a Cluster is a group of Nodes that talk together. Data is distributed between the nodes automatically, so the load is balanced. The cluster can also provides replication of data for resilience.
a Bucket is the "logical" entity where your data is stored. It is both a namespace (like a database schema) and a table, to some extent. You can store multiple types of data in a single bucket, it doesn't care what form the data takes as long as it is a key and its associated value (so you can store users, apples and oranges in a same Bucket).
The bucket acts gives the level of granularity for things like configuration (how much of the available memory do you want to dedicate to this bucket?), replication factor (how many backup copies of each document do you want in other nodes?), password protection...
Note that I said that Buckets where a "logical" entity? They are in fact divided into 1024 virtual fragments which are spread between all the nodes of the cluster (that's how data distribution is achieved).