Loading NOSQL data into Spark nodes - json

I am trying to understand what happens when I load data into Spark from a NoSQL source. ie. Will it try to load the records into the driver and then distribute it to the worker nodes OR will it load records into all the worker nodes simultaneously? . Basically is there any way to load data in parallel if yes, how to ensure the same record is not processed by more than one node?
If it is not a parallel process would writing the same json into a ".json" file help?(provided each line is a record)

It will always load directly to the workers. Depending on the source of the data and how it is stored, it can be possible to be loaded in parallel. When the data is being loaded, the data will be sharded with non-overlapping rows, so you won't have to worry about processing the same data twice. The file format will be irrelevant. Which data source are you loading from (mongo, cassandra, hbase)? I can give a better answer if you tell me the source system.

Related

Converting JSON .gz files into Delta Tables

I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost effective solution to get this data into delta lake?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Has anyone done this successfully without breaking the bank?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Yes, that's the big downside of gzip format - it is not splitable and therefore cannot be distributed across all your workers and cores - the Driver has to load a file in its entirety and decompress it in a single batch. Topic related to this question.
The only sensible workaround I've used myself is to make Driver have only few cores but as powerful ones as possible - I assume, since you are using Azure Blob, then you are using Databricks on Azure as well and here you can find all Azure VM types - just have to pick the one with fastest cores.

SSIS Cache Mamnager

Is it possible to use an SSIS Cache manger with anything other than a Lookup? I would like to use similar data across multiple data flows.
I haven't been able to find a way to cache this data in memory in a cache manager and then reuse it in a later flow.
Nope, a cache connection manager was specific to solving lookup tasks originally only allowing an OLE DB Connection to be used.
However, if you have a set of data you want to be static for the life of a package run and able to be used across data flows, or even other packages, as a table-like entity, perhaps you're looking for a Raw File. It's a tight, binary implementation of the data stored to disk. Since it's stored to disk, you will pay a write and subsequent read performance penalty but it's likely that the files are right sized such that any penalty is offset by the specific needs.
The first step you will need to do is define the data that will go into a Raw file and connect a Raw File Destination. Which is going to involve creating a Raw File Connection Manager where you will define where the file lives and the rules about the data in there (recreate, append, etc). At this point, run the data flow task so the file is created and populated.
The next step is everywhere you want to use the data, you'll patch in a Raw File Source. It's going to behave much as any other data source in your toolkit at this point.

loading 20 million records from SSIS to SNOWFLAKE through ODBC

I am trying to load around 20 million records from ssis to snowflake using ODBC connection, this load is taking forever to complete. I there any faster method than using ODBC? I can think of loading it into flat file and then using flat file to load into snowflake but sure how to do it.
Update:
i generated a text file using bcp and the put that file on snowflake staging using ODBC connection and then using copy into command to load the data into tables.
issue: the txt file generated is a 2.5gb file and the ODBC is struggling to send the file to snowflake stage any help on this part??
It should be faster to write compressed objects to the cloud provider's object store (AWS S3, Azure blob, etc.) and then COPY INTO Snowflake. But also more complex.
You are, by chance, not writing one row at a time, for 20,000,000 database calls?
ODBC is slow on a database like this, Snowflake (and similar columnar warehouses) also want to eat shred files, not single large ones. The problem with your original approach was no method of ODBC usage is going to be particularly fast on a system designed to load nodes in parallel across shred staged files.
The problem with your second approach was no shred took place. Non-columnar databases with a head node (say, Netezza) would like and eat and shred your single file, but a Snowflake or a Redshift are basically going to ingest it as a single thread into a single node. Thus your ingest of a single 2.5 GB file is going to take the same amount of time on an XS 1-node Snowflake as an L 8-node Snowflake cluster. Your single node itself is not saturated and has plenty of CPU cycles to spare, doing nothing. Snowflake appears to use up to 8 write threads on a node basis for an extract or ingest operation. You can see some tests here: https://www.doyouevendata.com/2018/12/21/how-to-load-data-into-snowflake-snowflake-data-load-best-practices/
My suggestion would be to make at least 8 files of size (2.5 GB / 8), or about 8 315MB files. For 2-nodes, at least 16. Likely this involves some effort in your file creation process if it is not natively shredding and horizontally scaling; although as a bonus it's breaking up your data into easier bite sized processes to abort/resume/etc should any problems occur.
Also note that once the data is bulk insert into Snowflake it is unlikely to be optimally placed to take advantage of the benefits of micro-partitions - so I would recommend something like rebuilding the table with the loaded data and at least sorting it on an oft restricted column, ie. a fact table I would at least rebuild and sort by date. https://www.doyouevendata.com/2018/03/06/performance-query-tuning-snowflake-clustering/
generate the file and then use Snow CLI to Put it in the internal Stage. Use Cooy into for stage->table. Some coding to do, and you can never avoid transporting GB over the net, but Put coukd compress and transfer the file in chunks

cassandra sstableloader load data from csv with various partition keys

I want to load a large CSV file to my cassandra cluster (1 node at this moment).
Basing on: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
My data is transformed by CQLSSTableWriter to SSTables files, then I use SSTableLoader to load that SSTables to a cassandra table already containing some data.
That CSV file contains various partition keys.
Now lets assume that multi-node cassandra cluser is used.
My questions:
1) Is the loading procedure that I use correct in case of multinode cluster?
2) Will that SSTable files be splitted by SSTableLoader and send to nodes responsible for the specific partition keys?
Thank you
1) Loading into a single-node cluster or 100-node cluster is the same. The only difference is that the data will be distributed around the ring if you have a multi-node cluster. The node where you run sstableloader becomes the coordinator (as #rtumaykin already stated) and will send the writes to the appropriate nodes.
2) No. As in my response above, the "splitting" is done by the coordinator. Think of the sstableloader utility as just another instance of a client sending writes to the cluster.
3) In response to your follow-up question, the sstableloader utility is not sending files to nodes but sending writes of the rows contained in those SSTables. The sstableloader reads the data and sends write requests to the cluster.
Yes
It will be actually done by the coordinator node, not by the SSTableLoader.

where does neo4j save its data?

I read some where that it is better to use Redis as cache server,because Redis holds the data in memory,so if you are going to save lots of data,Redis is not a good choice. Redis is good for keeping temporary data.now my question is:
1.where do rest of databases (especially neo4j and sql server) save data?
Don't they save data in memory?
if no,so where they save them?
if yes,why do we use them for saving lots of data?
2."It is better to save indices/relationships in neo4j and data in mysql,and retrieve the index from neo4j and then take the data related to the index from mysql" (I have read it some where),is this because neo4j has the same problem as Redis does?
Neo4J and SQL Server both store data on the file system. However, both also implement caching strategies. I am not an expert on the caching in these databases. Usually you can expect recently accessed data to be cached and data that has not been accessed for a while to fall out of the cache. If the DB needs to get something that is in the cache, it can avoid hits to the file system. Neo4j saves data in a subfolder called "data" by default. This linke may help you find the location of a SQL Server database: http://technet.microsoft.com/en-us/library/dd206993.aspx
This will depend a lot on your specific use-case and the required performance characteristics. My gut feeling is to put data in one or the other based on some initial performance tests. Split the data up if it solves some specific problem.