Converting JSON .gz files into Delta Tables - json

I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost effective solution to get this data into delta lake?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Has anyone done this successfully without breaking the bank?

From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Yes, that's the big downside of gzip format - it is not splitable and therefore cannot be distributed across all your workers and cores - the Driver has to load a file in its entirety and decompress it in a single batch. Topic related to this question.
The only sensible workaround I've used myself is to make Driver have only few cores but as powerful ones as possible - I assume, since you are using Azure Blob, then you are using Databricks on Azure as well and here you can find all Azure VM types - just have to pick the one with fastest cores.

Related

How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.
Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.

loading 20 million records from SSIS to SNOWFLAKE through ODBC

I am trying to load around 20 million records from ssis to snowflake using ODBC connection, this load is taking forever to complete. I there any faster method than using ODBC? I can think of loading it into flat file and then using flat file to load into snowflake but sure how to do it.
Update:
i generated a text file using bcp and the put that file on snowflake staging using ODBC connection and then using copy into command to load the data into tables.
issue: the txt file generated is a 2.5gb file and the ODBC is struggling to send the file to snowflake stage any help on this part??
It should be faster to write compressed objects to the cloud provider's object store (AWS S3, Azure blob, etc.) and then COPY INTO Snowflake. But also more complex.
You are, by chance, not writing one row at a time, for 20,000,000 database calls?
ODBC is slow on a database like this, Snowflake (and similar columnar warehouses) also want to eat shred files, not single large ones. The problem with your original approach was no method of ODBC usage is going to be particularly fast on a system designed to load nodes in parallel across shred staged files.
The problem with your second approach was no shred took place. Non-columnar databases with a head node (say, Netezza) would like and eat and shred your single file, but a Snowflake or a Redshift are basically going to ingest it as a single thread into a single node. Thus your ingest of a single 2.5 GB file is going to take the same amount of time on an XS 1-node Snowflake as an L 8-node Snowflake cluster. Your single node itself is not saturated and has plenty of CPU cycles to spare, doing nothing. Snowflake appears to use up to 8 write threads on a node basis for an extract or ingest operation. You can see some tests here: https://www.doyouevendata.com/2018/12/21/how-to-load-data-into-snowflake-snowflake-data-load-best-practices/
My suggestion would be to make at least 8 files of size (2.5 GB / 8), or about 8 315MB files. For 2-nodes, at least 16. Likely this involves some effort in your file creation process if it is not natively shredding and horizontally scaling; although as a bonus it's breaking up your data into easier bite sized processes to abort/resume/etc should any problems occur.
Also note that once the data is bulk insert into Snowflake it is unlikely to be optimally placed to take advantage of the benefits of micro-partitions - so I would recommend something like rebuilding the table with the loaded data and at least sorting it on an oft restricted column, ie. a fact table I would at least rebuild and sort by date. https://www.doyouevendata.com/2018/03/06/performance-query-tuning-snowflake-clustering/
generate the file and then use Snow CLI to Put it in the internal Stage. Use Cooy into for stage->table. Some coding to do, and you can never avoid transporting GB over the net, but Put coukd compress and transfer the file in chunks

EC2 suitability for synching large CSV files from an FTP

I have to execute a task twice per week. The task consists on fetching a 1.4GB csv file from a public ftp server. Then I have to process it (apply some filters, discard some rows, make some calculations) and then synch it to a Postgres database hosted on AWS RDS. For each row I have to retrieve a SKU entry on the database and determine wether it needs an update or not.
My question is if EC2 could work as a solution for me. My main concern is the memory.. I have searched for some solutions https://github.com/goodby/csv which handle this issue by fetching row by row instead of pulling it all to memory, however they do not work if I try to read the .csv directly from the FTP.
Can anyone provide some insight? Is AWS EC2 a good platform to solve this problem? How would you deal with the issue of the csv size and memory limitations?
You wont be able to stream the file directly from FTP, instead, you are going to copy the entire file and store it locally. Using curl or ftp command is likely the most efficient way to do this.
Once you do that, you will need to write some kind of program that will read the file a line at a time or several if you can parallelize the work. There are ETL tools available that will make this easy. Using PHP can work, but its not a very efficient choice for this type of work and your parallelization options are limited.
Of course you can do this on an EC2 instance (you can do almost anything you can supply the code for in EC2), but if you only need to run the task twice a week, the EC2 instance will be sitting idle, eating money, the rest of the time, unless you manually stop and start it for each task run.
A scheduled AWS Lambda function may be more cost-effective and appropriate here. You are slightly more limited in your code options, but you can give the Lambda function the same IAM privileges to access RDS, and it only runs when it's scheduled or invoked.
FTP protocol doesn't do "streaming". You cannot read file from Ftp chunks by chunk.
Honestly, downloading the file and trigger run a bigger instance is not a big deal if you only run twice a week, you just choose r3.large (it cost less than 0.20/hour ), execute ASAP and stop it. The internal SSD disk space should give you the best possible I/O compare to EBS.
Just make sure your OS and code are deployed inside EBS for future reuse(unless you have automated code deployment mechanism). And you must make sure RDS will handle the burst I/O, otherwise it will become bottleneck.
Even better, using r3.large instance, you can split the CSV file into smaller chunks, load them in parallel, then shutdown the instance after everything finish. You just need to pay the minimal root EBS storage cost afterwards.
I will not suggest lambda if the process is lengthy, since lambda is only mean for short and fast processing (it will terminate after 300 seconds).
(update):
If you open up a file, the simple ways to parse it is read it sequentially, it may not put the whole CPU into full use. You can split up of CSV file follow reference this answer here.
Then using the same script, you can call them simultaneously by sending some to the background process, example below show putting python process in background under Linux.
parse_csvfile.py csv1 &
parse_csvfile.py csv2 &
parse_csvfile.py csv3 &
so instead single file sequential I/O, it will make use of multiple files. In addition, splitting the file should be a snap under SSD.
So I made it work like this.
I used Python and two great libraries. First of all I created a Python code to request and download the csv file from the FTP so I could load it to the memory. The first package is Pandas, which is a tool to analyze large amounts of data. It includes methods to read files from a csv easily. I used the included features to filter and sort. I filtered the large csv by a field and created about 25 new smaller csv files, which allowed me to deal with the memory issue. I used as well Eloquent which is a library inspired by Laravel's ORM. This library allows you to create a connection using AWS public DNS, database name, username and password and make queries using simple methods, without writing a single Postgres query. Finally I created a T2 micro AWS instance, installed Pandas and Eloquent updated my code and that was it.

HIVE, HBASE which one I have to use for My Data Analytics

I have 150 GB of MySQL data, Plan to replace MySQL to Casandra as backend.
Analytics, plan to go with Hadoop, HIVE or HBASE.
Currently I have 4 physical machines for POC. Please some one help me to come up with best efficient architecture.
Per day I will get 5 GB of Data.
Daily Status report I have to send to each customer.
Have to give Analysis report based on request : for example : 1 week report or last month first 2 week report. Is it possible to produce report instantly using HIVe or HBASE ?
I want to give best performance using Cassandra, Hadoop .
Hadoop can process your data using map reduce paradigm or other, using emerging technologies such as Spark. The advantage is a reliable distributed filesystem and the usage of data locality to send the computation to the nodes that have the data.
Hive is a good SQL-like way of processing files and generate your reports once a day. It's batch processing and 5 more GB a day shouldn't produce a big impact. It has a high overhead latency though, but shouldn't be a problem if you do it once a day.
HBase and Cassandra are NoSQL databases whose purpose is to serve data with low latency. If that's a requirement, you should go with any of those. HBase uses the DFS to store the data and Cassandra has good connectors to Hadoop, so it's simple to run jobs consuming from these two sources.
For reports based on request, specifying a date range, you should store the data in an efficient way so you don't have to ingest data that's not needed for your report. Hive supports partitioning and that can be done using date (i.e. /<year>/<month>/<day>/). Using partitioning can significantly optimize your job execution times.
If you go to the NoSQL approach, be sure the rowkeys have some date format as prefix (e.g. 20140521...) so that you can select those that start by the dates you want.
Some questions you should also consider are:
how many data do you want to store in your cluster – e.g. last 180
days, etc. This will affect the number of nodes / disks. Beware data is usually replicated 3 times.
how many files do you have in HDFS – when the number of files is high,
Namenode will be hit hard on retrieval of file metadata. Some
solutions exist such as replicated namenode or using MapR Hadoop
distribution which doesn't rely on a Namenode per se.

Mechanism for extracting data out of Cassandra for load into relational databases

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?
You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.
There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.