Any setting to avoid using all memory in apache drill when reading lots of json files and output to a single file? - apache-drill

May I know if there is any setting to avoid drill using all direct memory, when reading many big json files (20kb per file, with 100k+) and output to a file?
E.g., by running query like below, say there are 2k json files at stroageplugin.root./inputpath/, each file has about 40k bytes strings in the "Content" attribute. The query will consume about 80MB direct memory to complete this. If there are 100k json files, the query will consume 4GB direct memory to complete.
Do we have a way to reduce the direct memory consumption here when merging lots of files into a single file?
CREATE TABLE stroageplugin.output./outputpath/ AS
SELECT Id, CreatedTime, Content
FROM stroageplugin.root./inputpath/;

You can configure memory settings using environment variables (or configuring <drill_installation_directory>/conf/drill-env.sh directly):
DRILL_HEAP=8G
DRILL_MAX_DIRECT_MEMORY=10G
DRILLBIT_CODE_CACHE_SIZE=1024M
See https://drill.apache.org/docs/configuring-drill-memory/

Related

Converting JSON .gz files into Delta Tables

I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost effective solution to get this data into delta lake?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Has anyone done this successfully without breaking the bank?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Yes, that's the big downside of gzip format - it is not splitable and therefore cannot be distributed across all your workers and cores - the Driver has to load a file in its entirety and decompress it in a single batch. Topic related to this question.
The only sensible workaround I've used myself is to make Driver have only few cores but as powerful ones as possible - I assume, since you are using Azure Blob, then you are using Databricks on Azure as well and here you can find all Azure VM types - just have to pick the one with fastest cores.

SSIS Cache Mamnager

Is it possible to use an SSIS Cache manger with anything other than a Lookup? I would like to use similar data across multiple data flows.
I haven't been able to find a way to cache this data in memory in a cache manager and then reuse it in a later flow.
Nope, a cache connection manager was specific to solving lookup tasks originally only allowing an OLE DB Connection to be used.
However, if you have a set of data you want to be static for the life of a package run and able to be used across data flows, or even other packages, as a table-like entity, perhaps you're looking for a Raw File. It's a tight, binary implementation of the data stored to disk. Since it's stored to disk, you will pay a write and subsequent read performance penalty but it's likely that the files are right sized such that any penalty is offset by the specific needs.
The first step you will need to do is define the data that will go into a Raw file and connect a Raw File Destination. Which is going to involve creating a Raw File Connection Manager where you will define where the file lives and the rules about the data in there (recreate, append, etc). At this point, run the data flow task so the file is created and populated.
The next step is everywhere you want to use the data, you'll patch in a Raw File Source. It's going to behave much as any other data source in your toolkit at this point.

loading 20 million records from SSIS to SNOWFLAKE through ODBC

I am trying to load around 20 million records from ssis to snowflake using ODBC connection, this load is taking forever to complete. I there any faster method than using ODBC? I can think of loading it into flat file and then using flat file to load into snowflake but sure how to do it.
Update:
i generated a text file using bcp and the put that file on snowflake staging using ODBC connection and then using copy into command to load the data into tables.
issue: the txt file generated is a 2.5gb file and the ODBC is struggling to send the file to snowflake stage any help on this part??
It should be faster to write compressed objects to the cloud provider's object store (AWS S3, Azure blob, etc.) and then COPY INTO Snowflake. But also more complex.
You are, by chance, not writing one row at a time, for 20,000,000 database calls?
ODBC is slow on a database like this, Snowflake (and similar columnar warehouses) also want to eat shred files, not single large ones. The problem with your original approach was no method of ODBC usage is going to be particularly fast on a system designed to load nodes in parallel across shred staged files.
The problem with your second approach was no shred took place. Non-columnar databases with a head node (say, Netezza) would like and eat and shred your single file, but a Snowflake or a Redshift are basically going to ingest it as a single thread into a single node. Thus your ingest of a single 2.5 GB file is going to take the same amount of time on an XS 1-node Snowflake as an L 8-node Snowflake cluster. Your single node itself is not saturated and has plenty of CPU cycles to spare, doing nothing. Snowflake appears to use up to 8 write threads on a node basis for an extract or ingest operation. You can see some tests here: https://www.doyouevendata.com/2018/12/21/how-to-load-data-into-snowflake-snowflake-data-load-best-practices/
My suggestion would be to make at least 8 files of size (2.5 GB / 8), or about 8 315MB files. For 2-nodes, at least 16. Likely this involves some effort in your file creation process if it is not natively shredding and horizontally scaling; although as a bonus it's breaking up your data into easier bite sized processes to abort/resume/etc should any problems occur.
Also note that once the data is bulk insert into Snowflake it is unlikely to be optimally placed to take advantage of the benefits of micro-partitions - so I would recommend something like rebuilding the table with the loaded data and at least sorting it on an oft restricted column, ie. a fact table I would at least rebuild and sort by date. https://www.doyouevendata.com/2018/03/06/performance-query-tuning-snowflake-clustering/
generate the file and then use Snow CLI to Put it in the internal Stage. Use Cooy into for stage->table. Some coding to do, and you can never avoid transporting GB over the net, but Put coukd compress and transfer the file in chunks

Streaming very large json directly from a POST request in Django

I'm looking for efficient ways to handle very large json files (i.e. possibly several GB in size, amounting to a few million json objects) in requests to a Django 2.0 server (using Django REST Framework). Each row needs to go through some processing, and then get saved to a DB.
The biggest painpoint thus far is the sheer memory consumption of the file itself, and the memory consumption steadily still increasing while the data is being processed in Django, without any ability to manually release the memory used.
Is there a recommended way of processing very large json files in requests to a Django app, without slaughtering memory consumption? Possible to combine with compressing (gzip)? I'm thinking of uploading the json to the API as a regular file, stream that to disk, and then stream from the file on disk using ijson or similar? Is there a more straightforward way?

split a csv file by content of first column without creating a copy?

I am trying to accomplish something similar to what was described in this thread: How to split a huge csv file based on content of first column?
There, the best solution seemed to be use awk which does do the job. However, I am dealing with very massive csv files and I would like to split up the file without creating a new copy since the disk I/O speed is killing me. Is there a way to split the original file without creating a new copy?
I'm not really sure what you're asking, but if your question is: "Can I take a huge file on disk and split it 'in-place' so I get many smaller files without actually having to write those smaller files to disk?", then the answer is no.
You will need to iterate through the first file and write the "segments" back to disk as new files, regardless of whether you use awk, Python or a text editor for this. You do not need to make a copy of the first file beforehand, though.
"Splitting a file" still requires RAM and disk I/O. There's no way around that; it's just how the world works.
However, you can certainly reduce the impact of I/O-bound processes on your system. Some obvious solutions are:
Use a RAM disk to reduce disk I/O.
Use a SAN disk to reduce local disk I/O.
Use an I/O scheduler to rate-limit your disk I/O. For example, most Linux systems support the ionice utility for this purpose.
Chunk up the file and use batch queues to reduce CPU load.
Use nice to reduce CPU load during file processing.
If you're dealing with files, then you're dealing with I/O. It's up to you to make the best of it within your system contraints.