I am new to golang and using encoding/csv ReadAll() lib to read all records of CSV file. e.g
records := csv.NewReader(filename).ReadAll()
Just want to know are there any constraints that I should be aware of for e.g CSV file size etc..
How big CSV file can I read using ReadAll() without issues.
The only limitation here is originating from the hardware's RAM.
In case you come across such a problem you may resolve this (depending on the case) with stream processing.
On stream processing, you read one element at a time before proceeding to the next one.
Here is an example from another thread.
Related
May I know if there is any setting to avoid drill using all direct memory, when reading many big json files (20kb per file, with 100k+) and output to a file?
E.g., by running query like below, say there are 2k json files at stroageplugin.root./inputpath/, each file has about 40k bytes strings in the "Content" attribute. The query will consume about 80MB direct memory to complete this. If there are 100k json files, the query will consume 4GB direct memory to complete.
Do we have a way to reduce the direct memory consumption here when merging lots of files into a single file?
CREATE TABLE stroageplugin.output./outputpath/ AS
SELECT Id, CreatedTime, Content
FROM stroageplugin.root./inputpath/;
You can configure memory settings using environment variables (or configuring <drill_installation_directory>/conf/drill-env.sh directly):
DRILL_HEAP=8G
DRILL_MAX_DIRECT_MEMORY=10G
DRILLBIT_CODE_CACHE_SIZE=1024M
See https://drill.apache.org/docs/configuring-drill-memory/
Is it possible to use an SSIS Cache manger with anything other than a Lookup? I would like to use similar data across multiple data flows.
I haven't been able to find a way to cache this data in memory in a cache manager and then reuse it in a later flow.
Nope, a cache connection manager was specific to solving lookup tasks originally only allowing an OLE DB Connection to be used.
However, if you have a set of data you want to be static for the life of a package run and able to be used across data flows, or even other packages, as a table-like entity, perhaps you're looking for a Raw File. It's a tight, binary implementation of the data stored to disk. Since it's stored to disk, you will pay a write and subsequent read performance penalty but it's likely that the files are right sized such that any penalty is offset by the specific needs.
The first step you will need to do is define the data that will go into a Raw file and connect a Raw File Destination. Which is going to involve creating a Raw File Connection Manager where you will define where the file lives and the rules about the data in there (recreate, append, etc). At this point, run the data flow task so the file is created and populated.
The next step is everywhere you want to use the data, you'll patch in a Raw File Source. It's going to behave much as any other data source in your toolkit at this point.
Serialising JSON to a file is a convenient way for storing arbitrary data structures in a persistent manner. At work, I see this happening a lot, and I think it's understandable to do this instead of using something like SQLite because it's just so damn easy.
The problem with this is that when you modify the file programmatically, you might end up corrupting the file and you might lose your data or the software might be unable to proceed after the data has been corrupted. E.g. the file could be only partially written due to an abrupt power failure or a crash. Also, if there are multiple processes modifying the file, it will need some locking, but this is rarely the case.
A few years back, I came up with what I believe is a failsafe approach to modifying JSON files on Linux, but I am not a database expert and people say that "you should never write your own database". Thus, I'd appreciate some feedback from database experts on the matter. Is this really a failsafe approach?
For the single consumer, single producer case, it goes like this:
Read the JSON file and parse it
Change a node in the object tree
Serialise the new object tree to a file on the same file system.
E.g. if the path to the old file is /etc/example/config.json then the new file should be at /etc/example/config.json.tmp.
It's important to keep the temporary file on the same filesystem, and not on e.g. /tmp because this makes the rename() system call atomic with regard to filesystem operations.
This means that after a rename(), the file is guaranteed to be complete.
If the system experiences a power failure at the time of rename() the change may be lost but the old file will still be complete and not corrupted.
Run fdatasync() or fsync() on the new file.
This can take a while. Don't run this in the main loop.
When this call returns, the file is guaranteed to have been written to persistent storage
(optional) Read the new file and verify that it's valid JSON and that it fits the schema
Rename the new file to the name of the old file using the rename() system call
We almost never share files between processes but in the multiple producers case, one might use something like flock(), but read-locks are not necessary because of the rename() logic described above. The file is guaranteed to always be complete.
I am in the process of building my first live node.js web app. It contains a form that accepts data regarding my clients current stock. When submitted, an object is made and saved to an array of current stock. This stock is then permanently displayed on their website until the entry is modified or deleted.
It is unlikely that there will ever be more than 20 objects stored at any time and these will only be updated perhaps once a week. I am not sure if it is necessary to use MongoDB to store these, or whether there could be a simpler more appropriate alternative. Perhaps the objects could be stored to a JSON file instead? Or would this have too big an implication on page load times?
You could potentially store in a JSON file or even in a cache of sorts such as Redis but I still think MongoDB would be your best bet for a live site.
Storing something in a JSON file is not scalable so if you end up storing a lot more data than originally planned (this often happens) you may find you run out of storage on your server hard drive. Also if you end up scaling and putting your app behind a load balancer, then you will need to make sure there are matching copy's of that JSON file on each server. Further more, it is easy to run into race conditions when updating a JSON file. If two processes are trying to update the file at the same time, you are going to potentially lose data. Technically speaking, JSON file would work but it's not recommended.
Storing in memory (i.e.) Redis has similar implications that the data is only available on that one server. Also the data is not persistent, so if your server restarted for whatever reason, you'd lose what was stored in memory.
For all intents and purposes, MongoDB is your best bet.
The only way to know for sure is test it with a load test. But as you probably read html and js files from the file system when serving web pages anyway, the extra load of reading a few json files shouldn't be a problem.
If you want to go with simpler way i.e JSON file use nedb API which is plenty fast as well.
I'm writing a script which runs a MySQL query that returns ~5 million results. I need to manipulate this data:
Formatting
Encoding
Convert some sections to JSON
etc
Then write that to a CSV file in the shortest amount of time possible.
Since the node-mysql module is able to handle streams, I figured that's my best route. However I'm not familiar enough with writing my own streams to do this.
What would be the best way to stream the data from MySQL, through a formatting function and into a CSV.
We're in a position where we don't mind what order the data goes in as, so long as it's written to a CSV file real quick.
https://github.com/wdavidw/node-csv seems to do what you need. It can get streams, write streams, and has built-in transformers.