Let's say I have a moderately large matrix with data. Around 200k elements.
Whenever I mistakenly input X into the console, it locks into a printing loop that lasts minutes (I can't imagine what would be like with a real data matrix with millions of elements); meanwhile languages like python have the much saner default of printing only a few rows from the start & end of such matrices (or lists).
Is there any option (I'm not finding one in the docs) to make this the default behavior? My quality of life using octave will improve a lot.
Related
Before I go much further down this thought process, I wanted to check whether the idea was feasible at all. Essentially, I have two datasets, each of which will consist of ~500K records. For the sake of discussion, we can assume they will be presented in CSV files for ingesting.
Basically, what I'll need to do, is take records from the first dataset, do a lookup against the second dataset, and then produce an output that essentially merges the two together and produces an output CSV file with the results. The expected number of records after the merge will be in the range of 1.5-2M records.
So, my questions are,
Will Power Automate allow me to work with CSV datasets of those sizes?
Will the "Apply to each" operator function across that large of a dataset?
Will Power Automate allow me to produce the export CSV file with that size?
Will the process actually complete, or will it eventually just hit some sort of internal timeout error?
I know that I can use more traditional services like SQL Server Integration Services for this, but I'm wondering whether Power Automate has matured enough to handle this level of ETL operation.
I am trying to save a bunch of trained random forest classifiers in order to reuse them later. For this, I am trying to use pickle or joblib. The problem I encounter is, that the saved files get huge. This seems to be correlated to the amount of data that I use for training (which is several 10-millions of samples per forest, leading to dumped files in the order of up to 20GB!).
Is the RF classifier itself saving the training data in its structure? If so, how could I take the structure apart and only save the necessary parameters for later predictions? Sadly, I could not find anything on the subject of size yet.
Thanks for your help!
Baradrist
Here's what I did in a nutshell:
I trained the (fairly standard) RF on a large dataset and saved the trained forest afterwards, trying both pickle and joblib (also with the compress-option set to 3).
X_train, y_train = ... some data
classifier = RandomForestClassifier(n_estimators=24, max_depth=10)
classifier.fit(X_train, y_train)
pickle.dump(classifier, open(path+'classifier.pickle', 'wb'))
or
joblib.dump(classifier, path+'classifier.joblib', compress=True)
Since the saved files got quite big (5GB to nearly 20GB, compressed aprox. 1/3 of this - and I will need >50 such forests!) and the training takes a while, I experimented with different subsets of the training data. Depending on the size of the train set, I found different sizes for the saved classifier, making me believe that information about the training is pickled/joblibed as well. This seems unintuitive to me, as for predictions, I only need the information of all the trained weak predictors (decision trees) which should be steady and since the number of trees and the max depth is not too high, they should also not take up that much space. And certainly not more due to a larger training set.
All in all, I suspect that the structure is containing more than I need. Yet, I couldn't find a good answer on how to exclude these parts from it and save only the necessary information for my future predictions.
I ran into a similar issue and I also thought in the beginning that the model was saving unnecessary information or that the serialization was introducing some redundancy. It turns out in fact that decision trees are indeed memory hungry structures that consists of multiple arrays of length given by the total number of nodes. Nodes in general grow with the size of data (and parameters like max_depth cannot effectively used to limit growth since the reasonable values still have room to generate huge number of nodes). See details in this answer but the gist is:
a single decision tree can easy grow to a few MBs (example above has a 5MB decision tree for 100K data and a 50MB decision tree for 1M data)
a random forest commonly contains at least 100 such decision tree and for the example above you would have models in the range of 0.5/5GB
compression is usually not enough to reduce to reasonable sizes (1/2, 1/3 are usual ranges)
Other notes:
using a different algorithm models might remain of a more manageable size (e.g. with xgboost I saw much smaller serialized models)
it is probably possible to "prune" some of the data used by decision trees if you only plan it to reuse it for prediction. In particular I imagine the array of impurity and possible those on n_samples might not be needed but I have not checked.
with respect to you hypothesis that the random forest is saving the data on which it is trained: not it is not and the data itself would likely be one or more order of magnitude smaller than the final model
so in principle another strategy if you have a reproducible training pipeline could be to save the data instead of the model and retrain on purpose, but this is only possible if you can spare the time to retrain (for example if in a use case where you have a long running service which has the model in memory and you serialize the model in order to have a backup for when the model goes down)
there are probably also other options to limit growth of random forest, the best one I have found until now is in this answer, where the suggestion is to work with min_samples_leaf to set it as a percentage of data
in pure Python it is easy:
in_string = 'abc,def,ghi,jklmnop,, '
out = in_string.lower().rstrip().split(',') # too slow!!!
out -> ['abc','def','ghi','jklmnop','']
In my case this is called several million times and I need to speed it up a little. I am already using Cython but I do not know not to speed up this particular portion of code.
There can be up to 300 substrings. Pure ASCII. Letters, numbers and some other printable characters. There can be no comma "," in a substring. So a comma is the separator.
Edit:
OK, I see that a simple question turns into a big one. So the data comes from files which have a CSV-like format (no ready to run software works on this) and in total can be 100GB in size. The method reads the file line by line, needs to get the substrings and then sends the substrings to a SQlite database (I am already using executemany). The whole is done in multiprocessing manner, so each file is processed by its own process. The whole is already fast, but I want to squeeze out the last bit of performance. Additionally I want to learn more about Cython. So I have picked this (simple) part of Python code and have run it with "cython -a" which produces a big amount of generated code. So I think this is the best part to start optimizing.
Profiling the code is not that easy because of multiprocessing and cython is being used.
So once someone answers my question, I could implement this code and make a test run. So even I might not improve the speed of my code I will for sure learn a lot. Unfortunately I am a C noob
Yes you can do this in Cython, larger question is if you should.
Where does the input come from?
Is it a file? Then other optimisations are possible, e.g. you could map the file into memory.
Is it a database or network connection? In that case your runtime is probably dominated by waiting for disk/network.
What do you plan to do with the output?
Does the output have to be a string, or can it be a buffer?
"abc,def" -> "abc\0def\0"
buffer1 ------^ ^
buffer2 -----------!
You mentioned that string splitting fragment was called millions of times, processing the string is not that slow, what probably kills performance is allocating all the small strings, an array to hold the result, and then collecting the garbage once substrings are no longer user.
If you could give out pointers to existing data instead, you could speed things up a bit.
How often are these substrings used? If split is called millions of times, it seems to suggest that most substrings are discarded (or you'd run out of memory).
For example, consider the problem "split into substrings and return numbers only"
filter(str.isdigit, "dfasdf,6785,2,dhs,dfgsd,dsg,dsffg".split(","))
If you know in advance that most substrings are not numbers, you'd want to optimise this larger problem as a single block.
How many substrings are there in a typical input?
If there are 4, like in your example, it's not worth it. If there are millions, or even thousands, you may get somewhere.
Is there unicode?
.lower() on an ASCII string is trivial, but not so on unicode. I'd stick to Python if you expect unicode.
I am trying to come up with an efficient way to characterize two narrowband tones separated by about 900kHz (one at around 100kHZ and one at around 1MHz once translated to baseband). They don't move much in freq over time but may have amplitude variations we want to monitor.
Each tone is roughly about 100Hz wide and we are required to characterize these two beasts over long periods of time down to a resolution of about 0.1 Hz. The samples are coming in at over 2M Samples/sec (TBD) to adequately acquire the highest tone.
I'm trying to avoid (if possible) doing brute force >2MSample FFTs on the data once a second to extract frequency domain data. Is there an efficient approach? Something akin to performing two (much) smaller FFTs around the bands of interest? Ive looked at Goertzel and chirp z methods but I am not certain it helps save processing.
Something akin to performing two (much) smaller FFTs around the bands of interest
There is, it's called Goertzel, and is kind of the FFT for single bins, and you already have looked at it. It will save you CPU time.
Anyway, there's no reason to do a 2M-point FFT; first of all, you only want a resolution of about 1/20 the sampling rate, hence, a 20-point FFT would totally do, and should be pretty doable for your CPU at these low rates; since you don't seem to care about phase of your tones, FFT->complex_to_mag.
However, there's one thing that you should always do: look at your signal of interest, and decimate down to the rate that fits exactly that. Since GNU Radio's filters are implemented cleverly, the filter itself will only run at the decimated rate, and you can spend the CPU cycles saved on a better filter.
Because a direct decimation from 2MHz to 100Hz (decimation: 20000) will really have an ugly filter length, you should do this multi-rated:
I'd try first decimating by 100, and then in a second step by 100, leaving you with 200Hz observable spectrum. The xlating fir filter blocks will let you use a simple low-pass filter (use the "Low-Pass Filter Taps" block to define a variable that contains such taps) as a band-selector.
"The output column "A" (67) on output "Output0" (5) and component "Data Flow Task" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance."
Please resolve my problem
These warnings indicate that you have columns in your data flow that are not used. A Data Flow works by allocating "buckets" of fixed size memory, filling it with data from the source and allowing the downstream components to directly access the memory address to perform synchronous transformations.
Memory is a finite resource. If SSIS detects is has 1 GB to work with and one row of data will cost 4096 MB, then you could have at most 256 rows of data in the pipeline before running out of memory space. Those 256 rows would get split into N buckets of rows because as much as you can, you want to perform set based operations when working with databases.
Why does all this matter? SSIS detects whether you've used everything you've brought into the pipeline. If it's never used, then you're wasting memory. Instead of a single row costing 4096, by excluding unused columns, you reduce the amount of memory required for each row down to 1024 MB and now you can have 1024 rows in the pipeline just by only taking what you needed.
How do you get there? In your data source, write a query instead of selecting a table. Don't use SELECT * FROM myTable instead, explicitly enumerate all of the columns you need and nothing more. Same goes for Flat File Sources---uncheck the columns that are never used. You'll still pay a disk penalty for having to read the whole row in but they don't have to hit your DF and consume that memory. Same story for any Lookups - only query the data you need.
Asynchronous components are the last thing to be aware of as this has turned into a diatribe on performance. The above calculations are much like freshman calculus classes: assume a cow is a sphere to make the math easier. Asynchronous components result in your memory being split before and after the component. They radically change the shape of the rows going through a component such that downstream components can't reuse the address space above it. This results in a physical memory copy which is a slow operation.
My final comment though is if your package is performing adequately, finishing in an acceptable time frame, unless you have nothing else to do, leave it be and go on to your next task. These are just warnings and should not "grow up" to full blown errors.