Improving read performance of pyarrow - pyarrow

I have a partitioned dataset stored on internal S3 cloud. I am reading the dataset with pyarrow table
import pyarrow.dataset as ds
my_dataset = ds.dataset( ds_name, format="parquet", filesystem=s3file, partitioning="hive")
fragments = list(my_dataset.get_fragments())
required_fragment = fragements.pop()
The metadata from the required fragment shows the following:
required_fragment.metadata
<pyarrow._parquet.FileMetaData object at 0x00000291798EDF48>
created_by: parquet-cpp-arrow version 9.0.0
num_columns: 22
num_rows: 949650
num_row_groups: 29
format_version: 1.0
serialized_size: 68750
converting this to table however takes a long time
%timeit required_fragment.to_table()
6min 29s ± 1min 15s per loop (mean ± std. dev. of 7 runs, 1 loop each)
The size of the table itself is about 272mb
required_fragment.to_table().nbytes
272850898
Any ideas how i can speed up converting the ds.fragment to table?
Updates
So I instead of pyarrow.dataset, i tried using pyarrow.parquet
Only part of my code that changed is
import pyarrow.parquet as pq
my_dataset = pq.ParquetDataset(ds_name,filesystem=s3file, partitioning="hive", use_legacy_dataset=False )
fragments = my_dataset.fragments
required_fragment = fragements.pop()
and when i tried the code again, the performance was much better
%timeit required_fragment.to_table()
12.4 s ± 1.56 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
While i am happy with the better performance, it still feels confusing as under the hood, by setting use_legacy_dataset = False, the program should have similar outcomes
PC Information
Installed RAM: 21.0GB
Software: Windows 10 Enterprise
Internet speed: 10Mbps / 156 Mpbs (download / upload)
s3 location: Asia

Related

How to free ram space in Google Colab

I am working in a Deep Learning project where I am trying different CNN architectures with CIFAR10. I've built some custom functions and do some nested foor-loops to iterate over my different architectures. The problem I get is that the 12GB of RAM get close to 100% and I cannot free that space to continue. I would like a solution different to "reset your runtime environment", I want to free that space, given that 12GB should be enough for what I am doing, if you manage it correctly.
What I've done so far:
Added gc.collect() at the end of each training epoch
Added keras.backend.clear_session() after each model is trained
I've also tried to see the locals() using
import sys
def sizeof_fmt(num, suffix='B'):
''' by Fred Cirera, https://stackoverflow.com/a/1094933/1870254, modified'''
for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f %s%s" % (num, 'Yi', suffix)
for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))
Which yields
xtrain: 1.1 GiB
xtest: 234.4 MiB
_i13: 3.4 KiB
So I cannot understand how the other 10GB are allocated in my current session.

Understanding an wav file exported from a daw

I have generated a tone from Audacity at 440Hz with Amplitude as 1 for 1 sec like this:
I understand that this is going to create 440 peaks in 1 sec with Amplitude as 1.
Here i see that its a 32 bit file with 44100Hz is the sample rate which means there are 44100 samples per sec. The Amplitude is 1 which is as expected because that is what i chose.
What i dont understand is, what is the unit of this Amplitude? When right-clicked it shows linear(-1 to +1)
There is an option to select dB it shows (0 to -60 to 0) which i dont understand how is this converted!
now when i use this wav file in the python scipy to read the wav and get values of time and amplitude
How to match or get the relation between what i generated vs what i see when i read a wav file?
The peak is amplitude is 32767.987724003342 Frequency 439.99002267573695
The code i have used in python is
wavFileName ="440Hz.wav"
sample_rate, sample_data = wavfile.read(wavFileName)
print ("Sample Rate or Sampling Frequency is", sample_rate," Hz")
l_audio = len(sample_data.shape)
print ("Channels", l_audio,"Audio data shape",sample_data.shape,"l_audio",l_audio)
if l_audio == 2:
sample_data = sample_data.sum(axis=1) / 2
N = sample_data.shape[0]
length = N / sample_rate
print ("Duration of audio wav file in secs", length,"Number of Samples chosen",sample_data.shape[0])
time =np.linspace(0, length, sample_data.shape[0])
sampling_interval=time[1]-time[0]
notice in audacity when you created the one second of audio with aplitude choice of 1.0 right before saving file it says signed 16 bit integer so amplitude from -1 to +1 means the WAV file in PCM format stores your raw audio by varying signed integers from its max negative to its max positive which since 2^16 is 65536 then signed 16 bit int range is -32768 to 32767 in other words from -2^15 to ( +2^15 - 1 ) ... to better plot I suggest you choose a shorter time period much smaller than one second lets say 0.1 seconds ... once your OK with that then boost it back to using a full one second which is hard to visualize on a plot due to 44100 samples
import os
import scipy.io
import scipy.io.wavfile
import numpy as np
import matplotlib.pyplot as plt
myAudioFilename = '/home/olof/sine_wave_440_Hz.wav'
samplerate, audio_buffer = scipy.io.wavfile.read(myAudioFilename)
duration = len(audio_buffer)/samplerate
time = np.arange(0,duration,1/samplerate) #time vector
plt.plot(time,audio_buffer)
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.title(myAudioFilename)
plt.show()
here is 0.1 seconds of 440 Hz using signed 16 bit notice the Y axis of Amplitude range matches above mentioned min to max signed integer value range

Spark CSV GZip to Parquet?

I am using Spark 2.3.1 PySpark (AWS EMR)
I am getting memory errors:
Container killed by YARN for exceeding memory limits
Consider boosting spark.yarn.executor.memoryOverhead
I have input of 160 files, each file approx 350-400 MBytes, each file is a CSV Gzip format.
To read the csv.gz files (with wildcard) and I use this Pyspark
dfgz = spark.read.load("s3://mybucket/yyyymm=201708/datafile_*.csv.gz",
format="csv", sep="^", inferSchema="false", header="false", multiLine="true", quote="^", nullValue="~", schema="id string,...."))
To save the data frame I use this (PySpark)
(dfgz
.write
.partitionBy("yyyymm")
.mode("overwrite")
.format("parquet")
.option("path", "s3://mybucket/mytable_parquet")
.saveAsTable("data_test.mytable")
)
One line of code to save all 160 files.
I tried this with 1 file and it works fine.
Total size for all 160 files (csv.gzip) is about 64 GBytes.
Each file as a pure CSV, when Unzipped is approx 3.5 GBytes. I am assuming Spark may unzip each file in RAM and then convert it to Parquet in RAM ??
I want to convert each csv.gzip file to Parquet format i.e. I want 160 Parquet files as output (ideally).
The task runs for a while and it seems to create 1 Parquet file for each CSV.GZ file. After some time it always fails with Yarn memory error.
I tried various settings for executors memory and memoryOverhead and all results in no change - jobs always fails. I tried memoryOverhead of up to 1-8 GB and executormemory of 8G.
Apart from manually breaking up input 160 files workload into many small workloads what else can I do?
Do I need a Spark cluster with a total RAM capacity of much greater than 64 GB?
I use 4 slave nodes, each has 8 CPU and 16 GB per node (slaves), plus one master of 4 CPU and 8 GB of RAM.
This is (with overhead) less than 64 GB of input gzip csv files I am trying to process but the files are evenly sized of 350-400 MBytes so I dont understand why Spark is throwing memory errors given it can easily process these 1 file at a time per executor, discard it and move on to next file. It does not appear to work this way. I feel it is trying to load all input csv.gzip files into memory but I have no way of knowing it (I am still new to Spark 2.3.1).
Late Update: I managed to get it to work with following memory config:
4 slave nodes, each 8 CPU and 16 GB of RAM
1 master node, 4 CPU and 8 GB of RAM:
spark maximizeResourceAllocation false
spark-defaults spark.driver.memoryOverhead 1g
spark-defaults spark.executor.memoryOverhead 2g
spark-defaults spark.executor.instances 8
spark-defaults spark.executor.cores 3
spark-defaults spark.default.parallelism 48
spark-defaults spark.driver.memory 6g
spark-defaults spark.executor.memory 6g
Needless to say - I cannot explain why this config worked!
Also this took 2 hours+ to process 64 GB of gzip data which seems slow even for a small 4+1 node cluster with total of 32+4 CPU and 64+8 GB of RAM. Perhaps S3 was the bottleneck....
FWIW I just did not expect to micro-manage a database cluster for memory, disk I/O or CPU allocation.
Update 2:
I just ran another load on same cluster with same config, a smaller load of 129 files of same sizes and this load failed with same Yarn memory errors.
I am very disappointed with Spark 2.3.1 memory management.
Thank you for any guidance

Python 1Hz measurement temperature

I want to create csv file in python ,storing data from the sensor plus time stamp of the reading .But sensor measures fast and I need exactly 1 measurment from the sensor exactly after 1 sec.For example sensor value is 20 at time 12:34:15.and i need value exactly at 12:34:16 .I do not have to use time.sleep because it creates delay more than second and will affect the log file if i have to take readings more than hundred.
Consumer PCs do not have real-time operating systems and there is no guarantee that a particular process will execute at least once per second and certainly no guarantee that it will be executing at each 1 second interval. If you want precision timed measurements with Python, you should look as Micropython executing on a microcontroller board. It may be able to do what you want. Python on a Raspberry Pi board might also work better than a PC.
On a regular PC, I would start with something using perf_counter.
from time import perf_counter as timer
from somewhere import sensor, save # read temperature, save value
t0 = t1 = timer()
delta = .99999 # adjust by experiment to average 1 sec reading intervals
while True:
while t1 - t0 < delta:
t1 = timer()
value = sensor())
save(value)
t0 = t1

Multiple regression with lagged time series using libsvm

I'm trying to develop a forecaster for electric consumption. So I want to perform a regression using daily data for an entire year. My dataset has several features. Googling I've found that my problem is a Multiple regression problem (Correct me please if I am mistaken).
What I want to do is train a svm for regression with several independent variables and one dependent variable with n lagged days. Here's a sample of my independent variables, I actually have around 10. (We used PCA to determine which variables had some correlation to our problem)
Day Indep1 Indep2 Indep3
1 1.53 2.33 3.81
2 1.71 2.36 3.76
3 1.83 2.81 3.64
... ... ... ...
363 1.5 2.65 3.25
364 1.46 2.46 3.27
365 1.61 2.72 3.13
And the independendant variable 1 is actually my dependant variable in the future. So for example, with a p=2 (lagged days) I would expect my svm to train with the first 2 time series of all three independant variables.
Indep1 Indep2 Indep3
1.53 2.33 3.81
1.71 2.36 3.76
And the output value of the dependent variable would be "1.83" (Indep variable 1 on time 3).
My main problem is that I don't know how to train properly. What I was doing is just putting all features-p in an array for my "x" variables and for my "y" variables I'm just putting my independent variable on p+1 in case I want to predict next day's power consumption.
Example of training.
x with p = 2 and 3 independent variables y for next day
[1.53, 2.33, 3.81, 1.71, 2.36, 3.76] [1.83]
I tried with x being a two dimensional array but when you combine it for several days it becomes a 3d array and libsvm says it can't be.
Perhaps I should change from libsvm to another tool or maybe it's just that I'm training incorrectly.
Thanks for your help,
Aldo.
Let me answer with the python / numpy notation.
Assume the original time series data matrix with columns (Indep1, Indep2, Indep3, ...) is a numpy array data with shape (n_samples, n_variables). Let's generate it randomly for this example:
>>> import numpy as np
>>> n_samples = 100, n_variables = 5
>>> data = np.random.randn(n_samples, n_variables)
>>> data.shape
(100, 5)
If you want to use a window size of 2 time-steps, then the training set can be built as follows:
>>> targets = data[2:, 0] # shape is (n_samples - 2,)
>>> targets.shape
(98,)
>>> features = np.hstack([data[0:-2, :], data[1:-1, :]]) # shape is (n_samples - 2, n_variables * 2)
>>> features.shape
(98, 10)
Now you have your 2D input array + 1D targes that you can feed to libsvm or scikit-learn.
Edit: it might very well be the case that extracting more time-series oriented features such as moving average, moving min, moving max, moving differences (time based derivatives of the signal) or STFT might help your SVM mode make better predictions.