the meaning of Time_step in Time series Data - deep-learning

I have an application that can generate 50 data in 1 second,which means the sampling_rate of my application is 50Hz. then theb data type is timeseries. then I make my data like below before training.
steps= 50
inp = []
out = []
for i in range(len(data_scaled_out) - (steps)):
inp.append(data_scaled_in[i:i+steps])
out.append(data_scaled_out[i+steps])
Data shape is (50,89)
if I use data shape (50,89) is it the same I change the model's sampling rate to 1Hz? means that the model can only predict 1 data in 1 second? because my professor said that if I use the input shape (50.89) it is the same as me downsampling the frequency. Is that true?

Related

Why is transformer decoder always generating output of same length as gold labels?

I am generating some summaries using a fine-tuned BART model, and I've noticed something strange. If I feed the labels to the model, it will always generate summaries of the same length of the label, whereas if I do not pass the labels to the model, it generates outputs of length 1024 (max BART seq length). This is unexpected, so I'm trying to understand if there is any problem / bug with the reproducible example below
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model=AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large-cnn')
tokenizer=AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
sentence_to_summarize = ['This is a text to summarise. I just went for a walk in the park and saw very large crowds gathering to watch an impromptu football match']
encoded_dict = tokenizer.batch_encode_plus(sentence_to_summarize, return_tensors='pt', max_length=1024, padding='max_length')
input_ids = encoded_dict['input_ids']
attention_mask = encoded_dict['attention_mask']
label = tokenizer.encode('I went to the park', return_tensors='pt')
Notice the following two cases.
Case 1:
output = model(input_ids=input_ids, attention_mask=attention_mask)
print(output['logits'].shape)
shape printed is torch.Size([1, 1024, 50264])
Case 2
output = model(input_ids=input_ids, attention_mask=attention_mask, labels=label)
print(output['logits'].shape)
shape printed is torch.Size([1, 7, 50264]) where 7 is the length of the label 'I went to the park' (including start and end tokens).
Ideally the summarization model would learn when to generate the EOS token, but this should not always lead to summaries of identical length of the gold output (i.e. the label). Why is the label length influencing the model output in this way?
I would expect the only difference between cases 1 and 2 being that in the second case the output also contains the loss value, but I wouldn't expect this to influence the logits in any way
Original example not use label parameter
https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/bart#transformers.BartForConditionalGeneration.forward.example
label parameter is optional and i think not used for summerizing
https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/bart#transformers.BartForConditionalGeneration.forward.labels

how do I load length frequency histogram data into mixtools?

I want to use mixtools to separate 1, 2 and 3+ year old cohorts in shellfish length frequency data. I am totally new to R coding. The example is old faithful geyser data but it is merely a list of 272 data points. I have various tables of lengths (size class midpoints) and frequencies. Generally about 15 length classes and counts in each between 0 and 50. I can create a data frame from my MSexcel table but not sure how to call it with normalmixEM() Thanks.

Impact of large dataset on the trained model size?

If the dataset is large does it mean that the model size will also be large?
for example our real world data is function y = ax + b
dataset is all data point on this line
model is our mathematical representation for representing these data
model size is number of parameter we use for representation
now the model size is related to a real world function we want to representing.
we can create a huge dataset for line ax + b but the model of this function only need two parameters a and b and if we have only two point of this line we can create model
real world data are not simple as a line but if we have some model with same result model with least number of parameter is beast one.
The training dataset size depends on the complexity of our model

What algorithm is used to filter out from higher dimensional data points?

I have 4-dimensional data points stored in my MySQL database in server. One time dimension data with three spatial GPS data (lat, lon, alt). GPS data are sampled by 1 minute timeframe for thousands of users and are being added to my server 24x7.
Sample REST/post json looks like,
{
"id": "1005",
"location": {
"lat":-87.8788,
"lon":37.909090,
"alt":0.0,
},
"datetime": 11882784
}
Now, I need to filter out all the candidates (userID) whose positions were within k meters distance from a given userID for a given time period.
Sample REST/get query params for filtering looks like,
{
"id": "1001", // user for whose we need to filter out candidates IDs
"maxDistance":3, // max distance in meter to consider (euclidian distance from users location to candidates location)
"maxDuration":14 // duration offset (in days) from current datetime to consider
}
As we see, thousands of entries are inserted in my database per minute which results huge number of total entries. Thus to iterate over for all the entries for filtering, I am afraid trivial naive approach won't be feasible for my current requirement. So, what algorithm should I implement in the server? I have tried to implement naive algorithm like,
params ($uid, $mDis, $mDay)
1. Init $candidates = []
2. For all the locations $Li of user with $uid
3. For all locations $Di in database within $mDay
4. $dif = EuclidianDis($Li, $Di)
5. If $dif < $mDis
6. $candidates += userId for $Di
7. Return $candidates
However, this approach is very slow in practice. And pre calculation might not be feasible as it costs huge space for all userIDs. What other algorithm can improve efficiency?
You could implement a spatial hashing algorithm to efficiently query your database for candidates within a given area/time.
Divide the 3D space into a 3D grid of cubes with width k, and when inserting a data-point into your database, compute which cube the point lies in and compute a hash value based on the cube co-ordinates.
When querying for all data-points within k of another datapoint d, compute the cube that d sits in, and find the 8 adjacent cubes (+/- 1 in each dimension). Compute the hash values of the 9 cubes and query your database for all entries with these hash values within the given time period. You will have small candidate set from which you can then iterate over to find all datapoints within k of d.
If your value of k can range between 2-5 meters, give your cubes a width of 5.
Timestamps can be stored as a separate field, or alternatively you can make your cubes 4-dimensional and include the timestamp in the hash, and search 27 cubes instead of 9.

One Hot Encoding dimension - Model Compexity

I will explain my problem:
I have around 50.000 samples, each of one described by a list of codes representing "events"
The number of unique codes are around 800.
The max number of codes that a sample could have is around 600.
I want to represent each sample using one-hot encoding. The representation should be, if we consider the operation of padding for those samples that has fewer codes, a 800x600 matrix.
Giving this new representation as input of a network, means to flatten each matrix to a vector of size 800x600 (460.000 values).
At the end the dataset should consist in 50.000 vectors of size 460.000 .
Now, I have two considerations:
How is it possible to handle a dataset of that size?(I tried data generator to obtain the representation on-the-fly but they are really slow).
Having a vector of size 460.000 as input for each sample, means that the complexity of my model( number of parameters to learn ) is extremely high ( around 15.000.000 in my case ) and, so, I need an huge dataset to train the model properly. Doesn't it?
Why do not you use the conventional model used in NLP?
These events can be translated as you say by embedding matrix.
Then you can represent the chains of events using LSTM (or GRU or RNN o Bilateral LSTM), the difference of using LSTM instead of a conventional network is that you use the same module repeated by N times.
So your input really is not 460,000, but internally an event A indirectly helps you learn about an event B. That's because the LSTM has a module that repeats itself for each event in the chain.
You have an example here:
https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras
Broadly speaking what I would do would be the following (in Keras pseudo-code):
Detect the number of total events. I generate a unique list.
unique_events = list (set ([event_0, ..., event_n]))
You can perform the translation of a sequence with:
seq_events_idx = map (unique_events.index, seq_events)
Add the necessary pad to each sequence:
sequences_pad = pad_sequences (sequences, max_seq)
Then you can directly use an embedding to carry out the transfer of the event to an associated vector of the dimension that you consider.
input_ = Input (shape = (max_seq,), dtype = 'int32')
embedding = Embedding (len(unique_events),
                    dimensions,
                    input_length = max_seq,
                    trainable = True) (input_)
Then you define the architecture of your LSTM (For example):
lstm = LSTM (128, input_shape = (max_seq, dimensions), dropout = 0.2, recurrent_dropout = 0.2, return_sequences = True) (embedding)
Add the dense and the result you want:
out = Dense (10, activation = 'softmax') (lstm)
I think that this type of model can help you and give better results.