Python\\ How to find out duplicate elements in 3 files [closed] - duplicates

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have 3 files and I need to find out duplicate elements that appear in different files twice at-least.

The simplest method is to read each file into memory and then compare the results.
For example given two lists you can do this to identify the difference (items in the first that are not in the second).
list(set(['foo', 'bar']) - set(['bar']))
So if you had three lists s1, s2, and s3 you could:
s1 = ['a', 'b', 'c', 'd']
s2 = ['b', 'c']
s3 = ['c', 'd']
list(set(s1) - set(s2) - set(s3))
// gives us ['a']
Now we can take that and apply it to reading the files.
This example makes a few assumptions:
- you're comparing lines in the file. If this isn't accurate, you'll need to do your own list/set preparation after reading the files
- You simply want to identify the unique lines, if you want to do something else with the duplicates you'll need to modify it accordingly.
with open('s1.txt') as f:
s1 = f.readlines()
with open('s2.txt') as f:
s2 = f.readlines()
with open('s3.txt') as f:
s3 = f.readlines()
unique_lines = list(set(s1) - set(s2) - set(s3))
print(unique_lines)
Note: This is not particularly performant for large files/datasets but is sufficient for most simple examples.
Update: As per comments, to find the duplicates themselves, you can union the intersection between each set.
>>> s1 = set(['a', 'b', 'c', 'd'])
>>> s2 = set(['x', 'c'])
>>> s3 = set(['z', 'd'])
>>> s1 & s2
{'c'}
>>> s2 & s3
set()
>>> s3 & s1
{'d'}
>>> s1 & s2 | s2 & s3 | s3 & s1
{'d', 'c'}
Regarding your data size, unless you have specific low memory profile constraints, just be aware it might take a few hundred Mb memory when executing. This is because you have all three data sets in memory.

Related

How do I use two loss for two dataset in one pytorch nn? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am quite new to pytorch and deep learning. Here is my question. I have two different datasets with the same feature domain sharing one neural network for a regression problem. The input is the features and the output is the target value. The first dataset uses a normal loss while the second dataset, I am trying to create a new loss for it.
I have searched multi-loss problems, people usually have two loss summed up for the backward process. But I want to use the loss in turn. (When I train the first dataset, the nn uses the first loss and when I train the second dataset, the nn uses the other loss)
Is this possible to do? Appreciate if anyone has some idea.
The loss function does not necessarily have to do with network topology. You can use the corresponding loss with each dataset you use, e.g.
if first_task:
dataloader = torch.utils.data.DataLoader(first_dataset)
loss_fn = first_loss_fn
else:
dataloader = torch.utils.data.Dataloader(second_dataset)
loss_fn = second_loss_fn
# The pytorch training loop, very roughly
for batch in dataloader:
x, y = batch
optimizer.zero_grad()
loss = loss_fn(network.forward(x), y) # calls the corresponding loss function
loss.backward()
optimizer.step()
You can do this for the two datasets sequentially (meaning you interleave by epochs):
for batch in dataloader_1:
...
loss = first_loss_fn(...)
for batch in dataloader_2:
...
loss = second_loss_fn(...)
or better
dataset = torch.utils.data.ChainDataset([first_dataset, second_dataset])
dataloader = torch.utils.data.DataLoader(dataset)
You can also do simultaneously (interleave by examples). The standard way I think would be to use torch.utils.data.ConcatDataset
dataset = torch.utils.data.ConcatDataset([first_dataset, second_dataset])
dataloader = torch.utils.data.DataLoader(dataset)
Note that here you need each sample to store information about the dataset it comes from so you can determine which cost to apply.
A simpler way would be to interleave by batches (then you apply the same cost to the entire batch). For this case one way proposed here is to use separate dataloaders (this way you get flexibility on how often to sample each of them).

Is there a many to many convolution in Pytorch? is this a thing? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have been thinking about convolutions recently. There are common 3by3 convs, where (3,3) kernel's information is weighted and aggregated to supply information to a single spatial point on the output. There are also 3 by 3 upconvs, where a single spatial point on the input supplies weighted information to a 3 by 3 output space.
The conv is a many to one relationship and the upconv is a one to many relationship.
I have however never heard of many to many conv? is there such a thing? For example, a 3by3 kernel supplying information to another 3by3 kernel. I would like to experiment with it in PyTorch. My internet searching has not revealed anything.
You can combine pixel shuffle and averaging to get what you want.
for example, if you want 3x3 -> 3x3 mapping with in_channels to out_channels:
from torch import nn
import torch.nn.functional as nnf
class ManyToManyConv2d(nn.Module):
def __init__(in_channels, out_channels, in_kernel, out_kernel):
self.out_kernel = out_kernel
self.conv = nn.Conv2d(in_channels, out_channles * out_kernel * out_kernel, in_kernel)
def forward(self, x):
y = self.conv(x) # all the output kernel are "folded" into the channel dim
y = nnf.pixel_shuffle(y, self.out_kernel) # "unfold" the out_kernel - image size *out_kernel bigger
y = nnf.avg_pool2d(y, self.out_kernel)
return y

Created factors with EFA, tried regressing (lm) with control variables - Error message "variable lengths differ"

EFA first-timer here!
I ran an Exploratory Factor Analysis (EFA) on a data set ("df1" = 1320 observations) with 50 variables by creating a subset with relevant variables only that have no missing values ("df2" = 301 observations).
I was able to filter 4 factors (19 variables in total).
Now I would like to take those 4 factors and regress them with control variables.
For instance: Factor 1 (df2$fa1) describes job satisfaction.
I would like to control for age and marital status.
Fa1Regression <- lm(df2$fa1 ~ df1$age + df1$marital)
However I receive the error message:
Error in model.frame.default(formula = df2$fa1 ~ df1$age + :
variable lengths differ (found for 'df1$age')
What can I do to run the regression correctly? Can I delete observations from df1 that are nonexistent in df2 so that the variable lengths are the same?
Its having a problem using lm to regress a latent factor on other coefficients. Instead, use the lavaan package, where your model statement would be myModel<- 'df2$fa1~ x1+x2+x3'

Understanding stateful LSTM [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm going through this tutorial on RNNs/LSTMs and I'm having quite a hard time understanding stateful LSTMs. My questions are as follows :
1. Training batching size
In the Keras docs on RNNs, I found out that the hidden state of the sample in i-th position within the batch will be fed as input hidden state for the sample in i-th position in the next batch. Does that mean that if we want to pass the hidden state from sample to sample we have to use batches of size 1 and therefore perform online gradient descent? Is there a way to pass the hidden state within a batch of size >1 and perform gradient descent on that batch ?
2. One-Char Mapping Problems
In the tutorial's paragraph 'Stateful LSTM for a One-Char to One-Char Mapping' were given a code that uses batch_size = 1 and stateful = True to learn to predict the next letter of the alphabet given a letter of the alphabet. In the last part of the code (line 53 to the end of the complete code), the model is tested starting with a random letter ('K') and predicts 'B' then given 'B' it predicts 'C', etc. It seems to work well except for 'K'. However, I tried the following tweak to the code (last part too, I kept lines 52 and above):
# demonstrate a random starting point
letter1 = "M"
seed1 = [char_to_int[letter1]]
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed1[0]], "->", int_to_char[index])
letter2 = "E"
seed2 = [char_to_int[letter2]]
seed = seed2
print("New start: ", letter1, letter2)
for i in range(0, 5):
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed[0]], "->", int_to_char[index])
seed = [index]
model.reset_states()
and these outputs:
M -> B
New start: M E
E -> C
C -> D
D -> E
E -> F
It looks like the LSTM did not learn the alphabet but just the positions of the letters, and that regardless of the first letter we feed in, the LSTM will always predict B since it's the second letter, then C and so on.
Therefore, how does keeping the previous hidden state as initial hidden state for the current hidden state help us with the learning given that during test if we start with the letter 'K' for example, letters A to J will not have been fed in before and the initial hidden state won't be the same as during training ?
3. Training an LSTM on a book for sentence generation
I want to train my LSTM on a whole book to learn how to generate sentences and perhaps learn the authors style too, how can I naturally train my LSTM on that text (input the whole text and let the LSTM figure out the dependencies between the words) instead of having to 'artificially' create batches of sentences from that book myself to train my LSTM on? I believe I should use stateful LSTMs could help but I'm not sure how.
Having a stateful LSTM in Keras means that a Keras variable will be used to store and update the state, and in fact you could check the value of the state vector(s) at any time (that is, until you call reset_states()). A non-stateful model, on the other hand, will use an initial zero state every time it processes a batch, so it is as if you always called reset_states() after train_on_batch, test_on_batch and predict_on_batch. The explanation about the state being reused for the next batch on stateful models is just about that difference with non-stateful; of course the state will always flow within each sequence in the batch and you do not need to have batches of size 1 for that to happen. I see two scenarios where stateful models are useful:
You want to train on split sequences of data because these are very long and it would not be practical to train on their whole length.
On prediction time, you want to retrieve the output for each time point in the sequence, not just at the end (either because you want to feed it back into the network or because your application needs it). I personally do that in the models that I export for later integration (which are "copies" of the training model with batch size of 1).
I agree that the example of an RNN for the alphabet does not really seem very useful in practice; it will only work when you start with the letter A. If you want to learn to reproduce the alphabet starting at any letter, you would need to train the network with that kind of examples (subsequences or rotations of the alphabet). But I think a regular feed-forward network could learn to predict the next letter of the alphabet training on pairs like (A, B), (B, C), etc. I think the example is meant for demonstrative purposes more than anything else.
You may have probably already read it, but the popular post The Unreasonable Effectiveness of Recurrent Neural Networks shows some interesting results along the lines of what you want to do (although it does not really dive into implementation specifics). I don't have personal experience training RNN with textual data, but there is a number of approaches you can research. You can build character-based models (like the ones in the post), where your input and receive one character at a time. A more advanced approach is to do some preprocessing on the texts and transform them into sequences of numbers; Keras includes some text preprocessing functions to do that. Having one single number as feature space is probably not going to work all that well, so you could simply turn each word into a vector with one-hot encoding or, more interestingly, have the network learn the best vector representation for each for, which is what they call en embedding. You can go even further with the preprocessing and look into something like NLTK, specially if you want to remove stop words, punctuation and things like that. Finally, if you have sequences of different sizes (e.g. you are using full texts instead of excerpts of a fixed size, which may or may not be important for you) you will need to be a bit more careful and use masking and/or sample weighting. Depending on the exact problem, you can set up the training accordingly. If you want to learn to generate similar text, the "Y" would be the similar to the "X" (one-hot encoded), only shifted by one (or more) positions (in this case you may need to use return_sequences=True and TimeDistributed layers). If you want to determine the autor, your output could be a softmax Dense layer.
Hope that helps.

What is out of bag error in Random Forests? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
What is out of bag error in Random Forests?
Is it the optimal parameter for finding the right number of trees in a Random Forest?
I will take an attempt to explain:
Suppose our training data set is represented by T and suppose data set has M features (or attributes or variables).
T = {(X1,y1), (X2,y2), ... (Xn, yn)}
and
Xi is input vector {xi1, xi2, ... xiM}
yi is the label (or output or class).
summary of RF:
Random Forests algorithm is a classifier based on primarily two methods -
Bagging
Random subspace method.
Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T with-replacement (n times for each dataset). This will result in {T1, T2, ... TS} datasets. Each of these is called a bootstrap dataset. Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. This is called Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics))
Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.
Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree. This is called random subspace method.
So for each Ti bootstrap dataset you create a tree Ki. If you want to classify some input data D = {x1, x2, ..., xM} you let it pass through each tree and produce S outputs (one for each tree) which can be denoted by Y = {y1, y2, ..., ys}. Final prediction is a majority vote on this set.
Out-of-bag error:
After creating the classifiers (S trees), for each (Xi,yi) in the original training set i.e. T, select all Tk which does not include (Xi,yi). This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. This set is called out-of-bag examples. There are n such subsets (one for each data record in original dataset T). OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set (compare it with known yi's).
Why is it important?
The study of error estimates for bagged classifiers in Breiman
[1996b], gives empirical evidence to show that the out-of-bag estimate
is as accurate as using a test set of the same size as the training
set. Therefore, using the out-of-bag error estimate removes the need
for a set aside test set.1
(Thanks #Rudolf for corrections. His comments below.)
In Breiman's original implementation of the random forest algorithm, each tree is trained on about 2/3 of the total training data. As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree. This is the out of bag error estimate - an internal error estimate of a random forest as it is being constructed.