How to apply countvectorizer to bigrams in a pandas dataframe - nltk

I'm trying to apply the countvectorizer to a dataframe containing bigrams to convert it into a frequency matrix showing the number of times each bigram appears in each row but I keep getting error messages.
This is what I tried using
cereal['bigrams'].head()
0 [(best, thing), (thing, I), (I, have),....
1 [(eat, it), (it, every), (every, morning),...
2 [(every, morning), (morning, my), (my, brother),...
3 [(I, have), (five, cartons), (cartons, lying),...
.........
bow = CountVectorizer(max_features=5000, ngram_range=(2,2))
train_bow = bow.fit_transform(cereal['bigrams'])
train_bow
Expected results
(best,thing) (thing, I) (I, have) (eat,it) (every,morning)....
0 1 1 1 0 0
1 0 0 0 1 1
2 0 0 0 0 1
3 0 0 1 0 0
....

I see you are trying to convert a pd.Series into a count representation of each term.
Thats a bit different from what CountVectorizer does;
From the function description:
Convert a collection of text documents to a matrix of token counts
The official example of case use is:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
So, as one can see, it takes as input a list where each term is a "document".
Thats problaby the cause of the errors you are getting, you see, you are passing a pd.Series where each term is a list of tuples.
For you to use CountVectorizer you would have to transform your input into the proper format.
If you have the original corpus/text you can easily implement CountVectorizer on top of it (with the ngram parameter) to get the desired result.
Else, best solution wld be to treat it as it is, a series with a list of items, which must be counted/pivoted.
Sample workaround:
(it wld be a lot easier if you just use the text corpus instead)
Hope it helps!

Related

How can I merge/join multiple columns from two dataframes, depending on a matching pattern

I would like to merge two dataframes based on similar patterns in the chromosome column. I made various attempts with R & BASH such as with "data.table" "tidyverse", & merge(). Could someone help me by providing alternative solutions in R, BASH, Python, Perl, etc. for solving this solution? I would like to merge based on the chromosome information and retain both counts/RXNs.
NOTE: These two DFs are not aligned and I am also curious what happens if some values are missing.
Thanks and Cheers:
DF1:
Chromosome;RXN;ID
1009250;q9hxn4;NA
1010820;p16256;NA
31783;p16588;"PNTOt4;PNTOt4pp"
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;"DHQTi;DQDH"
DF2:
Chromosome;Count1;Count2;Count3;Count4;Count5
203;1;31;1;0;0;0
1010820;152;7;0;11;4
1009250;5;0;0;17;0
31783;1;0;0;0;0;0
Expected Result:
Chromosome;RXN;Count1;Count2;Count3;Count4;Count5
1009250;q9hxn4;5;0;0;17;0
1010820;p16256;152;7;0;11;4
31783;p16588;1;0;0;0;0
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;1;31;1;0;0;0
As bash was mentioned in the text body, I offer you an awk solution. The dataframes are in files df1 and df2:
$ awk '
BEGIN {
FS=OFS=";" # input and output field delimiters
}
NR==FNR { # process df1
a[$1]=$2 # hash to an array, 1st is the key, 2nd the value
next # process next record
}
{ # process df2
$2=(a[$1] OFS $2) # prepend RXN field to 2nd field of df2
}1' df1 df2 # 1 is output command, mind the file order
The 2 last lines could be written perhaps more clearly:
...
{
print $1,a[$1],$2,$3,$4,$5,$6
}' df1 df2
Output:
Chromosome;RXN;Count1;Count2;Count3;Count4;Count5
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;1;31;1;0;0;0
1010820;p16256;152;7;0;11;4
1009250;q9hxn4;5;0;0;17;0
31783;p16588;1;0;0;0;0;0
Output will be in the order of df2. Chromosome present in df1 but not in df2 will not be included. Chromosome in df2 but not in df1 will be output from df2 with empty RXN field. Also, if there are duplicate chromosomes in df1, the last one is used. This can be fixed if it is an issue.
If I understand your request correctly, this should do it in Python. I've made the Chromosome column into the index of each DataFrame.
from io import StringIO
txt1 = '''Chromosome;RXN;ID
1009250;q9hxn4;NA
1010820;p16256;NA
31783;p16588;"PNTOt4;PNTOt4pp"
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;"DHQTi;DQDH"'''
txt2 = """Chromosome;Count1;Count2;Count3;Count4;Count5;Count6
203;1;31;1;0;0;0
1010820;152;7;0;11;4
1009250;5;0;0;17;0
31783;1;0;0;0;0;0"""
df1 = pd.read_csv(
StringIO(txt1),
sep=';',
index_col=0,
header=0
)
df2 = pd.read_csv(
StringIO(txt2),
sep=';',
index_col=0,
header=0
)
DF1:
RXN ID
Chromosome
1009250 q9hxn4 NaN
1010820 p16256 NaN
31783 p16588 PNTOt4;PNTOt4pp
203 3-DEHYDROQUINATE-DEHYDRATASE-RXN DHQTi;DQDH
DF2:
Count1 Count2 Count3 Count4 Count5 Count6
Chromosome
203 1 31 1 0 0 0.0
1010820 152 7 0 11 4 NaN
1009250 5 0 0 17 0 NaN
31783 1 0 0 0 0 0.0
result = pd.concat(
[df1.sort_index(), df2.sort_index()],
axis=1
)
print(result)
RXN ID Count1 Count2 Count3 Count4 Count5 Count6
Chromosome
203 3-DEHYDROQUINATE-DEHYDRATASE-RXN DHQTi;DQDH 1 31 1 0 0 0.0
31783 p16588 PNTOt4;PNTOt4pp 1 0 0 0 0 0.0
1009250 q9hxn4 NaN 5 0 0 17 0 NaN
1010820 p16256 NaN 152 7 0 11 4 NaN
The concat command also handles mismatched indices by simply filling in NaN values for columns in e.g. df1 if df2 doesn't have have the same index, and vice versa.

How to find the nodes of a triangle from an adjacency matrix in Octave

I know how to find the number of triangles in an adjacency matrix.
tri = trace(A^3) / 6
But i require to find the nodes so that i can finally find the value of the edges from adjacency matrix since it's a sign graph. Is there already existing function which does that?
Taking the power of the adjacency matrix loses information about the intermediate nodes. Instead of a 2-dimensional matrix, we need 3 dimensions.
Given a graph:
and its adjacency matrix:
A =
0 0 0 0 1 1 0 1 0 0
0 0 0 1 0 1 0 0 0 0
0 0 0 1 0 0 0 1 0 1
0 1 1 0 1 0 1 0 0 0
1 0 0 1 0 0 1 0 0 0
1 1 0 0 0 0 0 1 1 0
0 0 0 1 1 0 0 0 1 0
1 0 1 0 0 1 0 0 0 0
0 0 0 0 0 1 1 0 0 0
0 0 1 0 0 0 0 0 0 0
Compute the 3d matrix T such that T(i,j,k) == 1 iff there is a path in the graph i=>j=>k=>i.
T = and(A, permute(A, [3 1 2]))
This is the equivalent of squaring the adjacency matrix, but keeping the path information. and is used here instead of multiplication in case A is a weighted adjacency matrix. If you sum along the 2nd dimension, you'll get A^2:
>> isequal(squeeze(sum(T,2)), A^2)
ans = 1
Now that we've got the paths of length 2, we just need to filter so we keep only the paths that return to their starting points.
T = and(T, permute(A.', [1 3 2])); % Transpose A in case graph is directed
Now, if T(i,j,k) == 1, then there is a triangle starting at node i, through nodes j and k and returning to node i. If you want to find all such paths:
[M,N,P] = ind2sub(size(T), find(T));
P = [M,N,P];
P will be a list of all triangular paths:
P =
8 6 1
6 8 1
7 5 4
5 7 4
7 4 5
4 7 5
8 1 6
1 8 6
5 4 7
4 5 7
6 1 8
1 6 8
In this case we get 12 paths. All paths in an undirected graph have 6 duplicates: one starting at each triangle point, times 2 directions. This gives the same results as trace:
>> trace(A^3)
ans = 12
If you want to remove the duplicates, the simplest way for triangles is to simply sort the vertex ordering and then take the unique rows of the list. This works for triangles only because all permutations of the nodes in the cycle are present. For longer cycles, this will not work.
P = unique(sort(P, 2), 'rows');
P =
1 6 8
4 5 7
Here is a solution using matrix multiplication:
C = (A * A.') & A;
[x, y] = find(tril(C));
n = numel(x);
D = sparse([x; y], [1:n 1:n].', 1, size(A,1), n);
[X, ~, V] = find(C * D);
tri = [x y X(V == 2)]
tri = unique(sort(tri, 2), 'rows');
First we need to know what are triangle nodes. Two nodes are triangle nodes if they have a common neighbor and both of them are neighbor of each other.
We take the definition to compute an adjacency matrix C that only contains triangle nodes and all other node are removed.
The expression A * A.' selects nodes that have common neighbors and the & A operator says that those nodes that have common neighbors should by neighbor of each other.
Now we can use [x, y] = find(tril(C)); to extract the first and the second points of each triangle as x and y respectively.
For the third node we need to find a node that has x and y as its neighbors. As before we can use the multiplication of boolean matrix trick to speed up the computation.
Finally the result tri has duplicates that should be remove using unique and sort.

What are the states and rewards in the reward matrix?

This code :
R = ql.matrix([ [0,0,0,0,1,0],
[0,0,0,1,0,1],
[0,0,100,1,0,0],
[0,1,1,0,1,0],
[1,0,0,1,0,0],
[0,1,0,0,0,0] ])
is from :
https://github.com/PacktPublishing/Artificial-Intelligence-By-Example/blob/47bed1a88db2c9577c492f950069f58353375cfe/Chapter01/MDP.py
R is defined as the "Reward matrix for each state" . What are the states and rewards in this matrix ?
# Reward for state 0
print('R[0,]:' , R[0,])
# Reward for state 0
print('R[1,]:' , R[1,])
prints :
R[0,]: [[0 0 0 0 1 0]]
R[1,]: [[0 0 0 1 0 1]]
Is [0 0 0 0 1 0] state0 & [0 0 0 1 0 1] state1 ?
According to the book that uses that example, R represents the reward of the transitions from one current state s to another next state s'.
Specifically, R is associated with the following graph:
Each line in the matrix R represents a letter from A to F, and each column represents a letter from A to F. The 1 values represent the nodes of the graphs. I.e., R[0,]: [[0 0 0 0 1 0]] means that you can go from state s=A to next state s'=E and receive a reward of 1. Similarly, R[1,]: [[0 0 0 1 0 1]] means that you receive a reward of 1 if you go from B to F or D. The goal seems to be achieving and remaining in C, which obtains the largest reward.

Octave element wise comparisons [duplicate]

let us consider following code for impulse function
function y=impulse_function(n);
y=0;
if n==0
y=1;
end
end
this code
>> n=-2:2;
>> i=1:length(n);
>> f(i)=impulse_function(n(i));
>>
returns result
f
f =
0 0 0 0 0
while this code
>> n=-2:2;
>> for i=1:length(n);
f(i)=impulse_function(n(i));
end
>> f
f =
0 0 1 0 0
in both case i is 1 2 3 4 5,what is different?
Your function is not defined to handle vector input.
Modify your impluse function as follows:
function y=impulse_function(n)
[a b]=size(n);
y=zeros(a,b);
y(n==0)=1;
end
In your definition of impulse_function, whole array is compared to zero and return value is only a single number instead of a vector.
In the first case you are comparing an array to the value 0. This will give the result [0 0 1 0 0], which is not a simple true or false. So the statement y = 0; will not get executed and f will be [0 0 0 0 0] as shown.
In the second you are iterating through the array value by value and passing it to the function. Since the array contains the value 0, then you will get 1 back from the function in the print out of f (or [0 0 1 0 0], which is an impulse).
You'll need to modify your function to take array inputs.
Perhaps this example will clarify the issue further:
cond = 0;
if cond == 0
disp(cond) % This will print 0 since 0 == 0
end
cond = 1;
if cond == 0
disp(cond) % This won't print since since 1 ~= 0 (not equal)
end
cond = [-2 -1 0 1 2];
if cond == 0
disp(cond) % This won't print since since [-2 -1 0 1 2] ~= 0 (not equal)
end
You could define your impulse function simply as this one -
impulse_function = #(n) (1:numel(n)).*n==0
Sample run -
>> n = -6:4
n =
-6 -5 -4 -3 -2 -1 0 1 2 3 4
>> out = impulse_function(n)
out =
0 0 0 0 0 0 1 0 0 0 0
Plot code -
plot(n,out,'o') %// data points
hold on
line([0 0],[1 0]) %// impulse point
Plot result -
You can write an even simpler function:
function y=impulse_function(n);
y = n==0;
Note that this will return y as a type logical array but that should not affect later numerical computations.

Is there a general uniques counting function?

By general, I mean it can count the different elements in the input given it is either a list of numbers (or other atoms), a list of vectors or a list of matrices.
Example: given a list of row vectors of length 3:
x = [1 1 1; 1 0 1; 0 1 1; 1 0 1; 1 1 1; 1 0 1];
the expected outcome should be:
[1 1 1] --> 2
[1 0 1] --> 3
[0 1 1] --> 1
returned in e.g. two lists. I know about the count_uniques function, but it deals with non-array inputs only, as far as I know.
You can use unique. If the input is an array use unique(X,'rows').
If you want a universal function you can do:
function varargout=universal_unique(X);
if(size(X,2)==1)
[varargout{:}]=unique(X);
else
[varargout{:}]=unique(X,'rows');
end
end