NLTK frequency distribution for group of words - nltk

Could you please help me how to calculate frequency distribution of "group of words"?
In other words, I have a text file. Here is a snapshot:
Here is my code to find the 50 most common words in the text file:
f=open('myfile.txt','rU')
text=f.read()
text1=text.split()
keywords=nltk.Text(text1)
fdist1=FreqDist(keywords)
fdist1.most_common(50)
In the results, as you can see in the link, each word is calculated. Here is a screenshot of the results:
It works well, but I am trying to find the frequency distribution of each line in the text file. For instance, in the first line, there is a term 'conceptual change'. The program calculates 'conceptual' and 'change' as different keywords. However, I need to find the frequency distribution of the term 'conceptual change'.

You're splitting up the text by any whitespace. See the docs, this is default behavior when you do not give any separator.
If you were to print out the value of text1 in your example program, you would see this. It's simply a list of words -- not lines -- so the damage has already been done by the time it's passed to FreqDist.
To fix it, just replace with text.split("\n"):
import nltk
from nltk import FreqDist
f=open('myfile.txt','rU')
text=f.read()
text1=text.split("\n")
keywords=nltk.Text(text1)
print(type(keywords))
fdist1=FreqDist(keywords)
print(fdist1.most_common(50))
This gives an output like:
[('conceptual change', 1), ('coherence', 1), ('cost-benefit tradeoffs', 1), ('interactive behavior', 1), ('naive physics', 1), ('rationality', 1), ('suboptimal performance', 1)]

Related

Julia Box plots, not reading columns where the csv file column that the name has spaces and parenthesis but has no problem reading 1word column title

So here's the code in Julia
using CSV
using DataFrames
using PlotlyJS
df= CSV.read("path", DataFrame)
plot(df, x=:Age, kind="box")
#I DO get the box plot for this one, because in the csv that column is headed with "Age"
plot(df, x=:Annual Income (k$), kind="box")
ERROR: syntax: missing comma or ) in argument list
Stacktrace:
[1] top-level scope
# none:1
#here I get an error asking about syntax, but I don't understand since the x= part is exactly what the column is labeled. If I try 'x=:Annual' I get a box plot of nothing, but the column title is "Annual Income (k$)".
Help is greatly appreciated!
Refrence: https://plotly.com/julia/box-plots/
Try:
plot(df, x=Symbol("Annual Income (k\$)"), kind="box")
The : syntax constructs a Symbol, but only upto the next space. So :Annual Income (k$) says to build the Symbol Symbol("Annual"), but then leaves the Income (k$) parts dangling. Instead you can explicitly construct the Symbol yourself like above.
The backslash before the $ symbol is because Julia uses $ usually for interpolation, and here we want to use the raw $ character itself. You can also do plot(df, x=Symbol(raw"Annual Income (k$)"), kind="box") instead, as no interpolation happens inside raw"" strings.

Octave - dlmread and csvread convert the first value to zero

When I try to read a csv file in Octave I realize that the very first value from it is converted to zero. I tried both csvread and dlmread and I'm receiving no errors. I am able to open the file in a plain text editor and I can see the correct value there. From what I can tell, there are no funny hidden characters, spacings, or similar in the csv file. Files also contain only numbers. The only thing that I feel might be important is that I have five columns/groups that each have different number of values in them.
I went through the commands' documentation on Octave Forge and I do not know what may be causing this. Does anyone have an idea what I can troubleshoot?
To try to illustrate the issue, if I try to load a file with the contents:
1.1,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
Command window will return:
0.0,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
( with additional trailing zeros after the decimal point).
Command syntaxes I'm using are:
dt = csvread("FileName.csv")
and
dt = dlmread("FileName.csv",",")
and they both return the same.
Your csv file contains a Byte Order Mark right before the first number. You can confirm this if you open the file in a hex editor, you will see the sequence EF BB BF before the numbers start.
This causes the first entry to be interpreted as a 'string', and since strings are parsed based on whether there are numbers in 'front' of the string sequence, this is parsed as the number zero. (see also this answer for more details on how csv entries are parsed).
In my text editor, if I start at the top left of the file, and press the right arrow key once, you can tell that the cursor hasn't moved (meaning I've just gone over the invisible byte order mark, which takes no visible space). Pressing backspace at this point to delete the byte order mark allows the csv to be read properly. Alternatively, you may have to fix your file in a hex editor, or find some other way to convert it to a proper Ascii file (or UTF without the byte order mark).
Also, it may be worth checking how this file was produced; if you have any control in that process, perhaps you can find why this mark was placed in the first place and prevent it. E.g., if this was exported from Excel, you can choose plain 'csv' format instead of 'utf-8 csv'.
UPDATE
In fact, this issue seems to have already been submitted as a bug and fixed in the development branch of octave. See #58813 :)

How to replace MATLAB's timeseries and synchronize functions in Octave?

I have a MATLAB script that I would like to run in Octave. But it turns out that the timeseries and synchronize functions from MATLAB are not yet implemented in Octave. So my question is if there is a way to express or replace these functions in Octave.
For understanding, I have two text files with different row lengths, which I want to synchronize into one text file with the same row length over time. The content of the text files is:
Text file 1:
1st column contains the distance
2nd column contains the time
Text file 2:
1st column contains the angle
2nd column contains the time
Here is the part of my code that I use in MATLAB to synchronize the files.
ts1 = timeseries(distance,timed);
ts2 = timeseries(angle,timea);
[ts1 ts2] = synchronize(ts1,ts2,'union');
distance = ts1.Data;
angle = ts2.Data;
Thanks in advance for your help.
edit:
Here are some example files.
input distance
input roation angle
output
The synchronize function seems to create a common timeseries from two separate ones (here, specifically via their union), and then use interpolation (here 'linear') to find interpolated values for both distance and angle at the common timepoints.
An example of how to achieve this to get the same output in octave as your provided output file is as follows.
Note: I had to preprocess your input files first to replace 'decimal commas' with dots, and then 'tabs' with commas, to make them valid csv files.
Distance_t = csvread('input_distance.txt', 1, 0); % skip header row
Rotation_t = csvread('input_rotation_angle.txt', 1, 0); % skip header row
Common_t = union( Distance_t(:,2), Rotation_t(:,2) );
InterpolatedDistance = interp1( Distance_t(:,2), Distance_t(:,1), Common_t );
InterpolatedRotation = interp1( Rotation_t(:,2), Rotation_t(:,1), Common_t );
Output = [ InterpolatedRotation, InterpolatedDistance ];
Output = sortrows( Output, -1 ); % sort according to column 1, in descending order
Output = Output(~isna(Output(:,2)), :); % remove NA entries
(Note, The step involving removal of NA entries was necessary because we did not specify we wanted extrapolation during the interpolation step, and some of the resulting distance values would be outside the original timerange, which octave labels as NA).

What does 'multiline strings are different' meant by from RIDE (Robot Framework) output?

i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!
When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.

liblinear's train.exe: "Wrong input format at line 1"

I'm trying to run liblinear's train.exe on Windows:
>train ex1_train.txt
Wrong input format at line 1
Here's the beginning of the file. What's wrong?
17.592 1:6.1101
9.1302 1:5.5277
13.662 1:8.5186
11.854 1:7.0032
6.8233 1:5.8598
11.886 1:8.3829
4.3483 1:7.4764
12 1:8.5781
6.5987 1:6.4862
3.8166 1:5.0546
3.2522 1:5.7107
15.505 1:14.164
3.1551 1:5.734
7.2258 1:8.4084
0.71618 1:5.6407
3.5129 1:5.3794
5.3048 1:6.3654
0.56077 1:5.1301
3.6518 1:6.4296
5.3893 1:7.0708
Liblinear requires the same input format as LibSVM. And, from their README file,
The format of training and testing data file is:
<label> <index1>:<value1> <index2>:<value2> ...
Each line contains an instance and is ended by a '\n' character. For
classification, <label> is an integer indicating the class label
(multi-class is supported). For regression, <label> is the target
value which can be any real number. For one-class SVM, it's not used
so can be any number. The pair <index>:<value> gives a feature
(attribute) value: <index> is an integer starting from 1 and <value>
is a real number. The only exception is the precomputed kernel, where
<index> starts from 0; see the section of precomputed kernels. Indices
must be in ASCENDING order.
Since we don't have the entire file, the best answer we can provide is that make sure all these instructions are followed. E.g., there is no TAB instead of space, there is no '\r\n' instead of '\n', etc. A good way to debug would be to take a few lines and keep adding until you get the error.
head -10 <yourfile> > tmp10
head -20 <yourfile> > tmp20
etc. And see where the error pops up.
My problems were that: you can't use zero as a feature id, and your features need to be sorted.