How can I make nltk.NaiveBayesClassifier.train() work with my dictionary - nltk

I'm currently making a simples spam/ham email filter using Naive Bayles.
For you to understand my algorithm logic: I have a folder with lots os files, which are examples of spam/ham emails. I also have two other files in this folder containing the titles of all my ham examples and another with the titles of all my spam examples. I organized like this so I can open and read this emails properly.
I'm putting all the words I judge to be important in a dictionary structure, with a label "spam" or "ham" depending from which kind of file I extracted them from.
Then I'm using nltk.NaiveBayesClassifier.train() so I can train my classifier, but I'm getting the error:
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack
I don't know why this is happening. When I looked for a solution, I found that strings are not hashable, and I was using a list to do it, then I turned it into a dictionary, which are hashable as far as I know, but it keeps getting this error.
Someone knows how to solve it? Thanks!
All my code is listed below:
import nltk
import re
import random
stopwords = nltk.corpus.stopwords.words('english') #Words I should avoid since they have weak value for classification
my_file = open("spam_files.txt", "r") #my_file now has the name of each file that contains a spam email example
word = {} #a dictionary where I will storage all the words and which value they have (spam or ham)
for lines in my_file: #for each name of file (which will be represenetd by LINES) of my_file
with open(lines.rsplit('\n')[0]) as email: #I will open the file pointed by LINES, and then, read the email example that is inside this file
for phrase in email: #After that, I will take every phrase of this email example I just opened
try: #and I'll try to tokenize it
tokens = nltk.word_tokenize(phrase)
except:
continue #I will ignore non-ascii elements
for c in tokens: #for each token
regex = re.compile('[^a-zA-Z]') #I will also exclude numbers
c = regex.sub('', c)
if (c): #If there is any element left
if (c not in stopwords): #And if this element is a not a stopword
c.lower()
word.update({c: 'spam'})#I put this element in my dictionary. Since I'm analysing spam examples, variable C is labeled "spam".
my_file.close()
email.close()
#The same logic is used for the Ham emails. Since my ham emails contain only ascii elements, I dont test it with TRY
my_file = open("ham_files.txt", "r")
for lines in my_file:
with open(lines.rsplit('\n')[0]) as email:
for phrase in email:
tokens = nltk.word_tokenize(phrase)
for c in tokens:
regex = re.compile('[^a-zA-Z]')
c = regex.sub('', c)
if (c):
if (c not in stopwords):
c.lower()
word.update({c: 'ham'})
my_file.close()
email.close()
#And here I train my classifier
classifier = nltk.NaiveBayesClassifier.train(word)
classifier.show_most_informative_features(5)

nltk.NaiveBayesClassifier.train() expects “a list of tuples (featureset, label)” (see the documentation of the train() method)
What is not mentioned there is that featureset should be a dict of feature names mapped to feature values.
So, in a typical spam/ham classification with a bag-of-words model, the labels are 'spam'/'ham' or 1/0 or True/False;
the feature names are the occurring words and the values are the number of times each word occurs.
For example, the argument to the train() method might look like this:
[({'greetings': 1, 'loan': 2, 'offer': 1}, 'spam'),
({'money': 3}, 'spam'),
...
({'dear': 1, 'meeting': 2}, 'ham'),
...
]
If your dataset is rather small, you might want to replace the actual word counts with 1, to reduce data sparsity.

Related

How can I write certain sections of text from different lines to multiple lines?

So I'm currently trying to use Python to transform large sums of data into a neat and tidy .csv file from a .txt file. The first stage is trying to get the 8-digit company numbers into one column called 'Company numbers'. I've created the header and just need to put each company number from each line into the column. What I want to know is, how do I tell my script to read the first eight characters of each line in the .txt file (which correspond to the company number) and then write them to the .csv file? This is probably very simple but I'm only new to Python!
So far, I have something which looks like this:
with open(r'C:/Users/test1.txt') as rf:
with open(r'C:/Users/test2.csv','w',newline='') as wf:
outputDictWriter = csv.DictWriter(wf,['Company number'])
outputDictWriter.writeheader()
rf = rf.read(8)
for line in rf:
wf.write(line)
My recommendation would be 1) read the file in, 2) make the relevant transformation, and then 3) write the results to file. I don't have sample data, so I can't verify whether my solution exactly addresses your case
with open('input.txt','r') as file_handle:
file_content = file_handle.read()
list_of_IDs = []
for line in file_content.split('\n')
print("line = ",line)
print("first 8 =", line[0:8])
list_of_IDs.append(line[0:8])
with open("output.csv", "w") as file_handle:
file_handle.write("Company\n")
for line in list_of_IDs:
file_handle.write(line+"\n")
The value of separating these steps is to enable debugging.

Reading a .dat file in Julia, issues with variable delimeter spacing

I am having issues reading a .dat file into a dataframe. I think the issue is with the delimiter. I have included a screen shot of what the data in the file looks like below. My best guess is that it is tab delimited between columns and then new-line delimited between rows. I have tried reading in the data with the following commands:
df = CSV.File("FORCECHAIN00046.dat"; header=false) |> DataFrame!
df = CSV.File("FORCECHAIN00046.dat"; header=false, delim = ' ') |> DataFrame!
My result either way is just a DataFrame with only one column including all the data frome each column concatenated into one string. I tried to even specify the types with the following code:
df = CSV.File("FORCECHAIN00046.dat"; types=[Float64,Float64,Float64,Float64,
Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64]) |> DataFrame!
And I received an the following error:
┌ Warning: 2; something went wrong trying to determine row positions for multithreading; it'd be very helpful if you could open an issue at https://github.com/JuliaData/CS
V.jl/issues so package authors can investigate
I can work around this by uploading it into google sheets and then downloading a csv, but I would like to find a way to make the original .dat file work.
Part of the issue here is that .dat is not a proper file format—it's just something that seems to be written out in a somewhat human-readable format with columns of numbers separated by variable numbers of spaces so that the numbers line up when you look at them in an editor. Google Sheets has a lot of clever tricks built in to "do what you want" for all kinds of ill-defined data files, so I'm not too surprised that it manages to parse this. The CSV package on the other hand supports using a single character as a delimiter or even a multi-character string, but not a variable number of spaces like this.
Possible solutions:
if the files aren't too big, you could easily roll your own parser that splits each line and then builds a matrix
you can also pre-process the file turning multiple spaces into single spaces
That's probably the easiest way to do this and here's some Julia code (untested since you didn't provide test data) that will open your file and convert it to a more reasonable format:
function dat2csv(dat_path::AbstractString, csv_path::AbstractString)
open(csv_path, write=true) do io
for line in eachline(dat_path)
join(io, split(line), ',')
println(io)
end
end
return csv_path
end
function dat2csv(dat_path::AbstractString)
base, ext = splitext(dat_path)
ext == ".dat" ||
throw(ArgumentError("file name doesn't end with `.dat`"))
return dat2csv(dat_path, "$base.csv")
end
You would call this function as dat2csv("FORCECHAIN00046.dat") and it would create the file FORCECHAIN00046.csv, which would be a proper CSV file using commas as delimiters. That won't work well if the files contain any values with commas in them, but it looks like they are just numbers, in which case it should be fine. So you can use this function to convert the files to proper CSV and then load that file with the CSV package.
A little explanation of the code:
the two-argument dat2csv method opens csv_path for writing and then calls eachline on dat_path to read one line form it at a time
eachline strips any trailing newline from each line, so each line will be bunch of numbers separated by whitespace with some leading and/or trailing whitespace
split(line) does the default splitting of line which splits it on whitespace, dropping any empty values—this leaves just the non-whitespace entries as strings in an array
join(io, split(line), ',') joins the strings in the array together, separated by the , character and writes that to the io write handle for csv_path
println(io) writes a newline after that—otherwise everything would just end up on a single very long line
the one-argument dat2csv method calls splitext to split the file name into a base name and an extension, checking that the extension is the expected .dat and calling the two-argument version with the .dat replaced by .csv
Try using the readdlm function in DelimitedFiles library, and convert to DataFrame afterwards:
using DelimitedFiles, DataFrames
df = DataFrame(readdlm("FORCECHAIN00046.dat"), :auto)

How to write a list of string as double quotes, so that json can load?

Consider the following code, where the json can't load back because after the manipulation, the single quotes become double quotes, how can write to file as double quote list so that json can load back?
import configparser
import json
config = configparser.ConfigParser()
config.read("config.ini")
l = json.loads(config.get('Basic', 'simple_list'))
new_config = configparser.ConfigParser()
new_config.add_section("Basic")
new_config.set('Basic', 'simple_list', str(l))
with open("config1.ini", 'w') as f:
new_config.write(f)
config = configparser.ConfigParser()
config.read("config1.ini")
l = json.loads(config.get('Basic', 'simple_list'))
The settings.ini file content is like this:
[Basic]
simple_list = ["a", "b"]
As already mentionned by #L3viathan, the purely technical answer is "use json.dumps() instead of str()" (and yes, it works for dicts too).
BUT: storing json in an ini file is very bad idea. "ini" is a file format on it's own (even if not as strictly specified as json or yaml) and it has been designed to be user-editable with just any text editor. FWIW, the simple canonical way to store "lists" in an ini file is simply to store them as comma separated values, ie:
[Basic]
simple_list = a,b
and parse this back when reading the config as
values = config.get('Basic', 'simple_list')).split(",")
wrt/ "storing dicts", an ini file IS already a (kind of) dict since it's based on key:value pairs. It's restricted to two levels (sections and keys), but here again that's by design - it's a format designed for end-users, not for programmers.
Now if the ini forma doesn't suits your needs, nothing prevents you from just using a json (or yaml) file instead for the whole config

Term for a "Special Identifier" Embedded in String Data

I'm mostly at a loss for how to describe this, so I'll start with a simple example that is similar to some JSON I'm working with:
"user_interface": {
username: "Hello, %USER.username%",
create_date: "Your account was created on %USER.create_date%",
favorite_color: "Your favorite color is: %USER.fav_color%"
}
The "special identifiers" located in the username create_date and favorite_color fields start and end with % characters, and are supposed to be replaced with the correct information for that particular user. An example for the favorite_color field would be:
Your favorite color is: Orange
Is there a proper term for these identifiers? I'm trying to search google for best practices or libraries when parsing these before I reinvent the wheel, but everything I can think of results in a sea of false-positives.
Just some thoughts on the subject of %special identifier%. Let's take a look at a small subset of examples, that implement almost similar strings replacement.
WSH Shell ExpandEnvironmentStrings Method
Returns an environment variable's expanded value.
WSH .vbs code snippet
Set WshShell = WScript.CreateObject("WScript.Shell")
WScript.Echo WshShell.ExpandEnvironmentStrings("WinDir is %WinDir%")
' WinDir is C:\Windows
.NET Composite Formatting
The .NET Framework composite formatting feature takes a list of objects and a composite format string as input. A composite format string consists of fixed text intermixed with indexed placeholders, called format items, that correspond to the objects in the list. The formatting operation yields a result string that consists of the original fixed text intermixed with the string representation of the objects in the list.
VB.Net code snippet
Console.WriteLine(String.Format("Prime numbers less than 10: {0}, {1}, {2}, {3}, {4}", 1, 2, 3, 5, 7 ))
' Prime numbers less than 10: 1, 2, 3, 5, 7
JavaScript replace Method (with RegEx application)
... The match variables can be used in text replacement where the replacement string has to be determined dynamically... $n ... The nth captured submatch ...
Also called Format Flags, Substitution, Backreference and Format specifiersJavaScript code snippet
console.log("Hello, World!".replace(/(\w+)\W+(\w+)/g, "$1, dear $2"))
// Hello, dear World!
Python Format strings
Format strings contain “replacement fields” surrounded by curly braces {}. Anything that is not contained in braces is considered literal text, which is copied unchanged to the output...
Python code snippet
print "The sum of 1 + 2 is {0}".format(1+2)
# The sum of 1 + 2 is 3
Ruby String Interpolation
Double-quote strings allow interpolation of other values using#{...} ...
Ruby code snippet
res = 3
puts "The sum of 1 + 2 is #{res}"
# The sum of 1 + 2 is 3
TestComplete Custom String Generator
... A string of macros, text, format specifiers and regular expressions that will be used to generate values. The default value of this parameter is %INT(1, 2147483647, 1) %NAME(ANY, FULL) lives in %CITY. ... Also, you can format the generated values using special format specifiers. For instance, you can use the following macro to generate a sequence of integer values with the specified minimum length (3 characters) -- %0.3d%INT(1, 100, 3).
Angular Expression
Angular expressions are JavaScript-like code snippets that are mainly placed in interpolation bindings such as{{ textBinding }}...
Django Templates
Variables are surrounded by {{ and }} like this:My first name is {{ first_name }}. My last name is {{ last_name }}.With a context of {'first_name': 'John', 'last_name': 'Doe'}, this template renders to:My first name is John. My last name is Doe.
Node.js v4 Template strings
... Template strings can contain place holders. These are indicated by the Dollar sign and curly braces (${expression}). The expressions in the place holders and the text between them get passed to a function...
JavaScript code snippet
var res = 3;
console.log(`The sum of 1 + 2 is ${res}`);
// The sum of 1 + 2 is 3
C/C++ Macros
Preprocessing expands macros in all lines that are not preprocessor directives...
Replacement in source code.
C++ code snippet
std::cout << __DATE__;
// Jan 8 2016
AutoIt Macros
AutoIt has an number of Macros that are special read-only variables used by AutoIt. Macros start with the # character ...
Replacement in source code.
AutoIt code snippet
MsgBox(0, "", "CPU Architecture is " & #CPUArch)
; CPU Architecture is X64
SharePoint solution Replaceable Parameters
Replaceable parameters, or tokens, can be used inside project files to provide values for SharePoint solution items whose actual values are not known at design time. They are similar in function to the standard Visual Studio template tokens... Tokens begin and end with a dollar sign ($) character. Any tokens used are replaced with actual values when a project is packaged into a SharePoint solution package (.wsp) file at deployment time. For example, the token $SharePoint.Package.Name$ might resolve to the string "Test SharePoint Package."
Apache Ant Replace Task
Replace is a directory based task for replacing the occurrence of a given string with another string in selected file... token... the token which must be replaced...
So, based on functional context I would call it %token% (such a flavor of strings with an identified "meaning").

Auto building an output list of possible words from OCR'ing errors based on a given population of words

I wonder if in Perl/MySQL if is possible to build a list of variant words, based on a given word, to which that word may have the common OCR errors occurring (i.e. 8 instead of b)? In other words, if I have a list of words, and in that list is the word "Alphabet", then is there a way to extend or build a new list to include my original word plus the OCR error variants of "Alphabet"? So in my output, I could have the following variants to Alphabet perhaps:
Alphabet
A1phabet
Alpha8et
A1pha8et
Of course it would be useful to code for most if not all of the common errros that appear in OCR'ed text. Things like 8 instead of b, or 1 instead of l. I'm not looking to fix the errors, because in my data itself I could have OCR errors, but want to create a variant list of words as my output based on a list of words I give it as an input. So in my data, I may have Alpha8et, but if I do a simple search for Alphabet, it won't find this obvious error.
My quick and dirty MySQL approach
Select * from
(SELECT Word
FROM words
union all
-- Rule 1 (8 instead of b)
SELECT
case
when Word regexp 'b|B' = 1
then replace(replace(Word, 'B','8'),'b','8')
end as Word
FROM words
union all
-- Rule 2 (1 instead of l)
SELECT
case
when Word regexp 'l|L' = 1
then replace(replace(Word, 'L','1'),'l','1')
end as Word
FROM words) qry
where qry.Word is not null
order by qry.Word;
I'm thinking there must be a more automated and cleaner method
If you have examples of scanned texts with both the as-scanned (raw) version, and the corrected version, it should be relatively simple to generate a list of the character corrections. Gather this data from enough texts, then sort it by frequency. Decide how frequent a correction has to be for it to be "common," then leave only the common corrections in the list.
Turn the list into a map keyed by the correct letter; the value being an array of the common mis-scans for that letter. Use a recursive function to take a word and generate all of its variations.
This example, in Ruby, shows the recursive function. Gathering up the possible mis-scans is up to you:
VARIATIONS = {
'l' => ['1'],
'b' => ['8'],
}
def variations(word)
return [''] if word.empty?
first_character = word[0..0]
remainder = word[1..-1]
possible_first_characters =
[first_character] | VARIATIONS.fetch(first_character, [])
possible_remainders = variations(remainder)
possible_first_characters.product(possible_remainders).map(&:join)
end
p variations('Alphabet')
# => ["Alphabet", "Alpha8et", "A1phabet", "A1pha8et"]
The original word is included in the list of variations. If you want only possible mis-scans, then remove the original word:
def misscans(word)
variations(word) - [word]
end
p misscans('Alphabet')
# => ["Alpha8et", "A1phabet", "A1pha8et"]
A quick-and-dirty (and untested) version of a command-line program would couple the above functions with this "main" function:
input_path, output_path = ARGV
File.open(input_path, 'r') do |infile|
File.open(output_path, 'w') do |outfile|
while word = infile.gets
outfile.puts misscans(word)
end
end
end
An efficient way for achieving this is by using the bitap algorithm. Perl has re::engine::TRE, a binding to libtre which implements the fuzzy string matching in regexp:
use strict;
use warnings qw(all);
use re::engine::TRE max_cost => 1;
# match "Perl"
if ("A pearl is a hard object produced..." =~ /\(Perl\)/i) {
    say $1; # find "pearl"
}
Plus, there is agrep tool which allows you to use libtre from the command line:
$ agrep -i -E 1 peArl *
fork.pl:#!/usr/bin/env perl
geo.pl:#!/usr/bin/env perl
leak.pl:#!/usr/local/bin/perl
When you need to match several words against the OCRized text, there are two distinct approaches.
You could simply build one regexp with your entire dictionary, if it is small enough:
/(Arakanese|Nelumbium|additionary|archarios|corbeil|golee|layer|reinstill\)/
Large dictionary queries can be optimized by building trigram index.
Perl has a String::Trigram for doing this in-memory.
Several RDBMS also have trigram index extensions. PostgreSQL-flavored pg_trgm allows you to write queries like this, which are fast enough even for really big dictionaries:
SELECT DISTINCT street, similarity(street, word)
FROM address_street
JOIN (
SELECT UNNEST(ARRAY['higienopolis','lapa','morumbi']) AS word
) AS t0 ON street % word;
(this one took ~70ms on a table with ~150K rows)