How to get the prediction value in LSTM model? - deep-learning

This model is trying to classify the emails whether they are spam or ham.
The training is finished, but how should I input the new email text and get the value that can tell me if the email is spam or ham ?
This is the example code in the text book:
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score
# prediction on test data
predicted_blstm=model.predict(test_data)
predicted_blstm
# model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test, predicted_blstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test, predicted_blstm.round()))
and the result:
precision: [0.98782961 0.95348837]
recall: [0.99387755 0.91111111]
fscore: [0.99084435 0.93181818]
support: [980 135]
############################
precision recall f1-score support
0 0.99 0.99 0.99 980
1 0.95 0.91 0.93 135
micro avg 0.98 0.98 0.98 1115
macro avg 0.97 0.95 0.96 1115
weighted avg 0.98 0.98 0.98 1115
samples avg 0.98 0.98 0.98 1115
I had tested the code
mode.predict()
but the result is like:
array([[9.9973804e-01, 2.6198191e-04],
[9.9988401e-01, 1.1600493e-04],
[9.9996233e-01, 3.7628190e-05],
[9.9998081e-01, 1.9162568e-05],
[9.9998498e-01, 1.5043216e-05],
[9.9907982e-01, 9.2014833e-04],
...
[9.9996233e-01, 3.7628190e-05],
[9.9996233e-01, 3.7628190e-05]], dtype=float32)
What does this number means ?
Can I get the answer from this array by showing the message "spam" or "ham" ?

I don't know the features in your train and test data, but if your model is trained only on the email text feature, then you could the following:
1) Convert the email text, for example "This is the email I want to test", into a vector using the vectorizer used on the training data.
2) If your vector is stored in a variable 'vec'. Then you can predict if the email is ham or spam using
prediction = model.predict(vec)
The varaible 'prediction' will hold your answer.

Related

BeautifulSoup returns nothing

I'm trying to learn how to scrap components from website, specifically this website https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load
When I follow guidance from the internet, I collect several important elements such as class
"article-table sortable mw-collapsible jquery-tablesorter mw-made-collapsible"
and html elements like th and tb to get specific content of it using this code
import requests
from bs4 import BeautifulSoup
URL = "https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load"
page = requests.get(URL)
#print(page.text)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-content-text")
teapot_loads = results.find_all("table", class_="article-table sortable mw-collapsible jquery-tablesorter mw-made-collapsible")
for teapot_loads in teapot_loads:
table_head_element = teapot_loads.find("th", class_="headerSort")
print(table_head_element)
print()
I seem to have written the correct element (th) and correct class name "headerSort." But the program doesn't return anything although there's no error in the program as well. What did I do wrong?
You can debug your code to see what went wrong, where. One such debugging effort is below, where we keep only one class for tables, and then print out the full class of the actual elements:
import requests
from bs4 import BeautifulSoup
URL = "https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load"
page = requests.get(URL)
#print(page.text)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-content-text")
# print(results)
teapot_loads = results.find_all("table", class_="article-table")
for teapot_load in teapot_loads:
print(teapot_load.get_attribute_list('class'))
table_head_element = teapot_load.find("th", class_="headerSort")
print(table_head_element)
This will print out (beside the element you want printed out) the table class as well, as seen by requests/BeautifulSoup: ['article-table', 'sortable', 'mw-collapsible']. After the original HTML loads in page (with the original classes, seen by requests/BeautifulSoup), the Javascript in that page kicks in, and adds new classes to the table. As you are searching for elements containing such dynamically added classes, your search fails.
Nonetheless, here is a more elegant way of obtaining that table:
import pandas as pd
url = 'https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load'
dfs = pd.read_html(url)
print(dfs[1])
This will return a dataframe with that table:
Image
Name
Adeptal Energy
Load
ReducedLoad
Ratio
0
nan
"A Bloatty Floatty's Dream of the Sky"
60
65
47
0.92
1
nan
"A Guide in the Summer Woods"
60
35
24
1.71
2
nan
"A Messenger in the Summer Woods"
60
35
24
1.71
3
nan
"A Portrait of Paimon, the Greatest Companion"
90
35
24
2.57
4
nan
"A Seat in the Wilderness"
20
50
50
0.4
5
nan
"Ballad-Spinning Windwheel"
90
185
185
0.49
6
nan
"Between Nine Steps"
30
550
550
0.05
[...]
Documentation for bs4 (BeautifulSoup) can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
Also, docs for pandas.read_html: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

How to use HuggingFace nlp library's GLUE for CoLA

I've been trying to use the HuggingFace nlp library's GLUE metric to check whether a given sentence is a grammatical English sentence. But I'm getting an error and is stuck without being able to proceed.
What I've tried so far;
reference and prediction are 2 text sentences
!pip install transformers
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
reference="Security has been beefed across the country as a 2 day nation wide curfew came into effect."
prediction="Security has been tightened across the country as a 2-day nationwide curfew came into effect."
import nlp
glue_metric = nlp.load_metric('glue',name="cola")
#Using BertTokenizer
encoded_reference=tokenizer.encode(reference, add_special_tokens=False)
encoded_prediction=tokenizer.encode(prediction, add_special_tokens=False)
glue_score = glue_metric.compute(encoded_prediction, encoded_reference)
Error I'm getting;
ValueError Traceback (most recent call last)
<ipython-input-9-4c3a3ce7b583> in <module>()
----> 1 glue_score = glue_metric.compute(encoded_prediction, encoded_reference)
6 frames
/usr/local/lib/python3.6/dist-packages/nlp/metric.py in compute(self, predictions, references, timeout, **metrics_kwargs)
198 predictions = self.data["predictions"]
199 references = self.data["references"]
--> 200 output = self._compute(predictions=predictions, references=references, **metrics_kwargs)
201 return output
202
/usr/local/lib/python3.6/dist-packages/nlp/metrics/glue/27b1bc63e520833054bd0d7a8d0bc7f6aab84cc9eed1b576e98c806f9466d302/glue.py in _compute(self, predictions, references)
101 return pearson_and_spearman(predictions, references)
102 elif self.config_name in ["mrpc", "qqp"]:
--> 103 return acc_and_f1(predictions, references)
104 elif self.config_name in ["sst2", "mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]:
105 return {"accuracy": simple_accuracy(predictions, references)}
/usr/local/lib/python3.6/dist-packages/nlp/metrics/glue/27b1bc63e520833054bd0d7a8d0bc7f6aab84cc9eed1b576e98c806f9466d302/glue.py in acc_and_f1(preds, labels)
60 def acc_and_f1(preds, labels):
61 acc = simple_accuracy(preds, labels)
---> 62 f1 = f1_score(y_true=labels, y_pred=preds)
63 return {
64 "accuracy": acc,
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in f1_score(y_true, y_pred, labels, pos_label, average, sample_weight, zero_division)
1097 pos_label=pos_label, average=average,
1098 sample_weight=sample_weight,
-> 1099 zero_division=zero_division)
1100
1101
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in fbeta_score(y_true, y_pred, beta, labels, pos_label, average, sample_weight, zero_division)
1224 warn_for=('f-score',),
1225 sample_weight=sample_weight,
-> 1226 zero_division=zero_division)
1227 return f
1228
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight, zero_division)
1482 raise ValueError("beta should be >=0 in the F-beta score")
1483 labels = _check_set_wise_labels(y_true, y_pred, average, labels,
-> 1484 pos_label)
1485
1486 # Calculate tp_sum, pred_sum, true_sum ###
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py in _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
1314 raise ValueError("Target is %s but average='binary'. Please "
1315 "choose another average setting, one of %r."
-> 1316 % (y_type, average_options))
1317 elif pos_label not in (None, 1):
1318 warnings.warn("Note that pos_label (set to %r) is ignored when "
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
However, I'm able to get results (pearson and spearmanr) for 'stsb' with the same workaround as given above.
Some help and a workaround for(cola) this is really appreciated. Thank you.
In general, if you are seeing this error with HuggingFace, you are trying to use the f-score as a metric on a text classification problem with more than 2 classes. Pick a different metric, like "accuracy".
For this specific question:
Despite what you entered, it is trying to compute the f-score. From the example notebook, you should set the metric name as:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"

Statsmodels: How can I select different confidence intervals for regressions

I want to run a regression with 99% confidence interval instead of the default 95% using statsmodels.
I looked at the documentation if there is an argument in the fit() method but I didn't notice something. I tried also the conf_int method but I am confused from the output.
import pandas as pd
import math
import statsmodels.formula.api as sm
df = pd.read_excel(r'C:\TestData.xlsx')
df['LogBalance'] = df['Balance'].map(lambda x: math.log(x))
est = sm.ols(formula= 'LogBalance ~ N + Rate',
data=df).fit(cov_type='HAC',cov_kwds={'maxlags':1})
print(est.summary())
print(est.conf_int(alpha=0.01, cols=None))
Since I am new in Python can you tell me if and how can I perform a regression in statsmodels with adjusted confidence intervals if possible in the initial regression output?
Thanks
You can specify the confidence interval in .summary() directly Please consider the following example:
import statsmodels.formula.api as smf
import seaborn as sns
# load a sample dataset
df = sns.load_dataset('tips')
# run model
formula = 'tip ~ size + total_bill'
results = smf.ols(formula=formula, data=df).fit()
# use 95 % CI (default setting)
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: tip R-squared: 0.468
Model: OLS Adj. R-squared: 0.463
Method: Least Squares F-statistic: 105.9
Date: Fri, 21 Jun 2019 Prob (F-statistic): 9.67e-34
Time: 21:42:09 Log-Likelihood: -347.99
No. Observations: 244 AIC: 702.0
Df Residuals: 241 BIC: 712.5
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.6689 0.194 3.455 0.001 0.288 1.050
size 0.1926 0.085 2.258 0.025 0.025 0.361
total_bill 0.0927 0.009 10.172 0.000 0.075 0.111
==============================================================================
Omnibus: 24.753 Durbin-Watson: 2.100
Prob(Omnibus): 0.000 Jarque-Bera (JB): 46.169
Skew: 0.545 Prob(JB): 9.43e-11
Kurtosis: 4.831 Cond. No. 67.6
==============================================================================
# use 99 % CI
print(results.summary(alpha=0.01))
OLS Regression Results
==============================================================================
Dep. Variable: tip R-squared: 0.468
Model: OLS Adj. R-squared: 0.463
Method: Least Squares F-statistic: 105.9
Date: Fri, 21 Jun 2019 Prob (F-statistic): 9.67e-34
Time: 21:45:57 Log-Likelihood: -347.99
No. Observations: 244 AIC: 702.0
Df Residuals: 241 BIC: 712.5
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.005 0.995]
------------------------------------------------------------------------------
Intercept 0.6689 0.194 3.455 0.001 0.166 1.172
size 0.1926 0.085 2.258 0.025 -0.029 0.414
total_bill 0.0927 0.009 10.172 0.000 0.069 0.116
==============================================================================
Omnibus: 24.753 Durbin-Watson: 2.100
Prob(Omnibus): 0.000 Jarque-Bera (JB): 46.169
Skew: 0.545 Prob(JB): 9.43e-11
Kurtosis: 4.831 Cond. No. 67.6
==============================================================================

Different Sum Sq and MSS using lme4::lmer and lmerTest::lmer

I get sums of squares and mean sums of squares 10x higher when I use anova on lmerTest:: lmer compared to lme4:: lmer objects. See the R log file below. Note the warning that when I attach the lmerTest package, the stats::sigma function overrides the lme4::sigma function, and I suspect that it is this that leads to the discrepancy. In addition, the anova report now says that it is a Type III anova rather than the expected Type I. Is this a bug in the lmerTest package, or is there something about use of the Kenward-Roger approximation for degrees of freedom that changes the calculation of SumSQ and MSS and specification of the anova Type that I don't understand?
I would append the test file, but it is confidential clinical trial information. If necessary I can see if I can cobble up a test case.
Thanks in advance for any advice you all can provide about this.
> library(lme4)
Loading required package: Matrix
Attaching package: ‘lme4’
The following object is masked from ‘package:stats’:
sigma
> test100 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> library(lmerTest)
Attaching package: ‘lmerTest’
The following object is masked from ‘package:lme4’:
lmer
The following object is masked from ‘package:stats’:
step
Warning message:
replacing previous import ‘lme4::sigma’ by ‘stats::sigma’ when loading
‘lmerTest’
> test200 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> anova(test100)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
prepost 1 3.956 3.956 18.4825
lowhi 1 130.647 130.647 610.3836
prepost:lowhi 1 0.038 0.038 0.1758
> anova(test200, ddf = 'Ken')
Analysis of Variance Table of type III with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
prepost 37.15 37.15 1 308.04 18.68 2.094e-05 ***
lowhi 1207.97 1207.97 1 376.43 607.33 < 2.2e-16 ***
prepost:lowhi 0.35 0.35 1 376.43 0.17 0.676
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Update: Thanks, Ben. I did a little code archeology on lmerTest to see if I could dope out an explanation for the above anomalies. First, it turns out that lmerTest::lmer just submits the model to lme4::lmer and then relabels the result as an "mermodLmerTest" object. The only effect of this is to invoke versions of summary() and anova() from the lmerTest package rather than the usual defaults from base and stats. (These lmerTest functions are compiled, and I have not yet gone farther to look at the C++ code.) lmerTest::summary just adds three columns to the base::summary result, giving df, t value, and Pr. Note that lmerTest::anova, by default, computes a type III anova rather than a type I as in stats::anova. (Explanation of my second question above.) Not a great choice if one's model includes interactions. One can request a type I, II, or III anova using the type = 1/2/3 option.
However there are other surprises using the nlmeTest versions of summary and anova, as shown in the R console file below. I used lmerTest's included sleepstudy data so that this code should be replicable.
a. Note that "sleepstudy" has 180 records (with 3 variables)
b. The summaries of fm1 and fm1a are identical except for the added Fixed effects columns. But note that in the lmerTest::summary the ddfs for the intercept and Days are 1371 and 1281 respectively; odd given that there are only 180 records in "sleepstudy."
c. Just as in my original model above, the nlm4 anad nlmrTest versions of anova give very different values of Sum Sq and Mean Sq. (30031 and 446.65 respectively).
d: Interestingly, the nlmrTest versions of anova using Satterthwaite and Kenward-Rogers estimates of the DenDF are wildly different (5794080 and 28 respecitvely). The K-R value seems more reasonable.
Given the above issues, I am reluctant at this point to depend on lmerTest to give reliable p-values. Based on your (Doug Bates's) blog entry (https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html), I am using now (and recommending) the method from the posting by Dan Mirman (http://mindingthebrain.blogspot.ch/2014/02/three-ways-to-get-parameter-specific-p.html) in the final bit of code below to estimate a naive t-test p-value (assuming essentially infinite degrees of freedom) and a Kenward-Rogers estimate of df (using the R package 'pbkrtest' -- the same package used by lmerTest). I couldn't find R code to compute the Satterthwaite estimates. The naive t-test p-value is clearly anti-conservative, but the KR estimate is thought to be pretty good. If the two give similar estimates of p, then I think that one can feel comfortable with a p-value in the range of [naive t-test, KR estimate].
> library(lme4); library(lmerTest); library(pbkrtest);
dim(sleepstudy)
[1] 180 3
>
> fm1 <- lme4::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
> fm1a <- lmerTest::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
>
> summary(fm1)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.405 6.825 36.84
Days 10.467 1.546 6.77
Correlation of Fixed Effects:
(Intr)
Days -0.138
> summary(fm1a)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 251.405 6.825 1371.100 302.06 <2e-16 ***
Days 10.467 1.546 1281.700 55.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
Days -0.138
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML
criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> anova(fm1)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
Days 1 30031 30031 45.853
> anova(fm1a, ddf = 'Sat', type = 1)
Analysis of Variance Table of type I with Satterthwaite
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 5794080 45.853 1.275e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
> anova(fm1a, ddf = 'Ken', type = 1)
Analysis of Variance Table of type I with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 27.997 45.853 2.359e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> # t.test
> coefs <- data.frame(coef(summary(fm1)))
> coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value)))
> coefs
Estimate Std..Error t.value p.z
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11
>
> # Kenward-Rogers
> df.KR <- get_ddf_Lb(fm1, fixef(fm1))
> df.KR
[1] 25.89366
> coefs$p.KR <- 2 * (1 - pt(abs(coefs$t.value), df.KR))
> coefs
Estimate Std..Error t.value p.z p.KR
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00 0.0000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11 3.5447e-07

Regression estimation in Eviews

I estimate the dependency of export,gdp and human capital. If choosing the linear method, I got this:
Dependent Variable: EXPORTS
Method: Least Squares
Date: 05/23/15 Time: 18:20
Sample: 1960 2011
Included observations: 52
Variable Coefficient Std. Error t-Statistic Prob.
C 2.63E+10 1.38E+10 1.911506 0.0618
HC -1.36E+10 6.08E+09 -2.233089 0.0301
GDP 2903680. 192313.2 15.09870 0.0000
R-squared 0.967407 Mean dependent var 1.90E+10
Adjusted R-squared 0.966076 S.D. dependent var 2.22E+10
S.E. of regression 4.08E+09 Akaike info criterion 47.15324
Sum squared resid 8.16E+20 Schwarz criterion 47.26581
Log likelihood -1222.984 Hannan-Quinn criter. 47.19640
F-statistic 727.1844 Durbin-Watson stat 0.745562
Prob(F-statistic) 0.000000
The sign of HC coefficient is negative, which is against the theory.I have tried logarithmic, exponential forms, but I still get negative results for HC.
I wonder what is the way to estimate it right.
Thank you in advance.
here is my data
Durbin-Watson stat 0.745562
It means that there is auto correlation problem in your model.