Spacy - NLTK: Language detection - nltk

I am currently working on a project dealing with a bunch of social media posts.
Some of these posts are in English and some in Spanish.
My current code runs quite smoothly. However, I am asking myself does Spacy/NLTK automatically detect which language stemmer/stopwords/etc. it has to use for each post (depending on whether it is an English or Spanish post)? At the moment, I am just parsing each post to a stemmer without explicitly specifying the language.
This is a snippet of my current script:
import re
import pandas as pd
!pip install pyphen
import pyphen
!pip install spacy
import spacy
!pip install nltk
import nltk
from nltk import SnowballStemmer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
!pip install spacy-langdetect
from spacy_langdetect import LanguageDetector
!pip install textblob
from textblob import TextBlob
# Download Stopwords
nltk.download('stopwords')
stop_words_eng = set(stopwords.words('english'))
stop_words_es = set(stopwords.words('spanish'))
# Import Stemmer
p_stemmer = PorterStemmer()
#Snowball (Porter2): Nearly universally regarded as an improvement over porter, and for good reason.
snowball_stemmer = SnowballStemmer("english")
dic = pyphen.Pyphen(lang='en')
# Load Data
data = pd.read_csv("mergerfile.csv", error_bad_lines=False)
pd.set_option('display.max_columns', None)
posts = data.loc[data["ad_creative"] != "NONE"]
# Functions
def get_number_of_sentences(text):
sentences = [sent.string.strip() for sent in text.sents]
return len(sentences)
def get_average_sentence_length(text):
number_of_sentences = get_number_of_sentences(text)
tokens = [token.text for token in text]
return len(tokens) / number_of_sentences
def get_token_length(text):
tokens = [token.text for token in text]
return len(tokens)
def text_analyzer(data_frame):
content = []
label = []
avg_sentence_length = []
number_sentences = []
number_words = []
for string in data_frame:
string.join("")
if len(string) <= 4:
print(string)
print("filtered")
content.append(string)
avg_sentence_length.append("filtered")
number_sentences.append("filtered")
number_words.append("filtered")
else:
# print list
print(string)
content.append(string)
##Average Sentence Lenght
result = get_average_sentence_length(nlp(string))
avg_sentence_length.append(result)
print("avg sentence length:", result)
##Number of Sentences
result = get_number_of_sentences(nlp(string))
number_sentences.append(result)
print("#sentences:", result)
##Number of words
result = get_token_length(nlp(string))
number_words.append(result)
print("#Words", result)
content, avg_sentence_length, number_sentences, number_words = text_analyzer(
data["posts"])

Short answer is no, neither NLTK nor SpaCy will automatically determine the language and apply appropriate algorithms to a text.
SpaCy has separate language models with their own methods, part-of-speech and dependency tagsets. It also has a set of stopwords for each available language.
NLTK is more modular; for stemming there is RSLPStemmer (Portuguese), ISRIStemmer (Arabic), and SnowballStemmer (Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish).
When you determine the language of a post through spacy_langdetect, the next thing you should do is explicitly instruct to use the appropriate SpaCy language model or NLTK module.

Use GoogleTrans Library for this
#/usr/bin/python
from googletrans import Translator
translator = Translator()
translator.detect('이 문장은 한글로 쓰여졌습니다.')
This Returns
<Detected lang=ko confidence=0.27041003>
So, this is the best way to do so if you have an internet connection and is better in most cases than Spacy as Google Translate is more mature and has better algorithms, ;)

Related

PDFMiner does not detect all pages

I am trying to extract text from pdfs, but I am running into an error because my script sometimes detects every page of a pdf, and sometimes only detects the first page of a pdf. I even included this line from a previous post on stackoverflow.
print(len(list(extract_pages(pdf_file))))
Anytime my script extracted just the first page, the script only detected 1 page.
I've even tried another library (PyPDF2) to extract text, but had even worse results.
If I look up the properties of the pdfs that my script mishandles, Adobe clearly shows in the pdf's properties the correct number of pages.
Below is the code I am using. Any recommendations on how I might change my script to detect all pages of a pdf would be appreciated.
import os
from os.path import isfile, join
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
pdf_dir = "/dir/pdfs/"
txt_dir = "/dir/txt/"
corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
for filename in corpus:
print(filename)
output_string = StringIO()
with open(join(pdf_dir, filename), 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
txt_name = "{}.txt".format(filename[:-4])
with open(join(txt_dir, txt_name), mode="w", encoding='utf-8') as o:
o.write(output_string.getvalue())
Here is a solution. After trying different libraries in R (pdftools) and Python (pdfplumber), PyMuPDF works best.
from io import StringIO
import os
from os.path import isfile, join
import fitz
pdf_dir = "pdf path"
txt_dir = "txt path"
output_string = StringIO()
corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
for filename in corpus:
print(filename)
output_string = StringIO()
doc = fitz.open(join(pdf_dir,filename))
for page in doc:
output_string.write(page.getText("rawdict"))
txt_name = "{}.txt".format(filename[:-4])
with open(join(txt_dir, txt_name), mode="w", encoding='utf-8') as o:
o.write(output_string.getvalue())

How to train own model and test it with spacy

I am using the below code to train an already existing spacy ner model. However, I dont get correct results on tests:
What I am missing?
import spacy
import random
from spacy.gold import GoldParse
from spacy.language import EntityRecognizer
train_data = [
('Who is Rocky babu?', [(7, 16, 'PERSON')]),
('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])
]
nlp = spacy.load('en', entity=False, parser=False)
ner = EntityRecognizer(nlp.vocab, entity_types=['PERSON', 'LOC'])
for itn in range(5):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.tagger(doc)
nlp.entity.update([doc], [gold])
Now, When i try to test the above model by using the below code, I don't get the expected output.
text = ['Who is Rocky babu?']
for a in text:
doc = nlp(a)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
My output is as follows:
Entities []
whereas my expected output is as follows:
Entities [('Rocky babu', 'PERSON')]
Can someone please tell me what I'm missing ?
Could you retry with
nlp = spacy.load('en_core_web_sm', entity=False, parser=False)
If that gives an error because you don't have that model installed, you can run
python -m spacy download en_core_web_sm
on the commandline first.
And ofcourse keep in mind that for a proper training of the model, you'll need many more examples for the model to be able to generalize!

How to remove unusual characters from JSON dump in Python?

I have been searching around for a good way to remove all unusual characters from a JSON dump of tweets that I am using to compile a dataset for sentiment analysis.
characters I am trying to remove = ンボ チョボ付 最安値
These characters appear in my tweet data and I am trying to remove them using regex but to no avail.
import json
import csv
import pandas as pd
import matplotlib.pyplot as plt
tweets_data_path = 'twitter_data.txt'
tweets_data = []
tweets_text_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
for tweet in tweets_data:
if tweet['text']:
tweets_text_data.append(tweet['text'])
print(tweets_text_data)
with open('dataset_file', 'w') as dataset_file:
writer = csv.writer(dataset_file)
writer.writerow(tweets_text_data)
I tried using re.sub() to take away these charcters but it will not work. How can I make this work?

raise_FirstSetError in SpaCy topic modeling

I want to create a LDA topic model and am using SpaCy to do so, following a tutorial. The error I receive when I try to use spacy is one I cannot find on google, so I'm hoping someone here knows what it's about.
I'm running this code on Anaconda:
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint
# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
df = pd.DataFrame(data)
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
# deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words[:1])
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
return texts_out
nlp = spacy.load('en', disable=['parser', 'ner'])
# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])
And I receive the following error:
File "C:\Users\maart\AppData\Local\Continuum\anaconda3\lib\site-packages\_regex_core.py", line 1880, in get_firstset
raise _FirstSetError()
_FirstSetError
The error must occur somewhere after the lemmatization, because the other parts work fine.
Thanks a bunch!
I had this same issue and I was able to resolve it by uninstalling regex (I had the wrong version installed) and then running python -m spacy download en again. This will reinstall the correct version of regex.

Can't load dataset into ipython. UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 1: invalid continuation byte

Fairly new to using ipython so I'm still getting confused quite easily. Here is my code so far. After loading I have to display only the first 5 rows of the file.
# Import useful packages for data science
from IPython.display import display, HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Load concerts.csv
path1 = 'C:\\Users\\Cathal\\Documents\\concerts.csv'
concerts = pd.read_csv(path1)
Thanks in advance for any help.
try
concerts = pd.read_csv(path1, encoding = 'utf8')
if that doesnt work try
concerts = pd.read_csv(path1, encoding = "ISO-8859-1")