spacy_wordnet -> lang extra fields not permitted - nltk

i was following this tutorial for wordnet enter link description here running this code:
import spacy
print(spacy.__version__)
from spacy_wordnet.wordnet_annotator import WordnetAnnotator
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("spacy_wordnet", after='tagger', config={'lang': nlp.lang})
but im getting this error:
spacy_wordnet -> lang extra fields not permitted
How can i fix it? im using vs code, python 3.1.0 and spacy 3.3.0

Related

Detect language/script from pdf with python

I am trying to create a python script to detect the language(s)/script(s) inside a not yet OCRed pdf with the help of pytesseract before doing the 'real' ocr by passing the correct detected language(s)
I have like 10000 pdf's not always standard english and sometimes 1000 pages long. In order to do the real OCR I need to autodetect the language first.
So a sort of two step OCR as you will that tesseract both can preform
Detecting the language/script on some centered pages
Preforming the real OCR with the found language/script over all pages
Any tips to fix/improve this script? All I want is language(s) on the given pages detected returned.
#!/usr/bin/python3
import sys
import pytesseract
from wand.image import Image
import fitz
pdffilename = sys.argv[1]
doc = fitz.open(pdffilename)
center_page = round(doc.pageCount / 2)
surround = 2
with Image(filename=pdffilename + '[' + str(center_page - surround) + '-' + str(center_page + surround) + ']') as im:
print(pytesseract.image_to_osd(im, lang='osd',config='psm=0 pandas_config=None', nice =0, timeout=0))
I run the script as follows:
script_detect.py myunknown.pdf
I am getting the following error atm:
TypeError: Unsupported image object
Assuming that you have converted your pdf-file using some tool (OCR or other) you can use langdetect. Sample your text and feed it detect
from langdetect import detect
lang = detect("je suis un petit chat")
print(lang)
```output fr````
or
from langdetect import detect
lang = detect("我是法国人")
print(lang)
output ch
There are other libraries, such as polyglot, useful if you have mixed languages.

drop_duplicates() got an unexpected keyword argument 'ignore_index'

In my machine, the code can run normally. But in my friend's machine, there is an error about drop_duplicates(). The error type is the same as the title.
Open your command prompt, type pip show pandas to check the current version of your pandas.
If it's lower than 1.0.0 as #paulperry says, then type pip install --upgrade pandas --user
(substitute user with your windows account name)
Type import pandas as pd; pd.__version__ and see what version of Pandas you are using and make sure it's >= 1.0 .
I was having the same problem as Wzh -- but am running pandas version 1.1.3. So, it was not a version problem.
Ilya Chernov's comment pointed me in the right direction. I needed to extract a list of unique names from a single column in a more complicated DataFrame so that I could use that list in a lookup table. This seems like something others might need to do, so I will expand a bit on Chernov's comment with this example, using the sample csv file "iris.csv" that isavailable on GitHub. The file lists sepal and petal length for a number of iris varieties. Here we extract the variety names.
df = pd.read_csv('iris.csv')
# drop duplicates BEFORE extracting the column
names = df.drop_duplicates('variety', inplace=False, ignore_index=True)
# THEN extract the column you want
names = names['variety']
print(names)
Here is the output:
0 Setosa
1 Versicolor
2 Virginica
Name: variety, dtype: object
The key idea here is to get rid of the duplicate variety names while the object is still a DataFrame (without changing the original file), and then extract the one column that is of interest.

Any library that can help me create a JSON file with dummy records

I am looking at any library (in java) that can help me generate a dummy JSON file to test my code for e.g The JSON file can contain random user profile data-name, address, zipcode
I searched StackOverflow and found this link, found the following link : How to generate JSON string in Java?
I think the suggested library https://github.com/DiUS/java-faker, seems to be useful, however because of security constraints I cannot use this particular library. Are there any more recommendations?
Use for instance Faker, like that:
#!/usr/bin/env python3
from json import dumps
from faker import Faker
fake = Faker()
def user():
return dict(
name=fake.name(),
address=fake.address(),
bio=fake.text()
)
print('[')
try:
while True:
print(dumps(user()))
print(',')
except KeyboardInterrupt:
# XXX: json array can not end with a comma
print(dumps(user()))
print(']')
You can use it like that:
python3 fake_user.py > users.json
Use Ctrl+C to stop it when the file is big enough

jsonstat.from_file() return error "can't multiply sequence by non-int of type 'list'"

I'm trying to parse a json-stat file using jsonstat.py (v 0.1.7) but am getting an error.
The code below is copied from the examples on github (https://github.com/26fe/jsonstat.py/tree/master/examples-notebooks):
from __future__ import print_function
import os
import jsonstat
os.chdir(r'D:\Desktop\JSON_Stat')
url = 'http://www.cso.ie/StatbankServices/StatbankServices.svc/jsonservice/responseinstance/NQQ25'
file_name = "test02.json"
file_path = os.path.abspath(os.path.join("..","JSON_Stat", "CSO", file_name))
I added this line to deal with non ascii characters in the file:
# -*- coding: utf-8 -*-
this succesfully downloads the json file to my desktop:
if os.path.exists(file_path):
print("using already downloaded file {}".format(file_path))
else:
print("download file and storing on disk")
jsonstat.download(url, file_path)
From here, I can load and pprint the data using the json module:
import json
import pprint as pp
with open(r"CSO\test02.json") as data_file:
data = json.load(data_file)
pp.pprint(data)
... but when I try and use the jsonstat module (as specified in the examples) I get the error mentioned in the subject:
collection = jsonstat.from_file(r"D:\Desktop\JSON_Stat\CSO\test02.json")
collection
# abbreviated error message
--> 384 self.__pos2cat = self.__size * [None]
TypeError: can't multiply sequence by non-int of type 'list'
I understand what the error message itself means but, having studied the the dimensions.py module where it occurs, am stuck trying to understand why. I was able to run the sample OECD code without issue so perhaps the data itself is not formatted in the expected way, though the source site (http://www.cso.ie/webserviceclient/) states that the json-stat format is being used.
So, finally, my questions are: has anyone run into this error and resolved it? Has anyone succesfully used the jsonstat module to parse this specific data? Alternatively, any general advice towards troubleshooting this issue is welcome.
Thanks

Why can't I import a default export with "import ... as" with BabelJS

In version 5.6.4 of BabelJS, I seemingly cannot "import ... as." Here are examples of what I am trying to do:
In file 'test.js':
export default class Test {};
In file 'test2.js' (in the same directory):
import Test as Test2 from './test';
I have also tried to do:
import {Test as Test2} from './test';
Even though it says nothing about that here:
http://babeljs.io/docs/learn-es2015/#modules
And only uses brackets in the non-default syntax here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/import
Has anyone done this successfully?
EDIT: It is absolutely because of the default keyword. So, in this case, the question becomes, does anyone have any links to documentation that states that I should not be able to alias a default import? ECMA or Babel.
You can import the default export by either
import Test2 from './test';
or
import {default as Test2} from './test';
The default export doesn't have Test as a name that you would need to alias - you just need to import the default under the name that you want.
The best docs I've found so far is the article ECMAScript 6 modules: the final syntax in Axel Rauschmayers blog.