Why am I getting an "expected string or bytes-like object" when tokenizing? - nltk

I am a newbie in Python and I read and execute codes carefully. However, I found it hard to solve this basic problem. I always encounter "expected string or bytes-like object", when I tried to tokenize using nltk and pandas dataframe. Somehow, I tried to convert my text to string because of it but I get the same results.
This is my code:
df = pd.read_csv(r'C:\Users\nec_2\Documents\data_finalSLCA.csv', sep ='\t', encoding='utf-8', delimiter=",") print(df)
#extracting only the column abstract
df = pd.DataFrame(df, columns=['abstract'])
#converting to list
df_2 = [t for t in df.abstract.tolist()]
#tokenizing
from nltk.tokenize import word_tokenize
data = df_2
tokenized_text=word_tokenize(data)
print(tokenized_text)
print(type(tokenized_text))
I am getting error like "TypeError: expected string or bytes-like object".
Can somebody help me out with the problem?
Thanks in advance.

Related

sympy: function to convert string into actual sympy object

assume I have a csv-file in the following format:
csc($0);\csc##{$0};1;;;;;https://docs.sympy.org/latest/modules/functions/elementary.html#csc
cot($0);\cot##{$0};1;;;;;https://docs.sympy.org/latest/modules/functions/elementary.html#cot
> ;;;;;;;
sinh($0);\sinh##{$0};1;;;;;https://docs.sympy.org/latest/modules/functions/elementary.html#sinh
I want to read it into a Python script using:
default_semantic_latex_table = {}
with open("CAS_SymPy.csv") as file:
reader = csv.reader(file, delimiter=";")
for line in reader:
if line[0] != '':
tup = (line[0], line[2])
default_semantic_latex_table[tup] = str(line[1])
I'd like to get a dict of the following form: default_semantic_latex_table = {
(sympy.functions.elementary.trigonometric.sinh, 1): FormatTemplate(r"\sinh#{$0}")}
The first element of the tuple should not be a string but an actual sympy-object.
Does anyone know about a function that can convert a string into a sympy-object such as sympy.functions.elementary.trigonometric.sin, sympy.functions.elementary.exponential.exp or sympy.concrete.products.Product? I'd greatly appreciate any help!
sympify converst strings to SymPy objects:
>>> from sympy import sympify, pi
>>> sympify('cos')
cos
>>> _(pi)
-1

Error while running a python-Storing Data object in JSON

I've extracted data via api against which I had to transformation to read the data in tabular format. Sample code:
import json
import ast
import requests
from pandas import json_normalize
result = requests.get('https://website.com/api')
data = result.json()
df = pd.DataFrame(data['result']['records'])
Every time I run above python(.py) file in terminal, I get an error in line where it says;
in <module>
data = result.json()
Also this;
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Not sure why I am getting this error. Can anyone tell me how to fix this?
Any help would be appreciated.

line-delimited json format txt file, how to import with pandas

I have a line-delimited Json format txt file. The format of the file is .txt. Now I want to import it with pandas. Usually I can import with
df = pd.read_csv('df.txt')
df = pd.read_json('df.txt')
df = pd.read_fwf('df.txt')
they all give me an error.
ParserError: Error tokenizing data. C error: Expected 29 fields in line 1354, saw 34
ValueError: Trailing data
this returns the data, but the data is organized in a weird way where column name is in the left, next to the data
can anyone tells me how to solve this?
pd.read_json('df.txt', lines=True)
read_json accepts a boolean argument lines which will Read the file as a json object per line.

Unable to print output of JSON code into a .csv file

I'm getting the following errors when trying to decode this data, and the 2nd error after trying to compensate for the unicode error:
Error 1:
write.writerows(subjects)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 160: ordinal not in range(128)
Error 2:
with open("data.csv", encode="utf-8", "w",) as writeFile:
SyntaxError: non-keyword arg after keyword arg
Code
import requests
import json
import csv
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encode="utf-8", "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
Using requests and with the correction to the second part (as below) I have no problem running. I think your first problem is due to the second error (is a consequence of that being incorrect).
I am on Python3 and can run yours with my fix to open line and with
r = urllib.request.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
I personally would use requests.
import requests
import csv
data = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1').json()
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encoding ="utf-8", mode = "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
For your second, looking at documentation for open function, you need to use the right argument names and add the name of the mode argument if not positional matching.
with open("data.csv", encoding ="utf-8", mode = "w") as writeFile:

python 3 read csv UnicodeDecodeError

I have a very simple bit of code that takes in a CVS and puts it into a 2D array. It runs fine on Python2 but in Python3 I get the error below. Looking through the documentation,I think I need to use .decode() Could someone please explain how to use it in the context of my code and why I don't need to do anything in Python2
Error:
line 21, in
for row in datareader:
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 5002: invalid start byte
import csv
import sys
fullTable = sys.argv[1]
datareader = csv.reader(open(fullTable, 'r'), delimiter=',')
full_table = []
for row in datareader:
full_table.append(row)
print(full_table)
open(argv[1], encoding='ISO-8859-1')
CSV contained characters where were not UTF-8 which seemed like the default. I am however surprised that python2 dealt with this issue without any problems.