how to tokenize a text by nltk python - nltk

i have a text like this:
Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid()
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet'
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
i tokenize this text with word_tokenize in python and output is:
Exception
org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid
cause
'org.hibernate.exception.SQLGrammarException
could
extract
ResultSet'
Caused
java.sql.SQLSyntaxErrorException
ORA-00942
table
view
exist
But as you can see, the second line outputs several words that are dotted together. How to separate these as a Word?!
i use this python code:
>>> f = open('001.txt')
>>> text = [w for w in word_tokenize(f.read()) if w not in stopwords]
and In fact, I want all words to be separated like this:
Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
'org
hibernate
exception
SQLGrammarException
could
extract
ResultSet'
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist

f = "Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid() \
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet' \
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist'"
s = ''
f_list = f.replace('.', ' ').split(' ')
for item in f_list:
#print(item)
s = s + ' ' + ''.join(item)+'\n'
print(s)
output
Exception
in
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid()
with
cause
=
'org
hibernate
exception
SQLGrammarException:
could
not
extract
ResultSet'
Caused
by:
java
sql
SQLSyntaxErrorException:
ORA-00942:
table
or
view
does
not
exist'

i found a simple way that use of RegexpTokenizer of nltk.tokenize like this:
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'\w+')
The output after considering remove stopwords is as follows:
Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
org
hibernate
exception
SQLGrammarException
could
extract
ResultSet
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist

Related

Why can't I replace an existing table in my MySQL database using the following code?

import pandas as pd
from sqlalchemy import create_engine
db_instrumentos = pd.DataFrame(columns=('Market','Ticker','cficode'))
db_instrumentos = db_instrumentos.append({'Market':'NYSE'],
'Ticker':'MMM',
'cficode':'EAEWD25A'},
ignore_index=True)
db = db_obtenerInstrumentos
sql_engine = create_engine('mysql+pymysql://root:#localhost')
sql_conn = sql_engine.connect()
sql_conn.execute(f"CREATE database IF NOT EXISTS proof_Rofex")
sql_conn.execute(f"USE proof_Rofex")
db.to_sql(con=sql_conn, name='proof_table2', if_exists="replace")
sql_conn.close()
I am wanting to use this line of code db.to_sql (with = sql_conn, name = 'proof_table2', if_exists = "replace") and it throws me the following error
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1050, "Table 'proof_table2' already exists") [SQL: CREATE TABLE proof_table2 (index BIGINT, Market TEXT, Ticker TEXT, cficode TEXT)
] (Background on this error at: http://sqlalche.me/e/14/e3q8)
the proof_table2 table already exists but precisely that line should allow me to make a drop and replace ... that error that throws me is precisely on that line
I was able to solve it by adding schema, for your example it should be as below:
db.to_sql(con=sql_conn, name='proof_table2', if_exists='replace',
schema='proof_Rofex')

Python 3 psycopg2 COPY from stdin failed: error in .read()

I am trying to apply the code found on this page, in particular part 'Copy Data from String Iterator' of the Table of Contents, but run into an issue with my code.
Since not all lines coming from the generator (here log_lines) can be imported into the PostgreSQL database, I try to filter the correct lines (here row) using itertools.filterfalse like in the codeblock below:
def copy_string_iterator(connection, log_lines) -> None:
with connection.cursor() as cursor:
create_staging_table(cursor)
log_string_iterator = StringIteratorIO((
'|'.join(map(clean_csv_value, (
row['date'],
row['time'],
row['cs_uri_query'],
row['s_contentpath'],
row['sc_status'],
row['s_computername'],
...
row['sc_substates'],
row['s_port'],
row['cs_version'],
row['c_protocol'],
row.update({'cs_cookie':'x'}),
row['timetakenms'],
row['cs_uri_stem'],
))) + '\n')
for row in filterfalse(lambda line: "#" in line.get('date'), log_lines)
)
cursor.copy_from(log_string_iterator, 'log_table', sep = '|')
When I run this, cursor.copy_from() gives me the following error:
QueryCanceled: COPY from stdin failed: error in .read() call
CONTEXT: COPY log_table, line 112910
I understand why this error happens, it is because in the test file I use there are only 112909 lines that meet the filterfalse condition. But why does it try to copy line 112910 and throw the error and not just stop?
Since Python doesn't have a coalescing operator, add something like:
(map(clean_csv_value, (
row['date'] if 'date' in row else None,
:
row['cs_uri_stem'] if 'cs_uri_stem' in row else None,
))) + '\n')
for each of your fields so you can handle any missing fields in the JSON file. Of course the fields should be nullable in the db if you use None otherwise replace with None with some default value for that field.

Exception handling using mysql with twisted adbapi and scrapy

I'm using this scrapy pipeline. If there is any error in the sql in the insert_record function, it fails silently. For example, if a column name is miss-spelled, like this
def _insert_record(self, tx, item):
print "before tx.execute"
result = tx.execute(
""" INSERT INTO table(col_one, col_typo, col_three) VALUES (1,2,3)"""
)
print "after tx.execute"
if result > 0:
self.stats.inc_value('database/items_added')
then nothing is output after "before execute". There is a handle_error method but that's not called either. How can I catch and handle such errors?
Just needed to surround it with try...except
try:
result = tx.execute(
"""INSERT INTO table(col_one, col_typo, col_three) VALUES (1,2,3)"""
)
except Exception,e:
print str(e)

Error :JsonStorage in Pig Local mode

I am running my Pigscript in Local mode in eclipse.
when I try to store the output in JsonStorage.
Exception in thread "main" java.lang.RuntimeException: Cannot instantiate:org.apache.pig.builtin.JsonStorage
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:473)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.NonEvalFuncSpec(QueryParser.java:4976)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3473)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1351)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:706)
at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:967)
at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:716)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.PigServer.registerScript(PigServer.java:407)
at com.paypal.debugpig.DebugPig.main(DebugPig.java:13)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve org.apache.pig.builtin.JsonStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:458)
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:470)
... 14 more
PigScript :
REGISTER C:/path/to/jar/pig.jar;
REGISTER C:/path/to/jar/UpperUDf/UpperUDf_fat.jar;
A = LOAD 'C:/path/to/data/file/student.txt' using PigStorage('\t') AS (name: chararray, age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name) ,age, gpa ;
Store B into 'output_student_Json' using org.apache.pig.builtin.JsonStorage();
when I dump or store the ouput in text file its working and but issues occurs when I try to store in JSON format.
Any pointers appreciated
Thank you
I have verified it, and it is working for me if i am using the below line of code for storing output into json file format.
store B into 'json_output' using JsonStorage();

Error in fromJSON(paste(raw.data, collapse = "")) : unclosed string

I am using the R package rjson to download weather data from Wunderground.com. Often I leave the program to run and there are no problems, with the data being collected fine. However, often the program stops running and I get the following error message:
Error in fromJSON(paste(raw.data, collapse = "")) : unclosed string
In addition: Warning message:
In readLines(conn, n = -1L, ok = TRUE) :
incomplete final line found on 'http://api.wunderground.com/api/[my_API_code]/history_20121214pws:1/q/pws:IBIRMING7.json'
Does anyone know what this means, and how I can avoid it since it stops my program from collecting data as I would like?
Many thanks,
Ben
I can recreate your error message using the rjson package.
Here's an example that works.
rjson::fromJSON('{"x":"a string"}')
# $x
# [1] "a string"
If we omit a double quote from the value of x, then we get the error message.
rjson::fromJSON('{"x":"a string}')
# Error in rjson::fromJSON("{\"x\":\"a string}") : unclosed string
The RJSONIO package behaves slightly differently. Rather than throwing an error, it silently returns a NULL value.
RJSONIO::fromJSON('{"x":"a string}')
# $x
# NULL