Image to OCR processing using PySpark - ocr

I am trying to extract text from an image (using pytesser->Tesseract ocr engine) from pyspark, I have uploaded my image file in hdfs and trying to read from spark.
this is my code
>>> import sys
>>> sys.path.append("/home/dsuser/Downloads/pytesser")
>>> from PIL import Image
>>> from pytesser import *
>>> image_file = 'hdfs://localhost:9000/image/image6.jpg'
>>> rdd = sc.binaryFiles(image_file)
>>> img = Image.open(rdd.first())
>>> text = image_to_string(img)
>>> print("=====output=======\n")
>>> print(text)
while running, spark able to load the image file from hdfs, but i am getting below error while calling the below code
>>> im = Image.open(rdd.first())
Traceback (most recent call last): File "", line 1, in
File "/usr/lib/python2.7/dist-packages/PIL/Image.py", line
2000, in open
prefix = fp.read(16) AttributeError: 'tuple' object has no attribute 'read'
not sure about the error, I need help here to convert the image to OCR

Related

Error message when importing .csv files into MySQL using Python

I am a novice when it comes to Python and I am trying to import a .csv file into an already existing MySQL table. I have tried it several different ways but I cannot get anything to work. Below is my latest attempt (not the best syntax I'm sure). I originally tried using ‘%s’ instead of ‘?’, but that did not work. Then I saw an example of the question mark but that clearly isn’t working either. What am I doing wrong?
import mysql.connector
import pandas as pd
db = mysql.connector.connect(**Login Info**)
mycursor = db.cursor()
df = pd.read_csv("CSV_Test_5.csv")
insert_data = (
"INSERT INTO company_calculations.bs_import_test(ticker, date_updated, bs_section, yr_0, yr_1, yr_2, yr_3, yr_4, yr_5, yr_6, yr_7, yr_8, yr_9, yr_10, yr_11, yr_12, yr_13, yr_14, yr_15)"
"VALUES(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
)
for row in df.itertuples():
data_inputs = (row.ticker, row.date_updated, row.bs_section, row.yr_0, row.yr_1, row.yr_2, row.yr_3, row.yr_4, row.yr_5, row.yr_6, row.yr_7, row.yr_8, row.yr_9, row.yr_10, row.yr_11, row.yr_12, row.yr_13, row.yr_14, row.yr_15)
mycursor.execute(insert_data, data_inputs)
db.commit()
Error Message:
> Traceback (most recent call last): File
> "C:\...\Python_Test\Excel_Test_v1.py",
> line 33, in <module>
> mycursor.execute(insert_data, data_inputs) File "C:\...\mysql\connector\cursor_cext.py",
> line 325, in execute
> raise ProgrammingError( mysql.connector.errors.ProgrammingError: Not all parameters were used in the SQL statement
MySQL Connector/Python supports named parameters (which includes also printf style parameters (format)).
>>> import mysql.connector
>>> mysql.connector.paramstyle
'pyformat'
According to PEP-249 (DB API level 2.0) the definition of pyformat is:
pyformat: Python extended format codes, e.g. ...WHERE name=%(name)s
Example:
>>> cursor.execute("SELECT %s", ("foo", ))
>>> cursor.fetchall()
[('foo',)]
>>> cursor.execute("SELECT %(var)s", {"var" : "foo"})
>>> cursor.fetchall()
[('foo',)]
Afaik the qmark paramstyle (using question mark as a place holder) is only supported by MariaDB Connector/Python.

Convert dataframe to JSON using Python

I have been trying to convert a dataframe to JSON using Python. I am able to do it successfully but i am not getting the required format of JSON.
Code -
df1 = df.rename_axis('CUST_ID').reset_index()
df.to_json('abc.json')
Here, abc.json is the filename of JSON and df is the required dataframe.
What I am getting -
{"CUST_LAST_UPDATED":
{"1000":1556879045879.0,"1001":1556879052416.0},
"CUST_NAME":{"1000":"newly
updated_3_file","1001":"heeloo1"}}
What I want -
[{"CUST_ID":1000,"CUST_NAME":"newly
updated_3_file","CUST_LAST_UPDATED":1556879045879},
{"CUST_ID":1001,"CUST_NAME":"heeloo1","CUST_LAST_UPDATED":1556879052416}]
Error -
Traceback (most recent call last):
File
"C:/Users/T/PycharmProject/test_pandas.py",
line 19, in <module>
df1 = df.rename_axis('CUST_ID').reset_index()
File "C:\Users\T\AppData\Local\Programs\Python\Python36\lib\site-
packages\pandas\core\frame.py", line 3379, in reset_index
new_obj.insert(0, name, level_values)
File "C:\Users\T\AppData\Local\Programs\Python\Python36\lib\site-
packages\pandas\core\frame.py", line 2613, in insert
allow_duplicates=allow_duplicates)
File "C:\Users\T\AppData\Local\Programs\Python\Python36\lib\site-
packages\pandas\core\internals.py", line 4063, in insert
raise ValueError('cannot insert {}, already exists'.format(item))
ValueError: cannot insert CUST_ID, already exists
df.head() Output -
CUST_ID CUST_LAST_UPDATED CUST_NAME
0 1000 1556879045879 newly updated_3_file
1 1001 1556879052416 heeloo1
How to change the format while converting dataframe to JSON?
Use DataFrame.rename_axis with DataFrame.reset_index for column from index and then DataFrame.to_json with orient='records':
df1 = df.rename_axis('CUST_ID').reset_index()
df1.to_json('abc.json', orient='records')
[{"CUST_ID":"1000",
"CUST_LAST_UPDATED":1556879045879.0,
"CUST_NAME":"newly updated_3_file"},
{"CUST_ID":"1001",
"CUST_LAST_UPDATED":1556879052416.0,
"CUST_NAME":"heeloo1"}]
EDIT:
Because there is default index in data, use:
df1.to_json('abc.json', orient='records')
Verify:
print (df1.to_json(orient='records'))
[{"CUST_ID":1000,
"CUST_LAST_UPDATED":1556879045879,
"CUST_NAME":"newly pdated_3_file"},
{"CUST_ID":1001,
"CUST_LAST_UPDATED":1556879052416,
"CUST_NAME":"heeloo1"}]
You can convert a dataframe to a jason format using to_dict:
df1.to_dict('records')
the outpit would the one that you need.
Suppose if dataframe has nan values in each row and you don't want them in your json file. Follow below code
import pandas as pd
from pprint import pprint
import json
import argparse
if __name__=="__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--csv")
parser.add_argument("--json")
args = parser.parse_args()
entities=pd.read_csv(args.csv)
json_data=[row.dropna().to_dict() for index,row in entities.iterrows()]
with open(args.json,"w") as file:
json.dump(json_data,file)

Export JSON to CSV using Python

I wrote a code to extract some information from a website. the output is in JSON and I want to export it to CSV. So, I tried to convert it to a pandas dataframe and then export it to CSV in pandas. I can print the results but still, it doesn't convert the file to a pandas dataframe. Do you know what the problem with my code is?
# -*- coding: utf-8 -*-
# To create http request/session
import requests
import re, urllib
import pandas as pd
from BeautifulSoup import BeautifulSoup
url = "https://www.indeed.com/jobs?
q=construction%20manager&l=Houston&start=10"
# create session
s = requests.session()
html = s.get(url).text
# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' +
urllib.quote(job_ids)
# do Ajax request and convert the response to json
ajax_content = s.get(ajax_url).json()
print(ajax_content)
#Convert to pandas dataframe
df = pd.read_json(ajax_content)
#Export to CSV
df.to_csv("c:\\users\\Name\desktop\\newcsv.csv")
The error message is:
Traceback (most recent call last):
File "C:\Users\Mehrdad\Desktop\Indeed 06.py", line 21, in
df = pd.read_json(ajax_content)
File "c:\python27\lib\site-packages\pandas\io\json\json.py", line 408, in read_json
path_or_buf, encoding=encoding, compression=compression,
File "c:\python27\lib\site-packages\pandas\io\common.py", line 218, in get_filepath_or_buffer
raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type:
The problem was that nothing was going into the dataframe when you called read_json() because it was a nested JSON dict:
import requests
import re, urllib
import pandas as pd
from pandas.io.json import json_normalize
url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston&start=10"
s = requests.session()
html = s.get(url).text
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
ajax_content= s.get(ajax_url).json()
df = json_normalize(ajax_content).transpose()
df.to_csv('your_output_file.csv')
Note that I called json_normalize() to collapse the nested columns from the JSON. I also called transpose() so that the rows were labelled with the job ID rather than columns. This will give you a dataframe that looks like this:
0079ccae458b4dcf <p><b>Company Environment: </b></p><p>Planet F...
0c1ab61fe31a5c62 <p><b>Commercial Construction Project Manager<...
0feac44386ddcf99 <div><div>Trendmaker Homes is currently seekin...
...
It's not really clear what your expected output is, though ... what are you expecting the DataFrame/CSV file to look like?. If you actually were looking for just a single row/Series with the job ID's as column labels, just remove the call to transpose()

error while using nltk.post_tag

I have been trying to use nltk.pos_tag in my code but I face an error when I do so. I have already downloaded Penn treebank and max_ent_treebank_pos. But the error persists. here is my code :
import nltk
from nltk import tag
from nltk import*
a = "Alan Shearer is the first player to score over a hundred Premier League goals."
a_sentences = nltk.sent_tokenize(a)
a_words = [nltk.word_tokenize(sentence) for sentence in a_sentences]
a_pos = [nltk.pos_tag(sentence) for sentence in a_words]
print(a_pos)
and this is the error I get :
"Traceback (most recent call last):
File "<pyshell#9>", line 1, in <module>
print (nltk.pos_tag(text))
File "C:\Python34\lib\site-packages\nltk\tag\__init__.py", line 110, in pos_tag
tagger = PerceptronTagger()
File "C:\Python34\lib\site-packages\nltk\tag\perceptron.py", line 140, in __init__
AP_MODEL_LOC = 'file:'+str(find('taggers/averaged_perceptron_tagger/'+PICKLE))
File "C:\Python34\lib\site-packages\nltk\data.py", line 641, in find
raise LookupError(resource_not_found)
LookupError:
Resource 'taggers/averaged_perceptron_tagger/averaged_perceptron
_tagger.pickle' not found. Please use the NLTK Downloader to
obtain the resource: >>> nltk.download()
Searched in:
- 'C:\\Users\\T01142/nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'
- 'C:\\Python34\\nltk_data'
- 'C:\\Python34\\lib\\nltk_data'
- 'C:\\Users\\T01142\\AppData\\Roaming\\nltk_data'
Call this from python:
nltk.download('averaged_perceptron_tagger')
Had the same problem in a Flask server. nltk used a different path when in server config, so I recurred to adding nltk.data.path.append("/home/yourusername/whateverpath/") inside of the server code right before the pos_tag call
Note there is some replication of this question
How to config nltk data directory from code?
nltk doesn't add $NLTK_DATA to search path?
POS tagging with NLTK. Can't locate averaged_perceptron_tagger
To resolve this error run following command on python prompt:
import nltk
nltk.download('averaged_perceptron_tagger')

simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 3 (char 2)

I am trying to send a http request to any url and get the response using urllib library. Following is the code that I have used :
>>> import requests
>>> r = requests.get("http://www.youtube.com/results?bad+blood")
>>> r.status_code
200
when I try to do this I get following error.
>>> r.json()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/requests/models.py", line 808, in json
return complexjson.loads(self.text, **kwargs)
File "/Library/Python/2.7/site-packages/simplejson/__init__.py", line 516, in loads
return _default_decoder.decode(s)
File "/Library/Python/2.7/site-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/Library/Python/2.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 3 (char 2)
can someone tell me whats wrong with the code.
PS: I am using python 2.7.10
The response isn't JSON, it is 'text/html; charset=utf-8'. If you want to parse it, use something like BeautifulSoup.
>>> import requests, bs4
>>> rsp = requests.get('http://www.youtube.com/results?bad+blood')
>>> rsp.headers['Content-Type']
'text/html; charset=utf-8'
>>> soup = bs4.BeautifulSoup(rsp.content, 'html.parser')
I'd recommend using the YouTube Search API instead. Log in to Google Developers Console, set up a API key following the API Key Setup instructions, then you can make the request using the YouTube Search API:
>>> from urllib import parse
>>> import requests
>>> query = parse.urlencode({'q': 'bad blood',
... 'part': 'snippet',
... 'key': 'OKdE7HRNPP_CzHiuuv8FqkaJhPI2MlO8Nns9vuM'})
>>> url = parse.urlunsplit(('https', 'www.googleapis.com',
... '/youtube/v3/search', query, None))
>>> rsp = requests.get(url, headers={'Accept': 'application/json'})
>>> rsp.raise_for_status()
>>> response = rsp.json()
>>> response.keys()
dict_keys(['pageInfo', 'nextPageToken', 'regionCode', 'etag', 'items', 'kind'])
Note that the example is using Python 3. If you want to use Python 2, then you will have to import urlencode from urllib and urlunsplit from urlparse.
That URL returns HTML, not JSON, so there's no point calling .json() on the response.