I'm trying to navigate between 2 HTML pages through Tornado. Following is the code for the routes and their respective handlers:
class MainHandler(tornado.web.RequestHandler):
def get(self):
log.info("Rendering index.html")
self.render("index.html")
class NotificationsPageHandler(tornado.web.RequestHandler):
def get(self):
log.info("Rendering notifications")
self.render("notifications.html")
def start_server():
settings = {
"static_path": os.path.join(os.path.dirname(__file__), "static")
}
application = tornado.web.Application([
(r"/", MainHandler),
(r"/notifications.html", NotificationsPageHandler),
], **settings)
application.listen(8989)
tornado.ioloop.IOLoop.current().start()
When I load 127.0.0.1:8989 on the browser, I get the index.html page but when I try to navigate to notifications.html through an anchor tag in index.html, I get the following stack trace:
2016-07-06 12:07:06,546 - tornado.application - ERROR - Uncaught exception GET /notifications.html (127.0.0.1)
HTTPServerRequest(protocol='http', host='127.0.0.1:8989', method='GET', uri='/notifications.html', version='HTTP/1.1', remote_ip='127.0.0.1', headers={'Accept-Language': 'en-US,en;q=0.8', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Host': '127.0.0.1:8989', 'Upgrade-Insecure-Requests': '1', 'Accept-Encoding': 'gzip, deflate, sdch', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36', 'Referer': 'http://127.0.0.1:8989/', 'Connection': 'keep-alive'})
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tornado/web.py", line 1443, in _execute
result = method(*self.path_args, **self.path_kwargs)
File "BADWebServer.py", line 231, in get
self.render("notifications.html")
File "/usr/local/lib/python3.5/dist-packages/tornado/web.py", line 699, in render
html = self.render_string(template_name, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tornado/web.py", line 806, in render_string
return t.generate(**namespace)
File "/usr/local/lib/python3.5/dist-packages/tornado/template.py", line 345, in generate
return execute()
File "notifications_html.generated.py", line 5, in _tt_execute
_tt_tmp = item.score # notifications.html:37
NameError: name 'item' is not defined
2016-07-06 12:07:06,548 - tornado.access - ERROR - 500 GET /notifications.html (127.0.0.1) 4.51ms
I have seen a similar post, how to navigate from one html to other in tornado using anchor tag but I'm not sure why I'm getting the exception.
You're getting the error because, as the trace says, "name 'item' is not defined". Your notifications.html template contains some markup like:
{{ item.score }}
... but you haven't passed an "item" variable in. See the template syntax guide for an example.
Related
I was unable to use the requests library and use the get() function to scrape data from this specific website as running the below code block will result in a status code of 403 (unsuccessful)
import requests
#using headers in order to emulate a browser
headers = {'user-agent': 'Chrome/55.0.2883.87'}
url = "https://www.rumah.com/properti-dijual"
# Make a request to the website and retrieve the data
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
print("Request was successful", response.status_code)
# Save the source code as a text file
print(response.text)
else:
print("Request was not successful", response.status_code)
However, when I tried the same source code trying to scrape a different website, the request was successful (status code 200).
import requests
#using headers in order to emulate a browser
headers = {'user-agent': 'Chrome/55.0.2883.87'}
url = "https://www.subscene.com"
# Make a request to the website and retrieve the data
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
print("Request was successful", response.status_code)
# Save the source code as a text file
print(response.text)
else:
print("Request was not successful", response.status_code)
I'm trying to scrape housing data from the website by getting a successful request to the website. I realized that some websites prevent scraping and those specific pages are listed in the robots.txt file. However, I can't find the specific page that I want to scrape in the robots.txt file, therefore I thought that I should be able to scrape this website.
Here is the robots.txt file for the specific webpage:
enter image description here
This is my first question in StackOverflow. Any help would be appreciated!
Your url https://www.rumah.com/properti-dijual is using cloudfare protection, and https://www.subscene.com as well.
But maybe, https://www.subscene.com has a more strict policy.
In case your getting error 403:
provide all headers as following:
import requests
headers = {
'authority': 'www.rumah.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,fr;q=0.5,de-CH;q=0.4',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not_A Brand";v="99", "Microsoft Edge";v="109", "Chromium";v="109"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70',
}
response = requests.get('https://www.rumah.com/properti-dijual', headers=headers)
If that doesn't work, try using javascript:
fetch("https://www.rumah.com/properti-dijual", {
"headers": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-language": "de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,fr;q=0.5,de-CH;q=0.4",
"cache-control": "no-cache",
"pragma": "no-cache",
"sec-ch-ua": "\"Not_A Brand\";v=\"99\", \"Microsoft Edge\";v=\"109\", \"Chromium\";v=\"109\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1"
},
"body": null,
"method": "GET",
"mode": "cors",
});
You can also initiate javascript with python using Selenium or Selenium-Profiles (undetected, uses Chrome)
i’m try to connect Hue 4.8 with impala 3.0.0-cdh6.0.0, from editor i can see all databases but if try to refresh tables or execute some query i received this error:
[23/Jan/2021 12:46:46 -0800] thrift_util WARNING Unable to unpack the secret and guid in Thrift Handle: unpack requires a buffer of 16 bytes
[23/Jan/2021 12:46:46 -0800] api ERROR Autocomplete data fetching error Traceback (most recent call last): File “/usr/share/hue/apps/beeswax/src/beeswax/api.py”, line 111, in _autocomplete
response[‘functions’] = _get_functions(db, database)
File “/usr/share/hue/apps/beeswax/src/beeswax/api.py”, line 183, in _get_functions
functions = db.get_functions(prefix=database)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/dbms.py”, line 1167, in get_functions
handle = self.execute_and_wait(query, timeout_sec=5.0)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/dbms.py”, line 973, in execute_and_wait
handle = self.client.query(query)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 1420, in query
return self._client.execute_async_query(query, statement, session=session)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 988, in execute_async_query
return self.execute_async_statement(statement=query_statement, conf_overlay=configuration, session=session)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 1021, in execute_async_statement
(res, session) = self.call_return_result_and_session(thrift_function, thrift_request, session=session)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 753, in call_return_result_and_session
return self._call_return_result_and_session(fn, req, status=status, session=session)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 790, in _call_return_result_and_session
raise QueryServerException(Exception(message), message=message)
beeswax.server.dbms.QueryServerException: Malformed THandleIdentifier (guid size: 17, expected 16, secret size: 17, expected 16)
2021-01-23T20:46:46.959706000Z
[23/Jan/2021 12:46:46 -0800] access INFO 172.18.0.1 marco - “POST /notebook/api/autocomplete/ HTTP/1.1” returned in 73ms 200 129
[23/Jan/2021 12:46:46 -0800] access INFO 172.18.0.1 marco - “POST /notebook/api/autocomplete/ HTTP/1.1” returned in 73ms 200 129
172.18.0.1 - - [23/Jan/2021:20:46:46 +0000] “POST /notebook/api/autocomplete/ HTTP/1.1” 200 129 “http://localhost:8888/hue/editor/?type=impala” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36”
this is my hue setting:
[impala]
Hot of the Impala Server (one of the Impalad)
server_host=impala
Port of the Impala Server
server_port=21050
coordinator_url=http://impala:25000
impersonation_enabled=false
#use_thrift_http=true
Can someone help me to resolve this issue?
Thanks
I am making a python module that interacts with Carousell using the requests module. Now I am trying to send a post request with a JSON payload, but I keep getting HTTP error code 422(UNPROCESSABLE ENTITY). I don't know what's wrong with my JSON payload, python dict(before it's converted to JSON) or perhaps I am missing something in my request headers.
I tried taking the raw json string(from the POST request that I captured using Chrome dev tools) converting it dict and copy that dict(printed out) and try to use it in the program. It didn't work.
login_session = requests.session()
login_session.headers.update({"DNT":"1", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36", "Origin":"https://sg.carousell.com"})
login_payload = {'requests': {'g0': {'resource': 'sso', 'operation': 'create', 'params': {'loginToken': cookies["login-token"]}, 'body': {}}}, 'context': {'_csrf': cookies["_csrf"]}}
login_cookies = {"__cfduid": cookies["__cfduid"], "_csrf": cookies["_csrf"], "gtkprId": cookies["gtkprId"], "login-token": cookies["login-token"], "redirect":"redirect"}
login_headers = {'accept':'*/*','accept-encoding':'gzip, deflate, br','accept-language':'en-GB,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,en-US;q=0.6', 'x-requested-with': 'XMLHttpRequest', 'content-type': 'application/json'}
login_data = login_session.post(query_url, cookies=login_cookies, data=json.dumps(login_payload), headers=login_header)
Heres the output from debugging logger
DEBUG:urllib3.connectionpool:https://sg.carousell.com:443 "POST /ui/iso?_csrf=TNZTMZpBdQYgRFFouCF4ELVB HTTP/1.1" 422 0
Edit:
Heres the JSON payload which was sent to the server. I am trying to replicate it.
{"requests":{"g0":{"resource":"sso","operation":"create","params":{"loginToken":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpYXQiOjE1NTg1MDQyMjQsImlzcyI6ImxvZ2luLmNhcm91c2VsbC5jb20iLCJzc29pZCI6IkRacG1rd1l1SXAxdDF5U3A2M1RXWExPUTJnWmRFRzBOSHd3d0ZGSm9PSkFvVFFOdGFyNWt0MDMzNm5EVHRudHoiLCJ1c2VyaWQiOiIxNDczMjI3NCJ9.x7YxdLLk1ID6_jWy4trtLzbrPnZZ0eI7g_cQN1BilF8"},"body":{}}},"context":{"_csrf":"hPPhgajp-1GMLSbgjZBNBD7z2EGPVGCuA_mU"}}
Note that login token and _csrf are data from the cookies.
I want to use scrapy to crawl a json.Here is my code:
class zhangjiaweiSpider(scrapy.Spider):
name = "zhangjiawei"
start_urls = [
"https://zhuanlan.zhihu.com/api/columns/zhangjiawei/posts?limit=20&offset="
]
def start_requests(self):
for i in range(1):
url = self.start_urls[0] + str(i * 20)
yield scrapy.Request(url,callback = self.parse)
def parse(self, response):
jsonbody = json.loads(response.body.decode('utf-8','ignore'))
print(jsonbody)
But when I run it,I get errors:
Traceback (most recent call last):
File "d:\soft\python\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\code\python\spider\zhihuzhuanlan\zhihuzhuanlan\spiders\zhangjiaweispider.py", line 24, in parse
jsonbody = json.loads(response.body.decode('utf-8','ignore'))
File "d:\soft\python\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "d:\soft\python\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "d:\soft\python\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I tried to print response.body.decode(), but got error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 0:
invalid start byte
if I print response.body.decode('utf-8','ignore') they are Garbled.
I thought this error may be caused by response.body's decode. But I don't know how to resolve this problem.
my setting.py:
BOT_NAME = 'zhihuzhuanlan'
SPIDER_MODULES = ['zhihuzhuanlan.spiders']
NEWSPIDER_MODULE = 'zhihuzhuanlan.spiders'
DEFAULT_REQUEST_HEADERS = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding':'gzip, deflate, br',
'accept-language':'zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-CN;q=0.6',
'USER-AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
}
ITEM_PIPELINES = {
'zhihuzhuanlan.pipelines.ArticleDataBasePipeline': 5,
}
FEED_EXPORT_ENCODING = 'utf-8'
# linux pip install MySQL-python
DATABASE = {'drivername': 'mysql',
'host': '192.168.203.95',
'port': '3306',
'username': 'root',
'password': 'Password',
'database': 'spider',
'query': {'charset': 'utf8'}}
ROBOTSTXT_OBEY = True
i'm using python tornado to build a simple web server. Here is the code of tornado:
import json
import tornado.httpserver
import tornado.ioloop
import tornado.options
import tornado.web
from tornado.options import define, options
define("port", default=80, help="run on the given port", type=int)
class IndexHandler(tornado.web.RequestHandler):
def get(self, param):
print("\n\nthis is a get request from indexhandler:")
if param:
print("param is NOT null")
self.render(r"frontend/" + param)
else:
print("param is null")
self.render(r"frontend/index.html")
if __name__ == "__main__":
tornado.options.parse_command_line()
app = tornado.web.Application(handlers=[(r"/(.*)", IndexHandler)])
http_server = tornado.httpserver.HTTPServer(app)
http_server.listen(options.port)
tornado.ioloop.IOLoop.instance().start()
All of the frontend codes are in the directory /frontend so I used a simple regex (.*) to permit user to access all of resources in /frontend, such as js files and css files.
However, when I try to visit my website, I get some 304 errors at the server:
[I 170501 14:31:59 web:2063] 200 GET /html/country.html
[I 170501 14:31:59 web:2063] 304 GET /css/bootstrap.min.css
[I 170501 14:31:59 web:2063] 304 GET /css/reset.css
[I 170501 14:31:59 web:2063] 304 GET /css/icon/iconfont.css
[I 170501 14:31:59 web:2063] 304 GET /css/country.css
[I 170501 14:31:59 web:2063] 304 GET /css/common.css
UPDATE
I have another issue: error 500
In a word, I have some 304 and some 500. All of 500 are like this:
[E 170501 22:53:19 web:1590] Uncaught exception GET /images/main-img1.jpg (X.X.X.X)
HTTPServerRequest(protocol='http', host='X.X.X.X', method='GET', uri='/images/main-img1.jpg', version='HTTP/1.1', remote_ip='X.X.X.X', headers={'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4', 'Accept': 'image/webp,image/*,*/*;q=0.8', 'Host': 'X.X.X.X', 'Referer': 'http://X.X.X.X/html/country.html', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'})
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/tornado/web.py", line 1509, in _execute
result = method(*self.path_args, **self.path_kwargs)
File "tmp.py", line 20, in get
self.render("frontend/" + param)
File "/usr/local/lib/python3.5/site-packages/tornado/web.py", line 724, in render
html = self.render_string(template_name, **kwargs)
File "/usr/local/lib/python3.5/site-packages/tornado/web.py", line 862, in render_string
t = loader.load(template_name)
File "/usr/local/lib/python3.5/site-packages/tornado/template.py", line 427, in load
self.templates[name] = self._create_template(name)
File "/usr/local/lib/python3.5/site-packages/tornado/template.py", line 455, in _create_template
template = Template(f.read(), name=name, loader=self)
File "/usr/local/lib/python3.5/site-packages/tornado/template.py", line 304, in __init__
reader = _TemplateReader(name, escape.native_str(template_string),
File "/usr/local/lib/python3.5/site-packages/tornado/escape.py", line 218, in to_unicode
return value.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
[E 170501 22:53:19 web:2063] 500 GET /images/main-img1.jpg (X.X.X.X) 2.07ms
HTTP 304 is "Not Modified". It's not an error, it's an optimization. Tornado (like most web servers) tells your browser each static file's last-modified-date and a checksum of its contents (its "ETag" in HTTP terms). When the browser requests the file again, the browser tells Tornado which last-modified date and ETag it has in the browser's cached copy; Tornado compares those to its own and, if they haven't changed, just tells the browser "304 Not Modified". Thus, the browser knows it can use its cached copy and doesn't have to re-download the original.
The HTTP 500 is the actual problem. You have some characters in your template file that are not valid UTF-8. Apparently the very first character is not valid UTF-8, based on the "position 0" in the error message.