i'm using python tornado to build a simple web server. Here is the code of tornado:
import json
import tornado.httpserver
import tornado.ioloop
import tornado.options
import tornado.web
from tornado.options import define, options
define("port", default=80, help="run on the given port", type=int)
class IndexHandler(tornado.web.RequestHandler):
def get(self, param):
print("\n\nthis is a get request from indexhandler:")
if param:
print("param is NOT null")
self.render(r"frontend/" + param)
else:
print("param is null")
self.render(r"frontend/index.html")
if __name__ == "__main__":
tornado.options.parse_command_line()
app = tornado.web.Application(handlers=[(r"/(.*)", IndexHandler)])
http_server = tornado.httpserver.HTTPServer(app)
http_server.listen(options.port)
tornado.ioloop.IOLoop.instance().start()
All of the frontend codes are in the directory /frontend so I used a simple regex (.*) to permit user to access all of resources in /frontend, such as js files and css files.
However, when I try to visit my website, I get some 304 errors at the server:
[I 170501 14:31:59 web:2063] 200 GET /html/country.html
[I 170501 14:31:59 web:2063] 304 GET /css/bootstrap.min.css
[I 170501 14:31:59 web:2063] 304 GET /css/reset.css
[I 170501 14:31:59 web:2063] 304 GET /css/icon/iconfont.css
[I 170501 14:31:59 web:2063] 304 GET /css/country.css
[I 170501 14:31:59 web:2063] 304 GET /css/common.css
UPDATE
I have another issue: error 500
In a word, I have some 304 and some 500. All of 500 are like this:
[E 170501 22:53:19 web:1590] Uncaught exception GET /images/main-img1.jpg (X.X.X.X)
HTTPServerRequest(protocol='http', host='X.X.X.X', method='GET', uri='/images/main-img1.jpg', version='HTTP/1.1', remote_ip='X.X.X.X', headers={'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4', 'Accept': 'image/webp,image/*,*/*;q=0.8', 'Host': 'X.X.X.X', 'Referer': 'http://X.X.X.X/html/country.html', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'})
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/tornado/web.py", line 1509, in _execute
result = method(*self.path_args, **self.path_kwargs)
File "tmp.py", line 20, in get
self.render("frontend/" + param)
File "/usr/local/lib/python3.5/site-packages/tornado/web.py", line 724, in render
html = self.render_string(template_name, **kwargs)
File "/usr/local/lib/python3.5/site-packages/tornado/web.py", line 862, in render_string
t = loader.load(template_name)
File "/usr/local/lib/python3.5/site-packages/tornado/template.py", line 427, in load
self.templates[name] = self._create_template(name)
File "/usr/local/lib/python3.5/site-packages/tornado/template.py", line 455, in _create_template
template = Template(f.read(), name=name, loader=self)
File "/usr/local/lib/python3.5/site-packages/tornado/template.py", line 304, in __init__
reader = _TemplateReader(name, escape.native_str(template_string),
File "/usr/local/lib/python3.5/site-packages/tornado/escape.py", line 218, in to_unicode
return value.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
[E 170501 22:53:19 web:2063] 500 GET /images/main-img1.jpg (X.X.X.X) 2.07ms
HTTP 304 is "Not Modified". It's not an error, it's an optimization. Tornado (like most web servers) tells your browser each static file's last-modified-date and a checksum of its contents (its "ETag" in HTTP terms). When the browser requests the file again, the browser tells Tornado which last-modified date and ETag it has in the browser's cached copy; Tornado compares those to its own and, if they haven't changed, just tells the browser "304 Not Modified". Thus, the browser knows it can use its cached copy and doesn't have to re-download the original.
The HTTP 500 is the actual problem. You have some characters in your template file that are not valid UTF-8. Apparently the very first character is not valid UTF-8, based on the "position 0" in the error message.
Related
I was unable to use the requests library and use the get() function to scrape data from this specific website as running the below code block will result in a status code of 403 (unsuccessful)
import requests
#using headers in order to emulate a browser
headers = {'user-agent': 'Chrome/55.0.2883.87'}
url = "https://www.rumah.com/properti-dijual"
# Make a request to the website and retrieve the data
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
print("Request was successful", response.status_code)
# Save the source code as a text file
print(response.text)
else:
print("Request was not successful", response.status_code)
However, when I tried the same source code trying to scrape a different website, the request was successful (status code 200).
import requests
#using headers in order to emulate a browser
headers = {'user-agent': 'Chrome/55.0.2883.87'}
url = "https://www.subscene.com"
# Make a request to the website and retrieve the data
response = requests.get(url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
print("Request was successful", response.status_code)
# Save the source code as a text file
print(response.text)
else:
print("Request was not successful", response.status_code)
I'm trying to scrape housing data from the website by getting a successful request to the website. I realized that some websites prevent scraping and those specific pages are listed in the robots.txt file. However, I can't find the specific page that I want to scrape in the robots.txt file, therefore I thought that I should be able to scrape this website.
Here is the robots.txt file for the specific webpage:
enter image description here
This is my first question in StackOverflow. Any help would be appreciated!
Your url https://www.rumah.com/properti-dijual is using cloudfare protection, and https://www.subscene.com as well.
But maybe, https://www.subscene.com has a more strict policy.
In case your getting error 403:
provide all headers as following:
import requests
headers = {
'authority': 'www.rumah.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,fr;q=0.5,de-CH;q=0.4',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-ch-ua': '"Not_A Brand";v="99", "Microsoft Edge";v="109", "Chromium";v="109"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36 Edg/109.0.1518.70',
}
response = requests.get('https://www.rumah.com/properti-dijual', headers=headers)
If that doesn't work, try using javascript:
fetch("https://www.rumah.com/properti-dijual", {
"headers": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-language": "de,de-DE;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6,fr;q=0.5,de-CH;q=0.4",
"cache-control": "no-cache",
"pragma": "no-cache",
"sec-ch-ua": "\"Not_A Brand\";v=\"99\", \"Microsoft Edge\";v=\"109\", \"Chromium\";v=\"109\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Windows\"",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1"
},
"body": null,
"method": "GET",
"mode": "cors",
});
You can also initiate javascript with python using Selenium or Selenium-Profiles (undetected, uses Chrome)
i’m try to connect Hue 4.8 with impala 3.0.0-cdh6.0.0, from editor i can see all databases but if try to refresh tables or execute some query i received this error:
[23/Jan/2021 12:46:46 -0800] thrift_util WARNING Unable to unpack the secret and guid in Thrift Handle: unpack requires a buffer of 16 bytes
[23/Jan/2021 12:46:46 -0800] api ERROR Autocomplete data fetching error Traceback (most recent call last): File “/usr/share/hue/apps/beeswax/src/beeswax/api.py”, line 111, in _autocomplete
response[‘functions’] = _get_functions(db, database)
File “/usr/share/hue/apps/beeswax/src/beeswax/api.py”, line 183, in _get_functions
functions = db.get_functions(prefix=database)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/dbms.py”, line 1167, in get_functions
handle = self.execute_and_wait(query, timeout_sec=5.0)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/dbms.py”, line 973, in execute_and_wait
handle = self.client.query(query)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 1420, in query
return self._client.execute_async_query(query, statement, session=session)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 988, in execute_async_query
return self.execute_async_statement(statement=query_statement, conf_overlay=configuration, session=session)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 1021, in execute_async_statement
(res, session) = self.call_return_result_and_session(thrift_function, thrift_request, session=session)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 753, in call_return_result_and_session
return self._call_return_result_and_session(fn, req, status=status, session=session)
File “/usr/share/hue/apps/beeswax/src/beeswax/server/hive_server2_lib.py”, line 790, in _call_return_result_and_session
raise QueryServerException(Exception(message), message=message)
beeswax.server.dbms.QueryServerException: Malformed THandleIdentifier (guid size: 17, expected 16, secret size: 17, expected 16)
2021-01-23T20:46:46.959706000Z
[23/Jan/2021 12:46:46 -0800] access INFO 172.18.0.1 marco - “POST /notebook/api/autocomplete/ HTTP/1.1” returned in 73ms 200 129
[23/Jan/2021 12:46:46 -0800] access INFO 172.18.0.1 marco - “POST /notebook/api/autocomplete/ HTTP/1.1” returned in 73ms 200 129
172.18.0.1 - - [23/Jan/2021:20:46:46 +0000] “POST /notebook/api/autocomplete/ HTTP/1.1” 200 129 “http://localhost:8888/hue/editor/?type=impala” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36”
this is my hue setting:
[impala]
Hot of the Impala Server (one of the Impalad)
server_host=impala
Port of the Impala Server
server_port=21050
coordinator_url=http://impala:25000
impersonation_enabled=false
#use_thrift_http=true
Can someone help me to resolve this issue?
Thanks
I am sending a Post request with a json body to a server but can not extract the json file when it arrives. I have does exhaustive searches but to no avail. I have provided both client and server scripts to illustrate what is happening.
All I need is to extract the json portion at the end of the received string so I can analyze the request and return the appropriate data.
I'm sure it's simple but I can't seem to find the answer. Any direction would be appreciated
***
CLIENT: script to test Server
import json
import requests
def info_send():
url = 'http:1234abcd.ngrok.io'
payload = {
'command': '["command", "status", "off", None]',
'userID': 'userID string',
'status': 'current status',
}
requests.post(url, data=json.dumps(payload))
info_send()
***
SERVER: receives json POST request
HOST, PORT = '', 5000
listen_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
listen_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
listen_socket.bind((HOST, PORT))
listen_socket.listen(1)
print('Listening on port %s' % PORT)
while True:
client_connection, client_address = listen_socket.accept()
request = client_connection.recv(1024).decode('utf-8')
print(request)
***
This is what is printed at the server
POST / HTTP/1.1
Host: 1234abcd.ngrok.io
User-Agent: python-requests/2.18.4
Accept-Encoding: gzip, deflate
Accept: /
Content-Length: 112
X-Forwarded-For: 112.162.214.265
{"command": "[\"command\", \"status\", \"off\", None]", "userID": "userID string", "deviceID": "current status"}
I'm trying to navigate between 2 HTML pages through Tornado. Following is the code for the routes and their respective handlers:
class MainHandler(tornado.web.RequestHandler):
def get(self):
log.info("Rendering index.html")
self.render("index.html")
class NotificationsPageHandler(tornado.web.RequestHandler):
def get(self):
log.info("Rendering notifications")
self.render("notifications.html")
def start_server():
settings = {
"static_path": os.path.join(os.path.dirname(__file__), "static")
}
application = tornado.web.Application([
(r"/", MainHandler),
(r"/notifications.html", NotificationsPageHandler),
], **settings)
application.listen(8989)
tornado.ioloop.IOLoop.current().start()
When I load 127.0.0.1:8989 on the browser, I get the index.html page but when I try to navigate to notifications.html through an anchor tag in index.html, I get the following stack trace:
2016-07-06 12:07:06,546 - tornado.application - ERROR - Uncaught exception GET /notifications.html (127.0.0.1)
HTTPServerRequest(protocol='http', host='127.0.0.1:8989', method='GET', uri='/notifications.html', version='HTTP/1.1', remote_ip='127.0.0.1', headers={'Accept-Language': 'en-US,en;q=0.8', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Host': '127.0.0.1:8989', 'Upgrade-Insecure-Requests': '1', 'Accept-Encoding': 'gzip, deflate, sdch', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36', 'Referer': 'http://127.0.0.1:8989/', 'Connection': 'keep-alive'})
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tornado/web.py", line 1443, in _execute
result = method(*self.path_args, **self.path_kwargs)
File "BADWebServer.py", line 231, in get
self.render("notifications.html")
File "/usr/local/lib/python3.5/dist-packages/tornado/web.py", line 699, in render
html = self.render_string(template_name, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tornado/web.py", line 806, in render_string
return t.generate(**namespace)
File "/usr/local/lib/python3.5/dist-packages/tornado/template.py", line 345, in generate
return execute()
File "notifications_html.generated.py", line 5, in _tt_execute
_tt_tmp = item.score # notifications.html:37
NameError: name 'item' is not defined
2016-07-06 12:07:06,548 - tornado.access - ERROR - 500 GET /notifications.html (127.0.0.1) 4.51ms
I have seen a similar post, how to navigate from one html to other in tornado using anchor tag but I'm not sure why I'm getting the exception.
You're getting the error because, as the trace says, "name 'item' is not defined". Your notifications.html template contains some markup like:
{{ item.score }}
... but you haven't passed an "item" variable in. See the template syntax guide for an example.
I'm trying to stream out an audio/wav using HTML5 feature in the following way:
<audio type="audio/wav" src="/sound/10/audio/full" sound-id="10" version="full" controls="">
</audio>
This is working quite well on Chrome, except for the fact it's impossible to replay or reposition the currentTime attribute:
var audioElement = $('audio').get(0)
audioElement.currentTime
> 1.2479820251464844
audioElement.currentTime = 0
audioElement.currentTime
> 1.2479820251464844
I'm serving the audio file from a Grails controller, using following code:
def audio() {
File file = soundService.getAudio(...)
response.setContentType('audio/wav')
response.setContentLength (file.bytes.length)
response.setHeader("Content-disposition", "attachment;filename=${file.getName()}")
response.outputStream << file.newInputStream() // Performing a binary stream copy
response.status = 206
return false
}
It seems though Grails is giving back an HTTP response 200 instead of 206 (Partial-Content) as you can see from the following output from Chrome:
Request URL:http://localhost:8080/sound/10/audio/full
Request Method:GET
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:identity;q=1, *;q=0
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Cookie:JSESSIONID=9395D8FFF34B7455F937190F521AA1BC
Host:localhost:8080
Range:bytes=0-3189
Referer:http://localhost:8080/cms/sound/10/edit
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11
Response Headersview source
Content-disposition:attachment;filename=full.wav
Content-Length:3190
Content-Type:audio/wav
Date:Wed, 19 Dec 2012 13:58:44 GMT
Server:Apache-Coyote/1.1
Any idea what might be wrong?
Thanks, Amit.
ADDITION:
Changing the controller logic to:
response.status = 206
response.setContentType(version.mime)
response.setContentLength (file.bytes.length)
response.setHeader("Content-disposition", "attachment;filename=${file.getName()}")
response.outputStream << file.newInputStream() // Performing a binary stream copy
Did help with returning HTTP-206(Partial-Content), however both Chrome and Firefox won't play the audio file(mentioning again Chrome did play when it got the file with a 200...)
with the following info of the response:
Request URL:http://localhost:8080/sound/10/audio/full
Request Headersview source
Accept-Encoding:identity;q=1, *;q=0
Range:bytes=0-
Referer:http://localhost:8080/cms/sound/10/edit
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11
You might want to try:
response.reset();
response.setStatus(206);
response.setHeader("Accept-Ranges", "bytes");
response.setHeader("Content-length", Integer.toString(length + 1));
response.setHeader("Content-range", "bytes " + start.toString() + "-" + end.toString() + "/" + Long.toString(f.size()));
response.setContentType(...);
And this type of output should only be done if the client specifically asked for a range. You can check by using:
String range = request.getHeader("range");
if range is not null, then you'll have to parse the range for the start and end byte requests. Note that you can have "0-" as a range In some cases, you'll see "0-1" as a request to see if your service knows how to handle range requests.