Using scrapy getting crawlspider to work with authenticated (logged in) user session - html

Hello how can I get my crawlspider to work, I am able to login but nothing happens I don't really get not scrape. Also I been reading the scrapy doc and i really don't understand the rules to use to scrape. Why is nothing happening after "Successfully logged in. Let's start crawling!"
I also had this rule at the end of my else statement but remove it because it wasn't even being called because it was inside my else block. so I moved it at the top of start_request() method but got errors so i removed my rules.
rules = (
Rule(extractor,callback='parse_item',follow=True),
)
my code:
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from linkedconv.items import LinkedconvItem
class LinkedPySpider(CrawlSpider):
name = 'LinkedPy'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
# start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]
start_urls = ["http://www.linkedin.com/csearch/results"]
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# def init_request(self):
#"""This function is called before crawling starts."""
# return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'session_key': 'myemail#gmail.com', 'session_password': 'mypassword'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
self.log('Hi, this is an item page! %s' % response.url)
return
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_item(self, response):
self.log("\n\n\n We got data! \n\n\n")
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites:
item = LinkedconvItem()
item['title'] = site.select('h2/a/text()').extract()
item['link'] = site.select('h2/a/#href').extract()
items.append(item)
return items
myoutput
C:\Users\ye831c\Documents\Big Data\Scrapy\linkedconv>scrapy crawl LinkedPy
2013-07-12 13:39:40-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedconv)
2013-07-12 13:39:40-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Enabled item pipelines:
2013-07-12 13:39:41-0500 [LinkedPy] INFO: Spider opened
2013-07-12 13:39:41-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2013-07-12 13:39:41-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-12 13:39:41-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked
in.com/uas/login> (referer: None)
2013-07-12 13:39:42-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www.
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit>
2013-07-12 13:39:45-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi
n.com/nhome/> (referer: https://www.linkedin.com/uas/login)
2013-07-12 13:39:45-0500 [LinkedPy] DEBUG:
Successfully logged in. Let's start crawling!
2013-07-12 13:39:45-0500 [LinkedPy] DEBUG: Hi, this is an item page! http://www.
linkedin.com/nhome/
2013-07-12 13:39:45-0500 [LinkedPy] INFO: Closing spider (finished)
2013-07-12 13:39:45-0500 [LinkedPy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1670,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 65218,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 12, 18, 39, 45, 136000),
'log_count/DEBUG': 11,
'log_count/INFO': 4,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2013, 7, 12, 18, 39, 41, 50000)}
2013-07-12 13:39:45-0500 [LinkedPy] INFO: Spider closed (finished)

Right now, the crawling ends in check_login_response() because Scrapy has not been told to do anything more.
1st request to the login page using start_requests(): OK
2nd request to POST the login information: OK
which response is parsed with check_login_response... and that's it
Indeed check_login_response() returns nothing. To keep the crawling going, you need to return Request instances (that tell Scrapy what pages to fetch next, see Scrapy documentation on Spiders' callbacks)
So, inside check_login_response(), you need to return a Request instance to the starting page containing the links you want to crawl next, probably some of the URLs you defined in start_urls.
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
return Request(url='http://linkedin.com/page/containing/links')
By default, if you do not set a callback for your Request, the spider calls its parse() method (http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.BaseSpider.parse).
In your case, it will call CrawlSpider's built-in parse() method for you automatically, which applies the Rules you have defined to get next pages.
You must define your CrawlSpider rules within a rules attribute of you spider class, just as you did for name, allowed_domain etc., at the same level.
http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider-example provides example Rules. The main idea is that you tell the extractor what kind of absolute URL you are interested in within the page, using regular expression(s) in allow. If you do not set allow in your SgmlLinkExtractor, it will match all links.
And each Rule should have a callback to use for these links, in your case parse_item().
Good luck with parsing LinkedIn pages, I think a lot of what's in the pages is generated via Javascript and may not be inside the HTML content fetched by Scrapy.

Related

Telegraf json_v2 parser error: Unable to convert field to type int. strconv.ParseInt: parsing invalid syntax

Good day!
I have built a small IoT device that monitors the conditions inside a specific enclosure using an ESP32 and a couple of sensors. I want to monitor that data by publishing it to the ThingSpeak cloud, then writing it to InfluxDB with Telegraf and finally using the InfluxDB data source in Grafana to visualize it.
So far I have made everything work flawlessly, but with one small exception.
Which is: One of the plugins in my telegraf config fails with the error:
parsing metrics failed: Unable to convert field 'temperature' to type int: strconv.ParseInt: parsing "15.4": invalid syntax
The plugins are [inputs.http]] and [[inputs.http.json_v2]] and what I am doing with them is authenticating against my ThingSpeak API and parsing the json output of my fields. Then in my /etc/telegraf/telegraf.conf under [[inputs.http.json_v2.field]] I have added type = int as otherwise telegraf writes my metrics as Strings in InfluxDB and the only way to visualize them is using either a table or a single stat, because the rest of the flux queries fail with the error unsupported input type for mean aggregate: string. However, when I change to type = float in the config file I get a different error:
unprocessable entity: failure writing points to database: partial write: field type conflict: input field "temperature" on measurement "sensorData" is type float, already exists as type string dropped=1
I have a suspicion that I have misconfigured the parser plugin, however after hours of debugging I couldn't come up with a solution.
Some information that might be of use:
Telegraf version: Telegraf 1.24.2
Influxdb version: InfluxDB v2.4.0
Please see below for my telegraf.conf as well as the error messages.
Any help would be highly appreciated! (:
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 1000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = ""
omit_hostname = false
[[outputs.influxdb_v2]]
urls = ["http://localhost:8086"]
token = "XXXXXXXX"
organization = "XXXXXXXXX"
bucket = "sensor"
[[inputs.http]]
urls = [
"https://api.thingspeak.com/channels/XXXXX/feeds.json?api_key=XXXXXXXXXX&results=2"
]
name_override = "sensorData"
tagexclude = ["url", "host"]
data_format = "json_v2"
## HTTP method
method = "GET"
[[inputs.http.json_v2]]
[[inputs.http.json_v2.field]]
path = "feeds.1.field1"
rename = "temperature"
type = "int" #Error message 1
#type = "float" #Error message 2
Error when type = "float":
me#myserver:/etc/telegraf$ telegraf -config telegraf.conf --debug
2022-10-16T00:31:43Z I! Starting Telegraf 1.24.2
2022-10-16T00:31:43Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20
parsers, 57 outputs
2022-10-16T00:31:43Z I! Loaded inputs: http
2022-10-16T00:31:43Z I! Loaded aggregators:
2022-10-16T00:31:43Z I! Loaded processors:
2022-10-16T00:31:43Z I! Loaded outputs: influxdb_v2
2022-10-16T00:31:43Z I! Tags enabled: host=myserver
2022-10-16T00:31:43Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"myserver",
Flush Interval:10s
2022-10-16T00:31:43Z D! [agent] Initializing plugins
2022-10-16T00:31:43Z D! [agent] Connecting outputs
2022-10-16T00:31:43Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-10-16T00:31:43Z D! [agent] Successfully connected to outputs.influxdb_v2
2022-10-16T00:31:43Z D! [agent] Starting service inputs
2022-10-16T00:31:53Z E! [outputs.influxdb_v2] Failed to write metric to sensor (will be
dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to
database: partial write: field type conflict: input field "temperature" on measurement
"sensorData" is type float, already exists as type string dropped=1
2022-10-16T00:31:53Z D! [outputs.influxdb_v2] Wrote batch of 1 metrics in 8.9558ms
2022-10-16T00:31:53Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
Error when type = "int"
me#myserver:/etc/telegraf$ telegraf -config telegraf.conf --debug
2022-10-16T00:37:05Z I! Starting Telegraf 1.24.2
2022-10-16T00:37:05Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20
parsers, 57 outputs
2022-10-16T00:37:05Z I! Loaded inputs: http
2022-10-16T00:37:05Z I! Loaded aggregators:
2022-10-16T00:37:05Z I! Loaded processors:
2022-10-16T00:37:05Z I! Loaded outputs: influxdb_v2
2022-10-16T00:37:05Z I! Tags enabled: host=myserver
2022-10-16T00:37:05Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"myserver",
Flush Interval:10s
2022-10-16T00:37:05Z D! [agent] Initializing plugins
2022-10-16T00:37:05Z D! [agent] Connecting outputs
2022-10-16T00:37:05Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2022-10-16T00:37:05Z D! [agent] Successfully connected to outputs.influxdb_v2
2022-10-16T00:37:05Z D! [agent] Starting service inputs
2022-10-16T00:37:10Z E! [inputs.http] Error in plugin:
[url=https://api.thingspeak.com/channels/XXXXXX/feeds.json?
api_key=XXXXXXX&results=2]: parsing metrics failed: Unable to convert field
'temperature' to type int: strconv.ParseInt: parsing "15.3": invalid syntax
Fixed it by leaving type = float under [[inputs.http.json_v2.field]] in telegraf.conf and creating a NEW bucket with a new API key in Influx.
The issue was that the bucket sensor that I had previously defined in my telegraf.conf already had the field temperature created in my influx database from previous tries with its type set as last (aka: String) which could not be overwritten with the new type mean (aka: float).
As soon as I deleted all pre existing buckets everything started working as expected.
InfluxDB dashboard

I am trying to call a DAG( wrtitten in Python) using Cloud Function(Python 3.7) and getting error "405 Method Not Found" Could someone help here?

I have used below code available on the Google Cloud Docs platform:
from google.auth.transport.requests import Request
from google.oauth2 import id_token
import requests
IAM_SCOPE = 'https://www.googleapis.com/auth/iam'
OAUTH_TOKEN_URI = 'https://www.googleapis.com/oauth2/v4/token'
def trigger_dag(data, context=None):
"""Makes a POST request to the Composer DAG Trigger API
When called via Google Cloud Functions (GCF),
data and context are Background function parameters.
For more info, refer to
https://cloud.google.com/functions/docs/writing/background#functions_background_parameters-python
To call this function from a Python script, omit the ``context`` argument
and pass in a non-null value for the ``data`` argument.
"""
# Fill in with your Composer info here
# Navigate to your webserver's login page and get this from the URL
# Or use the script found at
# https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/composer/rest/get_client_id.py
client_id = '87431184677-jitlhi9o0u9sin3uvdebqrvqokl538aj.apps.googleusercontent.com'
# This should be part of your webserver's URL:
# {tenant-project-id}.appspot.com
webserver_id = 'b368a47a354ddf2f6p-tp'
# The name of the DAG you wish to trigger
dag_name = 'composer_sample_trigger_response_dag'
webserver_url = (
#'https://'
webserver_id
+ '.appspot.com/admin/airflow/tree?dag_id='
+ dag_name
#+ '/dag_runs'
)
# Make a POST request to IAP which then Triggers the DAG
make_iap_request(
webserver_url, client_id, method='POST', json={"conf": data, "replace_microseconds": 'false'})
# This code is copied from
# https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/iap/make_iap_request.py
# START COPIED IAP CODE
def make_iap_request(url, client_id, method='GET', **kwargs):
"""Makes a request to an application protected by Identity-Aware Proxy.
Args:
url: The Identity-Aware Proxy-protected URL to fetch.
client_id: The client ID used by Identity-Aware Proxy.
method: The request method to use
('GET', 'OPTIONS', 'HEAD', 'POST', 'PUT', 'PATCH', 'DELETE')
**kwargs: Any of the parameters defined for the request function:
https://github.com/requests/requests/blob/master/requests/api.py
If no timeout is provided, it is set to 90 by default.
Returns:
The page body, or raises an exception if the page couldn't be retrieved.
"""
# Set the default timeout, if missing
if 'timeout' not in kwargs:
kwargs['timeout'] = 90
# Obtain an OpenID Connect (OIDC) token from metadata server or using service
# account.
google_open_id_connect_token = id_token.fetch_id_token(Request(), client_id)
# Fetch the Identity-Aware Proxy-protected URL, including an
# Authorization header containing "Bearer " followed by a
# Google-issued OpenID Connect token for the service account.
resp = requests.request(
method, url,
headers={'Authorization': 'Bearer {}'.format(
google_open_id_connect_token)}, **kwargs)
if resp.status_code == 403:
raise Exception('Service account does not have permission to '
'access the IAP-protected application.')
elif resp.status_code != 200:
raise Exception(
'Bad response from application: {!r} / {!r} / {!r}'.format(
resp.status_code, resp.headers, resp.text))
else:
return resp.text
# END COPIED IAP CODE
You encounter the error "405 method not found" because you are trying to send a request to your_webserver_id.appspot.com/admin/airflow/tree?dag_id=composer_sample_trigger_response_dag which is the URL seen "Tree view" in the airflow webserver.
To properly send request to the Airflow API you need to construct the webserver_url just like in the documentation in Trigger DAGs in Cloud Functions. The webserver_url was constructed to use Trigger a new DAG endpoint to send requests. So if you'd like to trigger the DAG you can stick with the code below.
Airflow run DAG endpoint:
POST https://airflow.apache.org/api/v1/dags/{dag_id}/dagRuns
webserver_url = (
'https://'
+ webserver_id
+ '.appspot.com/api/experimental/dags/'
+ dag_name
+ '/dag_runs'
)
Moving forward, if you would like to perform different operations using the Airflow API you can check the Airflow REST reference

What is wrong in the following Lambda code that throws up module error?

Using the following code to use make an API that connects to Amazon AWS. This is the AMazon Lambda code that I use-
import boto3
import json
import requests
from requests_aws4auth import AWS4Auth
region = 'us-east-1'
service = 'es'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region,
service, session_token=credentials.token)
host = 'XXX.com'
index = 'items'
url = 'https://' + host + '/' + index + '/_search'
# Lambda execution starts here
def handler(event, context):
# Put the user query into the query DSL for more accurate search results.
# Note that certain fields are boosted (^).
query = {
"query": {
"multi_match": {
"query": event['queryStringParameters']['q'],
}
}
}
# ES 6.x requires an explicit Content-Type header
headers = { "Content-Type": "application/json" }
# Make the signed HTTP request
r = requests.get(url, auth=awsauth, headers=headers,
data=json.dumps(query))
# Create the response and add some extra content to support CORS
response = {
"statusCode": 200,
"headers": {
"Access-Control-Allow-Origin": '*'
},
"isBase64Encoded": False
}
# Add the search results to the response
response['body'] = r.text
return response
This should connect to an AWS ES cluster with endpoint XXX.com
Getting output when trying to test -
START RequestId: f640016e-e4d6-469f-b74d-838b9402968b Version: $LATEST
Unable to import module 'index': Error
at Function.Module._resolveFilename (module.js:547:15)
at Function.Module._load (module.js:474:25)
at Module.require (module.js:596:17)
at require (internal/module.js:11:18)
END RequestId: f640016e-e4d6-469f-b74d-838b9402968b
REPORT RequestId: f640016e-e4d6-469f-b74d-838b9402968b Duration:
44.49 ms Billed Duration: 100 ms Memory Size: 128 MB Max
Memory Used: 58 MB
While creating a Lambda function, we need to specify a handler, which is a function in your code, that the AWS Lambda service can invoke when the given Lambda function is executed.
By default, a Python Lambda function is created with handler as lambda_function.lambda_handler which signifies that the service must invoke lambda_handler function contained inside lambda_function module.
From the error you're receiving, it seems that the handler is incorrectly configured to something like index.<something>, and since there's no Python module called index in your deployment package, Lambda is unable to import the same in order to start the execution.
If i am getting things correctly to connect to a AWS ES cluster you need something of this sort
import gitlab
import logging
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3
#from aws_requests_auth.aws_auth import AWSRequestsAuth
LOGGER = logging.getLogger()
ES_HOST = {'host':'search-testelasticsearch-xxxxxxxxxx.eu-west-2.es.amazonaws.com', 'port': 443}
def lambda_handler(event, context):
LOGGER.info('started')
dump2={
'number': 9
}
service = 'es'
credentials = boto3.Session().get_credentials()
print('-------------------------------------------')
print(credentials.access_key)
print(credentials.secret_key)
print('--------------------------------------------------------')
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, "eu-west-2", service, session_token=credentials.token)
es = Elasticsearch(hosts=[ES_HOST], http_auth = awsauth, use_ssl = True, verify_certs = True, connection_class = RequestsHttpConnection)
DAVID_INDEX = 'test_index'
response = es.index(index=DAVID_INDEX, doc_type='is_this_important?', body=dump2, id='4')

500 Internal Server Error from third party API

Python 3.6 - Scrapy 1.5
I'm scraping the John Deere warranty webpage to watch all new PMP's and its expiration date. Looking inside network communication between browser and webpage I found a REST API that feed data in webpage.
Now, I'm trying to get json data from API rather scraping the javascript page's content. However, I'm getting a Internal Server Error and I don't know why.
I'm using scrapy to log in and catch data.
import scrapy
class PmpSpider(scrapy.Spider):
name = 'pmp'
start_urls = ['https://jdwarrantysystem.deere.com/portal/']
def parse(self, response):
self.log('***Form Request***')
login ={
'USERNAME':*******,
'PASSWORD':*******
}
yield scrapy.FormRequest.from_response(
response,
url = 'https://registration.deere.com/servlet/com.deere.u90950.registrationlogin.view.servlets.SignInServlet',
method = 'POST', formdata = login, callback = self.parse_pmp
)
self.log('***PARSE LOGIN***')
def parse_pmp(self, response):
self.log('***PARSE PMP***')
cookies = response.headers.getlist('Set-Cookie')
for cookie in cookies:
cookie = cookie.decode('utf-8')
self.log(cookie)
cook = cookie.split(';')[0].split('=')[1]
path = cookie.split(';')[1].split('=')[1]
domain = cookie.split(';')[2].split('=')[1]
yield scrapy.Request(
url = 'https://jdwarrantysystem.deere.com/api/pip-products/collection',
method = 'POST',
cookies = {
'SESSION':cook,
'path':path,
'domain':domain
},
headers = {
"Accept":"application/json",
"accounts":["201445","201264","201167","201342","201341","201221"],
"excludedPin":"",
"export":"",
"language":"",
"metric":"Y",
"pipFilter":"OPEN",
"pipType":["MALF","SAFT"]
},
meta = {'dont_redirect': True},
callback = self.parse_pmp_list
)
def parse_pmp_list(self, response):
self.log('***LISTA PMP***')
self.log(response.body)
Why am I getting an error? How to get data from this API?
2018-07-05 17:26:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (failed 1 times): 500 Internal Server Error
2018-07-05 17:26:20 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (failed 2 times): 500 Internal Server Error
2018-07-05 17:26:21 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (failed 3 times): 500 Internal Server Error
2018-07-05 17:26:21 [scrapy.core.engine] DEBUG: Crawled (500) <POST https://jdwarrantysystem.deere.com/api/pip-products/collection> (referer: https://jdwarrantysystem.deere.com/portal/)
2018-07-05 17:26:21 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://jdwarrantysystem.deere.com/api/pip-products/collection>: HTTP status code is not handled or not allowed
I found the problem: This is a POST request that must have a body data in json format, because unlike a GET request, the parameters don't go in the URI. The request header need too a "content-type": "application/json". See: How parameters are sent in POST request and Rest POST in python. So, editing the function parse_pmp:
def parse_pmp(self, response):
self.log('***PARSE PMP***')
cookies = response.headers.getlist('Set-Cookie')
for cookie in cookies:
cookie = cookie.decode('utf-8')
self.log(cookie)
cook = cookie.split(';')[0].split('=')[1]
path = cookie.split(';')[1].split('=')[1]
domain = cookie.split(';')[2].split('=')[1]
data = json.dumps({"accounts":["201445","201264","201167","201342","201341","201221"],"excludedPin":"","export":"","language":"","metric":"Y","pipFilter":"OPEN","pipType":["MALF","SAFT"]}) # <----
yield scrapy.Request(
url = 'https://jdwarrantysystem.deere.com/api/pip-products/collection',
method = 'POST',
cookies = {
'SESSION':cook,
'path':path,
'domain':domain
},
headers = {
"Accept":"application/json",
"content-type": "application/json" # <----
},
body = data, # <----
meta = {'dont_redirect': True},
callback = self.parse_pmp_list
)
Everything works fine!

scrapy - how to get text from 'div'

I just started to get to know scrapy. Now I am trying to crawl by following tutorials. But I have difficulty to crawl text from div.
This is items.py
from scrapy.item import Item, Fied
class DmozItem(Item):
name = Field()
title = Field()
pass
This is dmoz_spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from dmoz.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["roxie.com"]
start_urls = ["http://www.roxie.com/events/details.cfm?eventID=4921702B-9E3D-8678-50D614177545A594"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[#id="eventdescription"]')
items = []
for site in sites:
item = DmozItem()
item['name'] = hxs.select("text()").extract()
items.append(item)
return items
And now I am trying to crawl from top folder by commanding this:
scrapy crawl dmoz -o scraped_data.json -t json
But the file was created with only '['.
It perfectly works in the console(by commanding each select), but somehow it doesn't work as a script. I am really a starter of scrapy. Could you guys let me know how can I get the data in 'div'? Thanks in advance.
* In addition, this is what I get.
2013-06-19 08:43:56-0700 [scrapy] INFO: Scrapy 0.16.5 started (bot: dmoz)
/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/web/microdom.py:181: SyntaxWarning: assertion is always true, perhaps remove parentheses?
assert (oldChild.parentNode is self,
2013-06-19 08:43:56-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-06-19 08:43:56-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-06-19 08:43:56-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-06-19 08:43:56-0700 [scrapy] DEBUG: Enabled item pipelines:
2013-06-19 08:43:56-0700 [dmoz] INFO: Spider opened
2013-06-19 08:43:56-0700 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-06-19 08:43:56-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-06-19 08:43:56-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-06-19 08:43:56-0700 [dmoz] DEBUG: Crawled (200) <GET http://www.roxie.com/events/details.cfm?eventID=4921702B-9E3D-8678-50D614177545A594> (referer: None)
2013-06-19 08:43:56-0700 [dmoz] INFO: Closing spider (finished)
2013-06-19 08:43:56-0700 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 281,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 27451,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 6, 19, 15, 43, 56, 569164),
'log_count/DEBUG': 7,
'log_count/INFO': 4,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 6, 19, 15, 43, 56, 417661)}
2013-06-19 08:43:56-0700 [dmoz] INFO: Spider closed (finished)
I can get you the title of the film, but I'm somewhat crappy with XPath so the description XPath will get you everything within the <div class="tabbertab" title="Synopsis"> element. It's not ideal, but it's a starting point. Getting the image URL is left as an exercise for the OP. :)
from scrapy.item import Field, Item
class DmozItem(Item):
title = Field()
description = Field()
class DmozSpider(BaseSpider):
name = "test"
allowed_domains = ["roxie.com"]
start_urls = [
"http://www.roxie.com/events/details.cfm?eventID=4921702B-9E3D-8678-50D614177545A594"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = DmozItem()
item["title"] = hxs.select('//div[#style="width: 100%;"]/text()').extract()
item["description"] = hxs.select('//div[#class="tabbertab"]').extract()
return item
Just replace
item['name'] = hxs.select("text()").extract()
with
item['name'] = site.select("text()").extract()
Hope that helps.