I develop this scrapy crawler, with a loop to scrap 10 pages from one site
The loop works well and the log show me the correct list of urls
2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=8&ajax=true>
2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=9&ajax=true>
But the result is always the same, and return the content of page1
I test in shell and it works correctly, from the browser too. Only with scrapy crawler the problem occur
I tried with start_urls, url method, always the same problem
Any idea ?
import scrapy
import json
import urllib
import time
import datetime
import re
from re import sub
from decimal import Decimal
#from prod.items import ProdItem
from staging.items import StagingItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
ts = time.time()
timestamp = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d')
class QuotesSpider(scrapy.Spider):
name = "lazada2"
def start_requests(self):
for i in range(1, 10):
urls = 'https://www.lazada.vn/trang-diem/?page=%s&ajax=true' % i
yield scrapy.Request(url=urls, callback=self.parse)
def parse(self,response):
data = json.loads(response.body)
next_page = data['mainInfo']['page']
for product in data['mods']['listItems']:
item = StagingItem()
item['collector_sku'] = product['name']
if 'originalPrice' in product:
item['collector_price_promo'] = product['originalPrice'],
else:
item['collector_price_promo'] = '',
item['collector_retailer'] = 'Lazada'
item['collector_url'] = product['productUrl'],
item['collector_photo_url'] = product['image']
item['collector_brand'] = product['brandName']
item['collector_quantity'] = 'NA'
item['collector_category'] = 'Makeup',
item['collector_price'] = product['price']
item['collector_timestamp'] = timestamp
item['collector_local_id'] = ''
item['collector_location_id'] = ''
item['collector_location_name'] = ''
item['collector_vendor_id'] = ''
item['collector_vendor_name'] = ''
yield item
With the cookies and headers
:
headers = {
"content-type": "application/json",
"authority": "www.lazada.vn",
"scheme": "https",
"Accept-Language": "en-SG,en;q=0.9,en-US;q=0.8,zh-CN;q=0.7,zh;q=0.6,vi;q=0.5,fr;q=0.4",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Accept": "*/*",
"Path": "/trang-diem/?page=%s" % i,
"Referer": "https://www.lazada.vn/trang-diem/?page=%s&ajax=true" % i,
"accept-encoding": "gzip, deflate, br"
}
cookies = {
"cookie": "_uab_collina=153864259681792402093714; _bl_uid=qpj7jm4CuXhcUk26er9n7hnhyRqd; t_fv=1538642596635; t_uid=mbei2vPUviVx0oPB6KjX1uVgASJvw7dA; lzd_cid=07e3d81c-bb96-4608-be5d-542d35d39dff; lzd_sid=1d8bf18519bb7fd8fb661ac558726c4d; _tb_token_=58e7f715a30eb; cna=O5A8FGGivzcCAXNPwzeoH+5y; hng=VN|vi|VND|704; userLanguageML=vi; cto_lwid=c9ad6486-acac-465f-ab05-6e0b3744d1dc; _ga=GA1.2.1435138343.1538642600; _gid=GA1.2.19901051.1538642600; cto_axid=zGni0uxNaRyv441RxQNq7EZ_LS8xiGmL; JSESSIONID=85306FF3F7612F91677FC6ED978B42E1; isg=BJ6eL8eUSXz4CZ0YqjCefDlu7zTqVCYsGgm5Z0gmm-DyaztFsOyk6OZNZi9CoFrx"
}
body ="?ajax=true&page=%s" % i
urls = "https://www.lazada.vn/trang-diem/?ajax=true&page=%s" % i
yield scrapy.Request(url=urls, body=body, cookies=cookies, headers=headers, callback=self.parse)
Related
I am trying to place an order but it gives me this error:
{"code":"400005","msg":"Invalid KC-API-SIGN"}
I'll be so thankful if someone check my code and let me know the problem
import requests
import time
import base64
import hashlib
import hmac
import json
import uuid
api_key = 'XXXXXXXXXXXXXXXXXXXXXXX'
api_secret = 'XXXXXXXXXXXXXXXXXXXXXX'
api_passphrase = 'XXXXXXXXXXXXXXX'
future_base_url = "https://api-futures.kucoin.com"
clientOid = uuid.uuid4().hex
params = {
"clientOid": str(clientOid),
"side": str(side),
"symbol": str(symbol),
"type": "limit",
"leverage": "5",
"stop": "down",
"stopPriceType": "TP",
"price": str(price),
"size": int(size),
"stopPrice": str(stopprice)
}
json_params = json.dumps(params)
print(json_params)
now = int(time.time() * 1000)
str_to_sign = str(now) + 'POST' + '/api/v1/orders' + json_params
signature = base64.b64encode(hmac.new(api_secret.encode('utf-8'), str_to_sign.encode('utf-8'), hashlib.sha256).digest())
passphrase = base64.b64encode(hmac.new(api_secret.encode('utf-8'), api_passphrase.encode('utf-8'), hashlib.sha256).digest())
headers = {
"KC-API-SIGN": signature,
"KC-API-TIMESTAMP": str(now),
"KC-API-KEY": api_key,
"KC-API-PASSPHRASE": passphrase,
"KC-API-KEY-VERSION": "2",
"Content-Type": "application/json"
}
response = requests.request('POST', future_base_url + '/api/v1/orders', params=params, headers=headers)
print(response.text)
This worked for me:
tickerK = "AVAXUSDTM"
clientOid = tickerK + '_' + str(now)
side = "buy"
typee = "market"
leverage = "2"
stop = "up"
stopPriceType = "TP"
stopPrice = "12"
size = "3"
# Set the request body
data = {
"clientOid":clientOid,
"side":side,
"symbol":tickerK,
"type":typee,
"leverage":leverage,
"stop":stop,
"stopPriceType":stopPriceType,
"stopPrice":stopPrice,
"size":size
}
data_json = json.dumps(data, separators=(',', ':'))
data_json
url = 'https://api-futures.kucoin.com/api/v1/orders'
now = int(time() * 1000)
str_to_sign = str(now) + 'POST' + '/api/v1/orders' + data_json
signature = base64.b64encode(
hmac.new(api_secret.encode('utf-8'), str_to_sign.encode('utf-8'), hashlib.sha256).digest())
passphrase = base64.b64encode(hmac.new(api_secret.encode('utf-8'), api_passphrase.encode('utf-8'), hashlib.sha256).digest())
headers = {
"KC-API-SIGN": signature,
"KC-API-TIMESTAMP": str(now),
"KC-API-KEY": api_key,
"KC-API-PASSPHRASE": passphrase,
"KC-API-KEY-VERSION": "2",
"Content-Type": "application/json"
}
# Send the POST request
response = requests.request('post', url, headers=headers, data=data_json)
# Print the response
print(response.json())
Please take care of the lines marked in red:
Remove spaces from the json
Add the json to the string to sign
Add content type to the header
Do the request this way
# Import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
import ast
start_time = time.time()
s = requests.Session()
#Get URL and extract content
page=1
traits = []
accessories, backgrounds, shoes = [], [], []
while page != 100:
params = {
('arg', f"Qmer3VzaeFhb7c5uiwuHJbRuVCaUu72DcnSoUKb1EvnB2x/{page}"),
}
content = s.get('https://ipfs.infura.io:5001/api/v0/cat', params=params, auth=('', ''))
soup = BeautifulSoup(content.text, 'html.parser')
page = page + 1
traits = ast.literal_eval(soup.text)['attributes']
df = pd.DataFrame(traits)
df1 = df[df['trait_type']=='ACCESSORIES']
accessories.append(df1['value'].values[0])
When I run the above code I get the following error:
IndexError: index 0 is out of bounds for axis 0 with size 0
This happens because not every item has an "ACCESSORIES" trait data point. So how would I go about adding/filling in an ACCESSORIES trait for those items that don't have one with an empty, nan, or 0 value?
Following code solves this issue:
# Import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
import ast
start_time = time.time()
s = requests.Session()
#Get URL and extract content
page=1
traits = []
accessories, backgrounds, shoes = [], [], []
while page != 100:
params = {
('arg', f"Qmer3VzaeFhb7c5uiwuHJbRuVCaUu72DcnSoUKb1EvnB2x/{page}"),
}
content = s.get('https://ipfs.infura.io:5001/api/v0/cat', params=params, auth=('', ''))
soup = BeautifulSoup(content.text, 'html.parser')
page = page + 1
traits = ast.literal_eval(soup.text)['attributes']
df = pd.DataFrame(traits)
df1 = df[df['trait_type']=='ACCESSORIES']
try:
accessories.append(df1['value'].values[0])
except:
'NONE'
I'm trying to put python objects into a JSON file by getting the API from one of the sites but somehow when I run the code nothing has been put in the JSON file. API is working well, as well when I print out the code by json.load I get the output but I have no idea why does dump doesn't work.
here is my code:
from django.shortcuts import render
import requests
import json
import datetime
import re
def index(request):
now = datetime.datetime.now()
format = "{}-{}-{}".format(now.year, now.month, now.day)
source = []
author = []
title = []
date = []
url = "http://newsapi.org/v2/everything"
params = {
'q': 'bitcoin',
'from': format,
'sortBy': 'publishedAt',
'apiKey': '1186d3b0ccf24e6a91ab9816de603b90'
}
response = requests.request("GET", url, params=params)
for news in response.json()['articles']:
matching = re.match("\d+-\d+-\d+", news['publishedAt'])
if format == matching.group():
source.append(news['source'])
author.append(news['author'])
title.append(news['title'])
date.append(news['publishedAt'])
data = \
{
'source': source,
'author': author,
'title': title,
'date': date
}
with open('data.json', "a+") as fp:
x = json.dump(data, fp, indent=4)
return render(request, 'news/news.html', {'response': response})
I am trying to transfer a code from python2 to 3. The problem happens. "pad * chr(pad)" looks like a string but when I print it out it shows . I dont know what it is really is.
<ipython-input-26-6c9679723473> in aesEncrypt(text, secKey)
43 def aesEncrypt(text, secKey):
44 pad = 16 - len(text) % 16
---> 45 text =text + pad * chr(pad)
46 encryptor = AES.new(secKey, 2, '0102030405060708')
47 ciphertext = encryptor.encrypt(text)
TypeError: can't concat str to bytes
I then tried encode() but it didnt work. I am wonder how can concat two string in python3.
<ipython-input-53-e9f33b00348a> in aesEncrypt(text, secKey)
43 def aesEncrypt(text, secKey):
44 pad = 16 - len(text) % 16
---> 45 text = text.encode("utf-8") + (pad * chr(pad)).encode("utf-8")
46 encryptor = AES.new(secKey, 2, '0102030405060708')
47 ciphertext = encryptor.encrypt(text)
AttributeError:'bytes' object has no attribute 'encode'
For the reference the original code is
作者:路人甲
链接:https://www.zhihu.com/question/31677442/answer/119959112
#encoding=utf8
import requests
from bs4 import BeautifulSoup
import re,time
import os,json
import base64
from Crypto.Cipher import AES
from pprint import pprint
Default_Header = {
'Referer':'http://music.163.com/',
'Host':'music.163.com',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.3.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate'
}
BASE_URL = 'http://music.163.com'
_session = requests.session()
_session.headers.update(Default_Header)
def getPage(pageIndex):
pageUrl = 'http://music.163.com/discover/playlist/?order=hot&cat=全部&limit=35&offset='+pageIndex
soup = BeautifulSoup(_session.get(pageUrl).content)
songList = soup.findAll('a',attrs = {'class':'tit f-thide s-fc0'})
for i in songList:
print i['href']
getPlayList(i['href'])
def getPlayList(playListId):
playListUrl = BASE_URL + playListId
soup = BeautifulSoup(_session.get(playListUrl).content)
songList = soup.find('ul',attrs = {'class':'f-hide'})
for i in songList.findAll('li'):
startIndex = (i.find('a'))['href']
songId = startIndex.split('=')[1]
readEver(songId)
def getSongInfo(songId):
pass
def aesEncrypt(text, secKey):
pad = 16 - len(text) % 16
text = text + pad * chr(pad)
encryptor = AES.new(secKey, 2, '0102030405060708')
ciphertext = encryptor.encrypt(text)
ciphertext = base64.b64encode(ciphertext)
return ciphertext
def rsaEncrypt(text, pubKey, modulus):
text = text[::-1]
rs = int(text.encode('hex'), 16)**int(pubKey, 16) % int(modulus, 16)
return format(rs, 'x').zfill(256)
def createSecretKey(size):
return (''.join(map(lambda xx: (hex(ord(xx))[2:]), os.urandom(size))))[0:16]
def readEver(songId):
url = 'http://music.163.com/weapi/v1/resource/comments/R_SO_4_'+str(songId)+'/?csrf_token='
headers = { 'Cookie': 'appver=1.5.0.75771;', 'Referer': 'http://music.163.com/' }
text = { 'username': '', 'password': '', 'rememberLogin': 'true' }
modulus = '00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7'
nonce = '0CoJUm6Qyw8W8jud'
pubKey = '010001'
text = json.dumps(text)
secKey = createSecretKey(16)
encText = aesEncrypt(aesEncrypt(text, nonce), secKey)
encSecKey = rsaEncrypt(secKey, pubKey, modulus)
data = { 'params': encText, 'encSecKey': encSecKey }
req = requests.post(url, headers=headers, data=data)
total = req.json()['total']
if int(total) > 10000:
print songId,total
else:
pass
if __name__=='__main__':
for i in range(1,43):
getPage(str(i*35))
around line 85: encText = aesEncrypt(aesEncrypt(text, nonce), secKey)
aesEncrypt() is called first time with data as shown with subscript '_1' and 2nd time with data shown with subscript '_2'. Notice that the type of secKey changes from a string to a list of strings, and text from a string to a bytes object.
>>> secKey_1
'0CoJUm6Qyw8W8jud'
>>> secKey_2
['e4', '1a', '61', '7c', '1e', '62', '76', '5', '94', '62', '5a', '92', '9', 'fd', '2f', '4a']
>>>
>>> text_1
'{"username": "", "password": "", "rememberLogin": "true"}'
>>> text_2
b'qjTTWCVgh3v45StLOhGtNtY3zzoImIeGkRry1Vq0LzNSgr9hDHkkh19ujd+iqbvXnzjmHDhOIA5H0z/Wf3uU5Q=='
>>>
Thanks to #Todd. He has found the issue. aesEncrypt()has been called twice and it returns bytes while it receives str, which is acceptable in Python2 but not for Python3.
In the end, I change return ciphertext to return str(ciphertext).
I am trying to use the sttp core library along with the sttp json4s library in order to receive a json response and convert it to a case class.
The source code on Github is here and the documentation for this example that I am trying to replicate is here
The response to the GET request to the URL http://httpbin.org/get?foo=bar looks like:
{
"args": {
"foo": "bar"
},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
"Connection": "close",
"Cookie": "_gauges_unique_day=1; _gauges_unique_month=1; _gauges_unique_year=1; _gauges_unique=1; stale_after=never",
"Forwarded": "for=49.255.235.138",
"Host": "httpbin.org",
"Save-Data": "on",
"Scheme": "http",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
},
"origin": "49.255.235.138, 66.249.82.79",
"url": "http://httpbin.org/get?foo=bar"
}
The case class which attempts to read the json response above looks like:
case class HttpBinResponse(
args: Map[String, String],
origin: String,
headers: Map[String,String]
)
The case class has been defined at the top of the test file and is accessible to the test.
This test passes:
it should "send a GET request parse response as JSON" in {
implicit val backend = HttpURLConnectionBackend()
val queryParams = Map("foo" -> "bar", "bugs" -> "life")
val endpoint:Uri = uri"http://httpbin.org/get?foo=bar"
val request = sttp
.get(endpoint)
.response(asJson[HttpBinResponse])
val response = request.send()
// response.body is an Either
response.code should be(200)
val res = response.body.fold(_ => { "Error" }, a => { a })
res shouldBe a[HttpBinResponse]
}
The code that produces the error looks like this:
it should "send a GET request parse response as JSON" in {
implicit val backend = HttpURLConnectionBackend()
val queryParams = Map("foo" -> "bar", "bugs" -> "life")
val endpoint:Uri = uri"http://httpbin.org/get?foo=bar"
val request = sttp
.get(endpoint)
.response(asJson[HttpBinResponse])
val response = request.send()
// response.body is an Either
response.code should be(200)
val res = response.body.fold(_ => { "Error" }, a => { a })
res shouldBe a[HttpBinResponse]
println(res.origin)
}
However, when I try and access a value from the res.origin, I see the error that is value origin is not a member of java.io.Serializable
28. Waiting for source changes... (press enter to interrupt)
[info] Compiling 1 Scala source to /Users/localuser/Do/scalaexercises/target/scala-2.12/test-classes...
[error] /Users/localuser/Do/scalaexercises/src/test/scala/example/SttpSpec.scala:58: value origin is not a member of java.io.Serializable
[error] println(res.origin)
[error] ^
[error] one error found
[error] (test:compileIncremental) Compilation failed
The question was answered by Rob Norris on the gitter scala channel as:
When you fold the Either the result is Serializable. Try pattern matching on it, or fold to sys.error instead of String.
Pattern Matching on the result:
val res = response.body.fold(_ => { "Error" }, a => { a })
res match {
case HttpBinResponse(_, origin) => {
println("-----------------------------")
print("The origin for the request is ")
print(origin)
println("-----------------------------")
}
case _ => "Error"
}
fold the Left to sys.error instead of String, no pattern matching needed in this case:
val res = response.body.fold(
_ => { error("Error") },
a => { a }
)
println(res.origin)
res.args should contain("foo" -> "bar")
res.args should contain key("bugs")
res.args should contain value("life")
res.origin.length should be >10
Either of these two approaches would solve the issue.