Scrapy Json return the same contents - json

I develop this scrapy crawler, with a loop to scrap 10 pages from one site
The loop works well and the log show me the correct list of urls
2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=8&ajax=true>
2018-10-08 07:59:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/trang-diem/?page=9&ajax=true>
But the result is always the same, and return the content of page1
I test in shell and it works correctly, from the browser too. Only with scrapy crawler the problem occur
I tried with start_urls, url method, always the same problem
Any idea ?
import scrapy
import json
import urllib
import time
import datetime
import re
from re import sub
from decimal import Decimal
#from prod.items import ProdItem
from staging.items import StagingItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
ts = time.time()
timestamp = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d')
class QuotesSpider(scrapy.Spider):
name = "lazada2"
def start_requests(self):
for i in range(1, 10):
urls = 'https://www.lazada.vn/trang-diem/?page=%s&ajax=true' % i
yield scrapy.Request(url=urls, callback=self.parse)
def parse(self,response):
data = json.loads(response.body)
next_page = data['mainInfo']['page']
for product in data['mods']['listItems']:
item = StagingItem()
item['collector_sku'] = product['name']
if 'originalPrice' in product:
item['collector_price_promo'] = product['originalPrice'],
else:
item['collector_price_promo'] = '',
item['collector_retailer'] = 'Lazada'
item['collector_url'] = product['productUrl'],
item['collector_photo_url'] = product['image']
item['collector_brand'] = product['brandName']
item['collector_quantity'] = 'NA'
item['collector_category'] = 'Makeup',
item['collector_price'] = product['price']
item['collector_timestamp'] = timestamp
item['collector_local_id'] = ''
item['collector_location_id'] = ''
item['collector_location_name'] = ''
item['collector_vendor_id'] = ''
item['collector_vendor_name'] = ''
yield item

With the cookies and headers
:
headers = {
"content-type": "application/json",
"authority": "www.lazada.vn",
"scheme": "https",
"Accept-Language": "en-SG,en;q=0.9,en-US;q=0.8,zh-CN;q=0.7,zh;q=0.6,vi;q=0.5,fr;q=0.4",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Accept": "*/*",
"Path": "/trang-diem/?page=%s" % i,
"Referer": "https://www.lazada.vn/trang-diem/?page=%s&ajax=true" % i,
"accept-encoding": "gzip, deflate, br"
}
cookies = {
"cookie": "_uab_collina=153864259681792402093714; _bl_uid=qpj7jm4CuXhcUk26er9n7hnhyRqd; t_fv=1538642596635; t_uid=mbei2vPUviVx0oPB6KjX1uVgASJvw7dA; lzd_cid=07e3d81c-bb96-4608-be5d-542d35d39dff; lzd_sid=1d8bf18519bb7fd8fb661ac558726c4d; _tb_token_=58e7f715a30eb; cna=O5A8FGGivzcCAXNPwzeoH+5y; hng=VN|vi|VND|704; userLanguageML=vi; cto_lwid=c9ad6486-acac-465f-ab05-6e0b3744d1dc; _ga=GA1.2.1435138343.1538642600; _gid=GA1.2.19901051.1538642600; cto_axid=zGni0uxNaRyv441RxQNq7EZ_LS8xiGmL; JSESSIONID=85306FF3F7612F91677FC6ED978B42E1; isg=BJ6eL8eUSXz4CZ0YqjCefDlu7zTqVCYsGgm5Z0gmm-DyaztFsOyk6OZNZi9CoFrx"
}
body ="?ajax=true&page=%s" % i
urls = "https://www.lazada.vn/trang-diem/?ajax=true&page=%s" % i
yield scrapy.Request(url=urls, body=body, cookies=cookies, headers=headers, callback=self.parse)

Related

How to create a valid signature message to POST an order to the Kucoin Future API?

I am trying to place an order but it gives me this error:
{"code":"400005","msg":"Invalid KC-API-SIGN"}
I'll be so thankful if someone check my code and let me know the problem
import requests
import time
import base64
import hashlib
import hmac
import json
import uuid
api_key = 'XXXXXXXXXXXXXXXXXXXXXXX'
api_secret = 'XXXXXXXXXXXXXXXXXXXXXX'
api_passphrase = 'XXXXXXXXXXXXXXX'
future_base_url = "https://api-futures.kucoin.com"
clientOid = uuid.uuid4().hex
params = {
"clientOid": str(clientOid),
"side": str(side),
"symbol": str(symbol),
"type": "limit",
"leverage": "5",
"stop": "down",
"stopPriceType": "TP",
"price": str(price),
"size": int(size),
"stopPrice": str(stopprice)
}
json_params = json.dumps(params)
print(json_params)
now = int(time.time() * 1000)
str_to_sign = str(now) + 'POST' + '/api/v1/orders' + json_params
signature = base64.b64encode(hmac.new(api_secret.encode('utf-8'), str_to_sign.encode('utf-8'), hashlib.sha256).digest())
passphrase = base64.b64encode(hmac.new(api_secret.encode('utf-8'), api_passphrase.encode('utf-8'), hashlib.sha256).digest())
headers = {
"KC-API-SIGN": signature,
"KC-API-TIMESTAMP": str(now),
"KC-API-KEY": api_key,
"KC-API-PASSPHRASE": passphrase,
"KC-API-KEY-VERSION": "2",
"Content-Type": "application/json"
}
response = requests.request('POST', future_base_url + '/api/v1/orders', params=params, headers=headers)
print(response.text)
This worked for me:
tickerK = "AVAXUSDTM"
clientOid = tickerK + '_' + str(now)
side = "buy"
typee = "market"
leverage = "2"
stop = "up"
stopPriceType = "TP"
stopPrice = "12"
size = "3"
# Set the request body
data = {
"clientOid":clientOid,
"side":side,
"symbol":tickerK,
"type":typee,
"leverage":leverage,
"stop":stop,
"stopPriceType":stopPriceType,
"stopPrice":stopPrice,
"size":size
}
data_json = json.dumps(data, separators=(',', ':'))
data_json
url = 'https://api-futures.kucoin.com/api/v1/orders'
now = int(time() * 1000)
str_to_sign = str(now) + 'POST' + '/api/v1/orders' + data_json
signature = base64.b64encode(
hmac.new(api_secret.encode('utf-8'), str_to_sign.encode('utf-8'), hashlib.sha256).digest())
passphrase = base64.b64encode(hmac.new(api_secret.encode('utf-8'), api_passphrase.encode('utf-8'), hashlib.sha256).digest())
headers = {
"KC-API-SIGN": signature,
"KC-API-TIMESTAMP": str(now),
"KC-API-KEY": api_key,
"KC-API-PASSPHRASE": passphrase,
"KC-API-KEY-VERSION": "2",
"Content-Type": "application/json"
}
# Send the POST request
response = requests.request('post', url, headers=headers, data=data_json)
# Print the response
print(response.json())
Please take care of the lines marked in red:
Remove spaces from the json
Add the json to the string to sign
Add content type to the header
Do the request this way

How to fill in missing column value?

# Import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
import ast
start_time = time.time()
s = requests.Session()
#Get URL and extract content
page=1
traits = []
accessories, backgrounds, shoes = [], [], []
while page != 100:
params = {
('arg', f"Qmer3VzaeFhb7c5uiwuHJbRuVCaUu72DcnSoUKb1EvnB2x/{page}"),
}
content = s.get('https://ipfs.infura.io:5001/api/v0/cat', params=params, auth=('', ''))
soup = BeautifulSoup(content.text, 'html.parser')
page = page + 1
traits = ast.literal_eval(soup.text)['attributes']
df = pd.DataFrame(traits)
df1 = df[df['trait_type']=='ACCESSORIES']
accessories.append(df1['value'].values[0])
When I run the above code I get the following error:
IndexError: index 0 is out of bounds for axis 0 with size 0
This happens because not every item has an "ACCESSORIES" trait data point. So how would I go about adding/filling in an ACCESSORIES trait for those items that don't have one with an empty, nan, or 0 value?
Following code solves this issue:
# Import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
import ast
start_time = time.time()
s = requests.Session()
#Get URL and extract content
page=1
traits = []
accessories, backgrounds, shoes = [], [], []
while page != 100:
params = {
('arg', f"Qmer3VzaeFhb7c5uiwuHJbRuVCaUu72DcnSoUKb1EvnB2x/{page}"),
}
content = s.get('https://ipfs.infura.io:5001/api/v0/cat', params=params, auth=('', ''))
soup = BeautifulSoup(content.text, 'html.parser')
page = page + 1
traits = ast.literal_eval(soup.text)['attributes']
df = pd.DataFrame(traits)
df1 = df[df['trait_type']=='ACCESSORIES']
try:
accessories.append(df1['value'].values[0])
except:
'NONE'

why does JSON dump doesn't work in my code?

I'm trying to put python objects into a JSON file by getting the API from one of the sites but somehow when I run the code nothing has been put in the JSON file. API is working well, as well when I print out the code by json.load I get the output but I have no idea why does dump doesn't work.
here is my code:
from django.shortcuts import render
import requests
import json
import datetime
import re
def index(request):
now = datetime.datetime.now()
format = "{}-{}-{}".format(now.year, now.month, now.day)
source = []
author = []
title = []
date = []
url = "http://newsapi.org/v2/everything"
params = {
'q': 'bitcoin',
'from': format,
'sortBy': 'publishedAt',
'apiKey': '1186d3b0ccf24e6a91ab9816de603b90'
}
response = requests.request("GET", url, params=params)
for news in response.json()['articles']:
matching = re.match("\d+-\d+-\d+", news['publishedAt'])
if format == matching.group():
source.append(news['source'])
author.append(news['author'])
title.append(news['title'])
date.append(news['publishedAt'])
data = \
{
'source': source,
'author': author,
'title': title,
'date': date
}
with open('data.json', "a+") as fp:
x = json.dump(data, fp, indent=4)
return render(request, 'news/news.html', {'response': response})

TypeError: can't concat str to bytes when converting Python 2 to 3 with Encryption Function

I am trying to transfer a code from python2 to 3. The problem happens. "pad * chr(pad)" looks like a string but when I print it out it shows . I dont know what it is really is.
<ipython-input-26-6c9679723473> in aesEncrypt(text, secKey)
43 def aesEncrypt(text, secKey):
44 pad = 16 - len(text) % 16
---> 45 text =text + pad * chr(pad)
46 encryptor = AES.new(secKey, 2, '0102030405060708')
47 ciphertext = encryptor.encrypt(text)
TypeError: can't concat str to bytes
I then tried encode() but it didnt work. I am wonder how can concat two string in python3.
<ipython-input-53-e9f33b00348a> in aesEncrypt(text, secKey)
43 def aesEncrypt(text, secKey):
44 pad = 16 - len(text) % 16
---> 45 text = text.encode("utf-8") + (pad * chr(pad)).encode("utf-8")
46 encryptor = AES.new(secKey, 2, '0102030405060708')
47 ciphertext = encryptor.encrypt(text)
AttributeError:'bytes' object has no attribute 'encode'
For the reference the original code is
作者:路人甲
链接:https://www.zhihu.com/question/31677442/answer/119959112
#encoding=utf8
import requests
from bs4 import BeautifulSoup
import re,time
import os,json
import base64
from Crypto.Cipher import AES
from pprint import pprint
Default_Header = {
'Referer':'http://music.163.com/',
'Host':'music.163.com',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.3.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate'
}
BASE_URL = 'http://music.163.com'
_session = requests.session()
_session.headers.update(Default_Header)
def getPage(pageIndex):
pageUrl = 'http://music.163.com/discover/playlist/?order=hot&cat=全部&limit=35&offset='+pageIndex
soup = BeautifulSoup(_session.get(pageUrl).content)
songList = soup.findAll('a',attrs = {'class':'tit f-thide s-fc0'})
for i in songList:
print i['href']
getPlayList(i['href'])
def getPlayList(playListId):
playListUrl = BASE_URL + playListId
soup = BeautifulSoup(_session.get(playListUrl).content)
songList = soup.find('ul',attrs = {'class':'f-hide'})
for i in songList.findAll('li'):
startIndex = (i.find('a'))['href']
songId = startIndex.split('=')[1]
readEver(songId)
def getSongInfo(songId):
pass
def aesEncrypt(text, secKey):
pad = 16 - len(text) % 16
text = text + pad * chr(pad)
encryptor = AES.new(secKey, 2, '0102030405060708')
ciphertext = encryptor.encrypt(text)
ciphertext = base64.b64encode(ciphertext)
return ciphertext
def rsaEncrypt(text, pubKey, modulus):
text = text[::-1]
rs = int(text.encode('hex'), 16)**int(pubKey, 16) % int(modulus, 16)
return format(rs, 'x').zfill(256)
def createSecretKey(size):
return (''.join(map(lambda xx: (hex(ord(xx))[2:]), os.urandom(size))))[0:16]
def readEver(songId):
url = 'http://music.163.com/weapi/v1/resource/comments/R_SO_4_'+str(songId)+'/?csrf_token='
headers = { 'Cookie': 'appver=1.5.0.75771;', 'Referer': 'http://music.163.com/' }
text = { 'username': '', 'password': '', 'rememberLogin': 'true' }
modulus = '00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7'
nonce = '0CoJUm6Qyw8W8jud'
pubKey = '010001'
text = json.dumps(text)
secKey = createSecretKey(16)
encText = aesEncrypt(aesEncrypt(text, nonce), secKey)
encSecKey = rsaEncrypt(secKey, pubKey, modulus)
data = { 'params': encText, 'encSecKey': encSecKey }
req = requests.post(url, headers=headers, data=data)
total = req.json()['total']
if int(total) > 10000:
print songId,total
else:
pass
if __name__=='__main__':
for i in range(1,43):
getPage(str(i*35))
around line 85: encText = aesEncrypt(aesEncrypt(text, nonce), secKey)
aesEncrypt() is called first time with data as shown with subscript '_1' and 2nd time with data shown with subscript '_2'. Notice that the type of secKey changes from a string to a list of strings, and text from a string to a bytes object.
>>> secKey_1
'0CoJUm6Qyw8W8jud'
>>> secKey_2
['e4', '1a', '61', '7c', '1e', '62', '76', '5', '94', '62', '5a', '92', '9', 'fd', '2f', '4a']
>>>
>>> text_1
'{"username": "", "password": "", "rememberLogin": "true"}'
>>> text_2
b'qjTTWCVgh3v45StLOhGtNtY3zzoImIeGkRry1Vq0LzNSgr9hDHkkh19ujd+iqbvXnzjmHDhOIA5H0z/Wf3uU5Q=='
>>>
Thanks to #Todd. He has found the issue. aesEncrypt()has been called twice and it returns bytes while it receives str, which is acceptable in Python2 but not for Python3.
In the end, I change return ciphertext to return str(ciphertext).

Cannot access value from case class

I am trying to use the sttp core library along with the sttp json4s library in order to receive a json response and convert it to a case class.
The source code on Github is here and the documentation for this example that I am trying to replicate is here
The response to the GET request to the URL http://httpbin.org/get?foo=bar looks like:
{
"args": {
"foo": "bar"
},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
"Connection": "close",
"Cookie": "_gauges_unique_day=1; _gauges_unique_month=1; _gauges_unique_year=1; _gauges_unique=1; stale_after=never",
"Forwarded": "for=49.255.235.138",
"Host": "httpbin.org",
"Save-Data": "on",
"Scheme": "http",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
},
"origin": "49.255.235.138, 66.249.82.79",
"url": "http://httpbin.org/get?foo=bar"
}
The case class which attempts to read the json response above looks like:
case class HttpBinResponse(
args: Map[String, String],
origin: String,
headers: Map[String,String]
)
The case class has been defined at the top of the test file and is accessible to the test.
This test passes:
it should "send a GET request parse response as JSON" in {
implicit val backend = HttpURLConnectionBackend()
val queryParams = Map("foo" -> "bar", "bugs" -> "life")
val endpoint:Uri = uri"http://httpbin.org/get?foo=bar"
val request = sttp
.get(endpoint)
.response(asJson[HttpBinResponse])
val response = request.send()
// response.body is an Either
response.code should be(200)
val res = response.body.fold(_ => { "Error" }, a => { a })
res shouldBe a[HttpBinResponse]
}
The code that produces the error looks like this:
it should "send a GET request parse response as JSON" in {
implicit val backend = HttpURLConnectionBackend()
val queryParams = Map("foo" -> "bar", "bugs" -> "life")
val endpoint:Uri = uri"http://httpbin.org/get?foo=bar"
val request = sttp
.get(endpoint)
.response(asJson[HttpBinResponse])
val response = request.send()
// response.body is an Either
response.code should be(200)
val res = response.body.fold(_ => { "Error" }, a => { a })
res shouldBe a[HttpBinResponse]
println(res.origin)
}
However, when I try and access a value from the res.origin, I see the error that is value origin is not a member of java.io.Serializable
28. Waiting for source changes... (press enter to interrupt)
[info] Compiling 1 Scala source to /Users/localuser/Do/scalaexercises/target/scala-2.12/test-classes...
[error] /Users/localuser/Do/scalaexercises/src/test/scala/example/SttpSpec.scala:58: value origin is not a member of java.io.Serializable
[error] println(res.origin)
[error] ^
[error] one error found
[error] (test:compileIncremental) Compilation failed
The question was answered by Rob Norris on the gitter scala channel as:
When you fold the Either the result is Serializable. Try pattern matching on it, or fold to sys.error instead of String.
Pattern Matching on the result:
val res = response.body.fold(_ => { "Error" }, a => { a })
res match {
case HttpBinResponse(_, origin) => {
println("-----------------------------")
print("The origin for the request is ")
print(origin)
println("-----------------------------")
}
case _ => "Error"
}
fold the Left to sys.error instead of String, no pattern matching needed in this case:
val res = response.body.fold(
_ => { error("Error") },
a => { a }
)
println(res.origin)
res.args should contain("foo" -> "bar")
res.args should contain key("bugs")
res.args should contain value("life")
res.origin.length should be >10
Either of these two approaches would solve the issue.