Xpath not getting any data - html

I am trying to retrieve data from this rates website but It just cant get anything. I have tried this same format for a different website and it works fine so I'm not sure what is going on here.
import requests
from requests_html import HTMLSession
#get usd to gbp rate
session = HTMLSession()
response = session.get('https://www.xe.com/currencyconverter/convert/?Amount=1&From=USD&To=GBP')
rate = response.html.xpath('//span[#class="converterresult-toAmount"]/*')
print('USD to GBP Rate:',rate)
#(output): USD to GBP Rate: []

The HTML content returned from that URL is minimal, and does not contain the content you are attempting to target with XPath.
It appears that the JavaScript evaluated will render the content you are trying to scrape with react. In the HTML that is returned, there is a comment:
<!-- WARNING: Automated extraction of rates is prohibited under the Terms of Use. -->
So, I don't think they want you extracting data from the pages with scripts and automation.

Related

How can I increase work item payload limit for Forge Design Automation for Inventor?

I have updated the code I'm using to queue work items in Design Automation to use the Direct to S3 approach that is required starting next month. Due to this change, the URLs that are being sent in for outputs (where Design Automation will upload the outputs) are MUCH larger. As a result, I am getting a response with a 413 status code and an error message saying that the payload length exceeds the limit of 16384 bytes when I attempt to submit work items.
Is there a way around this?
Using the Signed URL endpoint with the query parameter of useCdn=true guarantees the payload goes through Direct to S3. Those urls are a lot shorter. What I'm doing is the following.
Upload to OSS (New approach)
POST presigned URL to S3 (you can control in how long will it expire up to 60 mins, default is 2 mins)
Upload to presigned URL (no authorization is required since it is a presinged url)
Complete the upload calling POST complete
Then we need 1 url for Download (input to Design Automation) and 1 for Upload (output from Design Automation)
This is where we can use the Signed URL endpoint
For the input will look something like this:
POST https://developer.api.autodesk.com/oss/v2/buckets/:bucketKey/objects/:objectKey/signed?access=read&useCdn=true
For output will look like this:
POST https://developer.api.autodesk.com/oss/v2/buckets/:bucketKey/objects/:objectKey/signed?access=write&useCdn=true
Both urls have an expiration time of 60 mins. I'm working on the update of the tutorials in Design Automation and also I have a collection that I will be blogging about it soon.

Sequence of events for Fuzzy search on html page

I have a page html let's call it abc.html
There are AngularJS fields embedded in it.
I am now writing a GET and POST in scala which routes the fuzzy search arguments to the proper page on the server.
I am trying to understand the sequence in which things occur in order to implement a GET/POST requests (written in scala) which would happen when someone makes a search on the search bar on the abc.html page, and which would return elements from the database
Is it abc.html (search) --> http GET request --> backend ( AngularJS) --> Database?
In this case this would mean my http post or get request would pass in the html data model elements which would in turn hit the backend AngularJS controller page which in turn would hit the database, and the return ride would send the database results via an http request to the page?
Do I need to explicitly define my GET in terms of angular fields and the database model?
thanks
HTTP uses request-response pairs. This means you don't have to make another request to return anything to the client, you just need to write the proper response. Other than that, your idea is fundamentally right. The process would look something like this:
Type something into the search form on your HTML page
Submit the search form to your backend. This creates a GET or POST request depending on your form element's method attribute.
(At this point the browser is awaiting a response)
As the request reaches the server, your backend code can capture its data and make a query to your database.
(At this point the server is awaiting data from the database)
The database returns its results, your backend code is free to format it into a response to the client's original request.
The client receives the response and you can use your frontend code to display it to the user.

failing to retrieve text from html using lxml and xpath

I'm working on a second house pricing project, so I need to scrape information from one of the largest second house trading platform in China. Here's my problem, the info on the page and the corresponding element using Chrome 'inspect' function are as follows:
my code:
>>>from lxml import etree
>>>import requests
>>>url = 'http://bj.lianjia.com/chengjiao/101101498110.html'
>>>r = requests.get(url)
>>>xiaoqu_avg_price = tree.xpath('//[#id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')
>>>xiaoqu_avg_price
[]
The returned empty list is not desirable (ideally it should be 73648). Furthermore, I viewed its HTML source code, which shows:
So how should I do to get what I want? And what is the resblockCard means? Thanks.
This site like many others uses ajax for populating content. If you make a similar request you can get desired value in json format.
import requests
url = 'http://bj.lianjia.com/chengjiao/resblock?hid=101101498110&rid=1111027378082'
# Get json response
response = requests.get(url).json()
print(response['data']['resblock']['unitPrice'])
# 73648
Note two group of numbers in request url. The first group from original page url, second you can find under script tag in the original page source: resblockId:'1111027378082'.
That XPath query is not working as expected because you are running it against the source code of the page as it is served by the server, not as it looks on a rendered browser page.
One solution for this is to use Selenium in conjunction with PhantomJS or some other browser driver, which will run the JavaScript on that page and render it for you.
from selenium import webdriver
from lxml import html
driver = webdriver.PhantomJS(executable_path="<path to>/phantomjs.exe")
driver.get('http://bj.lianjia.com/chengjiao/101101498110.html')
source = driver.page_source
driver.close() # or quit() if there are no more pages to scrape
tree = html.fromstring(source)
price = tree.xpath('//div[#id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')[0].strip()
The above returns 73648 元/㎡.

Get price of item in steam community market with JSON

I tried to find the price of the items in the steam community market through the steam website by selecting View Page Info (Google Chrome) to locate the JSON page but with no luck.
NOTE I am trying to understand how to get the URL of the JSON page of the items in the steam community market and not to scrap the JSON page.
I have searched through StackOverflow and they seem to be able to get this URL with the JSON which I am unable to find.
//way number 1
http://steamcommunity.com/market/priceoverview/?currency=3&appid=730&market_hash_name=StatTrak%E2%84%A2%20P250%20%7C%20Steel%20Disruption%20%28Factory%20New%29
// way number 2
http://steamcommunity.com/market/pricehistory/?country=DE&currency=3&appid=440&market_hash_name=Specialized%20Killstreak%20Brass%20Beast
You can use the priceoverview endpoint, just make sure you find the correct appid and market_hash_name.
You can easily locate needed values for these parameters by navigating to any Steam Market item listing.
For example, on this screenshot the appid is equal to 730 and the market_hash_name is AK-47%20%7C%20Redline%20%28Field-Tested%29.
The final parameter for request is currency, you can use the value of 1 for USD.
Finally, the request URL will look like this: http://steamcommunity.com/market/priceoverview/?appid=730&market_hash_name=AK-47%20%7C%20Redline%20%28Field-Tested%29&currency=1
And querying it will yield the following JSON response:
{
"success":true,
"lowest_price":"$7.90",
"volume":"1,113",
"median_price":"$7.77"
}
You can also read more about the whole Steam Market parsing topic here.

Unable to capture http log for some part of page with Charlesproxy or Fiddler?

I am trying to check the data loading into search listing in this page link below :
http://www.tigerdirectwireless.com/ecommerce/phones/?r=tigerdirect&filterbycarrier=68
We could not find the product details(name, price etc.) in page source. I have inspected in both Charles and Fiddler , but not able to view a log of any http request or response for this data.
Even after saving a complete webpage will not not download the listing products details, nor HTTrack Website Copier we help us identifying this.
We actually want the know the link from where this response data in generating in text/markup format.
Thanks Guys.
The data is sent back in JSON and the page is built up dynamically using JavaScript. For instance http://www.tigerdirectwireless.com/eCommerce/Service/CoreServicesProxy.asmx/GetPartnerPhoneList is the JSON response that contains the list of phones and details, while http://www.tigerdirectwireless.com/eCommerce/Service/CoreServicesProxy.asmx/GetPartnerPricedPhoneList contains the phones and prices.