set language for scraping amazon using puppeteer

set language for scraping amazon using puppeteer - puppeteer

I am scrapping Amazon.de reviews and questions using puppeteer. I am connected via a german IP. The header is set
await page.setExtraHTTPHeaders({
'Accept-Language': 'de-DE,de;q=0.9',
});
But when I use puppeteer all content is displayed in english and even the country is set as UAE. Therefore I see the wrong content. Is there a proper way to display amazon in German and set Germany as delivery land?

Related

In a in WinRT app, how do I connect using TLS1.2?

I've got a Windows Store app that's a WinRT Phone/Desktop app (i.e. not a UWP app), targeting Windows 8.1 and up.
It's been on the store for several years now, but recently it stopped being able to connect with various web APIs and websites (YouTube, as well as my own site) using HTTPS.
I have a WPF version of this app as well, and this happened on that app recently as well, and to fix it I used System.Net.ServicePointManager. Unfortunately, in my WinRT environment, System.Net doesn't include ServicePointManager. In my WPF app, I did this, and it worked just fine:
ServicePointManager.ServerCertificateValidationCallback = delegate
{
Debug.WriteLine("returning true (the ssl is valid)");
return true;
};
// our server is using TLS 1.2
ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
In doing some research around the internet, it seems that .NET 4.6 should include ServicePointManager, but I don't see any way to change (or even see) my version of .NET in the WinRT development environment.
I looked some more and found that a StreamSocket could be used to connect with TLS1.2... but that seems primarily designed to enable bluetooth communications, or communications to a web endpoint, but only by hostname... which is insufficient for me. I need to connect to an actual website, not just the base-level domain.
Trying this, I did the following:
StreamSocket socket = new StreamSocket();
string serverServiceName = "https";
socket.Control.KeepAlive = false;
url = "inadaydevelopment.com";
HostName serverHost = new HostName(url);
await socket.ConnectAsync(serverHost, serverServiceName, SocketProtectionLevel.Tls12);
text = await ReadDataFromSocket(socket);
I can include the code for ReadDataFromSocket() if necessary, but it seems to work, reading the data from the socket as expected when I point it at https://google.com. However, I can't seem to figure out how to point the socket at anything useful. The homepage of inadaydevelopment.com isn't what I want; I'm looking to consume a web API hosted on that server, but can't seem to find a way to do that.
Since the first parameter to the ConnectAsync() method is just HostName, the second parameter (remoteServiceName) must be the way to connect to the actual API or webpage I'm trying to connect to. According to the docs, that is The service name or TCP port number of the remote network destination... I haven't seen any example values for this parameter other than https and various numeric values, neither of which is going to get me to the API endpoint or webpage I'm trying to connect to.
So, with that super-long preamble out of the way, my question boils down to this:
Is there a way for me to use System.Net.ServicePointManager in my WinRT app like I do in my WPF app? If so, how?
If not, how can I use StreamSocket to connect to the exact web service or webpage I want to connect to, rather than just the top-level host?
If that's not possible, by what other means can I consume web content using TLS1.2?
Thanks in advance for any help or advice.

Use Windows.Web.Http API instead of System.Net.Http API.
System.Net.Http does not support TLS1.2 but Windows.Web.Http does in WinRT apps.

Unable to locate website API for web-scraping - no json response found

I'm trying to scrape council tax band data from this website and I'm unable to find the API:
https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearchresults.htm?action=pageNavigation&p=0
I've gone through previous answers on Stack Overflow and articles including:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
and
https://medium.com/#williamyeny/finding-ratemyprofessors-private-api-with-chrome-developer-tools-bd6747dd228d
I've gone into the Network tab - XHR/All - Headers/ Preview/ Response and the only thing I can find is:
/**/jQuery11130005376436794664263_1594893863666({ "html" : "<li class='navbar-text myprofile_salutation'>Welcome Guest!</li><li role='presentation' class=''><a href='https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/citizenportal/login.htm?redirect_url=https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/dashboard.htm'> Sign In / Register <span class='icon-arrow-right17 pull-right visible-xs-block'></span></a></li>" });
As a test I used AB24 4DE to search and couldn't find it anywhere within a json response.
As far as I can tell the data isn't hidden behind a web socket.
I ran a get request for the sake of it and got:
JSONDecodeError: Expecting value: line 10 column 1 (char 19)
What am I missing?

You're doing the right thing looking at the network tools. I find it's best to zoom in on the overview you're given in the network tab. You can select any proportion of an action in the browser. See what is happening which requests are made. So you could focus on the start of requests and responses when you click search. It gives you two requests that are made, one to post information to the server and one grabs information to a seperate url.
Suggestion
My suggestion having had a look at the website is to probably use selenium which is a package that mimics browser activity. Below you'll see my study of the requests. Essentially the form generates a unique token for every time you do a search. YOu have to replicate inorder to get the correct response. Which is hard to know in advance.
That being said, you can mimic browser activity using selenium and automatically input the postcode and automating the clicking of the search button. You then can grab the page source HTML and use beautifulsoup to parse it. Here is a minimal reproducible example showing you this.
Coding Example
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r'c:\users\aaron\chromedriver.exe')
url = 'https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearch.htm'
driver.get(url)
driver.find_element_by_id('postcode').send_keys('AB24 4DE')
driver.find_element_by_xpath('//input[#class="btn btn-primary cap-submit-inline"]').click()
soup = BeautifulSoup(driver.page_source,'html.parser')
There is also scope to make the browser headless, so it doesn't popup and all you'll get back is the parsed html.
Explanation of Code
We are importing webdriver from selenium, this provides the module necessary to load a browser. We then create an instance of that webdriver, in this case I'm using chrome but you can use firefox or other browsers.
You'll need to download chromedriver from here, https://chromedriver.chromium.org/. Webdriver uses this to open a browser.
We use the get webdriver get method to make chromedriver go to the specific page we want.
Webdriver has a list of find element by... you can use. The simplest here is find_element_by_id. We can find the id of the input box in the HTML for inputting the postcode, which I've done here. Send_keys will send whatever text we want, in this case its AB24 4DE.
find_element_by_xpath takes an XPATH selector. '//' goes through all of the DOM, we select input and the [#class=""] part selects the specifc input tag class. We want the submit button. The click() method, will click that browser.
We then grab the page source once this click is complete, this is necessary as we then input that into BeuatifulSoup, which will give us the parsed HTML of the postcode we desire.
Reverse Engineering the HTTP requests
Below is for education really, unless someone can get the unique token before sending requests to the server. Here's how the website works in terms of the search form.
Essentially looking at the process, it's sending cookies,headers,params and data to a server. The cookies has a session ID which does't seem to change on my test. The data variable is where you can change the postcode but also importantly the ABCtoken changes for every single time you want to do a search and the param is a check on the server to make sure it's not a bot.
As an in example of the HTTP POST request here. We send this
cookies = {
'JSESSIONID': '1DBAC40138879EB418C14AD83C05AD86',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'Origin': 'https://ecitizen.aberdeencity.gov.uk',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearch.htm',
'Accept-Language': 'en-US,en;q=0.9',
}
params = (
('action', 'validateData'),
)
data = {
'postcode': 'AB24 2RX',
'address': '',
'startSearch': 'Search',
'ABCToken': '35fbfd26-cb4b-4688-ac01-1e35bcb8f63d'
}
To
https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearch.htm
Then it's doing an HTTP GET request here with the same JSESSIONID and the unique ABCtoken to grab the data you want to bandsearchresults.html
'https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearchresults.htm'
So it's creating a JSESSIONID which seems to be the same for any postcode from my testing. then when you use that same JSESSIONID and you use the ABCtoken it supplies the searchresults URL you'll get the correct data back.

What is the reason why the OneNote APIs won't return all the pages in a notebook?

I am reading around here and I am seeing multiple messages about the /pages endpoint that is not working a expected
It seems that the OneNote APIs (MS Graph or Office365) are not returning all the pages that the user can see. In particular recent pages are not shown as available.
This message is for those of you who work for Microsoft and who keep an eye on this forum. Please if you have any explanation or workaround for this we would like to hear about it.
If this is work in progress we would also like to know when the APIs can be considered stable and reliable enough to consider them OK for production use
Update:
Permissions or scopes
scopes=[
"Notes.Read",
"Notes.Read.All",
"Notes.ReadWrite",
]
This is for a device authorization flow, the device is acting as a Microsoft Online account. The app is registered to Azure as personal app but the enterprise one does the same
The authorization process is described here
What type of app/authentication flow should I select to read my cloud OneNote content using a Python script and a personal Microsoft account?
After that I am using this endpoint to get the notebooks
https://graph.microsoft.com/v1.0/users/user-id/onenote/notebooks
from the returned json I pick the endpoint for the notebook I want to read and I access the endpoint the link stored in notebook['sectionsUrl']. This call returns a sections json
From this I pick the section I want and I access the link stored in section['pagesUrl']
Each call returns the expected info excepting the last one, when I get an arbitrary low number of pages in the section I want to explore. There is nothing wrong with the format of the info, it is just incomplete or not up to date
Not sure if this is related but when I try to access the pages in a section from MS Graph Explored I am seeing the same behavior (not all the pages are reported). This is a shared notebook and I am using the owner account for all the above so it should not be a permission problem
from msal import PublicClientApplication
import requests
endpoint= "https://graph.microsoft.com/v1.0/me/onenote"
authority = "https://login.microsoftonline.com/consumers"
app=PublicClientApplication(client_id=client_id, authority=authority)
flow = app.initiate_device_flow(scopes=scopes)
# there is an interactive part here that I automated using selenium, you
# are supposed to ouse a link to enter a code and then autorize the
# device; code not shown
result = app.acquire_token_by_device_flow(flow)
token= result['access_token']
headers={'Authorization': 'Bearer ' + token}
endpoint= "https://graph.microsoft.com/v1.0/users/c5af8759-4785-4abf-9434-xxxxxxxxx/onenote/notebooks"
notebooks = requests.get(endpoint,headers=headers).json()
for notebook in notebooks['value']:
print(notebook['displayName'])
print(notebook['sectionsUrl'])
print(notebook['sectionGroupsUrl'])
# I pick a certain notebook
section=[section for section in sections if section['displayName']=="Test"][0]
endpoint=notebook['sectionsUrl']
pages=requests.get(endpoint,headers=headers).json()
for page in pages['value']:
print(page['title'])
Update2
If I use this endpoint
https://graph.microsoft.com/v1.0/users/user-id/onenote/sections/section-id/pages
I would expect to get the complete list of pages for that section.
That is not working
After reading again and again the docs I my understanding is that the approach is to
call https://graph.microsoft.com/v1.0/users/user-id/onenote/pages$fiter or search etc etc
I this correct?
Also I vaguely remember there is a way to search for a section and have it expanded so that the search returs the children too.
Am I close to understanding this?
Thank you
MM

Why am I not receiving the English names of the streets from Google Directions?

I'm developing an app for myself which is using Google Directions to display some directions from my place to a certain location on the map.
I'm in Bulgaria, so the street names are in Bulgarian. But I want my app to be English only.
Therefore I'm deleting the Cyrillic(Bulgarian) letters and only show the turn-by-turn directions in English.
That's what I was doing. Today, though, I tried to do the same and my app showed me like..
Head \u003cb\u003ewest\u003c/b\u003e on \u003cb\u003eул. „Княз Ал. Дондуков-Корсаков“\u003c/b\u003e toward \u003cb\u003eул. „Академик Петко Стайнов“\u003c/b\u003e".
should be like Head west on [place] toward [place]".
Now, I'm not receiving the English names of the streets in the JSON file I'm receiving from the Google Directions API.
It's strange to me why is this happening and I don't know how can I tell Google that I only want the English names of the streets.
I set language to English in my request but still that didn't help.

Use this code if it is android..
if not android specify application, it does matter, as application read locale information from the device and location settings if not specified, that depends on application..
String uri = String.format(Locale.ENGLISH, "http://maps.google.com/maps?saddr=%f,%f(%s)&daddr=%f,%f (%s)", sourceLatitude, sourceLongitude, "Home Sweet Home", destinationLatitude, destinationLongitude, "Where the party is at");
Intent intent = new Intent(Intent.ACTION_VIEW, Uri.parse(uri));
intent.setClassName("com.google.android.apps.maps", "com.google.android.maps.MapsActivity");
startActivity(intent);

Crawling and Scraping iTunes App Store

I noticed that iTunes preview allows you to crawl and scrape pages via the http:// protocol. However, many of the links are trying to be opened in iTunes rather than the browser. For example, when you go to the iBooks page, it immediately tries opening a url with an itms:// protocol.
Are there any other methods of crawling the App Store or is this the only way?
Can the itms:// protocol links themselves be crawled somehow?

I would have a decent look at the iTunes Search API and the iTunes Enterprise Partner API
Search API -
http://www.apple.com/itunes/affiliates/resources/blog/introduction---search-api.html
Enterprise Partner API -
http://www.apple.com/itunes/affiliates/resources/documentation/itunes-enterprise-partner-feed.html
You might get most/all of the information you need in a nice JSON file format.
If you can't get the information you need with the API, I would be interested what it is :)

As phillipp mentioned, the iTunes search API is an easy way to retrieve data about your App Store listings in JSON format.
Simply query for this with your app id (you can find the app id by viewing the web listing for your app at itunes.apple.com), ex:
http://itunes.apple.com/lookup?id=INSERT_YOUR_APP_ID_HERE
then, parse the resulting JSON to your heart's content.

The only difference between http:// links and itms:// links is that you need to set your User-Agent to an iTunes user-agent, and depending on the version you may also have to include a verification code based on some not-so-secret algorithm.
For example this is the code for iTunes 9:
# Some magic. Generates a seed we use for X-Apple-Validation. Adapted from LWP::UserAgent::iTMS_Client.
function comp_seed($url, $user_agent) {
$random = sprintf( "%04X%04X", rand(0,0x10000), rand(0,0x10000) );
$static = base64_decode("ROkjAaKid4EUF5kGtTNn3Q==");
$url_end = ( preg_match("|.*/.*/.*(/.+)$|",$url,$matches)) ? $matches[1] : '?';
$digest = md5(join("",array($url_end, $user_agent, $static, $random)) );
return $random . '-' . strtoupper($digest);
}
However if you are only scraping, iTunes preview should work for your purposes, the link you gave us to the iBooks page had more than enough information to scrape.

We tried scraping ourselves too about a year ago and it just became too much of a headache. Philipp's comment is a good one as the enterprise feed from apple (need to apply for it with a legitimate use) does have a good amount of useful info that you might be after in scraping.
There are a few companies that offer data as a service too - abto and AppMonsta are two I heard of when I was looking. I can't seem to find abto anymore but http://appmonsta.com seems to be. The search API looks ok (never experimented) but limited.
Good luck!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008