Google Places Search Dynamic Data Loading - google-maps

Hi am trying to search Google Places and the basic structure is fine:
https://plus.google.com/local/Leeds,%20United%20Kingdom/s/car%20rental
But for loading additional data you scroll down the page. I have used fiddler to check the request but cannot identify where any of the data being posted comes from?
Does anyone know how to simply load or start on page 2/3. Or even load moare than 10 at once.

This will load from 30 to 40, it respons in a format that I can't recognize, but it does what you want. You must be more specific if you want further assistance. Offset is the offset of the query, so 30 gets from 30 to 40, 100 gets from 100 to 110 etc.
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import requests
url = "https://plus.google.com/_/local/searchresultsonly"
offset = "30"
data = {
"f.req": '[[[],[],"car rental","Leeds, United Kingdom",[],["Leeds, West Yorkshire, UK",[0,0,0,0]],null,0],[{}]]'.format(offset),
}
print requests.post(url, data).text

Related

Why is my xpath parser only scraping the first dozen elements in a table?

I'm learning how to interact with websites via python and I've been following this tutorial.
I wanted to know how xpath worked which led me here.
I've made some adjustments to the code from the origional site to select elements if multiple conditions are met.
However, in both the origional code from the tutorial and my own update I seem to be only grabbing a small section of the body. I was under the impression xpath('//tbody/tr')[:10]: would grab the entire body and allow me to interact with 10 in this case columns.
At time of writing when I run the code it retuns 2 hits, while if I do the same thing with a manual copy and paste to excel I get 94 hits.
Why am I only able to parse a few lines from the table and not all lines?
import requests
from lxml.html import fromstring
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]') and i.xpath('.//td[5][contains(text(),"elite")]'):
# Grabbing IP and corresponding PORT
proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
return proxies
proxies = get_proxies()
print(proxies)
I figured it out: [:10] refers not to columns but to rows.
Changing 10 to 20, 50 or 100, check 10, 20, 50 or 100 rows.

Downloading data from multiple-pages website with requests method in Python

I have this API documentation of the website http://json-homework.task-sss.krasilnikov.spb.ru/docs/9f66a575a6cfaaf7e43177317461d057 (which is only in Russian, unfortunately, but I'll try to explain), and I am to import the data about the group members from there, but the issue is that parameter page returns only 5 members, and when you increase the page number, it only returns next 5 members, not adding them to the previous five. Here is my code:
import pandas as pd
import requests as rq
import json
from pandas.io.json import json_normalize
url='http://json-homework.task-sss.krasilnikov.spb.ru/api/groups/getmembers?api_key=9f66a575a6cfaaf7e43177317461d057&group_id=4508123&page=1'
data=rq.get(url)
data1=json.loads(data.text)
data1=json_normalize(json.loads(data.text)["response"])
data1
and here is what my output looks like:
By entering bigger and bigger numbers, I also found out that the last part of data exists on 41 page, i.e. I need to get the data from 1 to 41 page. How can I include all the pages in my code? Maybe it is possible with some loop or something like that, I don't know...
According to the API documentation, there is no parameter to specify the users to fetch in one page, so you will have to get them 5 at a time, and since there are 41 pages you can just loop through the urls.
import requests as rq
import json
all_users = []
for page in range(1,42):
url=f'http://json-homework.task-sss.krasilnikov.spb.ru/api/groups/getmembers?api_key=9f66a575a6cfaaf7e43177317461d057&group_id=4508123&page={page}'
data=rq.get(url)
all_users.append(json.loads(data.text)["response"])
The above implementation, will of course not check for any api throttling i.e. the API may give unexpected data if too many requests are made in a very short duration, which you can mitigate using some well placed delays.

Squarespace Change pageSize limit on getting blog posts

i'm working on a Squarespace Website, i have a blog with several posts, actually 36 posts, using ajax call i parse all the posts with the following url url to parse, the problem is that SQS return only 20 items, and the other items should be parsed again with the offset returned :
"pagination": {
"nextPage": true,
"nextPageOffset": 1518167880210,
"nextPageUrl": "/timeline-list-v7/?offset=1518167880210",
"pageSize": 20
},
So if i have 100 or 500 posts created i should do 1 ajax call to get 20 posts each time (5 or 28 calls)? SQS forums doesn't give a solution for that. is there any param that i can give to the url that might give me more items than 20?
Thanks.
I know of no parameter that can return more results than what the collection's pagesize property is set to.
However, there are ways to get more than 20 results, both of which require developer mode to be enabled.
The first option is to set the collection's pagesize property in the .conf file to a number higher than 20. That should cause your requests to return that number of items.
"pageSize" : 999,
"forcePageSize" : true
Keep in mind that increasing the pagesize in this way may increase page load times within that collection.
The second option is to use a custom query tag (<squarespace:query>) and embed a <script> within its scope. Within the query, you could set the limit to up-to 100. The script could then have access to store the collection data to the global window object for use by another script outside that context (for example). But this will only help you up to 100 results, not 500.
If neither of those work (both require dev. mode), then I think you are left with a recursive AJAX request as your only option...one that continues to pull item data 20 at a time until all items are gathered.
Hope those ideas help.

How to change AJAX code on a webpage that would remain after refreshing?

I wanted to crawl metal-archives.com and put the info in a database about metal bands. After looking at the code for a good 20 minutes I figured they keep the data in a JSON file that can be accessed with this URL. The only problem is that the AJAX code is set to show only 200 entries per page:
$(document).ready(function() {
createGrid(
"#searchResults", 200,
At the top of the file I can see there are more than 11,000 bands, but only 200 showing. Also, when I click the different pages AJAX takes care of fetching the data dynamically, without changing the URL in the address bar, so I couldn't see the rest of the bands.
Then I tried changing the code above to "#searchResults", 1000 hoping it would remain after refreshing, but, alas, no luck. Any idea how I could do that, essentially make it possible to parse the entire JSON to a Python dictionary and create a DB?
As the url is always returning 200 records, you can call this url in loop until you get all the records
Step 1:
Using the below url, pass iDisplayStart=0 and get first 200 records,
http://www.metal-archives.com/search/ajax-band-search/?iDisplayStart=0&iDisplayLength=200
Step 2:
Parse the json and get the value of iTotalRecords in the json and call the url again and again in the loop until you get all the records.
You can change the iDisplayStart as iDisplayStart+=200 to call next 200 records as below,
http://www.metal-archives.com/search/ajax-band-search/?iDisplayStart=200&iDisplayLength=200
and then,
http://www.metal-archives.com/search/ajax-band-search/?iDisplayStart=400&iDisplayLength=200
Hope it helps you.

How to get all resource record sets from Route 53 with boto?

I'm writing a script that makes sure that all of our EC2 instances have DNS entries, and all of the DNS entries point to valid EC2 instances.
My approach is to try and get all of the resource records for our domain so that I can iterate through the list when checking for an instance's name.
However, getting all of the resource records doesn't seem to be a very straightforward thing to do! The documentation for GET ListResourceRecordSets seems to suggest it might do what I want and the boto equivalent seems to be get_all_rrsets ... but it doesn't seem to work as I would expect.
For example, if I go:
r53 = boto.connect_route53()
zones = r53.get_zones()
fooA = r53.get_all_rrsets(zones[0][id], name="a")
then I get 100 results. If I then go:
fooB = r53.get_all_rrsets(zones[0][id], name="b")
I get the same 100 results. Have I misunderstood and get_all_rrsets does not map onto ListResourceRecordSets?
Any suggestions on how I can get all of the records out of Route 53?
Update: cli53 (https://github.com/barnybug/cli53/blob/master/cli53/client.py) is able to do this through its feature to export a Route 53 zone in BIND format (cmd_export). However, my Python skills aren't strong enough to allow me to understand how that code works!
Thanks.
get_all_rrsets returns a ResourceRecordSets which derives from Python's list class. By default, 100 records are returned. So if you use the result as a list it will have 100 records. What you instead want to do is something like this:
r53records = r53.get_all_rrsets(zones[0][id])
for record in r53records:
# do something with each record here
Alternatively if you want all of the records in a list:
records = [r for r in r53.get_all_rrsets(zones[0][id]))]
When iterating either with a for loop or list comprehension Boto will fetch the additional records (up to) 100 records at a time as needed.
this blog post from 2018 has a script which allows exporting in bind format:
#!/usr/bin/env python3
# https://blog.jverkamp.com/2018/03/12/generating-zone-files-from-route53/
import boto3
import sys
route53 = boto3.client('route53')
paginate_hosted_zones = route53.get_paginator('list_hosted_zones')
paginate_resource_record_sets = route53.get_paginator('list_resource_record_sets')
domains = [domain.lower().rstrip('.') for domain in sys.argv[1:]]
for zone_page in paginate_hosted_zones.paginate():
for zone in zone_page['HostedZones']:
if domains and not zone['Name'].lower().rstrip('.') in domains:
continue
for record_page in paginate_resource_record_sets.paginate(HostedZoneId = zone['Id']):
for record in record_page['ResourceRecordSets']:
if record.get('ResourceRecords'):
for target in record['ResourceRecords']:
print(record['Name'], record['TTL'], 'IN', record['Type'], target['Value'], sep = '\t')
elif record.get('AliasTarget'):
print(record['Name'], 300, 'IN', record['Type'], record['AliasTarget']['DNSName'], '; ALIAS', sep = '\t')
else:
raise Exception('Unknown record type: {}'.format(record))
usage example:
./export-zone.py mydnszone.aws
mydnszone.aws. 300 IN A server.mydnszone.aws. ; ALIAS
mydnszone.aws. 86400 IN CAA 0 iodef "mailto:hostmaster#mydnszone.aws"
mydnszone.aws. 86400 IN CAA 128 issue "letsencrypt.org"
mydnszone.aws. 86400 IN MX 10 server.mydnszone.aws.
the output can be saved as a file and/or copied to the clipboard:
the Import zone file page allows to paste the data:
at the time of this writing the script was working fine using python 3.9.