Why is my xpath parser only scraping the first dozen elements in a table?

Why is my xpath parser only scraping the first dozen elements in a table? - html

I'm learning how to interact with websites via python and I've been following this tutorial.
I wanted to know how xpath worked which led me here.
I've made some adjustments to the code from the origional site to select elements if multiple conditions are met.
However, in both the origional code from the tutorial and my own update I seem to be only grabbing a small section of the body. I was under the impression xpath('//tbody/tr')[:10]: would grab the entire body and allow me to interact with 10 in this case columns.
At time of writing when I run the code it retuns 2 hits, while if I do the same thing with a manual copy and paste to excel I get 94 hits.
Why am I only able to parse a few lines from the table and not all lines?
import requests
from lxml.html import fromstring
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]') and i.xpath('.//td[5][contains(text(),"elite")]'):
# Grabbing IP and corresponding PORT
proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
return proxies
proxies = get_proxies()
print(proxies)

I figured it out: [:10] refers not to columns but to rows.
Changing 10 to 20, 50 or 100, check 10, 20, 50 or 100 rows.

Related

Downloading data from multiple-pages website with requests method in Python

I have this API documentation of the website http://json-homework.task-sss.krasilnikov.spb.ru/docs/9f66a575a6cfaaf7e43177317461d057 (which is only in Russian, unfortunately, but I'll try to explain), and I am to import the data about the group members from there, but the issue is that parameter page returns only 5 members, and when you increase the page number, it only returns next 5 members, not adding them to the previous five. Here is my code:
import pandas as pd
import requests as rq
import json
from pandas.io.json import json_normalize
url='http://json-homework.task-sss.krasilnikov.spb.ru/api/groups/getmembers?api_key=9f66a575a6cfaaf7e43177317461d057&group_id=4508123&page=1'
data=rq.get(url)
data1=json.loads(data.text)
data1=json_normalize(json.loads(data.text)["response"])
data1
and here is what my output looks like:
By entering bigger and bigger numbers, I also found out that the last part of data exists on 41 page, i.e. I need to get the data from 1 to 41 page. How can I include all the pages in my code? Maybe it is possible with some loop or something like that, I don't know...

According to the API documentation, there is no parameter to specify the users to fetch in one page, so you will have to get them 5 at a time, and since there are 41 pages you can just loop through the urls.
import requests as rq
import json
all_users = []
for page in range(1,42):
url=f'http://json-homework.task-sss.krasilnikov.spb.ru/api/groups/getmembers?api_key=9f66a575a6cfaaf7e43177317461d057&group_id=4508123&page={page}'
data=rq.get(url)
all_users.append(json.loads(data.text)["response"])
The above implementation, will of course not check for any api throttling i.e. the API may give unexpected data if too many requests are made in a very short duration, which you can mitigate using some well placed delays.

How do I have beautiful soup read in the html fully? Possibly selenium issue?

I am trying to have some practice with beautiful soup, web scraping and python but I am struggling with getting this data from certain tags. I am trying to go through multiple pages of data on cars.com.
So when I read in the html, and the tags I need are
<cars-shop-srp-pagination>
</cars-shop-srp-pagination>
because the page number is in between them and in order for me to loop through the website pages, I need to know the max pages
from bs4
import BeautifulSoup
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&stkTypId=28881&zc=21042'
#
source = requests.get('https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=58767&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&zc=21042').text
source = requests.get(url).content
soup = BeautifulSoup(source, 'html.parser')
print(soup.prettify())
link = soup.find(word_ = "cars-shop-srp-pagination")# linkNext = link.find('a')
print(link)
When I go through the output, the only thing I see for the "cars-shop-srp-pagination: is
<cars-shop-srp-pagination>
</cars-shop-srp-pagination>
when I need to see:
All of the code inside of them, specifically I want to get to:
*"<li ng-if="showLast"> <a class="js-last-page" ng-click="goToPage($event, numberOfPages)">50</a> </li>"*

Remember that BeautifulSoup only parses through HTML/XML code that you give it. If the page number isn't in your captured HTML code in your first place, then that's a problem with being able to capture the code properly, not with BeautifulSoup. Unfortunately, I think that this data is dynamically generated.
I found a work-around, though. Notice that at the top of the search results, the page says "(some number of cars) matches near you". For example:
<div class="matchcount">
<span class="filter-count">1,711</span>
<span class="filter-text"> matches near you</span>
You could capture this number, then divide by the number of results per page being displayed. In fact, this latter number can be passed into the URL. Note that you have to round up to the nearest integer, to catch the search results that show up on the final page. Also, any commas in numbers over 999 have to be removed from the string before you can int it.
from bs4 import BeautifulSoup
import urllib2
import math
perpage = 100
url = 'https://www.cars.com/for-sale/searchresults.action/'
url += '?dealerType=all&mdId=58767&mkId=20089&page=1&perPage=%d' % perpage
url += '&prMx=25000&searchSource=PAGINATION&sort=relevance&zc=21042'
response = urllib2.urlopen(url)
source = response.read()
soup = BeautifulSoup(source, 'lxml')
count_tag = soup.find('span', {'class' : 'filter-count'})
count = int(count_tag.text.replace(',',''))
pages = int(math.ceil(1.0* count / perpage))
print(pages)
One catch to this however is that if the search isn't refined enough, the website will say something like "Over 30 thousand matches", which is not an integer.
Also, I was getting a 503 response from requests.get(), so I switched to using urllib2 to get the HTML.

All that info (number of results, number of pages, results per page) is stored in a javascript dictionary within the returned content. You can simply regex out the object and parse with json. Note that the url is a query string and you can alter the results per page count in it. So, after doing an initial request to determine how many results there are, you can perform calcs to make any other changes. Note that you may also be able to use json through out and not BeautifulSoup. Though I think there would be a limit (perhaps the 20) with grabbing as shown below from each page so probably better to go with the 100 results per page and make initial request, regex out info and if more than 100 results then loop, altering url, to collect rest of results.
I don't think, regardless of the number of pages indicated/calculated, that you can actually go beyond page 50.
import requests
import re
import json
p = re.compile(r'digitalData = (.*?);')
r = requests.get('https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&stkTypId=28881&zc=21042')
data = json.loads(p.findall(r.text)[0])
num_results_returned = data['page']['search']['numResultsReturned']
total_num_pages = data['page']['search']['totalNumPages']
num_results_on_page = data['page']['search']['numResultsOnPage']

How to get all resource record sets from Route 53 with boto?

I'm writing a script that makes sure that all of our EC2 instances have DNS entries, and all of the DNS entries point to valid EC2 instances.
My approach is to try and get all of the resource records for our domain so that I can iterate through the list when checking for an instance's name.
However, getting all of the resource records doesn't seem to be a very straightforward thing to do! The documentation for GET ListResourceRecordSets seems to suggest it might do what I want and the boto equivalent seems to be get_all_rrsets ... but it doesn't seem to work as I would expect.
For example, if I go:
r53 = boto.connect_route53()
zones = r53.get_zones()
fooA = r53.get_all_rrsets(zones[0][id], name="a")
then I get 100 results. If I then go:
fooB = r53.get_all_rrsets(zones[0][id], name="b")
I get the same 100 results. Have I misunderstood and get_all_rrsets does not map onto ListResourceRecordSets?
Any suggestions on how I can get all of the records out of Route 53?
Update: cli53 (https://github.com/barnybug/cli53/blob/master/cli53/client.py) is able to do this through its feature to export a Route 53 zone in BIND format (cmd_export). However, my Python skills aren't strong enough to allow me to understand how that code works!
Thanks.

get_all_rrsets returns a ResourceRecordSets which derives from Python's list class. By default, 100 records are returned. So if you use the result as a list it will have 100 records. What you instead want to do is something like this:
r53records = r53.get_all_rrsets(zones[0][id])
for record in r53records:
# do something with each record here
Alternatively if you want all of the records in a list:
records = [r for r in r53.get_all_rrsets(zones[0][id]))]
When iterating either with a for loop or list comprehension Boto will fetch the additional records (up to) 100 records at a time as needed.

this blog post from 2018 has a script which allows exporting in bind format:
#!/usr/bin/env python3
# https://blog.jverkamp.com/2018/03/12/generating-zone-files-from-route53/
import boto3
import sys
route53 = boto3.client('route53')
paginate_hosted_zones = route53.get_paginator('list_hosted_zones')
paginate_resource_record_sets = route53.get_paginator('list_resource_record_sets')
domains = [domain.lower().rstrip('.') for domain in sys.argv[1:]]
for zone_page in paginate_hosted_zones.paginate():
for zone in zone_page['HostedZones']:
if domains and not zone['Name'].lower().rstrip('.') in domains:
continue
for record_page in paginate_resource_record_sets.paginate(HostedZoneId = zone['Id']):
for record in record_page['ResourceRecordSets']:
if record.get('ResourceRecords'):
for target in record['ResourceRecords']:
print(record['Name'], record['TTL'], 'IN', record['Type'], target['Value'], sep = '\t')
elif record.get('AliasTarget'):
print(record['Name'], 300, 'IN', record['Type'], record['AliasTarget']['DNSName'], '; ALIAS', sep = '\t')
else:
raise Exception('Unknown record type: {}'.format(record))
usage example:
./export-zone.py mydnszone.aws
mydnszone.aws. 300 IN A server.mydnszone.aws. ; ALIAS
mydnszone.aws. 86400 IN CAA 0 iodef "mailto:hostmaster#mydnszone.aws"
mydnszone.aws. 86400 IN CAA 128 issue "letsencrypt.org"
mydnszone.aws. 86400 IN MX 10 server.mydnszone.aws.
the output can be saved as a file and/or copied to the clipboard:
the Import zone file page allows to paste the data:
at the time of this writing the script was working fine using python 3.9.

How to upload multiple JSON files into CouchDB

I am new to CouchDB. I need to get 60 or more JSON files in a minute from a server.
I have to upload these JSON files to CouchDB individually as soon as I receive them.
I installed CouchDB on my Linux machine.
I hope some one can help me with my requirement.
If possible can someone help me with pseudo code.
My Idea:
Is to write a python script for uploading all JSON files to CouchDB.
Each and every JSON file must be each document and the data present in
JSON must be inserted same into CouchDB
(the specified format with values in a file).
Note:
These JSON files are Transactional, every second 1 file is generated
so I need to read the file upload as same format into CouchDB on
successful uploading archive the file into local system of different folder.

python program to parse the json and insert into CouchDb:
import sys
import glob
import errno,time,os
import couchdb,simplejson
import json
from pprint import pprint
couch = couchdb.Server() # Assuming localhost:5984
#couch.resource.credentials = (USERNAME, PASSWORD)
# If your CouchDB server is running elsewhere, set it up like this:
couch = couchdb.Server('http://localhost:5984/')
db = couch['mydb']
path = 'C:/Users/Desktop/CouchDB_Python/Json_files/*.json'
#dirPath = 'C:/Users/VijayKumar/Desktop/CouchDB_Python'
files = glob.glob(path)
for file1 in files:
#dirs = os.listdir( dirPath )
file2 = glob.glob(file1)
for name in file2: # 'file' is a builtin type, 'name' is a less-ambiguous variable name.
try:
with open(name) as f: # No need to specify 'r': this is the default.
#sys.stdout.write(f.read())
json_data=f
data = json.load(json_data)
db.save(data)
pprint(data)
json_data.close()
#time.sleep(2)
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.

I would use CouchDB bulk API, even though you have specified that you need to send them to db one by one. For example, by implementing a simple queue that gets sent out every say 5 - 10 seconds via a bulk doc call will greatly increase performance of your application.
There is obviously a quirk in that and that is you need to know the IDs of the docs that you want to get from the DB. But for the PUTs it is perfect. (it is not entirely true, you can get ranges of docs using bulk operation if the IDs you are using for your docs can be sorted nicely).
From my experience working with CouchDB, I have a hunch that you are dealing with Transactional documents in order to compile them into some sort of sum result and act on that data accordingly (maybe creating next transactional doc in series). For that you can rely on CouchDB by using 'reduce' functions on the views you create. It takes a little practice to get reduce function working properly and is highly dependent on what it is you actually what to achieve and what data you are prepared to emit by the view so I can't really provide you with more detail on that.
So in the end the app logic would go something like that:
get _design/someDesign/_view/yourReducedView
calculate new transaction
add transaction to queue
onTimeout
send all in transaction queue
If I got that first part of why you are using transactional docs wrong all that would really change is the part where you getting those transactional docs in my app logic.
Also, before writing your own 'reduce' function, have a look at buil-in ones (they are alot faster then anything outside of db engine can do)
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
EDIT:
Since you are starting, I strongly recommend to have a look at CouchDB Definitive Guide.
NOTE FOR LATER:
Here is one hidden stone (well maybe not so much a hidden stone but not an obvious thing to look out for for the new-comer in any case). When you write reduce function make sure that it does not produce too much output for the query without boundaries. This will extremely slow down the entire view even when you provide reduce=false when getting stuff from it.

So you need to get JSON documents from a server and send them to CouchDB as you receive them. A Python script would work fine. Here is some pseudo-code:
loop (until no more docs)
get new JSON doc from server
send JSON doc to CouchDB
end loop
In Python, you could use requests to send the documents to CouchDB and probably to get the documents from the server as well (if it is using an HTTP API).

You might want to checkout the pycouchdb module for python3. I've used it myself to upload lots of JSON objects into couchdb instance. My project does pretty much the same as you describe so you can take a look at my project Pyro at Github for details.
My class looks like that:
class MyCouch:
""" COMMUNICATES WITH COUCHDB SERVER """
def __init__(self, server, port, user, password, database):
# ESTABLISHING CONNECTION
self.server = pycouchdb.Server("http://" + user + ":" + password + "#" + server + ":" + port + "/")
self.db = self.server.database(database)
def check_doc_rev(self, doc_id):
# CHECKS REVISION OF SUPPLIED DOCUMENT
try:
rev = self.db.get(doc_id)
return rev["_rev"]
except Exception as inst:
return -1
def update(self, all_computers):
# UPDATES DATABASE WITH JSON STRING
try:
result = self.db.save_bulk( all_computers, transaction=False )
sys.stdout.write( " Updating database" )
sys.stdout.flush()
return result
except Exception as ex:
sys.stdout.write( "Updating database" )
sys.stdout.write( "Exception: " )
print( ex )
sys.stdout.flush()
return None
Let me know in case of any questions - I will be more than glad to help if you will find some of my code usable.

Trouble getting a dynamodb scan to work with boto

I'm using boto to access a dynamodb table. Everything was going well until I tried to perform a scan operation.
I've tried a couple of syntaxes I've found after repeated searches of The Internet, but no luck:
def scanAssets(self, asset):
results = self.table.scan({('asset', 'EQ', asset)})
-or-
results = self.table.scan(scan_filter={'asset':boto.dynamodb.condition.EQ(asset)})
The attribute I'm scanning for is called 'asset', and asset is a string.
The odd thing is the table.scan call always ends up going through this function:
def dynamize_scan_filter(self, scan_filter):
"""
Convert a layer2 scan_filter parameter into the
structure required by Layer1.
"""
d = None
if scan_filter:
d = {}
for attr_name in scan_filter:
condition = scan_filter[attr_name]
d[attr_name] = condition.to_dict()
return d
I'm not a python expert, but I don't see how this would work. I.e. what kind of structure would scan_filter have to be to get through this code?
Again, maybe I'm just calling it wrong. Any suggestions?

OK, looks like I had an import problem. Simply using:
import boto
and specifying boto.dynamodb.condition doesn't cut it. I had to add:
import dynamodb.condition
to get the condition type to get picked up. My now working code is:
results = self.table.scan(scan_filter={'asset': dynamodb.condition.EQ(asset)})
Not that I completely understand why, but it's working for me now. :-)

Or you can do this
exclusive_start_key = None
while True:
result_set = self.table.scan(
asset__eq=asset, # The scan filter is explicitly given here
max_page_size=100, # Number of entries per page
limit=100,
# You can divide the table by n segments so that processing can be done parallelly and quickly.
total_segments=number_of_segments,
segment=segment, # Specify which segment you want to process
exclusive_start_key=exclusive_start_key # To start for last key seen
)
dynamodb_items = map(lambda item: item, result_set)
# Do something with your item, add it to a list for later processing when you come out of the while loop
exclusive_start_key = result_set._last_key_seen
if not exclusive_start_key:
break
This is applicable for any field.
segmentation: suppose you have above script in test.py
you can run parallelly like
python test.py --segment=0 --total_segments=4
python test.py --segment=1 --total_segments=4
python test.py --segment=2 --total_segments=4
python test.py --segment=3 --total_segments=4
in different screens

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Why is my xpath parser only scraping the first dozen elements in a table? - html

I figured it out: [:10] refers not to columns but to rows. Changing 10 to 20, 50 or 100, check 10, 20, 50 or 100 rows.

Related

Downloading data from multiple-pages website with requests method in Python

How do I have beautiful soup read in the html fully? Possibly selenium issue?

How to get all resource record sets from Route 53 with boto?

How to upload multiple JSON files into CouchDB

Trouble getting a dynamodb scan to work with boto

Categories

Resources