Select Next node in Python with XPath - html

I am trying to scrape population information from Wikipedia country pages. The trouble I am having is that in the node I am trying to scrape there is no information referring to population, instead population is only referenced in the node before it. So using Xpath I am trying to get the expression to move to the next node, but can't find the correct command.
For example for the following page:
https://en.wikipedia.org/wiki/Afghanistan
Below is an xpath expression that gets me to the node before the population number I want to scrape:
//table[#class='infobox geography vcard']//tr[#class = 'mergedtoprow']//a[contains(#href,"Demographics")]/../..
It searches for a href in the table that contains "Demographics" then goes up two levels to the parents of the parents. But the problem is that the title is in a different node to the number I want to extract and so I need something that could go to next node.
I have seen the expression /following-sibling::div[1] but it doesn't seem to work for my expression and I don't know why.
If anyone can think of a more direct way of finding the node in the above web page that would be good too.
Thanks
Edit:
Below is the Python code I am using
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from urllib.parse import urljoin
class CountryinfoSpider(scrapy.Spider):
name = 'CountryInfo'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s']
def parse(self, response):
## Extract all countries names
countries = response.xpath('//table//b//#title').extract()
for country in countries:
url = response.xpath('//table//a[#title="'+ country +'"]/#href').extract_first()
capital = response.xpath('//table//a[#title="'+ country +'"]/../..//i/a/#title').extract()
absolute_url = urljoin('https://en.wikipedia.org/', url)
yield Request(absolute_url, callback = self.parse_country)
def parse_country(self, response):
test = response.xpath('//table[#class='infobox geography vcard']//tr[#class = 'mergedtoprow']//a[contains(#href,"Demographics")]/../..').extract()
yield{'Test':test}
It a little more complicated than I explained but I go to the website "List of sovereign states in the 2020s". Copy the country names, capitals and urls. Then I go into the url, after joining it to Wikipedia and try to use the xpath expression I am working on to pull the population.
Thanks

I think the general answer to your question is: "predicates can be nested".
//table[
#class='infobox geography vcard'
]//tr[
#class = 'mergedtoprow' and .//a[contains(#href, "Demographics")]
]/following-sibling::tr[1]/td/text()[1]

Related

Web scraping an "onclick" object table on a website with python

I am trying to scrape the data for this link: page.
If you click the up arrow you will notice the highlighted days in the month sections. Clicking on a highlighted day, a table with initiated tenders for that day will appear. All I need to do is get the data in each table for each highlighted day in the calendar. There might be one or more tenders (up to max of 7) per day.
Table appears on click
I have done some web scraping with bs4, however I think that this is a job for selenium (please, correct me if I am wrong) with which I am not very familiar.
So far, I have managed to find the arrow element by XPATH to navigate around the calendar and show me more months. After that I try clicking on a random day (in below code I clicked on 30.03.2020) upon which an html object called: "tenders-table cloned" appears in the html on inspect. The object name stays the same no matter what day you click on.
I am pretty stuck now, have tried to select by iterate and/or print what is inside that object table, it either says that object is not iterable or is None.
from selenium import webdriver
chrome_path = r"C:\Users\<name>\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://www.ibex.bg/bg/данни-за-пазара/централизиран-пазар-за-двустранни-договори/търговски-календар/")
driver.find_element_by_xpath("""//*[#id="content"]/div[3]/div/div[1]/div/i""").click()
driver.find_element_by_xpath("""//*[#id="content"]/div[3]/div/div[2]/div[1]/div[3]/table/tbody/tr[6]/td[1]""").click()
Please advice how I can proceed to extract the data from the table pop-up.
Please try below solution
driver.maximize_window()
wait = WebDriverWait(driver, 20)
elemnt=wait.until(EC.presence_of_element_located((By.XPATH, "//body/div[#id='wrapper']/div[#id='content']/div[#class='tenders']/div[#class='form-group']/div[1]/div[1]//i")))
elemnt.click()
elemnt1=wait.until(EC.presence_of_element_located((By.XPATH, "//div[#class='form-group']//div[1]//div[3]//table[1]//tbody[1]//tr[6]//td[1]")))
elemnt1.click()
lists=wait.until(EC.presence_of_all_elements_located((By.XPATH, "//table[#class='tenders-table cloned']")))
for element in lists:
print element.text
Well, i see there's no reason to use selenium for such case as it's will slow down your task.
The website is loaded with JavaScript event which render it's data dynamically once the page loads.
requests library will not be able to render JavaScript on the fly. so you can use selenium or requests_html. and indeed there's a lot of modules which can do that.
Now, we do have another option on the table, to track from where the data is rendered. I were able to locate the XHR request which is used to retrieve the data from the back-end API and render it to the users side.
You can get the XHR request by open Developer-Tools and check Network and check XHR/JS requests made depending of the type of call such as fetch
import requests
import json
data = {
'from': '2020-1-01',
'to': '2020-3-01'
}
def main(url):
r = requests.post(url, data=data).json()
print(json.dumps(r, indent=4)) # to see it in nice format.
print(r.keys())
main("http://www.ibex.bg/ajax/tenders_ajax.php")
Because am just a lazy coder: I will do it in this way:
import requests
import re
import pandas as pd
import ast
from datetime import datetime
data = {
'from': '2020-1-01',
'to': '2020-3-01'
}
def main(url):
r = requests.post(url, data=data).json()
matches = set(re.findall(r"tender_date': '([^']*)'", str(r)))
sort = (sorted(matches, key=lambda k: datetime.strptime(k, '%d.%m.%Y')))
print(f"Available Dates: {sort}")
opa = re.findall(r"({\'id.*?})", str(r))
convert = [ast.literal_eval(x) for x in opa]
df = pd.DataFrame(convert)
print(df)
df.to_csv("data.csv", index=False)
main("http://www.ibex.bg/ajax/tenders_ajax.php")
Output: view-online

How do I have beautiful soup read in the html fully? Possibly selenium issue?

I am trying to have some practice with beautiful soup, web scraping and python but I am struggling with getting this data from certain tags. I am trying to go through multiple pages of data on cars.com.
So when I read in the html, and the tags I need are
<cars-shop-srp-pagination>
</cars-shop-srp-pagination>
because the page number is in between them and in order for me to loop through the website pages, I need to know the max pages
from bs4
import BeautifulSoup
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&stkTypId=28881&zc=21042'
#
source = requests.get('https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=58767&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&zc=21042').text
source = requests.get(url).content
soup = BeautifulSoup(source, 'html.parser')
print(soup.prettify())
link = soup.find(word_ = "cars-shop-srp-pagination")# linkNext = link.find('a')
print(link)
When I go through the output, the only thing I see for the "cars-shop-srp-pagination: is
<cars-shop-srp-pagination>
</cars-shop-srp-pagination>
when I need to see:
All of the code inside of them, specifically I want to get to:
*"<li ng-if="showLast"> <a class="js-last-page" ng-click="goToPage($event, numberOfPages)">50</a> </li>"*
Remember that BeautifulSoup only parses through HTML/XML code that you give it. If the page number isn't in your captured HTML code in your first place, then that's a problem with being able to capture the code properly, not with BeautifulSoup. Unfortunately, I think that this data is dynamically generated.
I found a work-around, though. Notice that at the top of the search results, the page says "(some number of cars) matches near you". For example:
<div class="matchcount">
<span class="filter-count">1,711</span>
<span class="filter-text"> matches near you</span>
You could capture this number, then divide by the number of results per page being displayed. In fact, this latter number can be passed into the URL. Note that you have to round up to the nearest integer, to catch the search results that show up on the final page. Also, any commas in numbers over 999 have to be removed from the string before you can int it.
from bs4 import BeautifulSoup
import urllib2
import math
perpage = 100
url = 'https://www.cars.com/for-sale/searchresults.action/'
url += '?dealerType=all&mdId=58767&mkId=20089&page=1&perPage=%d' % perpage
url += '&prMx=25000&searchSource=PAGINATION&sort=relevance&zc=21042'
response = urllib2.urlopen(url)
source = response.read()
soup = BeautifulSoup(source, 'lxml')
count_tag = soup.find('span', {'class' : 'filter-count'})
count = int(count_tag.text.replace(',',''))
pages = int(math.ceil(1.0* count / perpage))
print(pages)
One catch to this however is that if the search isn't refined enough, the website will say something like "Over 30 thousand matches", which is not an integer.
Also, I was getting a 503 response from requests.get(), so I switched to using urllib2 to get the HTML.
All that info (number of results, number of pages, results per page) is stored in a javascript dictionary within the returned content. You can simply regex out the object and parse with json. Note that the url is a query string and you can alter the results per page count in it. So, after doing an initial request to determine how many results there are, you can perform calcs to make any other changes. Note that you may also be able to use json through out and not BeautifulSoup. Though I think there would be a limit (perhaps the 20) with grabbing as shown below from each page so probably better to go with the 100 results per page and make initial request, regex out info and if more than 100 results then loop, altering url, to collect rest of results.
I don't think, regardless of the number of pages indicated/calculated, that you can actually go beyond page 50.
import requests
import re
import json
p = re.compile(r'digitalData = (.*?);')
r = requests.get('https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&stkTypId=28881&zc=21042')
data = json.loads(p.findall(r.text)[0])
num_results_returned = data['page']['search']['numResultsReturned']
total_num_pages = data['page']['search']['totalNumPages']
num_results_on_page = data['page']['search']['numResultsOnPage']

How to add w:altChunk and its relationship with python-docx

I have a use case that make use of <w:altChunk/> element in Word document by inject (fragment of) HTML file as alternate chunks and let Word do it works when the file gets opened. The current implementation was using XML/XSL to compose WordML XML, modify relationships, and do all packaging stuffs manually which is a real pain.
I wanted to move to python-docx but the API doesn't support this directly. Currently I found a way to add the <w:altChunk/> in the document XML. But still struggle to find a way to add relationship and related file to the package.
I think I should make a compatible part and pass it to document.part.relate_to function to do its job. But still can't figure how to:
from docx import Document
from docx.oxml import OxmlElement, qn
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, chunk_part):
''' TODO: figuring how to add files and relationships'''
r_id = doc.part.relate_to(chunk_part, RT.A_F_CHUNK)
alt = OxmlElement('w:altChunk')
alt.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt)
Update:
As per scanny's advice, below is my working code. Thank you very much Steve!
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from docx.opc.part import Part
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, html: str):
package = doc.part.package
partname = package.next_partname('/word/altChunk%d.html')
alt_part = Part(partname, 'text/html', html.encode(), package)
r_id = doc.part.relate_to(alt_part, RT.A_F_CHUNK)
alt_chunk = OxmlElement('w:altChunk')
alt_chunk.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt_chunk)
doc = Document()
doc.add_paragraph('Hello')
add_alt_chunk(doc, "<body><strong>I'm an altChunk</strong></body>")
doc.add_paragraph('Have a nice day!')
doc.save('test.docx')
Note: the altChunk parts only work/appear when document is open using MS Word
Well, some hints here anyway. Maybe you can post your working code at the end as a full "answer":
The alt-chunk part needs to start its life as a docx.opc.part.Part object.
The blob argument should be the bytes of the file, which is often but not always plain text. It must be bytes though, not unicode (characters), so any encoding has to happen before calling Part().
I expect you can work out the other arguments:
package is the overall OPC package, available on document.part.package.
You can use docx.opc.package.OpcPackage.next_partname() to get an available partname based on a root template like: "altChunk%s" for a name like "altChunk3". Check what partname prefix Word uses for these, possibly with unzip -l has-an-alt-chunk.docx; should be easy to spot.
The content-type is one in docx.opc.constants.CONTENT_TYPE. Check the [Content_Types].xml part in a .docx file that has an altChunk to see what they use.
Once formed, the document_part.relate_to() method will create the proper relationship. If there is more than one relationship (not common) then you need to create each one separately. There would only be one relationship from a particular part, just some parts are related to more than one other part. Check the relationships in an existing .docx to see, but pretty good guess it's only the one in this case.
So your code would look something like:
package = document.part.package
partname = package.next_partname("altChunkySomethingPrefix")
content_type = docx.opc.constants.CONTENT_TYPE.THE_RIGHT_MIME_TYPE
blob = make_the_altChunk_file_bytes()
alt_chunk_part = Part(partname, content_type, blob, package)
rId = document.part.relate_to(alt_chunk_part, RT.A_F_CHUNK)
etc.

How to make a complex list into a dataframe in R?

I have a complex list which is get from a json file.
The json file was get from a map service api in China.
I searched the website to solve the problem but I can't find a proper solution to my question, so I put it in this question and hope it can be solved.
If I missing something that I didn't find in the website, I apologize for that.
The code to get the list are as follows:`
library(rjson)
library(RCurl)
key<-"fd5a14632c36aecd2e759a0cc91a3b4a"
origin<-"大润发东环店"
urlorigin <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",origin,"&city=苏州",sep = "")
dataorigin<-readLines(urlorigin,encoding="UTF-8")
origininfo<-fromJSON(dataorigin)
originpoi<-origininfo$geocodes[[1]]$location
destination<-"苏州大学本部北门"
urldest <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",destination,"&city=苏州",sep = "")
datadest<-readLines(urldest,encoding="UTF-8")
destinfo<-fromJSON(datadest)
destpoi<-destinfo$geocodes[[1]]$location
urlpath <- paste("http://restapi.amap.com/v3/direction/driving?key=",key,"&origin=",originpoi,"&destination=",destpoi, "&originid=&destinationid=&extensions=all&strategy=0&waypoints=&avoidpolygons=&avoidroad=",sep = "")
pathjson<-paste(readLines(urlpath,encoding = "UTF-8"),collapse = "")
pathinfo<-fromJSON(pathjson)
The pathinfo was the list I get at last and I want to convert it into a dataframe that I can work with.
Thank you for your time.
I'm from China and my English is not that good, I apologize for that.
My Chinese is very limited as well. But your code to get the data is working (with some warnings).
pathinfo_df <- as.data.frame(lapply(pathinfo,rbind))
pathinfo_df is now a data_frame.
summary(pathinfo_df)
status info infocode count
1:1 OK:1 10000:1 1:1
route.origin.Length route.origin.Class route.origin.Mode
1 -none- character
route.destination.Length route.destination.Class route.destination.Mode
1 -none- character
route.taxi_cost.Length route.taxi_cost.Class route.taxi_cost.Mode
1 -none- character
route.paths.Length route.paths.Class route.paths.Mode
1 -none- list
So, there's plenty to select and play with. Read up on selecting from lists. see also:
str(pathinfo_df)
Then map it on Google Earth. Looks like the taxi might be costly. Have a good trip!

What is the difference between EmbeddedDocumentField and ReferenceField in mongoengine

Internally, what are the differences between these two fields? What kind of schema do these fields map to in mongo? Also, how should documents with relations be added to these fields? For example, if I use
from mongoengine import *
class User(Document):
name = StringField()
class Comment(EmbeddedDocument):
text = StringField()
tag = StringField()
class Post(Document):
title = StringField()
author = ReferenceField(User)
comments = ListField(EmbeddedDocumentField(Comment))
and call
>>> some_author = User.objects.get(name="ExampleUserName")
>>> post = Post.objects.get(author=some_author)
>>> post.comments
[]
>>> comment = Comment(text="cool post", tag="django")
>>> comment.save()
>>>
should I use post.comments.append(comment) or post.comments += comment for appending this document? My original question stems from this confusion as to how I should handle this.
EmbeddedDocumentField is just path of parent document like DictField and stored in one record with parent document in mongo.
To save EmbeddedDocument just save parent document.
>>> some_author = User.objects.get(name="ExampleUserName")
>>> post = Post.objects.get(author=some_author)
>>> post.comments
[]
>>> comment = Comment(text="cool post", tag="django")
>>> post.comment.append(comment)
>>> post.save()
>>> post.comment
[<Comment object __unicode__>]
>>> Post.objects.get(author=some_author).comment
[<Comment object __unicode__>]
See documentation: http://docs.mongoengine.org/guide/defining-documents.html#embedded-documents.
This one just a sample case where we can use embedded docs.
Lets say for example you are going to create an app that takes in requirements as they come in and save them in the db.
Now your requirement is to assign this requirement to a bunch of people each at a later stage after some processing of the requirement.
you also need to track the changes and log the activity pertaining to the processing taken place with regards to the requirement.
I know i know you might say we can use rdbms kind of relationship with refference field. but it involves in taking care of deleting obselete records in either collections which is nothing but extra code to handle the maintenance of the child collection in case of parent doc being deleted.( There are other extra efforts too that come into place ..)
instead embedded documents are stored as part of the parent doc. which Maintaining parent will involve embedded docs too.
and it will be easy to create complex json structured data using embedded docs rather than using user logic to manipulate and process the data into a complex structure.
Now Here the relation is one requirement to many handlers(which is nothing but an activity log by the handlers for the one requirement).