Modify HTML content during Web Scraping - html

I try to do some Web Scraping
The objective is to collect all remedials according to the postal code. The problem is when I try my code, my list is empty because the url did't change according to the postal code. This is why I want to change the HTML value during the scrape.
I'm not sure how to do this. I tried using Selenium and XPATH however I wasn't able to find anything.
Here's the HTML Code: (in red is what I need to change.)
EDIT : Indeed, the goal is to collect the pagination with the name and the type of remedial according to the postal code, this is why I want to change the HTML content during the scrap.
This is the best that I can do for the moment, I hope u will see the error

This input is in a form, which is good because Selenium has special functionalities to handle forms.
from selenium import webdriver
url = "https://www.maif.fr/services-en-ligne/consultationreparateurs/geolocaliserReparateur.action?view"
query = "whatever you want to put into the search box"
driver = webdriver.Chrome()
driver.get(url)
webform_input = driver.find_element_by_xpath("//input[#id='adresseInternaute']")
webform_input.send_keys(query)
webform_input.submit()
The key here is submit(). It will walk the HTML tree until it finds a button within the current form, meaning you don't have to write an extra two lines just to click the search button.

Related

Python Web Scrape Index

I am VERY new to web scraping in any shape or form, I've been trying to get into Python and I heard that web scraping was a good way to expose myself to Python. So, after many Google searches I finally came down to the use of two highly recommended modules: Requests and BeautifulSoup. I've read up a fair amount on both and have a basic understanding on how to use them.
I found a very basic website (basic in that there isn't much content or javascript and the like, making parsing the HTML a lot easier) and I have the following code:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://www.basicwebs.co.uk/contact.htm').text)
for row in soup('div',{'id': 'Layer1'})[0].h2('font'):
tds = row.text
print tds
This code works. It produces the following result:
BASIC
WEBS
Contact details
Contact details
Which, if you spend a few minutes inspecting the code on this page, is the correct result (I assume). Now, the thing is, while this code works, what if I wanted to get a different part of the page? Like the little paragraph on the page that states "If you are interested in having a website designed and hosted by us, please contact us either by e-mail or telephone." - my understanding would be to simply change the index number to the corresponding header that this text is found under, but when I change it I get a message that the list index is out of range.
Can anybody help? (as simple as you can make it, if possible)
I'm using Python 2.7.8
The text you require surrounded by the font tag with an attribute size=3, so one way to do it is by selecting the first occurrence of it like this:
font_elements = soup('font', {'size': 3})
if font_elements:
print font_elements[0].text
RESULT:
If you are interested in having a website designed
and hosted by us, please contact us either by e-mail or telephone.
You can directly do this :
soup('font',{'size': '3'})[0].text
However, I want to draw your attention towards the mistake you made before.
soup('div',{'id': 'Layer1'})
this returns the div tag with id='Layer1' which can be more than one. So it basically returns a list of all HTML elements whose div tags have id='Layer1' but unfortunately the HTML you were trying to parse has one such element. So it went out of bound.
You can probably use some interactive interpreter of python like bpython or ipython to test what are you getting in an object.? Happy Hacking!!!
from urllib.request import urlopen
from bs4 import BeautifulSoup
web_address=' http://www.basicwebs.co.uk/contact.htm'
html = urlopen(web_address)
bs = BeautifulSoup(html.read(), 'html.parser')
contact_info = bs.findAll('h2', {'align':'left'})[0]
for info in contact_info:
print(info.get_text())

Passing info in a URL to Sharepoint

I'm new to SharePoint but I was wondering if there was a way to pass variables from an external website to a SharePoint Web Part via GET.
e.g.
http://mysharepointpage.com/sites?name=Jay&age=23 does not populate my name or age input fields in the SharePoint Web Part.
Note: I had to remove the <form method="GET"> tags because I kept getting an error advising;
<FORM> tags are not supported in the HTML specified in either the Content property or the Content Link property. You can
remove the <FORM> tag, or use the Page Viewer Web Part, which supports the HTML <FORM> tag. The Content property can
be modified in the Rich Text Editor or Source Editor. More about the Page Viewer Web Part
When I click on the link it tells me;
Cannot display help.
Technical details: HC not found.
I'm guessing this is why I can't retrieve the data via URL.
A little more information as to how would be much appreciated.
Your query string looks correct:
Connect a Query String (URL) Filter Web Part to another Web Part
You will then need to implement javascript on the page to loop through all the keys and populate the form as required.
You should be able to use this link to help you with your javascript or please post what you are currently using: Getting query string values in JavaScript

Extract POST URL using GET

I am trying to simulate the functionality of a form in this website, but don't know exactly what the post URL looks like.
The link to the website is here:
selfservice.mypurdue.purdue.edu/prod/bwckschd.p_disp_dyn_sched >> then click on Spring2013. The code I am trying to replicate is the one that happens when the user clicks Course Search and selects CS from the subject list.
You can look at the HTML file to see the values they use in their POST command. How do I see what the values look like once the button is clicked, as I am trying to replicate this and set the variables to the same values. What I need is a URL to be shown with all of the variables set to their respective values. I understand this can be done with a GET command. Can someone tell me how to extract this URL for me so I can proceed?
I edited the page using chrome inspector and changed the form action to GET - this is the URL that was displayed.
https://selfservice.mypurdue.purdue.edu/prod/bwckschd.p_get_crse_unsec?term_in=201320&sel_subj=dummy&sel_day=dummy&sel_schd=dummy&sel_insm=dummy&sel_camp=dummy&sel_levl=dummy&sel_sess=dummy&sel_instr=dummy&sel_ptrm=dummy&sel_attr=dummy&sel_subj=CS&sel_crse=&sel_title=&sel_schd=%25&sel_from_cred=&sel_to_cred=&sel_camp=%25&sel_ptrm=%25&sel_instr=%25&sel_sess=%25&sel_attr=%25&begin_hh=0&begin_mi=0&begin_ap=a&end_hh=0&end_mi=0&end_ap=a
However, this URL dosen't resolve as the script is obviously expecting POST data.

Parsing a website and getting the info I need

hi so I need to retrieve the url for the first article on a term I search up on nytimes.com
So if I search for Apple. This link would return the result
http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse
And you just replace Apple with the term you are searching for.
If you click on that link you would see that NYtimes ask you if you mean Apple Inc.
I want to get the url for this link, and go to it.
Then you will just get a lot of information on Apple Inc.
If you scroll down you will see the articles related to Apple.
So what I ultimately want is the URL of the first article on this page.
So I really do not know how to go about this. Do I use Java, or what do I use? Any help would be greatly appreciated and I would put a bounty on this later, but I need the answer ASAP.
Thanks
EDIT: Can we do this in Java?
You can use Python with the standard urllib module to fetch the pages and the great HTML parser BeautifulSoup to obtain the information you need from the pages.
From the documentation of BeautifulSoup, here's sample code that fetches a web page and extracts some info from it:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
where, linebreak, what = incident.contents[:3]
print where.strip()
print what.strip()
print
This this is a nice and detailed article on the topic.
You certainly can do it in Java. Look at the HttpURLConnection class. Basically, you give it a URL, call the connect function, and you get back an input stream with the contents of the page, i.e. HTML text. You can then process that and parse out whatever information you want.
You're facing two challenges in the project you are describing. The first, and probably really the lesser challenge, is figuring out the mechanics of how to connect to a web page and get hold of the text within your program. The second and probably bigger challenge will be to figure out exactly how to extract the information you want from that text. I'm not clear on the details of your requirements, but you're going to have to sort through a ton of text to find what you're looking for. Without actually looking at the NY Times site at the momemnt, I'm sure it has all sorts of decorations like pretty pictures and the company logo and headlines and so on, and then there are going to be menus and advertisements and all sorts of stuff. I sincerely doubt that the NY Times or almost any other commercial web site is going to return a search page that includes nothing but a link to the article you are interested in. Somehow your program will have to figure out that the first link is to the "subscribe on line" page, the second is to an advertisement, the third is to customer service, the fourth and fifth are additional advertisements, the sixth is to the home page, etc etc until you finally get to the one you're actually interested in. How will you identify the interesting link? There are probably headings or formatting that make it recognizable to a human being, but you use a lot of intuition to screen out the clutter that can be difficult to reproduce in a program.
Good luck!
You can do this in C# using the HTML Agility Pack, or using LINQ to XML if the site is valid XHTML. EDIT: It isn't valid XHTML; I checked.
The following (tested) code will get the URL of the first search result:
var doc = new HtmlWeb().Load(#"http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse");
var url = HtmlEntity.DeEntitize(doc.DocumentNode.Descendants("ul")
.First(ul => ul.Attributes["class"] != null
&& ul.Attributes["class"].Value == "results")
.Descendants("a")
.First()
.Attributes["href"].Value);
Note that if their website changes, this code might stop working.

WPF, Frame Control, HTML DOM Document access

Ive used WindowsHost to host a WebBrowser control, and that has allowed me to access the WebBrowsers Document/DOM directly, t read HTML content via mouse clicks on HTML document elements and also to invokes on submit forms. I never found a way even in Net 3.5 to do this when I was searching at the time. Ive found this post http://rhizohm.net/irhetoric/blog/72/default.aspx and it looks like through som magic casing you can expose the dom. BUT My question is, has any one done this, and is it possible once you get the dom to do Invokes to submit contect to html forms and also get HTML elements via mouse click events????
Anyone tried? and was able to do both?
Thanks
I'm using WPF.
add a reference to:
Microsoft.mshtml
then:
var doc = ( mshtml.HTMLDocument )_wbOne.Document;
and this gives you the raw string:
doc.documentElement.innerHTML
in return, if you know how to get information out of the HTML document, i'd appreciate it.
for example get all the s and and the metas and whatever else might be gettable so i can get the information from them? i don't want to dink around with the html, just get the info from them...:-)