Error while Extracting Link from webpage using Python 3 - html

Let us consider the following:
<div class="more reviewdata">
<a onclick="bindreviewcontent('1660651',this,false,'I found this review of Star Health Insurance pretty useful',925075287,'.jpg','I found this review of Star Health Insurance pretty useful %23WriteShareWin','http://www.mouthshut.com/review/Star-Health-Insurance-review-toqnmqrlrrm','Star Health Insurance',' 2/5');" style="cursor:pointer">Read More</a>
</div>
From something like the above, I wanted to extract the http link alone as follows:
http://www.mouthshut.com/review/Star-Health-Insurance-review-toqnmqrlrrm
In order to achieve this, I wrote a code using BeautifulSoup and regular expression in Python. The code is as follows:
import urllib.request
import re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://www.mouthshut.com/product-reviews/Star-Health-Insurance-reviews-925075287').read()
soup = BeautifulSoup(page, "html.parser")
required = soup.find_all("div", {"class": "more reviewdata"})
for link in re.findall('http://www.mouthshut.com/review/Star-Health-Insurance-review-[a-z]*', required):
print(link)
On execution, the program threw an error as follows:
Traceback (most recent call last):
File "E:/beautifulSoup20April2.py", line 11, in <module>
for link in re.findall('http://www.mouthshut.com/review/Star-Health-Insurance-review-[a-z]*', required):
File "C:\Program Files (x86)\Python35-32\lib\re.py", line 213, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
Can someone suggest what should be done to extract the url alone without any error?

First you need to loop required, second you're trying to use a regex on an object <class 'bs4.element.Tag'> (python was complaining about this), then you need to extract the html from the bs4 element, which can be done with prettify()
here's a working version:
import urllib.request
import re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://www.mouthshut.com/product-reviews/Star-Health-Insurance-reviews-925075287').read()
soup = BeautifulSoup(page, "html.parser")
required = soup.find_all("div", {"class": "more reviewdata"})
for div in required:
for link in re.findall(r'http://www\.mouthshut\.com/review/Star-Health-Insurance-review-[a-z]*', div.prettify()):
print(link)
Output:
http://www.mouthshut.com/review/Star-Health-Insurance-review-ommmnmpmqtm
http://www.mouthshut.com/review/Star-Health-Insurance-review-rmqulrolqtm
http://www.mouthshut.com/review/Star-Health-Insurance-review-ooqrupoootm
http://www.mouthshut.com/review/Star-Health-Insurance-review-rlrnnuslotm
http://www.mouthshut.com/review/Star-Health-Insurance-review-umqsquttntm
...

Related

Web-scraping a link from web-page

New to web-scraping here. I basically want to extract a link from a web page into my jupyter notebook as shown in the image below :
Following is the code that I tried out:
from flask import Flask, render_template, request, jsonify
from flask_cors import CORS, cross_origin
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen as uReq
flipkart_url = "https://www.flipkart.com/search?q=" + 'acer-aspire-7-core-i5'
uClient = uReq(flipkart_url)
flipkartPage = uClient.read()
flipkart_html = bs(flipkartPage, "html.parser")
#Since I am only interested in the class "_1AtVbE col-12-12"
bigboxes = flipkart_html.findAll("div", {"class": "_1AtVbE col-12-12"})
Now here's the thing, I don't exactly understand what bigboxes is storing. The type of bigboxes is bs4.element.ResultSet, the length is 16.
Now if I run:
box = bigboxes[0]
productlink = "https://www.flipkart.com" + box.div.div.div.a['href']
I am getting an error. However when I run:
box = bigboxes[2]
productlink = "https://www.flipkart.com" + box.div.div.div.a['href']
I am successfully able to extract the link. Can someone please explain to me why the third element was able to read the link? I have a basic knowledge of HTML (at least I thought so) and I don't understand the layers to it. What exactly is bigboxes storing? Clearly, the HTML script shows no layers as such.
Your class filter is not very specific.
The first and second elements are pointing to html nodes which do not contain the link. Thus you are getting error.
A more specific class to check could be: _13oc-S
bigboxes = flipkart_html.findAll("div", {"class": "_13oc-S"})

I want to extract the html from a particular div class named "se_component_wrap sect_dsc __se_component_area"

I have already got a working python script, but i want to automate the url fetching from a page..I just need all the html code inside the div class se_component_wrap sect_dsc __se_component_area but currently i'm getting the html of the whole page
from lxml import html
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
from fetch import *
my_url=('https://m.post.naver.com/viewer/postView.nhn?volumeNo=20163796&memberNo=29747755')
#Opening Client
uClient = uReq(my_url)
#Opening the client
page_html = uClient.read()
#Closing connection
uClient.close()
page_soup = soup(page_html, "html.parser")
clear_file=page_soup.prettify()
with open("test.txt","w", encoding="utf-8") as outp:
outp.write(clear_file)
print (page_soup)
fetcher()
I expect the output to be the html code that's contained in that division instead of the complete page
A straight forward way would be:
soup = BeautifulSoup(page_html, "html.parser")
for div in soup.findAll('div',{'class':'se_component_wrap sect_dsc __se_component_area'}):
print div

parse a json object in a div while scraping using beautifulsoup python

I am learning scraping. I need to access the json string i encounter within a DIV. I am using beautifulsoup.
This is the json string i get in the DIV. I need the value (51.65) of the tag "lastprice". Please help. The JSON object is in json_d
import pip
import requests
import json
from bs4 import BeautifulSoup
print ('hi')
page = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=NBCC&illiquid=0&smeFlag=0&itpFlag=0')
soup = BeautifulSoup(page.text, 'html.parser')
json_d = soup.find(id='responseDiv')
print ('bye')
import bs4
import json
r= '''
<div id="responseDiv" style="display:none">{"tradedDate":"07DEC2018","data":[{"pricebandupper":"58.35","symbol":"NBCC","applicableMargin":"15.35","bcEndDate":"14-SEP-18","totalSellQuantity":"40,722","adhocMargin":"-","companyName":"NBCC (India) Limited","marketType":"N","exDate":"06-SEP-18","bcStartDate":"10-SEP-18","css_status_desc":"Listed","dayHigh":"53.55","basePrice":"53.05","securityVar":"10.35","pricebandlower":"47.75","sellQuantity5":"-","sellQuantity4":"-","sellQuantity3":"-","cm_adj_high_dt":"08-DEC-17","sellQuantity2":"-","dayLow":"51.55","sellQuantity1":"40,722","quantityTraded":"71,35,742","pChange":"-2.64","totalTradedValue":"3,714.15","deliveryToTradedQuantity":"40.23","totalBuyQuantity":"-","averagePrice":"52.05","indexVar":"-","cm_ffm":"2,424.24","purpose":"ANNUAL GENERAL MEETING\/DIVIDEND RE 0.56 PER SHARE","buyPrice2":"-","secDate":"7DEC2018","buyPrice1":"-","high52":"266.00","previousClose":"53.05","ndEndDate":"-","low52":"50.80","buyPrice4":"-","buyPrice3":"-","recordDate":"-","deliveryQuantity":"28,70,753","buyPrice5":"-","priceBand":"No Band","extremeLossMargin":"5.00","cm_adj_low_dt":"26-OCT-18","varMargin":"10.35","sellPrice1":"51.80","sellPrice2":"-","totalTradedVolume":"71,35,742","sellPrice3":"-","sellPrice4":"-","sellPrice5":"-","change":"-1.40","surv_indicator":"-","ndStartDate":"-","buyQuantity4":"-","isExDateFlag":false,"buyQuantity3":"-","buyQuantity2":"-","buyQuantity1":"-","series":"EQ","faceValue":"1.00","buyQuantity5":"-","closePrice":"51.80","open":"53.15","isinCode":"INE095N01031","lastPrice":"51.65"}],"optLink":"\/marketinfo\/sym_map\/symbolMapping.jsp?symbol=NBCC&instrument=-&date=-&segmentLink=17&symbolCount=2","otherSeries":["EQ"],"futLink":"\/live_market\/dynaContent\/live_watch\/get_quote\/GetQuoteFO.jsp?underlying=NBCC&instrument=FUTSTK&expiry=27DEC2018&type=-&strike=-","lastUpdateTime":"07-DEC-2018 15:59:59"}</div>'''
html = bs4.BeautifulSoup(r)
soup = html.find('div', {'id':'responseDiv'}).text
data = json.loads(soup)
last_price = data['data'][0]['lastPrice']
EDIT:
json_d = soup.find(id='responseDiv')
Try changing to
json_d = soup.find(‘div’, {‘id’:'responseDiv'})
Then you should be able to do
data = json.loads(json_d)
last_price = data['data'][0]['lastPrice']
See if that helps. I’m currently away from my computer until Tuesday so typing this up on my iPhone, so can’t test/play with it.
The other thing is the site might need to be read in after it’s loaded. In that case, I think you’d need to look into selenium package or html-requests packages.
Again, I can’t look until Tuesday when I get back home to my laptop.

BeautifulSoup and prettify() function

To parse html codes of a website, I decided to use BeautifulSoup class and prettify() method. I wrote the code below.
import requests
import bs4
response = requests.get("https://www.doviz.com")
soup = bs4.BeautifulSoup(response.content, "html.parser")
print(soup.prettify())
When I execute this code on Mac terminal, indentation of the codes are not set. On the other hand, If I execute this code on windows cmd or PyCharm, all codes are set.
Do you know the reason for this ?
try this code:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.doviz.com")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())

How to isolate a part of HTML page in Python 3

I made a simple script to retrieve sourcecode of a page, but I'd like to "isolate" the part of ips so that I can save to proxy.txt file. Any suggestions?
import urllib.request
sourcecode = urllib.request.urlopen("https://www.inforge.net/xi/threads/dichvusocks-us-15h10-pm-update-24-24-good-socks.455588/")
sourcecode = str(sourcecode.read())
out_file = open("proxy.txt","w")
out_file.write(sourcecode)
out_file.close()
I've added a couple of lines to your code, the only problem is that the UI version (check the page source) is being added as an IP address.
import urllib.request
import re
sourcecode = urllib.request.urlopen("https://www.inforge.net/xi/threads/dichvusocks-us-15h10-pm-update-24-24-good-socks.455588/")
sourcecode = str(sourcecode.read())
out_file = open("proxy.txt","w")
out_file.write(sourcecode)
out_file.close()
with open('proxy.txt') as fp:
for line in fp:
ip = re.findall('(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', line)
for addr in ip:
print(addr)
UPDATE:
This is what you are looking for, BeatifulSoup can extract only the data we need from the page using CSS classes, however it needs to be installed with pip. You don't need to save the page to a file.
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen('https://www.inforge.net/xi/threads/dichvusocks-us-15h10-pm-update-24-24-good-socks.455588/').read()
soup = BeautifulSoup(url, "html.parser")
# Searching the CSS class name
msg_content = soup.find_all("div", class_="messageContent")
ips = re.findall('(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', str(msg_content))
for addr in ips:
print(addr)
Why won't you use re?
I need the source code to say exactly how.