How to extract html links with a matching word - html

I am trying to make a crawler that can use a text file list of urls be turned auto assigned as variables to later be added into a list that can be parsed to search for urls containing the word "wp- ". Unfortunately, I am getting stuck at the part where I need to scrape the page to see if any urls bring up " wp- ". I've tried a number of ways but nothing is working. I've tried various semblances of
//a[contains(#href, 'wp-')]
but it does not work. Any suggestions on how to get the parsing for wp- working?
Here is my code so far
'''
#!/usr/#!/usr/bin/python
import urllib.request
import urlopen
# import urls into readable python file
f = open("url-list.txt", "r")
text = f.read()
# turn urls in file into a list by spliting it into lines
text_list = text.splitlines()
f.close()
#print(text_list) #dont need to show the links as list
#make list into variables
count = 0
for breakaway in text_list: #made iterate of a list to set their value
count = count + 1
print(count + 0, " Sending url-list to scraper...")
for url in //a[contains(#href, 'wp-')].extract():
print(url)
'''

Related

JSONDecodeError: Expecting value: line 1 column 1 (char 0) while getting data from Pokemon API

I am trying to scrape the pokemon API and create a dataset for all pokemon. So I have written a function which looks like this:
import requests
import json
import pandas as pd
def poke_scrape(x, y):
'''
A function that takes in a range of pokemon (based on pokedex ID) and returns
a pandas dataframe with information related to the pokemon using the Poke API
'''
#GATERING THE DATA FROM API
url = 'https://pokeapi.co/api/v2/pokemon/'
ids = range(x, (y+1))
pkmn = []
for id_ in ids:
url = 'https://pokeapi.co/api/v2/pokemon/' + str(id_)
pages = requests.get(url).json()
# content = json.dumps(pages, indent = 4, sort_keys=True)
if 'error' not in pages:
pkmn.append([pages['id'], pages['name'], pages['abilities'], pages['stats'], pages['types']])
#MAKING A DATAFRAME FROM GATHERED API DATA
cols = ['id', 'name', 'abilities', 'stats', 'types']
df = pd.DataFrame(pkmn, columns=cols)
The code works fine for most pokemon. However, when I am trying to run poke_scrape(229, 229) (so trying to load ONLY the 229th pokemon), it gives me the JSONDecodeError. It looks like this:
So far I have tried using json.loads() instead but that has not solved the issue. What is even more perplexing is that specific pokemon has loaded before and the same issue was with another ID - otherwise I could just manually enter the stats for the specific pokemon that is unable to load into my dataframe. Any help is appreciated!
Because of the way the PokeAPI works, some links to the JSON data for each pokemon only load when the links end with a '/' (such as https://pokeapi.co/api/v2/pokemon/229/ vs https://pokeapi.co/api/v2/pokemon/229 - first link will work and the second will return not found). However, others will respond with a response error because of the added '/' so fixed the issue with a few if statements right after the for loop in the beginning of the function

Python web-scraping output

I want to make a script that prints the links to results in bing search to the console. The problem is that when I run the script there is no output. I believe the website thinks I am a bot?
from bs4 import BeautifulSoup
import requests
search = input("search for:")
params = {"q": "search"}
r = requests.get("http://www.bing.com/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find("ol", {"id": "b_results"})
links = results.find_all("Li", {"class": "b_algo"})
for item in links:
item_text = item.find("a").text
item_href = item.find("a").attrs["href"]
if item_text and item_href:
print(item_text)
print(item_href)
You need to use the search variable instead of "search". You also have a typo in your script: li is lower case.
Change these lines:
params = {"q": "search"}
.......
links = results.find_all("Li", {"class": "b_algo"})
To this:
params = {"q": search}
........
links = results.find_all("li", {"class": "b_algo"})
Note that some queries don't return anything. "crossword" has results, but "peanut" does not. The result page structure may be different based on the query.
There are 2 issues in this code -
Search is a variable name, so it should not be used with quotes. Change it to below
params = {"q": search}
When you include variable name inside quotes while fetching link, it becomes a static link. For dynamic link you should do it as below -
r = requests.get("http://www.bing.com/"+search, params=params)
After making these 2 changes, if you still do not get any output , check if you are using correct tag in results variable.

Scrape table from webpage when in <div> format - using Beautiful Soup

So I'm aiming to scrape 2 tables (in different formats) from a website - https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate after using the search bar to iterate this over a list of license codes. I haven't included the loop fully yet but I added it at the top for completeness.
My issue is that because the two tables I want, Product Data and Certificate Data are in 2 different formats, so I have to scrape them separately. As the Product data is in the normal "tr" format on the webpage, this bit is easy and I've managed to extract a CSV file of this. The harder bit is extracting Certificate Data, as it is in "div" form.
I've managed to print the Certificate Data as a list of text, using the class function, however I need to have it in a tabular form saved in a CSV file. As you can see, I've tried several unsuccessful ways of converting it to a CSV but If you have any suggestions, it would be much appreciated, thank you!! Also any other general tips to improve my code would be great too, as I am new to web-scraping.
#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
#include all of the below
driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)
search_bar = driver.find_element_by_xpath('//*[#id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url
r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))
driver.quit()
df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)
#print(df[0].to_json(orient='records'))
print certificate
Output:
Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0
What I want but over hundreds/thousands of license codes (I just manually created this one sample in Excel):
Desired output
EDIT
So whilst this is now working for Certificate Data, I also want to scrape the Product Data and output that into another .csv file. However currently it is only printing 5 copies of the product data for the final license code which is not what I want.
New Code:
df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]
def get_data_by_code(code):
data = [
('code', code),
('submit', 'Search'),
]
response = requests.post('https://info.fsc.org/certificate.php', data=data)
soup = BeautifulSoup(response.content, 'lxml')
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
return [code, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
table = soup.find_all('table')[0]
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
Here's all you need.
No chromedriver. No pandas. Forget about it in context of scraping.
import requests
import csv
from bs4 import BeautifulSoup
# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.
#Function to parse single data page based on single input code.
def get_data_by_code(code):
# Parameters to build POST-request.
# "type" and "submit" params are static. "code" is your desired code to scrape.
data = [
('type', 'certificate'),
('code', code),
('submit', 'Search'),
]
# POST-request to gain page data.
response = requests.post('https://info.fsc.org/certificate.php', data=data)
# "soup" object to parse html data.
soup = BeautifulSoup(response.content, 'lxml')
# "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
# Same for issue dates... etc.
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
# Returning found data as list of values.
return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
#Writing list of values to file as single row.
writer.writerow((get_data_by_code(code)))
Everything is really straightforward here. I'd suggest you spend some time in Chrome dev tools "network" tab to have a better understanding of request forging, which is a must for scraping tasks.
In general, you don't need to run chrome to click the "search" button, you need to forge request generated by this click. Same for any form and ajax.
well... you should sharpen your skills (:
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
### HERE'S THE PROBLEM:
# "soup" variable is declared inside of "get_data_by_code" function.
# So you can't use it in outer context.
table = soup.find_all('table')[0] # <--- you should move this line to
#definition of "get_data_by_code" function and return it's value somehow...
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
As per example you can return dictionary of values from "get_data_by_code" function:
def get_data_by_code(code):
...
table = soup.find_all('table')[0]
return dict(row=row, table=table)

string search returns none or []

I have several urls that i want to open to a specific place and search for a specific name but I'm only getting None returned or [].
I have searched but cannot see an answer that is pertinent to my code.
from bs4 import BeautifulSoup
from urllib import request
webpage = request.urlopen("http://www.dsfire.gov.uk/News/Newsdesk/IncidentsPast7days.cfm?siteCategoryId=3&T1ID=26&T2ID=35")
soup = BeautifulSoup(webpage)
incidents = soup.find(id="CollapsiblePanel1")
Links = []
for line in incidents.find_all('a'):
Links.append("http://www.dsfire.gov.uk/News/Newsdesk/"+line.get('href'))
n = 0
e = len(Links)
while n < e:
webpage = request.urlopen(Links[n])
soup = BeautifulSoup(webpage)
station = soup.find(id="IncidentDetailContainer")
#search string
print(soup.body.findAll(text='Ashburton'))
n=n+1
I know its in the last link found on the page.
Thanks in advance for any ideas or comments
If your ouput is "[]" only, it means your output is an array. You have to set the index then => variable[index].
Try this one
print(soup.body.findAll(text='Ashburton')[0])
...where storing it into a variable first would be easier:
search = soup.body.findAll(text='Ashburton')
print(search[0])
This will bring you the first found item.
For printing all found items you could go
search = soup.body.findAll(text='Ashburton')
foreach(entry in search)
print(entry)
Notice this is more pseude-code instead of a working example. I really dont know beautifulsoap.

Issue in scraping data from a html page using beautiful soup

I am scraping some data from a website and I am able to do so using the below referred code:
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('O2_2012-12-21.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","OEM","Device Name","Price"])
oems = soup.findAll('span', {"class": "wwFix_h2"},text=True)
items = soup.findAll('div',{"class":"title"})
prices = soup.findAll('span', {"class": "handset"})
for oem, item, price in zip(oems, items, prices):
textcontent = u' '.join(islice(item.stripped_strings, 1, 2, 1))
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(oem.string).encode('utf8').strip(),textcontent,unicode(price.string).encode('utf8').strip()])
Now, issue is 2 of the all the price values I am scraping have different html structure then rest of the values. My output csv is showing "None" value for those because of this. Normal html structure for price on webpage is
<span class="handset">
FREE to £79.99</span>
For those 2 values structure is
<span class="handset">
<span class="delivery_amber">Up to 7 days delivery</span>
<br>"FREE on all tariffs"</span>
Out which I am getting right now displays None for the second html structure instead of Free on all tariffs, also price value Free on all tariffs is mentioned under double quotes in second structure while it is outside any quotes in first structure
Please help me solve this issue, Pardon my ignorance as I am new to programming.
Just detect those 2 items with an additional if statement:
if price.string is None:
price_text = u' '.join(price.stripped_strings).replace('"', '').encode('utf8')
else:
price_text = unicode(price.string).strip().encode('utf8')
then use price_text for your CSV file. Note that I removed the " quotes with a simple replace call.