I have a project that requires us to download and read a table from Wikipedia and and use those information for calculation.
wikipedia page is https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate#Criticism_of_ranking_crime_data
it is required we take the Total Violent crime of each city,state listed ( states are repeated) However all the cells for that columns only have tag to it. it is under one table the question is how would I use beautifulsoup to read that specific column which is under the Violent Crime column
i have scoured the internet and i have landed on many choices from here and other websites but they are not really helping in this specific case But here is the code that i currently have that can take all the values from the table. most variables are holders while i test because i been going at it for a few days
state = soup.find_all('th', limit = 7)
for row in state:
row_data = row.get_text(strip = True, separator = '|').split('|')[0:1]
outfile.write(str(row_data)+ "\n")
umber = soup.find_all('td')
for column in number:
column_data = column.get_text(strip = True, separator = '|').split('|')[0:1]
outfile.write(str(column_data)+ "\n")
i basically want to store those information into a list sort of for later use and then use the links to each cities and get their cords and then reference it to a few cities in Texas for closest to border
We are only allowed to use BeautifulSoup and CSV no Pandas or NumPy
Edit:
the Write out functions are only for testing as well. Its only to see if its grabbing the information of the table correctly. My IDE console cant display all of them so writing it out was the next best thing i could think of
Looks like it's just an issue of creating your lists. You can do this by initializing your list, and then adding your list to that. Or you can append each of the items in your for loop to it. Or you can make that more concise by doing a list comprehension.
The reason you get nothing back, is because you keep overwriting your row_data and column_data in the loop. And it'll write to file however, it's going to put a new line after each, when I'm assuming you'd want to write the whole row, and then do a new line, so I'd also then put your write after the list is created/complete:
Combining the list to a list:
row_data = []
for row in state:
row_data = row_data + row.get_text(strip = True, separator = '|').split('|')[0:1]
outfile.write(str(row_data)+ "\n")
number = soup.find_all('td')
column_data = []
for column in number:
column_data = column_data + column.get_text(strip = True, separator = '|').split('|')[0:1]
outfile.write(str(column_data)+ "\n")
Appending an item/element to a list:
# Initiate and then append to a list
row_data = []
for row in state:
row_data.append(row.text)
outfile.write(str(row_data)+ "\n")
number = soup.find_all('td')
column_data = []
for column in number:
column_data.append(column.text)
outfile.write(str(column_data)+ "\n")
List Comprehension:
#List comprehension
row_data = [ row.text for row in state ]
outfile.write(str(row_data)+ "\n")
column_data = [ column.text for column in number ]
outfile.write(str(column_data)+ "\n")
As far as getting those sub-columns, it's tricky because those aren't child tags. They are however, the next <tr> tag after the <th> tag that you pull, so we could use that.
import bs4
import requests
import csv
url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate#Criticism_of_ranking_crime_data'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
# Only want State and City so limit = 2
headers = soup.find_all('th', limit = 2)
sub_headers = headers[0].findNext('tr')
# Initiate and then append to a list
header_data = []
for data in headers:
header_data.append(data.text.strip())
sub_header_data = []
for data in sub_headers.find_all('th'):
sub_header_data.append(data.text.strip())
# Only want to append the first Total column from the sub_headers
header_data.append(sub_header_data[0])
with open('C:/test.csv', mode='w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(header_data)
table_body = soup.find_all('tbody')[1]
rows = table_body.find_all('tr')
for row in rows:
tds = row.find_all('td', limit = 4)
#Skip the blank rows of data
if tds == []:
continue
tds_data = []
for data in tds:
tds_data.append(data.text.strip())
#Remove the Population number/data
del tds_data[2]
writer.writerow(tds_data)
Related
I want to make a script that prints the links to results in bing search to the console. The problem is that when I run the script there is no output. I believe the website thinks I am a bot?
from bs4 import BeautifulSoup
import requests
search = input("search for:")
params = {"q": "search"}
r = requests.get("http://www.bing.com/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find("ol", {"id": "b_results"})
links = results.find_all("Li", {"class": "b_algo"})
for item in links:
item_text = item.find("a").text
item_href = item.find("a").attrs["href"]
if item_text and item_href:
print(item_text)
print(item_href)
You need to use the search variable instead of "search". You also have a typo in your script: li is lower case.
Change these lines:
params = {"q": "search"}
.......
links = results.find_all("Li", {"class": "b_algo"})
To this:
params = {"q": search}
........
links = results.find_all("li", {"class": "b_algo"})
Note that some queries don't return anything. "crossword" has results, but "peanut" does not. The result page structure may be different based on the query.
There are 2 issues in this code -
Search is a variable name, so it should not be used with quotes. Change it to below
params = {"q": search}
When you include variable name inside quotes while fetching link, it becomes a static link. For dynamic link you should do it as below -
r = requests.get("http://www.bing.com/"+search, params=params)
After making these 2 changes, if you still do not get any output , check if you are using correct tag in results variable.
I scraped a website using the below code.
The website is structured in a certain way that requires using 4 different classes to scrape all the data which causes some data to be duplicated.
For converting my variables into lists, I tried using the split(' ') method, but it only created a list for each scraped string with /n in the beginning.
I also tried to create the variable as empty lists, api_name = [] for instance but it did not work.
For removing duplicates, I thought of using the set method, but I think it only works on lists.
I want to remove all the duplicated data from my variables before I write them into the CSV file, do I have to convert them into lists first or there is a way to remove them directly from the variables?
Any assistance or even feedback for the code would be appreciated.
Thanks.
import requests
from bs4 import BeautifulSoup
import csv
url = "https://www.programmableweb.com/apis/directory"
api_no = 0
urlnumber = 0
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
csv_file = open('api_scraper.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['api_no', 'API Name', 'Description','api_url', 'Category', 'Submitted'])
#THis is the place where I parse and combine all the classes, which causes the duplicates data
directories1 = soup.find_all('tr', {'class': 'odd'})
directories2 = soup.find_all('tr', {'class': 'even'})
directories3 = soup.find_all('tr', {'class': 'odd views-row-first'})
directories4 = soup.find_all('tr', {'class': 'odd views-row-last'})
directories = directories1 + directories2 + directories3 + directories4
while urlnumber <= 765:
for directory in directories:
api_NameTag = directory.find('td', {'class':'views-field views-field-title col-md-3'})
api_name = api_NameTag.text if api_NameTag else "N/A"
description_nametag = directory.find('td', {'class': 'col-md-8'})
description = description_nametag.text if description_nametag else 'N/A'
api_url = 'https://www.programmableweb.com' + api_NameTag.a.get('href')
category_nametage = directory.find('td',{'class': 'views-field views-field-field-article-primary-category'})
category = category_nametage.text if category_nametage else 'N/A'
submitted_nametag = directory.find('td', {'class':'views-field views-field-created'})
submitted = submitted_nametag.text if submitted_nametag else 'N/A'
#These are the variables I want to remove the duplicates from
csv_writer.writerow([api_no,api_name,description,api_url,category,submitted])
api_no +=1
urlnumber +=1
url = "https://www.programmableweb.com/apis/directory?page=" + str(urlnumber)
csv_file.close()
If it wasn't for the api links I would have said just use pandas read_html and take index 2. As you want the urls as well I suggest that you change your selectors. You want to limit to the table to avoid duplicates and choose the class name that depicts the column.
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.programmableweb.com/apis/directory')
soup = bs(r.content, 'lxml')
api_names, api_links = zip(*[(item.text, 'https://www.programmableweb.com' + item['href']) for item in soup.select('.table .views-field-title a')])
descriptions = [item.text for item in soup.select('td.views-field-search-api-excerpt')]
categories = [item.text for item in soup.select('td.views-field-field-article-primary-category a')]
submitted = [item.text for item in soup.select('td.views-field-created')]
df = pd.DataFrame(list(zip(api_names, api_links, descriptions, categories, submitted)), columns = ['API name','API Link', 'Description', 'Category', 'Submitted'])
print(df)
Though you could just do
pd.read_html(url)[2]
and then add in the extra column for api_links from bs4 using selectors shown above.
I am having a problem with web-scraping. I am trying to learn how to do it, but I can't seem to get past some of the basics. I am getting an error, "TypeError: 'ResultSet' object is not callable" is the error I'm getting.
I've tried a number of different things. I was originally trying to use the "find" instead of "find_all" function, but I was having an issue with beautifulsoup pulling in a nonetype. I was unable to create an if loop that could overcome that exception, so I tried using the "find_all" instead.
page = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = BeautifulSoup(page.text,'html.parser')all_company_list =
soup.find_all(class_='sortable-table')
#all_company_list = soup.find(class_='sortable-table')
company_name_list_items = all_company_list('td')
for company_name in company_name_list_items:
#print(company_name.prettify())
companies = company_name.content[0]
I'd like this to pull in all the companies in Orange County California that are on this list in a clean manner. As you can see, I've already accomplished pulling them in, but I want the list to be clean.
You've got the right idea. I think instead of immediately finding all the <td> tags (which is going to return one <td> for each row (140 rows) and each column in the row (4 columns)), if you want only the company names, it might be easier to find all the rows (<tr> tags) then append however many columns you want by iterating the <td>s in each row.
This will get the first column, the company names:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = BeautifulSoup(page.text,'html.parser')
all_company_list = soup.find_all('tr')
company_list = [c.find('td').text for c in all_company_list[1::]]
Now company_list contains all 140 company names:
>>> print(len(company_list))
['Advanced Behavioral Health', 'Advanced Management Company & R³ Construction Services, Inc.',
...
, 'Wes-Tec, Inc', 'Western Resources Title Company', 'Wunderman', 'Ytel, Inc.', 'Zillow Group']
Change c.find('td') to c.find_all('td') and iterate that list to get all the columns for each company.
Pandas:
Pandas is often useful here. The page uses multiple sorts including company size, rank. I show rank sort.
import pandas as pd
table = pd.read_html('https://topworkplaces.com/publication/ocregister/')[0]
table.columns = table.iloc[0]
table = table[1:]
table.Rank = pd.to_numeric(table.Rank)
rank_sort_table = table.sort_values(by='Rank', axis=0, ascending = True)
rank_sort_table.reset_index(inplace=True, drop=True)
rank_sort_table.columns.names = ['Index']
print(rank_sort_table)
Depending on your sort, companies in order:
print(rank_sort_table.Company)
Requests:
Incidentally, you can use nth-of-type to select just first column (company names) and use id, rather than class name, to identify the table as faster
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = bs(r.content, 'lxml')
names = [item.text for item in soup.select('#twpRegionalList td:nth-of-type(1)')]
print(names)
Note the default sorting is alphabetical on name column rather than rank.
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
So I'm aiming to scrape 2 tables (in different formats) from a website - https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate after using the search bar to iterate this over a list of license codes. I haven't included the loop fully yet but I added it at the top for completeness.
My issue is that because the two tables I want, Product Data and Certificate Data are in 2 different formats, so I have to scrape them separately. As the Product data is in the normal "tr" format on the webpage, this bit is easy and I've managed to extract a CSV file of this. The harder bit is extracting Certificate Data, as it is in "div" form.
I've managed to print the Certificate Data as a list of text, using the class function, however I need to have it in a tabular form saved in a CSV file. As you can see, I've tried several unsuccessful ways of converting it to a CSV but If you have any suggestions, it would be much appreciated, thank you!! Also any other general tips to improve my code would be great too, as I am new to web-scraping.
#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
#include all of the below
driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)
search_bar = driver.find_element_by_xpath('//*[#id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url
r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))
driver.quit()
df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)
#print(df[0].to_json(orient='records'))
print certificate
Output:
Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0
What I want but over hundreds/thousands of license codes (I just manually created this one sample in Excel):
Desired output
EDIT
So whilst this is now working for Certificate Data, I also want to scrape the Product Data and output that into another .csv file. However currently it is only printing 5 copies of the product data for the final license code which is not what I want.
New Code:
df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]
def get_data_by_code(code):
data = [
('code', code),
('submit', 'Search'),
]
response = requests.post('https://info.fsc.org/certificate.php', data=data)
soup = BeautifulSoup(response.content, 'lxml')
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
return [code, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
table = soup.find_all('table')[0]
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
Here's all you need.
No chromedriver. No pandas. Forget about it in context of scraping.
import requests
import csv
from bs4 import BeautifulSoup
# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.
#Function to parse single data page based on single input code.
def get_data_by_code(code):
# Parameters to build POST-request.
# "type" and "submit" params are static. "code" is your desired code to scrape.
data = [
('type', 'certificate'),
('code', code),
('submit', 'Search'),
]
# POST-request to gain page data.
response = requests.post('https://info.fsc.org/certificate.php', data=data)
# "soup" object to parse html data.
soup = BeautifulSoup(response.content, 'lxml')
# "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
# Same for issue dates... etc.
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
# Returning found data as list of values.
return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
#Writing list of values to file as single row.
writer.writerow((get_data_by_code(code)))
Everything is really straightforward here. I'd suggest you spend some time in Chrome dev tools "network" tab to have a better understanding of request forging, which is a must for scraping tasks.
In general, you don't need to run chrome to click the "search" button, you need to forge request generated by this click. Same for any form and ajax.
well... you should sharpen your skills (:
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
### HERE'S THE PROBLEM:
# "soup" variable is declared inside of "get_data_by_code" function.
# So you can't use it in outer context.
table = soup.find_all('table')[0] # <--- you should move this line to
#definition of "get_data_by_code" function and return it's value somehow...
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
As per example you can return dictionary of values from "get_data_by_code" function:
def get_data_by_code(code):
...
table = soup.find_all('table')[0]
return dict(row=row, table=table)
I have several urls that i want to open to a specific place and search for a specific name but I'm only getting None returned or [].
I have searched but cannot see an answer that is pertinent to my code.
from bs4 import BeautifulSoup
from urllib import request
webpage = request.urlopen("http://www.dsfire.gov.uk/News/Newsdesk/IncidentsPast7days.cfm?siteCategoryId=3&T1ID=26&T2ID=35")
soup = BeautifulSoup(webpage)
incidents = soup.find(id="CollapsiblePanel1")
Links = []
for line in incidents.find_all('a'):
Links.append("http://www.dsfire.gov.uk/News/Newsdesk/"+line.get('href'))
n = 0
e = len(Links)
while n < e:
webpage = request.urlopen(Links[n])
soup = BeautifulSoup(webpage)
station = soup.find(id="IncidentDetailContainer")
#search string
print(soup.body.findAll(text='Ashburton'))
n=n+1
I know its in the last link found on the page.
Thanks in advance for any ideas or comments
If your ouput is "[]" only, it means your output is an array. You have to set the index then => variable[index].
Try this one
print(soup.body.findAll(text='Ashburton')[0])
...where storing it into a variable first would be easier:
search = soup.body.findAll(text='Ashburton')
print(search[0])
This will bring you the first found item.
For printing all found items you could go
search = soup.body.findAll(text='Ashburton')
foreach(entry in search)
print(entry)
Notice this is more pseude-code instead of a working example. I really dont know beautifulsoap.