How to set columns while saving in csv using scrapy - html

so basically i am scraping data off the web and i have an items file which is imported to my main spider file. Now when i scrape data and store it in containers and save it as csv, the links columns always ends up being the first column in the csv. How do set custom placement of columns?
pName = response.css('#search .a-size-medium').css('::text').extract()
pPrice = response.css('#search .a-price-whole').css('::text').extract()
imgs = response.css('.sbv-product-img , .s-image-fixed-height .s-image').css('::attr(src)').extract()
for prod in zip(pName , pPrice , imgs):
items['prodName'] = prod[0]
items['price'] = prod[1]
items['imgLink'] = prod[2]
yield items

Use the FEED_EXPORT_FIELDS setting either in your settings.py file or in your spiders custom_settings attribute. The columns will be in the order you set them to in the value of the setting.
For example:
class MySpider(scrapy.Spider):
custom_settings = {
"FEED_EXPORT_FIELDS": ["prodName", "price", "imgLink"]
}
or in settings.py:
FEED_EXPORT_FIELDS=["prodName", "price", "imgLink"]
scrapy docs link and link2

Related

azure-devops export work items as csv and include ALL comments/discussions

if I make a query in Azure Devops for my work items, when I add columns to display, there is only comments count and discussions, but is there any where to include all comments from a work item in my output saved csv file? I need to archive the comments for every work item in a csv.
Can I do this through the azure desktop web ui somehow? or do i need to write my own script using the azure api to view comments and add them to a work item
Can I do this through the azure desktop web ui somehow? or do i need
to write my own script using the azure api to view comments and add
them to a work item
For your first question, the answer is NO. Since you want all the comments for a workitem, then built-in UI feature will not be able to achieve your requirement.
For your second the question, the answer is YES.
from azure.devops.connection import Connection
from msrest.authentication import BasicAuthentication
import requests
import csv
import os
#get all the comments of a work item
def get_work_items_comments(wi_id):
#get a connection to Azure DevOps
organization_url = 'https://dev.azure.com/xxx'
personal_access_token = 'xxx'
credentials = BasicAuthentication('', personal_access_token)
connection = Connection(base_url=organization_url, creds=credentials)
work_item_tracking_client = connection.clients.get_work_item_tracking_client()
#get the work item
work_item = work_item_tracking_client.get_work_item(wi_id)
#get the comments of the work item
comments_ref = work_item._links.additional_properties['workItemComments']['href']
#send a request to get the comments
response = requests.get(comments_ref, auth=('', personal_access_token))
#get the comments
comments = response.json()['comments']
return comments
#return work item id, work item title and related work item comments
def get_work_items_results(wi_id):
#get a connection to Azure DevOps
organization_url = 'https://dev.azure.com/xxx'
personal_access_token = 'xxx'
credentials = BasicAuthentication('', personal_access_token)
connection = Connection(base_url=organization_url, creds=credentials)
work_item_tracking_client = connection.clients.get_work_item_tracking_client()
#get the work item
work_item = work_item_tracking_client.get_work_item(wi_id)
#get the title of the work item
title = work_item.fields['System.Title']
#get the work item id
id = work_item.id
#get the comments of the work item
items = get_work_items_comments(wi_id)
array_string = []
for item in items:
text = item['text']
array_string.append(text)
print(item['text'])
return id, title, array_string
#Save the work item id, work item title and related work item comments to a csv file
#create folder workitemresults if not exist
if not os.path.exists('workitemresults'):
os.makedirs('workitemresults')
with open('workitemresults/comments_results.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Workitem ID','Workitem Title','Workitem Comments'])
#===if you want multiple work items, just for loop in this place and replace the value 120 in this place, 120 is the workitem id on my side.===
writer.writerow(get_work_items_results(120))
The above code is to capture the comments and information for one workitem. It works fine on my side(The place for what 'for loop' should be placed I had already mentioned in my code, using the for loop in that place you can capture multiple workitems):
In my situation, if I only want the text content:
#remove <div> and </div> from the text
text = text.replace('<div>','')
text = text.replace('</div>','')
Result:

Python web-scraping output

I want to make a script that prints the links to results in bing search to the console. The problem is that when I run the script there is no output. I believe the website thinks I am a bot?
from bs4 import BeautifulSoup
import requests
search = input("search for:")
params = {"q": "search"}
r = requests.get("http://www.bing.com/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find("ol", {"id": "b_results"})
links = results.find_all("Li", {"class": "b_algo"})
for item in links:
item_text = item.find("a").text
item_href = item.find("a").attrs["href"]
if item_text and item_href:
print(item_text)
print(item_href)
You need to use the search variable instead of "search". You also have a typo in your script: li is lower case.
Change these lines:
params = {"q": "search"}
.......
links = results.find_all("Li", {"class": "b_algo"})
To this:
params = {"q": search}
........
links = results.find_all("li", {"class": "b_algo"})
Note that some queries don't return anything. "crossword" has results, but "peanut" does not. The result page structure may be different based on the query.
There are 2 issues in this code -
Search is a variable name, so it should not be used with quotes. Change it to below
params = {"q": search}
When you include variable name inside quotes while fetching link, it becomes a static link. For dynamic link you should do it as below -
r = requests.get("http://www.bing.com/"+search, params=params)
After making these 2 changes, if you still do not get any output , check if you are using correct tag in results variable.

AWS Glue Job - CSV to Parquet. How to ignore header?

I need to convert a bunch (23) of CSV files (source s3) into parquet format. The input CSV contains headers in all files. When I generated code for that using Glue. The output contains 22 header rows also in separate rows which means it ignored the first header. I need help in ignoring all the headers while doing this transformation.
Since I'm using from_catalog function for my input, I don't have any format_options to ignore the header rows.
Also, can I set an option in the Glue table that the header is present in the files? Will that automatically ignore the header when my job runs?
Part of my current approach is below. I'm new to Glue. This code was actually auto-generated by Glue.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
datasink1 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://my-bucket-name/full/s3/path-parquet"}, format = "parquet", transformation_ctx = "datasink1")
Faced exact issue while working on a ETL job which used AWS Glue.
The documentation for from_catalog says:
additional_options – A collection of optional name-value pairs. The possible options include those listed in Connection Types and Options for ETL in AWS Glue except for endpointUrl, streamName, bootstrap.servers, security.protocol, topicName, classification, and delimiter.
I tried using the below snippet and some of its permutations with from_catalog. But nothing worked for me.
additional_options = {"format": "csv", "format_options": '{"withHeader": "True"}'},
One way to go about fixing this is by using from_options instead of from_catalog and pointing directly to the S3 bucket or folder. This is what it should look like:
datasource0 = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={
'paths': ['s3://bucket_name/folder_name'],
"recurse": True,
'groupFiles': 'inPartition'
},
format="csv",
format_options={
"withHeader": True
},
transformation_ctx = "datasource0"
)
But if you can't do this for any reason and want to stick with from_catalog, using a filter worked for me.
Assuming that one of your header's name is name, this is what the snippet can look like:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["name"] != "name")
Not very sure about how spark's dataframes or glue's dynamicframes deal with csv headers and why data read from catalog had headers in rows as well as schema, but this seemed to solve my issue by removing the header values from the rows.

Type Error: Result Set Is Not Callable - BeautifulSoup

I am having a problem with web-scraping. I am trying to learn how to do it, but I can't seem to get past some of the basics. I am getting an error, "TypeError: 'ResultSet' object is not callable" is the error I'm getting.
I've tried a number of different things. I was originally trying to use the "find" instead of "find_all" function, but I was having an issue with beautifulsoup pulling in a nonetype. I was unable to create an if loop that could overcome that exception, so I tried using the "find_all" instead.
page = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = BeautifulSoup(page.text,'html.parser')all_company_list =
soup.find_all(class_='sortable-table')
#all_company_list = soup.find(class_='sortable-table')
company_name_list_items = all_company_list('td')
for company_name in company_name_list_items:
#print(company_name.prettify())
companies = company_name.content[0]
I'd like this to pull in all the companies in Orange County California that are on this list in a clean manner. As you can see, I've already accomplished pulling them in, but I want the list to be clean.
You've got the right idea. I think instead of immediately finding all the <td> tags (which is going to return one <td> for each row (140 rows) and each column in the row (4 columns)), if you want only the company names, it might be easier to find all the rows (<tr> tags) then append however many columns you want by iterating the <td>s in each row.
This will get the first column, the company names:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = BeautifulSoup(page.text,'html.parser')
all_company_list = soup.find_all('tr')
company_list = [c.find('td').text for c in all_company_list[1::]]
Now company_list contains all 140 company names:
>>> print(len(company_list))
['Advanced Behavioral Health', 'Advanced Management Company & R³ Construction Services, Inc.',
...
, 'Wes-Tec, Inc', 'Western Resources Title Company', 'Wunderman', 'Ytel, Inc.', 'Zillow Group']
Change c.find('td') to c.find_all('td') and iterate that list to get all the columns for each company.
Pandas:
Pandas is often useful here. The page uses multiple sorts including company size, rank. I show rank sort.
import pandas as pd
table = pd.read_html('https://topworkplaces.com/publication/ocregister/')[0]
table.columns = table.iloc[0]
table = table[1:]
table.Rank = pd.to_numeric(table.Rank)
rank_sort_table = table.sort_values(by='Rank', axis=0, ascending = True)
rank_sort_table.reset_index(inplace=True, drop=True)
rank_sort_table.columns.names = ['Index']
print(rank_sort_table)
Depending on your sort, companies in order:
print(rank_sort_table.Company)
Requests:
Incidentally, you can use nth-of-type to select just first column (company names) and use id, rather than class name, to identify the table as faster
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = bs(r.content, 'lxml')
names = [item.text for item in soup.select('#twpRegionalList td:nth-of-type(1)')]
print(names)
Note the default sorting is alphabetical on name column rather than rank.
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

web2py:Grid csv exports shows ids not values for reference fields

Table structure like -
db.define_table('parent',
Field('name'),format='%(name)s')
db.define_table('children',
Field('name'),
Field('mother','reference parent'),
Field('father','reference parent'))
db.children.mother.requires = IS_IN_DB(db, db.parent.id,'%(name)s')
db.children.father.requires = IS_IN_DB(db, db.parent.id,'%(name)s')
Controller :
grid = SQLFORM.grid(db.children, orderby=[db.children.id],
csv=True,
fields=[db.children.id, db.children.name, db.children.mother, db.children.father])
return dict(grid=grid)
Here grid shows proper values i.e names of the mother and father from the parent table.
But when I try to export it via csv link - resulted excelsheet shows ids and not the names of mother and father.
Please help!
The CSV download just gives you the raw database values without first applying each field's represent attribute. If you want the "represented" values of each field, you have two options. First, you can choose the TSV (tab-separated-values) download instead of CSV. Second, you can define a custom export class:
import cStringIO
class CSVExporter(object):
file_ext = "csv"
content_type = "text/csv"
def __init__(self, rows):
self.rows = rows
def export(self):
if self.rows:
s = cStringIO.StringIO()
self.rows.export_to_csv_file(s, represent=True)
return s.getvalue()
else:
return ''
grid = SQLFORM.grid(db.mytable, exportclasses=dict(csv=(CSVExporter, 'CSV')))
The exportclasses argument is a dictionary of custom download types that can be used to override existing types or add new ones. Each item is a tuple including the exporter class and the label to be used for the download link in the UI.
We should probably add this as an option.