NameError: name 'status_code' is not defined while parsing access.log - json

Good afternoon, while testing the code for parsing access.log, the following error occurred:
Traceback (most recent call last):
File "logsscript_3.py", line 31, in
dict_ip[ip][status_code] += 1
NameError: name 'status_code' is not defined
I need to output top 10 requests with code 400 to a json file
The code is like this:
import argparse
import json
import re
from collections import defaultdict
parser = argparse.ArgumentParser(description='log parser')
parser.add_argument('-f', dest='logfile', action='store', default='access.log')
args = parser.parse_args()
regul_ip = (r"^(?P<ips>.*?)")
regul_statuscode = (r"\s(?P<status_code>400)\s")
dict_ip = defaultdict(lambda: {"400": 0})
with open(args.logfile) as file:
for index, line in enumerate(file.readlines()):
try:
ip = re.search(regul_ip, line).group()
status_code = re.search(regul_statuscode, line).groups()[0]
except AttributeError:
pass
dict_ip[ip][status_code] += 1
print(json.dumps(dict_ip, indent=4))
with open("final_log.json", "w") as jsonfile:
json.dump(dict_ip, jsonfile, indent=5)
An example of a line from access.log:
213.137.244.2 - - [13/Dec/2015:17:30:13 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 7717

Following up on the comment (included below for completeness) where I explain why you see the error, below I explain some ways to fix the code.
Expanding on #khelwood point: the example line (and likely many more in your log) is not a 400 code line. Your regex includes 400, thus it will not match, and the entire status_code = ... line will fail with an AttributeError: 'NoneType' object has no attribute 'groups' for all non-400 lines. Ignoring the exception results in a NameError in the dict_ip[ip]... line b/c status_code was not assigned a value.
First, you can use one regex to parse the access logs.
>>> import re
>>>
>>> line = '213.137.244.2 - - [13/Dec/2015:17:30:13 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 7717'
>>> p = r'(\S+) (\S+) (\S+) \[(.*?)\] "(\S+) (\S+) (\S+)" (\S+) (\S+) "(\S+)" "(.*?)" (\S+)'
>>> pat = re.compile(p)
>>> m = pat.match(line)
>>> m.groups()
('213.137.244.2', '-', '-', '13/Dec/2015:17:30:13 +0100', 'GET', '/administrator/', 'HTTP/1.1', '200', '4263', '-', 'Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0', '7717')
>>> m.group(1)
'213.137.244.2'
>>> m.group(2)
'-'
...
The snippet above shows you how to fetch and access the various fields from your log, as I observed in your other recent questions that you need this.
You can slightly modify the above as shown below (since you care only about log lines with a 400 and need only the IP address). Note that this is not the only way to write the regex, it's simply one way that can be easily derived from the one above. Note also that for illustration purposes I changed 200 to 400.
>>> line = '213.137.244.2 - - [13/Dec/2015:17:30:13 +0100] "GET /administrator/ HTTP/1.1" 400 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 7717'
>>> p = r'(\S+) \S+ \S+ \[.*?\] "\S+ \S+ \S+" 400 \S+ "\S+" ".*?" \S+'
>>> pat = re.compile(p)
>>> m = pat.match(line)
>>> m.group(1)
'213.137.244.2'
So, reading your log, counting 400s per IP address, and saving the 10 IP addresses with most 400s in a json file:
>>> from collections import Counter
>>> import json
>>> import re
>>>
>>> p = r'(\S+) \S+ \S+ \[.*?\] "\S+ \S+ \S+" 400 \S+ "\S+" ".*?" \S+'
>>> pat = re.compile(p)
>>> dict_ips_400 = Counter()
>>>
>>> with open("input_log.text") as f:
>>> for line in f: # see Note 1
>>> m = pat.match(line)
>>> if m: # check if there is a match
>>> ip = m.group(1)
>>> dict_ips_400[ip] += 1
>>>
>>> with open("final_log.json", "w") as jsonfile:
... json.dump(dict_ips_400.most_common(10), jsonfile, indent=5)
...
Notes:
you may want to check the differences of using f.readlines() vs processing a file line by line as above (if you are working with large files)
you could modify the above to a. use named groups (see re's docs), as you attempted to do in your code and/or b. capture and store more fields, say, IP addresses and status code pair counts

Related

How to efficiently crawl a website using Scrapy

I am trying to attempt web-scraping a real estate website using Scrapy and PyCharm, and failing miserably.
Desired Results:
Scrape 1 base URL (https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/), but 5 different internal URLs (https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/{**i**}-r/), where {i} = 1,2,3,4,5
Crawl all pages in each internal URL or using the base URL
Get all href links and crawl all href link and get span tag data from inside each href link.
Scrape around 5,000-7,000 unique listings as efficiently and fast as possible.
Output data into a CSV file while keeping Cyrillic characters.
Note: I have attempted web-scraping using BeautifulSoup but it took me around 1-2 min per listing, and around 2-3 hours to scrape all listings using a for loop. I was referred to Scrapy being faster option by a community member. I'm unsure if its cause of the data pipelines or if I can do multi-threading.
All and any help is greatly appreciated.^^
Website sample HTML snippet: This is a picture of the HTML I am trying to scrape.
Current Scrapy Code: This is what I have so far. When I use the scrapy crawl unegui_apts I cannot seem to get the results I want. I'm so lost.
# -*- coding: utf-8 -*-
# Import library
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
# Create Spider class
class UneguiApartments(scrapy.Spider):
name = 'unegui_apts'
allowed_domains = ['www.unegui.mn']
custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}}}
start_urls = [
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/,'
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/2-r/'
]
headers = {
'user-agent': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
}
def parse(self, response):
self.logger.debug('callback "parse": got response %r' % response)
cards = response.xpath('//div[#class="list-announcement-block"]')
for card in cards:
name = card.xpath('.//meta[#itemprop="name"]/text()').extract_first()
price = card.xpath('.//meta[#itemprop="price"]/text()').extract_first()
city = card.xpath('.//meta[#itemprop="areaServed"]/text()').extract_first()
date = card.xpath('.//*[#class="announcement-block__date"]/text()').extract_first().strip().split(', ')[0]
request = Request(link, callback=self.parse_details, meta={'name': name,
'price': price,
'city': city,
'date': date})
yield request
next_url = response.xpath('//li[#class="pager-next"]/a/#href').get()
if next_url:
# go to next page until no more pages
yield response.follow(next_url, callback=self.parse)
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartments)
process.start()
Your code has a number of issues:
The start_urls list contains invalid links
You defined your user_agent string in the headers dictionary but you are not using it when yielding requests
Your xpath selectors are incorrect
The next_url is incorrect hence does not yield new requests to the next pages
I have updated your code to fix the issues above as follows:
import scrapy
from scrapy.crawler import CrawlerProcess
# Create Spider class
class UneguiApartments(scrapy.Spider):
name = 'unegui_apts'
allowed_domains = ['www.unegui.mn']
custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}},
'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"}
start_urls = [
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/'
]
def parse(self, response):
cards = response.xpath(
'//li[contains(#class,"announcement-container")]')
for card in cards:
name = card.xpath(".//a[#itemprop='name']/#content").extract_first()
price = card.xpath(".//*[#itemprop='price']/#content").extract_first()
date = card.xpath("normalize-space(.//div[contains(#class,'announcement-block__date')]/text())").extract_first()
city = card.xpath(".//*[#itemprop='areaServed']/#content").extract_first()
yield {'name': name,
'price': price,
'city': city,
'date': date}
next_url = response.xpath("//a[contains(#class,'red')]/parent::li/following-sibling::li/a/#href").extract_first()
if next_url:
# go to next page until no more pages
yield response.follow(next_url, callback=self.parse)
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartments)
process.start()
You run the above spider by executing the command python <filename.py> since you are running a standalone script and not a full blown project.
Sample csv results are as shown in the image below. You will need to clean up the data using pipelines and the scrapy item class. See the docs for more details.

Why does error 'NoneType' object has no attribute 'contents' occur with only one of two similar commands?

I'm extracting content from this url.
import requests
from bs4 import BeautifulSoup
url = 'https://www.collinsdictionary.com/dictionary/french-english/aimer'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers = headers).content, 'html.parser')
for script in soup.select('script, .hcdcrt, #ad_contentslot_1, #ad_contentslot_2'):
script.extract()
entry_name = soup.h2.text
content1 = ''.join(map(str, soup.select_one('.cB cB-def dictionary biling').contents))
Then I got an error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-84-e9cb11cd6b5d> in <module>
10
11 entry_name = soup.h2.text
---> 12 content1 = ''.join(map(str, soup.select_one('.cB cB-def dictionary biling').contents))
AttributeError: 'NoneType' object has no attribute 'contents'
On the other hand, if I replace cB cB-def dictionary biling by hom, i.e. content1 = ''.join(map(str, soup.select_one('.hom').contents)) then the code runs well. From below structure of the html, I think that cB cB-def dictionary biling and hom are very similar.
Could you please elaborate on how such problem arises and how to solve it?
try this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.collinsdictionary.com/dictionary/french-english/aimer'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers = headers).content, 'html.parser')
for script in soup.select('script, .hcdcrt, #ad_contentslot_1, #ad_contentslot_2'):
script.extract()
entry_name = soup.h2.text
content1 = ''.join(map(str, soup.select_one('.cB.cB-def.dictionary.biling').contents))
When you select classes and it is blank-spaces in it you replace the space with ..
cB, cB-def, dictionary and biling is four different classes. And if you let the spaces be there the script looking for a tag with class cB-def inside of a tag with class cB and so on....

Save Json file contents to CSV file in python/pandas

How to get the "data" information into a csv table as shown at the end (and also, the right 'headers' so that the source server doesn't throw me off thinking I am scraping)? The code I wrote so far is as below.
import requests, json
headers = {'User-Agent': 'Mozilla/5.0'}
data_json = requests.get('https://www1.nseindia.com/live_market/dynaContent/live_watch/stock_watch/foSecStockWatch.json', headers=headers)
print(data_json)
file = open('make_csv', 'w')
file.write(str(data_json))
file.close()
But as the output I receive is as follows:
<Response [200]>
and even the exported/saved file shows the same.
Here is the expected output table that I am trying to achieve:
Symbol,Open,High,Low,Last Traded Price,Change,%Change,Traded Volume(lacs),Traded Value(crs),52 Week High,52 Week Low,365 Days % Change,30 Days % Change
"LUPIN","582.45","665.90","578.00","662.00","82.95","14.33","64.93","411.13","884.00","504.75","-14.88","5.11"
"APOLLOHOSP","1,094.20","1,239.45","1,088.05","1,195.00","106.15","9.75","23.97","280.36","1,813.55","1,047.05","-4.80","-30.87"
"SUNPHARMA","343.95","389.80","340.00","376.45","32.90","9.58","285.51","1,055.40","483.90","312.00","-19.85","1.88"
"CIPLA","425.00","454.70","416.25","448.00","34.25","8.28","179.07","793.22","586.00","355.30","-14.28","11.46"
"CESC","393.00","429.80","386.25","420.00","26.85","6.83","9.30","38.63","851.70","365.25","-42.19","-34.53"
"TORNTPHARM","1,979.00","2,113.00","1,950.00","2,090.00","131.00","6.69","10.13","208.87","2,287.25","1,452.00","10.56","-1.75"
"ITC","167.90","182.75","167.00","177.50","11.10","6.67","628.68","1,100.88","310.00","134.60","-40.42","-9.11"
"OIL","82.25","85.60","80.25","84.50","5.25","6.62","27.05","22.39","189.70","63.50","-53.95","-16.91"
..........
..........
import requests
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
r = requests.get(url, headers=headers).json()
x = []
for item in r['data']:
df = pd.DataFrame.from_dict([item])
x.append(df)
new = pd.concat(x, ignore_index=True)
print(new)
new.to_csv("Data.csv")
main("https://www1.nseindia.com/live_market/dynaContent/live_watch/stock_watch/foSecStockWatch.json")
Output: view online

Dealing with hexadecimal values in BeautifulSoup response?

I am using beautiful soup to scrape some data with:
url = "https://www.transfermarkt.co.uk/jorge-molina/profil/spieler/94447"
heads = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
response = requests.get(url,headers=heads)
soup = BeautifulSoup(response.text, "lxml")
Then, I extract I particular piece of information with:
height = soup.find_all("th", string=re.compile("Height:"))[0].findNext("td").text
print(height)
which works as intended, printing
1,74 m
but when I try to evaluate that string with this function:
def format_height(height_string):
return int(height_string.split(" ")[0].replace(',',''))
I get the following error:
format_height(height)
Traceback (most recent call last):
File "get_player_info.py", line 73, in <module>
player_info = get_player_info(url)
File "get_player_info.py", line 39, in get_player_info
format_height(height)
File "/Users/kompella/Documents/la-segunda/util.py", line 49, in format_height
return int(height_string.split(" ")[0].replace(',',''))
ValueError: invalid literal for int() with base 10: '174\xa0m'
I am wondering how I should evaluate the hexadecimal values I am getting?
Use an attribute=value selector to target height instead then use function as is
import requests
from bs4 import BeautifulSoup as bs
def format_height(height_string):
return int(height_string.split(" ")[0].replace(',',''))
r = requests.get('https://www.transfermarkt.co.uk/jorge-molina/profil/spieler/94447', headers = {'User-Agent':'Mozilla\5.0'})
soup = bs(r.content,'lxml')
height_string = soup.select_one('[itemprop=height]').text
print(format_height(height_string))
Everything is perfectly fine, just deconstruct them & you can do whatever you want after.
import requests
import re
from bs4 import BeautifulSoup
url = "https://www.transfermarkt.co.uk/jorge-molina/profil/spieler/94447"
heads = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
response = requests.get(url,headers=heads)
soup = BeautifulSoup(response.text, "lxml")
height = soup.find_all("th", string=re.compile("Height:"))[0].findNext("td").text
numerals = [int(s) for s in re.findall(r'\b\d+\b', height)]
print (numerals)
#output: [1, 74]
print ("Height is: " + str(numerals[0]) +"."+ str(numerals[1]) +"m")
#output: Height is: 1.75m
print ("Height is: " + str(numerals[0]) + str(numerals[1]) +"cm")
#output: Height is: 175cm
Anyways, the same question was discuss in this thread. You may take a look:
ValueError: invalid literal for int() with base 10: ''

simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 3 (char 2)

I am trying to send a http request to any url and get the response using urllib library. Following is the code that I have used :
>>> import requests
>>> r = requests.get("http://www.youtube.com/results?bad+blood")
>>> r.status_code
200
when I try to do this I get following error.
>>> r.json()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/requests/models.py", line 808, in json
return complexjson.loads(self.text, **kwargs)
File "/Library/Python/2.7/site-packages/simplejson/__init__.py", line 516, in loads
return _default_decoder.decode(s)
File "/Library/Python/2.7/site-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/Library/Python/2.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 3 (char 2)
can someone tell me whats wrong with the code.
PS: I am using python 2.7.10
The response isn't JSON, it is 'text/html; charset=utf-8'. If you want to parse it, use something like BeautifulSoup.
>>> import requests, bs4
>>> rsp = requests.get('http://www.youtube.com/results?bad+blood')
>>> rsp.headers['Content-Type']
'text/html; charset=utf-8'
>>> soup = bs4.BeautifulSoup(rsp.content, 'html.parser')
I'd recommend using the YouTube Search API instead. Log in to Google Developers Console, set up a API key following the API Key Setup instructions, then you can make the request using the YouTube Search API:
>>> from urllib import parse
>>> import requests
>>> query = parse.urlencode({'q': 'bad blood',
... 'part': 'snippet',
... 'key': 'OKdE7HRNPP_CzHiuuv8FqkaJhPI2MlO8Nns9vuM'})
>>> url = parse.urlunsplit(('https', 'www.googleapis.com',
... '/youtube/v3/search', query, None))
>>> rsp = requests.get(url, headers={'Accept': 'application/json'})
>>> rsp.raise_for_status()
>>> response = rsp.json()
>>> response.keys()
dict_keys(['pageInfo', 'nextPageToken', 'regionCode', 'etag', 'items', 'kind'])
Note that the example is using Python 3. If you want to use Python 2, then you will have to import urlencode from urllib and urlunsplit from urlparse.
That URL returns HTML, not JSON, so there's no point calling .json() on the response.