How to get the country code for multiple IP addresses at a time (5 or more) using geoip2 database? - geoip2

This is how I've been getting the country name for one IP address at a time, but I need to be able to do multiple, sometimes over 50 at a time.
>>>import geoip2.database
>>>
>>>reader = geoip2.database.Reader('/path/to/GeoLite2-City.mmdb')
>>>
>>>response = reader.city('128.101.101.101')
>>>
>>>response.country.iso_code
>>>
>>>response.country.name

put all the ip's to a list and iterate through the list.
reader = geoip2.database.Reader('/path/to/GeoLite2-City.mmdb')
ip_list=['128.101.101.101','198.101.101.101','208.101.101.101','120.101.101.101','129.101.101.101','138.101.101.101','148.101.101.101']
for ip in ip_list:
response = reader.city(ip)
print response.country.iso_code
print response.country.name
or add the ip's to an excel sheet and use pandas or xlrd to read the ip's to a list and iterate them again as shown above.

print(response.city.name) and print(response.traits.network)

Related

Why is my xpath parser only scraping the first dozen elements in a table?

I'm learning how to interact with websites via python and I've been following this tutorial.
I wanted to know how xpath worked which led me here.
I've made some adjustments to the code from the origional site to select elements if multiple conditions are met.
However, in both the origional code from the tutorial and my own update I seem to be only grabbing a small section of the body. I was under the impression xpath('//tbody/tr')[:10]: would grab the entire body and allow me to interact with 10 in this case columns.
At time of writing when I run the code it retuns 2 hits, while if I do the same thing with a manual copy and paste to excel I get 94 hits.
Why am I only able to parse a few lines from the table and not all lines?
import requests
from lxml.html import fromstring
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]') and i.xpath('.//td[5][contains(text(),"elite")]'):
# Grabbing IP and corresponding PORT
proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
return proxies
proxies = get_proxies()
print(proxies)
I figured it out: [:10] refers not to columns but to rows.
Changing 10 to 20, 50 or 100, check 10, 20, 50 or 100 rows.

Select Next node in Python with XPath

I am trying to scrape population information from Wikipedia country pages. The trouble I am having is that in the node I am trying to scrape there is no information referring to population, instead population is only referenced in the node before it. So using Xpath I am trying to get the expression to move to the next node, but can't find the correct command.
For example for the following page:
https://en.wikipedia.org/wiki/Afghanistan
Below is an xpath expression that gets me to the node before the population number I want to scrape:
//table[#class='infobox geography vcard']//tr[#class = 'mergedtoprow']//a[contains(#href,"Demographics")]/../..
It searches for a href in the table that contains "Demographics" then goes up two levels to the parents of the parents. But the problem is that the title is in a different node to the number I want to extract and so I need something that could go to next node.
I have seen the expression /following-sibling::div[1] but it doesn't seem to work for my expression and I don't know why.
If anyone can think of a more direct way of finding the node in the above web page that would be good too.
Thanks
Edit:
Below is the Python code I am using
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from urllib.parse import urljoin
class CountryinfoSpider(scrapy.Spider):
name = 'CountryInfo'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/List_of_sovereign_states_in_the_2020s']
def parse(self, response):
## Extract all countries names
countries = response.xpath('//table//b//#title').extract()
for country in countries:
url = response.xpath('//table//a[#title="'+ country +'"]/#href').extract_first()
capital = response.xpath('//table//a[#title="'+ country +'"]/../..//i/a/#title').extract()
absolute_url = urljoin('https://en.wikipedia.org/', url)
yield Request(absolute_url, callback = self.parse_country)
def parse_country(self, response):
test = response.xpath('//table[#class='infobox geography vcard']//tr[#class = 'mergedtoprow']//a[contains(#href,"Demographics")]/../..').extract()
yield{'Test':test}
It a little more complicated than I explained but I go to the website "List of sovereign states in the 2020s". Copy the country names, capitals and urls. Then I go into the url, after joining it to Wikipedia and try to use the xpath expression I am working on to pull the population.
Thanks
I think the general answer to your question is: "predicates can be nested".
//table[
#class='infobox geography vcard'
]//tr[
#class = 'mergedtoprow' and .//a[contains(#href, "Demographics")]
]/following-sibling::tr[1]/td/text()[1]

How to download IMF Data Through JSON in R

I recently took an interest in retrieving data in R through JSON. Specifically, I want to be able to access data through the IMF. I know virtually nothing about JSON so I will share what I [think I] know so far, and what I have accomplished.
I browsed their web page for JSON, which helped a little bit. It gave me the start point URL. Here is the web page; http://datahelp.imf.org/knowledgebase/articles/667681-using-json-restful-web-service
I managed to download (using the GET() and the fromJSON() functions) some lists, which are really bulky. I know enough about the lists that the "call" was successful, but I cannot for the life of me get actual data. So far, I have been trying to use the rawToChar() function on the "content" data but I am virtually stuck there.
If anything, I managed to create data frames that contain the codes, which I presume would be used somewhere in the JSON link. Here is what I have.
all.imf.data = fromJSON("http://dataservices.imf.org/REST/SDMX_JSON.svc/Dataflow/")
str(all.imf.data)
#all.imf.data$Structure$Dataflows$Dataflow$Name[[2]] #for the catalogue of sources
catalogue1 = cbind(all.imf.data$Structure$Dataflows$Dataflow$KeyFamilyRef,
all.imf.data$Structure$Dataflows$Dataflow$Name[[2]])
catalogue1 = catalogue1[,-2] # catalogue of all the countries
data.structure = fromJSON("http://dataservices.imf.org/REST/SDMX_JSON.svc/DataStructure/IFS")
info1 = data.frame(data.structure$Structure$Concepts$ConceptScheme$Concept[,c(1,4)])
View(data.structure$Structure$CodeLists$CodeList$Description)
str(data.structure$Structure$CodeLists$CodeList$Code)
#Units
units = data.structure$Structure$CodeLists$CodeList$Code[[1]]
#Countries
countries = data.frame(data.structure$Structure$CodeLists$CodeList$Code[[3]])
countries = countries[,-length(countries)]
#Series Codes
codes = data.frame(data.structure$Structure$CodeLists$CodeList$Code[[4]])
codes = codes[,-length(codes)]
# all.imf.data # JSON from the starting point, provided on the website
# catalogue1 # data frame of all the data bases, International Financial Statistics, Government Financial Statistics, etc.
# codes # codes for the specific data sets (GDP, Current Account, etc).
# countries # data frame of all the countries and their ISO codes
# data.structure # large list, with starting URL and endpoint "IFS". Ideally, I want to find some data set somewhere within this data base.
"info1" # looks like parameters for retrieving the data (for instance, dates, units, etc).
# units # data frame that indicates the options for units
I would just like some advice about how to go about retrieving any data, something as simple as GDP (PPP) for a constant year. I have been following an article in R blogs (which retrieved data in the EU's database) but I cannot replicate the procedure for the IMF. I feel like I am close to retrieving something useful but I cannot quite get there. Given that I have data frames that contain the names for the databases, the series and the codes for the series, I think it is just a matter of figuring out how to construct the appropriate URL for getting the data, but I could be wrong.
Provided in the data frame codes are the codes for the data sets I presume. Is there a way to make a call for the data for, let's say, the US for BK_DB_BP6_USD, which is "Balance of Payments, Capital Account, Total, Debit, etc"? How should I go about doing this in the context of R?

How to get all resource record sets from Route 53 with boto?

I'm writing a script that makes sure that all of our EC2 instances have DNS entries, and all of the DNS entries point to valid EC2 instances.
My approach is to try and get all of the resource records for our domain so that I can iterate through the list when checking for an instance's name.
However, getting all of the resource records doesn't seem to be a very straightforward thing to do! The documentation for GET ListResourceRecordSets seems to suggest it might do what I want and the boto equivalent seems to be get_all_rrsets ... but it doesn't seem to work as I would expect.
For example, if I go:
r53 = boto.connect_route53()
zones = r53.get_zones()
fooA = r53.get_all_rrsets(zones[0][id], name="a")
then I get 100 results. If I then go:
fooB = r53.get_all_rrsets(zones[0][id], name="b")
I get the same 100 results. Have I misunderstood and get_all_rrsets does not map onto ListResourceRecordSets?
Any suggestions on how I can get all of the records out of Route 53?
Update: cli53 (https://github.com/barnybug/cli53/blob/master/cli53/client.py) is able to do this through its feature to export a Route 53 zone in BIND format (cmd_export). However, my Python skills aren't strong enough to allow me to understand how that code works!
Thanks.
get_all_rrsets returns a ResourceRecordSets which derives from Python's list class. By default, 100 records are returned. So if you use the result as a list it will have 100 records. What you instead want to do is something like this:
r53records = r53.get_all_rrsets(zones[0][id])
for record in r53records:
# do something with each record here
Alternatively if you want all of the records in a list:
records = [r for r in r53.get_all_rrsets(zones[0][id]))]
When iterating either with a for loop or list comprehension Boto will fetch the additional records (up to) 100 records at a time as needed.
this blog post from 2018 has a script which allows exporting in bind format:
#!/usr/bin/env python3
# https://blog.jverkamp.com/2018/03/12/generating-zone-files-from-route53/
import boto3
import sys
route53 = boto3.client('route53')
paginate_hosted_zones = route53.get_paginator('list_hosted_zones')
paginate_resource_record_sets = route53.get_paginator('list_resource_record_sets')
domains = [domain.lower().rstrip('.') for domain in sys.argv[1:]]
for zone_page in paginate_hosted_zones.paginate():
for zone in zone_page['HostedZones']:
if domains and not zone['Name'].lower().rstrip('.') in domains:
continue
for record_page in paginate_resource_record_sets.paginate(HostedZoneId = zone['Id']):
for record in record_page['ResourceRecordSets']:
if record.get('ResourceRecords'):
for target in record['ResourceRecords']:
print(record['Name'], record['TTL'], 'IN', record['Type'], target['Value'], sep = '\t')
elif record.get('AliasTarget'):
print(record['Name'], 300, 'IN', record['Type'], record['AliasTarget']['DNSName'], '; ALIAS', sep = '\t')
else:
raise Exception('Unknown record type: {}'.format(record))
usage example:
./export-zone.py mydnszone.aws
mydnszone.aws. 300 IN A server.mydnszone.aws. ; ALIAS
mydnszone.aws. 86400 IN CAA 0 iodef "mailto:hostmaster#mydnszone.aws"
mydnszone.aws. 86400 IN CAA 128 issue "letsencrypt.org"
mydnszone.aws. 86400 IN MX 10 server.mydnszone.aws.
the output can be saved as a file and/or copied to the clipboard:
the Import zone file page allows to paste the data:
at the time of this writing the script was working fine using python 3.9.

Trouble getting a dynamodb scan to work with boto

I'm using boto to access a dynamodb table. Everything was going well until I tried to perform a scan operation.
I've tried a couple of syntaxes I've found after repeated searches of The Internet, but no luck:
def scanAssets(self, asset):
results = self.table.scan({('asset', 'EQ', asset)})
-or-
results = self.table.scan(scan_filter={'asset':boto.dynamodb.condition.EQ(asset)})
The attribute I'm scanning for is called 'asset', and asset is a string.
The odd thing is the table.scan call always ends up going through this function:
def dynamize_scan_filter(self, scan_filter):
"""
Convert a layer2 scan_filter parameter into the
structure required by Layer1.
"""
d = None
if scan_filter:
d = {}
for attr_name in scan_filter:
condition = scan_filter[attr_name]
d[attr_name] = condition.to_dict()
return d
I'm not a python expert, but I don't see how this would work. I.e. what kind of structure would scan_filter have to be to get through this code?
Again, maybe I'm just calling it wrong. Any suggestions?
OK, looks like I had an import problem. Simply using:
import boto
and specifying boto.dynamodb.condition doesn't cut it. I had to add:
import dynamodb.condition
to get the condition type to get picked up. My now working code is:
results = self.table.scan(scan_filter={'asset': dynamodb.condition.EQ(asset)})
Not that I completely understand why, but it's working for me now. :-)
Or you can do this
exclusive_start_key = None
while True:
result_set = self.table.scan(
asset__eq=asset, # The scan filter is explicitly given here
max_page_size=100, # Number of entries per page
limit=100,
# You can divide the table by n segments so that processing can be done parallelly and quickly.
total_segments=number_of_segments,
segment=segment, # Specify which segment you want to process
exclusive_start_key=exclusive_start_key # To start for last key seen
)
dynamodb_items = map(lambda item: item, result_set)
# Do something with your item, add it to a list for later processing when you come out of the while loop
exclusive_start_key = result_set._last_key_seen
if not exclusive_start_key:
break
This is applicable for any field.
segmentation: suppose you have above script in test.py
you can run parallelly like
python test.py --segment=0 --total_segments=4
python test.py --segment=1 --total_segments=4
python test.py --segment=2 --total_segments=4
python test.py --segment=3 --total_segments=4
in different screens