Getting auto value for some Infobox attributes - mediawiki

I am using pywikibot api to fetch Wikipedia Infobox attributes. Few things I want to extract is population denisty, population, elevation etc. For some of the cities e.g(https://en.wikipedia.org/wiki/Beijing), the api is returning "auto" as value for keys like population_density_km2. For few other cities , I am getting the actual density instead of auto. Anyone has any idea on the reasoning behind this and how can I get the actual value?
import pywikibot
def get_page(city: dict) :
"""
Returns parsed wikipedia page
"""
page = pywikibot.Page(en_wiki, re.search(r'wiki/(.*)', city['article']['value']).group(1))
if page.pageid == 0:
raise Exception('page do not exist')
return page
def get_info_box_details(templates: dict):
"""
Get info box details
"""
infobox_template = []
for tmpl, params in templates:
if 'Infobox' in tmpl:
infobox_template.append(params)
population = { k:v for my_dict in infobox_template for k,v in my_dict.items() if 'population' in k}
print(population)
wiki_page = get_page(city)
templates = wiki_page.raw_extracted_templates
info_box = get_info_box_details(templates)

From Template:Infobox_settlement's documentation:
To calculate density with respect to the total area automatically, type auto in place of any density value.
So, auto means the infobox template (MediaWiki) will try to calculate the value using area and population values. As for the reasoning, I guess it will reduce redundancy which should make the template maintenance less burdensome for human editors.
You can do the same in you Python program (calculating density by dividing population to area; similar to how the template does it) or you may want to scrape the data directly from the HTML output of the page (which will have it's own challenges).

Related

Get value of object in json file if var is equal to object name - python

I have a function that sees what card is in a player's hand and will add to their score depending on the card in their hand. I have all the card values stored in a JSON file. I have code this so far:
with open("values.json") as values:
value = json.load(values)
for i in range(0, len(hand)):
card = hand[i]
values.json
{
"3Hearts": 3
}
if the card is 3Hearts how could I get the 3 to be returned?
Or is there a better way to store the data?
I will admit I am not very familiar with json files. However if the json file is not a necessity you could just store the data in another .py file (Cards.py for example).
Also, because you are using python, you would be better off making a Card class and make Card objects.
This is what it would look like:
# Make Card Class
class Card:
def __init__(self, name, number):
self.name = name
self.number = number
# Make Card Objects
threehearts = Card("3Hearts", "3")
Here I used threehearts instead of 3Hearts because making an object name starting with a number is not good practice. To compensate I made an attribute Card.name where you can "name" the card "3Hearts" as you did in the question.
So assuming you are going to use that .py file to store your data, this is what I would propose:
# Import data here
from Cards import*
# Make the player's hand
hand = [threehearts]
# Display the number corresponding to the player's hand
for i in range(0, len(hand)):
card = hand[i]
print(card.number)
The output of this code will be:
3
You can also store hand = [threehearts] in the Cards.py file as well if you need to.

Getting alignment/attention during translation in OpenNMT-py

Does anyone know how to get the alignments weights when translating in Opennmt-py? Usually the only output are the resulting sentences and I have tried to find a debugging flag or similar for the attention weights. So far, I have been unsuccessful.
I'm not sure if this is a new feature, since I did not come across this when looking for alignments a few months back, but onmt seems to have added a flag -report_align to output word alignments along with the translation.
https://opennmt.net/OpenNMT-py/FAQ.html#raw-alignments-from-averaging-transformer-attention-heads
Excerpt from opennnmt.net -
Currently, we support producing word alignment while translating for Transformer based models. Using -report_align when calling translate.py will output the inferred alignments in Pharaoh format. Those alignments are computed from an argmax on the average of the attention heads of the second to last decoder layer.
You can get the attention matrices. Note that it is not the same as alignment which is a term from statistical (not neural) machine translation.
There is a thread on github discussing it. Here is a snippet from the discussion. When you get the translations from the mode, the attentions are in the attn field.
import onmt
import onmt.io
import onmt.translate
import onmt.ModelConstructor
from collections import namedtuple
# Load the model.
Opt = namedtuple('Opt', ['model', 'data_type', 'reuse_copy_attn', "gpu"])
opt = Opt("PATH_TO_SAVED_MODEL", "text", False, 0)
fields, model, model_opt = onmt.ModelConstructor.load_test_model(
opt, {"reuse_copy_attn" : False})
# Test data
data = onmt.io.build_dataset(
fields, "text", "PATH_TO_DATA", None, use_filter_pred=False)
data_iter = onmt.io.OrderedIterator(
dataset=data, device=0,
batch_size=1, train=False, sort=False,
sort_within_batch=True, shuffle=False)
# Translator
translator = onmt.translate.Translator(
model, fields, beam_size=5, n_best=1,
global_scorer=None, cuda=True)
builder = onmt.translate.TranslationBuilder(
data, translator.fields, 1, False, None)
batch = next(data_iter)
batch_data = translator.translate_batch(batch, data)
translations = builder.from_batch(batch_data)
translations[0].attn # <--- here are the attentions

Removing characters from column in pandas data frame

My goal is to (1) import Twitter JSON, (2) extract data of interest, (3) create pandas data frame for the variables of interest. Here is my code:
import json
import pandas as pd
tweets = []
for line in open('00.json'):
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
# Create a data frame satisfying conditions:
df2 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]
So far, everything seems to be working fine.
Now, the extracted values for Geo result in the following example:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
To get rid of everything except the coordinates inside the squared brackets I tried using:
df2.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df2['Geo'] = df2['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error
Please advise on the correct way to obtain coordinates values only.
The following line from your question indicates that this is an issue with understanding the underlying data type of the returned object.
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
You are returning a Python dictionary here -- not a string! If you want to return just the values of the coordinates, you should just use the 'coordinates' key to return those values, e.g.
df2.loc[1921,'Geo']['coordinates']
[39.11890951, -84.48903638]
The returned object in this case will be a Python list object containing the two coordinate values. If you want just one of the values, you can slice the list, e.g.
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
This workflow is much easier to deal with than casting the dictionary to a string, parsing the string, and recapturing the coordinate values as you are trying to do.
So let's say you want to create a new column called "geo_coord0" which contains all of the coordinates in the first position (as shown above). You could use a something like the following:
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
This uses a Python list comprehension to iterate over all entries in the df2['Geo'] column and for each entry it uses the same syntax we used above to return the first coordinate value. It then assigns these values to a new column in df2.
See the Python documentation on data structures for more details on the data structures discussed above.

Groovy csv to string

I am using Dell Boomi to map data from one system to another. I can use groovy in the maps but have no experience with it. I tried to do this with the other Boomi tools, but have been told that I'll need to use groovy in a script. My inbound data is:
132265,Brown
132265,Gold
132265,Gray
132265,Green
I would like to output:
132265,"Brown,Gold,Gray,Green"
Hopefully this makes sense! Any ideas on the groovy code to make this work?
It can be elegantly solved with groupBy and the spread operator:
#Grapes(
#Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)
import org.apache.commons.csv.*
def csv = '''
132265,Brown
132265,Gold
132265,Gray
132265,Green
'''
def parsed = CSVParser.parse(csv, CSVFormat.DEFAULT.withHeader('code', 'color')
parsed.records.groupBy({ it.code }).each { k,v -> println "$k,\"${v*.color.join(',')}\"" }
The above prints:
132265,"Brown,Gold,Gray,Green"
Well, I don't know how are you getting your data, but here is a general way to achieve your goal. You can use a library, such as the one bellow to parse the csv.
https://github.com/xlson/groovycsv
The example for your data would be:
#Grab('com.xlson.groovycsv:groovycsv:1.1')
import static com.xlson.groovycsv.CsvParser.parseCsv
def csv = '''
132265,Brown
132265,Gold
132265,Gray
132265,Green
'''
def data = parseCsv(csv)
I believe you want to associate the number with various values of colors. So for each line you can create a map of the number and the colors associated with that number, splitting the line by ",":
map = [:]
for(line in data) {
number = line.split(',')[0]
colour = line.split(',')[1]
if(!map[number])
map[number] = []
map[number].add(colour)
}
println map
So map should contain:
[132265:["Brown","Gold","Gray","Green"]]
Well, if it is not what you want, you can extract the general idea.
Assuming your data is coming in as a comma separated string of data like this:
"132265,Brown 132265,Gold 132265,Gray 132265,Green 122222,Red 122222,White"
The following Groovy script code should do the trick.
def csvString = "132265,Brown 132265,Gold 132265,Gray 132265,Green 122222,Red 122222,White"
LinkedHashMap.metaClass.multiPut << { key, value ->
delegate[key] = delegate[key] ?: []; delegate[key] += value
}
def map = [:]
def csv = csvString.split().collect{ entry -> entry.split(",") }
csv.each{ entry -> map.multiPut(entry[0], entry[1]) }
def result = map.collect{ k, v -> k + ',"' + v.join(",") + '"'}.join("\n")
println result
Would print:
132265,"Brown,Gold,Gray,Green"
122222,"Red,White"
Do you HAVE to use scripting for some reason? This can be easily accomplished with out-of-the-box Boomi functionality.
Create a map function that prepends the ID field to a string of your choice (i.e. 222_concat_fields). Then use that value to set a dynamic process prop with that value.
The value of the process prop will contain the result of concatenating the name fields. Simply adding this function to your map should take care of it. Then use the final value to populate your result.
Well it depends upon the data how is it coming.
If the data which you have posted in the question is coming in a single document, then you can easily handle this in a map with groovy scripting.
If the data which you have posted in the question is coming into multiple documents i.e.
doc1: 132265,Brown
doc2: 132265,Gold
doc3: 132265,Gray
doc4: 132265,Green
In that case it cannot be handled into map. You will need to use Data Process Step with Custom Scripting.
For the code which you are asking to create in groovy depends upon the input profile in which you are getting the data. Please provide more information i.e. input profile, fields etc.

access leaves of json tree

I have a JSON file of the form:
{"id":442500000116137984, "reply":0, "children":[{"id":442502378957201408, "reply":0, "children":[]}]}
{"id":442500001084612608, "reply":0, "children":[{"id":442500145871990784, "reply":1, "children":[{"id":442500258421952512, "reply":1, "children":[]}]}]}
{"id":442500000258342912, "reply":0, "children":[{"id":442500636668489728, "reply":0, "children":[]}]}
In this each line refers to a separate tree. Now I want to go to the leaves of every tree and do something, basically
import json
f = open("file", 'r')
for line in f:
tree = json.loads(line)
#somehow walk through the tree and find leaves
if isLeaf(child):
print "Reached Leaf"
How do I walk through this tree object to detect all leaves?
This should work.
import json
f = open("file", 'r')
leafArray = []
def parseTree(obj):
if len(obj["children"]) == 0:
leafArray.append(obj)
else:
for child in obj["children"]:
parseTree(child)
for line in f:
global leafArray
leafArray = []
tree = json.loads(line.strip())
parseTree(tree)
#somehow walk through the tree and find leaves
print ""
for each in leafArray:
print each
You know, I once had to deal with a lot of hypermedia objects out of JSON, so I wrote this library. The problem was that I didn't know the depths of the trees beforehand, so I needed to be able to search around and get what I called the "paths" (the set of keys/indices you would use to reach a leaf) and values.
Anyway, you can mine it for ideas (I wrote it only for Python3.3+, but here's the method inside a class that would do what you want).
The basic idea is that you walk down the tree and check the objects you encounter and if you get more dictionaries (even inside of lists), you keep plunging deeper (I found it easier to write it as a recursive generator mostly by subclassing collections.MutableMapping and creating a class with a custom enumerate).
You keep track of the path you've taken along the way and once you get a value that doesn't merit further exploration (it's not a dict or a list), then you yield your path and the value:
def enumerate(self, path=None):
"""Iterate through the PelicanJson object yielding 1) the full path to
each value and 2) the value itself at that path.
"""
if path is None:
path = []
for k, v in self.store.items():
current_path = path[:]
current_path.append(k)
if isinstance(v, PelicanJson):
yield from v.enumerate(path=current_path)
elif isinstance(v, list):
for idx, list_item in enumerate(v):
list_path = current_path[:]
list_path.append(idx)
if isinstance(list_item, PelicanJson):
yield from list_item.enumerate(path=list_path)
else:
yield list_path, list_item
else:
yield current_path, v
Because this is exclusively for Python3, it takes advantage of things like yield from, so it won't work out of the box for you (and I certainly don't mean to offer my solution as the only one). Personally, I just got frustrated with reusing a lot of this logic in various functions, so writing this library saved me a lot of work and I could go back to doing weird things with the Hypermedia APIs I had to deal with.
You can do something like this. (I don't know the syntax of python).
temp = tree #Your JSON object in each line
while (temp.children ! = []){
temp = temp.children;
}
Your temp will now be the leaf.