Mediawiki API: How do I list all pages of a category and for each page show all of it's categories? - mediawiki

I am using the following wikimedia API to list all pages with a certain category: https://www.mediawiki.org/wiki/API:Categorymembers
E.g. https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Physics
This gives me a list of pages with title and id, but I would also like to see all categories for each page. However it seems that there is no cmprop for that.
cmprop: Which properties to get. (Default: ids|title)
ids: Page ID
title: Page title
sortkey: The sortkey used for sorting in the category (hexadecimal string)
sortkeyprefix: The sortkey prefix used for sorting in the category (human-readable part of the sortkey) 1.17+
type: Type that the page has been categorised as (page, subcat or file) 1.17+
timestamp: Time and date the article was added to the category
I have considered to query each page and use prop=categories to get the categories for each page, but that would mean a very large number of queries. Is there any better way of doing this?

You can use categorymembers as a generator. If you do that, you can then apply prop=categories:
https://en.wikipedia.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Physics&prop=categories&cllimit=max&gcmlimit=max

Related

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

Cesium - Modify infobox contents

I have n polygons with ids "test-1-1", "test-1-2" .... "test-1-n" which represent a single logical entity. Format of id can be generalized as < entity_name>-< entity_id>-< i>, where i is added to distinguish ids of multiple polygons.
My query here is, I want to display only "test" when any of these polygons is clicked. Currently id of selected polygon is displayed in info-box.
Is there any cesium way to do this? I would not prefer manipulating the strings at runtime.
A Cesium Entity has three fields of interest to the InfoBox (the thing that pops up when an Entity is selected).
entity.id - Each entity in a dataSource is required to have a unique id (a GUID will be auto-generated if no ID is supplied at creation). It is an arbitrary string and does not need to be human-friendly.
entity.name - This is the human-friendly name of the Entity. It does not need to be unique, you may have as many duplicate names as you like. It is half a line or less of plain text (not HTML).
entity.description - This is a sandboxed HTML description of the entity, and can span multiple paragraphs or include tables and other styling.
The InfoBox will attempt to show entity.name on its title bar by default, and will only fall back to show entity.id in the title bar if name is missing (because name is optional, id is not).
The body of the InfoBox only appears below the title bar if entity.description is set (otherwise only the bar is shown). The description is rendered with a sandboxed iframe (to offer some resistance to cross-site scripting for apps that display user-supplied entity descriptions).
I have n polygons with ids "test-1-1", "test-1-2" .... "test-1-n" ...
For this case, I would keep the existing ids, and set name to be the string you wish to see in the InfoBox popup. Multiple entities can have the same name but not the same id.

Semantic mediawiki #ask query: Displaying nested properties on the same query

I would like to display in the same query properties of a page which is related to the pages im querying for.
Let's say I would like to query all the pages in the City category, which are located in Germany, and I want to display the title of the page, but also I want to display the surface data of Germany, for example.
Something like this: {{#ask: [[Category:City]] [[location::Germany]] |?mainlabel |?Location.surface }}
I know this wont work, but you can see what I want to achieve.
I'm not sure if there's a way to nest queries directly inside other queries. The normal method of doing it is using a template. So you might define a template (or subpage of the template if this going into a template) called {{tablerow}} that consists of:
<includeonly>
|- valign="top"
| [[{{{1|}}}]]
| {{#show: {{{1|}}} | ?surface }}</includeonly>
The <includeonly> tags are important for reasons I don't really understand, it produces errors sometimes if you leave them out. Then you just run an #ask query with format = template. (You can build the header into the query, but I find it simpler to just put it outside.)
{| class="wikitable smwtable sortable"
|- valign="bottom"
! [[City]]
! [[Surface]]
{{#ask: [[Category:City]] [[location::Germany]]
| format = template
| template = tablerow
| link = none
}}
|}
That will punch each result returned by the query through the template as {{{1}}} and generate a row based on it. If you have other data to return from the main query, additional properties that you ask for will come out as consecutive unnamed parameters (so if you include | ?population, that will go into the template as {{{2}}} and will need to be added to the row structure or it will be dropped).

How to control column headings in the NEW/EDIT views based on the CATEGORY selected from a drop-down list. Ruby on Rails w/ MYSQL

Two models:
category has_many: components
component belongs_to: category
The CATEGORY table defines variable names for different component types:
TYPE, VAR1, VAR2, VAR3, ...
Insulator, Voltage, Height, Material, ...
Current Transformer, Voltage, Ratio, Indoor, ...
In the NEW/EDIT views for the COMPONENT model, the user will first section the CATEGORY from a drop-down list. Based on the CATEGORY selected the column headings and field labels in the form(s) need to dynamically update to indicate the variable names associated with the selected CATEGORY.
i.e. IF the user selects CATEGORY = Insulator THEN the field labels for VAR1 ... VAR3 are Voltage, Height, Material, etc.
I assume this will be controlled in the _form.html.erb of a typical scaffold. I am looking for a recommended technique.
Thanks in advance any information.
Changing a form in response to a user selecting a different option in a select tag is probably best done by Javascript. This allows for execution on the client side, which will be faster than a trip back to the server.
I would recommend placing the different form fields inside a div tag that is hidden when the page loads. Each of the combinations of categories can be toggled to show in the form by binding to the Javscript onChange event on the select tag.
Here is more information on the select tag: http://www.w3schools.com/jsref/event_onchange.asp

Handling Multiple Images with ColdFusion and MySQL

This is an architecture question, but its solution lies in ColdFusion and MySQL structure--or at least I believe so.
I have a products table in my database, and each product can have any number of screen-shots. My current method to display product screen-shots is the following:
I have a single folder where all screen-shots associated with all products are contained. All screen-shots are named exactly the same as their productID in the database, plus a prefix.
For example: Screen-shots of a product whose productID is 15 are found in the folder images, with the name 15_screen1.jpg, 15_screen2.jpg, etc...
In my ColdFusion page I have hard-coded the image path into the HTML (images/); the image name is broken into two parts; part one is dynamically generated using the productID from the query; and part two is a prefix, and is hard-coded. For example:
<img src"/images/#QueryName.productID#_screen1.jpg">
<img src"/images/#QueryName.productID#_screen2.jpg"> etc...
This method works, but it has several limitations the biggest listed bellow:
I have to hard-code the exact number of screen-shots in my HTML template. This means the number of screen shots I can display will always be the same. This does not work if one product has 10 screen shots, and another has 5.
I have to hard-code image prefixes into my HTML. For example, I can have up to five types of screen-shots associated with one product: productID=15 may have 15_screen1.jpg, 15_screen2.jpg, and 15_FrontCover.jpg, 15_BackCover.jpg, and 15_Backthumb.jpg, etc...
I thought about creating a paths column in my products table, but that meant creating several hundreds of folders for each product, something that also does not seem efficient.
Does anyone have any suggestions or ideas on the correct method to approach this problem?
Many thanks!
How about...
use an Image table, one product to many images (with optional sortOrder column?), and use imageID as the jpeg file name?
update:
Have a ImageClass table, many Image to one ImageClass.
Image
-----
ID
productID
imageClassID (FK to ImageClass)
Use back-end business logic to enforce the some classes can only have one image.
Or... if you really want to enforce some classes can only one image, then can go for a more complex design:
Product
------
ID
name
...
frontCoverImageID
backCoverImageID
frontThumbImageID
backThumbImageID
Image
-----
ID
productID
isScreenShot (bit) // optional, but easier to query later...
However, I like the first one better since you can have as many classes you see fit later, without refactoring the DB.
Keeping information on how many and what images in the database is definitely the way to go.
Barring that, if you want to use naming conventions to associate images with products, and the number of images is arbitrary, then it's probably a better idea to create one folder per product:
/images/products/{SKU1}/frontview.jpg
/images/products/{SKU1}/sideview.jpg
/images/products/{SKU2}/frontview.jpg
and so forth. Then use <cfdirectory> to collect the images for a given product. You might also want to name your images 00_frontview.jpg, 01_sideview.jpg and such so that you can sort and control what order they'll display on the page.
use the cfdirectory tags to inspect the filesystem:
<!--- get a query resultset of images in filesystem --->
<cfdirectory action="list" name="images" directory="images">
<!--- get images for specific product --->
<cfquery name="productImages" dbtype="query">
select *
from images
where name like '#productid#%'
</cfquery>
<cfoutput query="productImages">
<img src="#productimages.directory#/#productimages.name#" />
</cfoutput>
You could even try using the filter attribute to cfdirectory to try and omit the QoQ