I am using a Mediawiki site as a personal Zettelkasten. The zettelkasten is basically a collection of notes that should be linked to one another, making a wiki a good place to store one. The linking between the notes is the key feature of the zettelkasten. So for each "note" (i.e., page on my wiki), I need a list of 1) how to get to that page and 2) where you can go from that page. The first part is easy, since I can use the built-in {{Special:Whatlinkshere/{{PAGENAME}}}}. However, I can't figure out how to create a similar list of forward links from each page. Is there a way to do this within mediawiki, or an extension that can do this? What is the best way to gather a list of all (internal) links on a given wiki page?
If you install DynamicPageList3, you can use {{#dpl: linksfrom = {{FULLPAGENAME}} }}.
With Scribunto, you can define Module:Links with inner function:
local p = {}
function p.inner (frame)
local wikitext = frame:preprocess (mw.title.new (frame.args [1]):getContent ())
local link_set = {}
-- Find all occurences of [[...]]:
for title in mw.ustring.gmatch (wikitext, '%[%[([^%#|%]]+)%]%]') do
-- Remove #... or |...:
title = mw.text.trim (mw.ustring.gsub (title, '[#|][^%]]*', '', 1))
if title ~= '' then
link_set [title] = true
end
end
local links = {}
for link, _ in pairs (link_set) do
links [#links + 1] = '[[' .. link .. ']]'
end
table.sort (links)
return table.concat (links, ', ')
end
return p
and call it like this: {{#invoke:Links|inner|{{FULLPAGENAME}}}}. But this is expensive, and you mat need to filter titles better, if you have Semantic MediaWiki installed. There also will be issues with synchronisation (the list of links will be one version behind the page it is in, until a purge).
Related
I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]
I'm using MediaWiki at work and creating a Knowledge Base. We've got everything set up but one requirement is to have a unique identifier on each page, then it can be referenced in official documentation. I've done this by using the magic word {{PAGEID}} so it's added to the bottom right of each page.
Another requirement is to be able to find the page based on this unique number but when using the built in search function the page can't be found.
For example, the main page has the text "Page ID:1" in the bottom right corner. When doing a search for "Page ID:1" nothing can be found and the Wiki only gives me the option to create the page.
Does anyone know how you can either search on, or have the search include the Page ID?
Any help would be appreciated.
global $wgHooks;
$wgHooks['SearchGetNearMatchBefore'][] = function ( array $allSearchTerms, &$titleResult ) {
$searchTerm = $allSearchTerms[0];
if ( preg_match( '/^id:\d+$/', $searchTerm ) ) {
$pageId = (int)substr( $searchTerm, 3 );
$titleResult = Title::newFromID( $pageId );
return false;
}
};
will jump to the page with ID 123 when you enter id:123 in the search box. Seems like a silly way to use search though.
I have the following mini basic spider I use to get all links from a website.
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
I was wondering wether it would be possible to add to have this same spider scraping some html (like the one below)from these same links and to list link and info in a csv in two separate columns?
<span class="price">50,00 €</span>
marko
Yes, that's possible of course. First of all you need to use a feed export. This can be set in the settings.py with the options:
FEED_FORMAT = 'csv'
FEED_URL = 'file:///absolute/path/to/the/output.csv'
Then you will have to adjust your items to allow more elements. Currently, you only use the link. You will want to add a price field.
class SampleItem(Item):
link = Field()
price = Field()
One sidenote: Usually we define items in the items.py file, because generally multiple spiders should scrape the same type of item from several pages. You would then import them into your spider using from scrapername.items import SampleItem. An example application for this would be a price scraper which scrapes both Amazon and some smaller shops.
Finally, you will have to adjust the parse_page method of your spider. Currently you only save the URL into your item. You want to find the price and also save it. Finding numbers or texts on a page is a key element of scraping. For this purpose we have selectors. Scapy supports XPath, CSS and regular expression selectors. The first two are especially useful, because they can be nested. Regular expressions would generally be used when you found the correct HTML element, but there is too much information within one element.
A problem you might encounter is that a page might have multiple .price elements. Have you made sure there only is one? Otherwise the selector will give you all of them and you might have to refine your selector using more other tags.
So, let's assume there is only this one .price element and construct our selector. We use CSS selector here, because it's more intuitive in this case. You can call the selectors directly on the response using css and xpath methods. Both of them always return elements on which you might use css() and xpath() again. To get the textual representation you need to call extract() on them. This might be annoying at the beginning, but nesting selectors is very convenient. Note that the selectors give you the full HTML element including the tag. To only get the text content, you need to make this explicit. For CSS selectors via ::text, for XPath via /text().
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
try:
item['price'] = response.css('.price::text')[0].extract()
except IndexError:
# do whatever is best if price cannot be found
item['price'] = None
return item
Is it possible to create a wiki page, where you mark a single piece of text as a placeholder which can be put anywhere else on the wiki?
Let's say I have a wiki page containing a simple list. The first item in list must be always shown in the Main Page but the editing user should not edit two pages for that, just one page.
The list page:
Pineapples
{{SaveThisText|TodaysMeal|Dumplings}}
Beans
Oranges
Main Page:
Today, we'll have {{GetSavedText|TodaysMeal}}
...Main Page will result to "Today, we'll have Dumplings"
I know that it is possible to do this using templates but I want to avoid them, I want to edit the template like it's a part of page.
You can do this without writing any custom PHP, see:
http://www.mediawiki.org/wiki/Extension:Variables
This is definitely possible if you write a MediaWiki extension for it. This means that you could place a hook on GetSavedText and SaveThisText so their behaviour can be customized.
If you have a small wiki, you could just cycle through every page on the occurance of GetSavedText an search for {{SaveThisText|TodaysMeal|. Getting every page is easy:
// get existing pages
$db = wfGetDB ( DB_MASTER );
$results = $db->resultObject ( $db->query(
"select distinct page_title from {$wgDBprefix}page " )
);
$existing_pages = array();
while ( $r = $results->next() )
$title = Title::newFromText( $r->page_title );
$article = new Article ( $title );
$content = $article->getContent();
A more efficient approach would be to place a hook on the update of a page. If SaveThisText is present, you could update a line in a database table.
Anyone know of any good classes or functions that will do this? I've found some regexes but what I need is to pass the string to a method and have it return the same string, but with urls turned blue and turned into hyperlinks. Seems like a fairly common task, but I can't find anything.
EDIT - the following works for any link starting with http:
var myPattern:RegExp = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i;
var str = text.replace(myPattern, "<font color='#04717D'><a target='_blank' href=\"$&\">$&</a></font>");
field.htmlText = str;
But it doesn't work for links that start with "www", because the href ends up looking like this:
www.google.com
Would love to know how to fix that.
I'm wary of making the existing regular expression/ replacement call any more complicated. With that in mind the most straightforward way of doing this is probably to write a second regular expression to correct any bad tags in the output from the first. I'd also add a 'g' to the end of your main regular expression so that it captures multiple URLs in the text
So, your main regular expression would now look like this:
var mainPattern:RegExp = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/ig;
Your secondary regular expression will look something like this:
var secondaryPattern:RegExp = /\"www/g;
it should capture any links that don't start with "http:"
You then run both these expressions over your input string replacing as necessary:
var someText:String = "This is some text with a link in it www.stackoverflow.com and also another link http://www.stackoverflow.com/questions/5239966/as3-detect-urls-in-dynamic-text-and-make-them-links";
someText = someText.replace(mainPattern, "<a target='_blank' href=\"$&\">$&</a>");
someText = someText.replace(secondaryPattern, "\"http://www");