match pages corresponding to the wikipedia content from the wikipedia sql dumps - mediawiki

Referring to the description:
Wikipedia:Contents categorises types of articles in Wikipedia.
https://en.wikipedia.org/wiki/Wikipedia:Contents
I want to extract all of the articles referring to types of contents, like Outlines, Lists. These articles are in the same namespace as normal articles, so filtering pages by namespace did not work.
I looked at info at:
https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download
and at content and content models tables in :
https://www.mediawiki.org/wiki/Manual:Database_layout
https://www.mediawiki.org/wiki/Manual:Content_models_table
but I could not find a way to solve the problem.
How could I extract the pageids, or titles, of the pages that belongs to types of content as Outline and List, mentioned in https://en.wikipedia.org/wiki/Wikipedia:Contents ?
Which Dumps contains such info ?

I've used this query at Quarry which I think does what you're asking:
SELECT page_title,page_id from page
Where (page_title LIKE 'List_%'
OR page_title LIKE 'Outline_%')
and page_is_redirect = 0
and page_namespace = 0
You can see the results here: https://quarry.wmcloud.org/query/71439

Related

How to select all elements with a specific name under every li node with the same structure?

I have a certain bunch of XPath locators that hold the elements I want to extract, and they have a similar structure:
/div/ul/li[1]/div/div[2]/a
/div/ul/li[2]/div/div[2]/a
/div/ul/li[3]/div/div[2]/a
...
They are actually simplified from Pixiv user page. Each /div/div[2]/a element has a title string, so they are actually artwork titles.
I want to use a single expression to fetch all the above a elements in an WebExtension called PageProbe. Although I've tried a bunch of methods, it just can't return the wanted result.
However, the following expression does return all the a elements, including the ones I don't need.
/div/
The following expression returns the a element under only the first li item.
/div/ul/li/div/div[2]/a
Sorry for not providing enough info earlier. Hope someone can help me out. Thanks.
According to the information you gave here you can simply use this xpath:
/div/ul/li/div/div[2]/a
however I'm quite sure it should be some better locator based on other attributes like class names etc.

Any conventional standards for storing OCR data/metadata in JPEG images?

I want to organize a collection of scanned documents (receipts, bank statements, etc.) by adding their metadata and text content (OCR'ed) into the same jpeg files. Is there any more or less commonly accepted way of storing such data? Any commonly used schemas?
For metadata, for example - I found a Dublin Core scheme, but most of the fields I want are not there, and I'm not sure what's the good way to add custom fields - can I just use them like if they existed in DC or XMP scheme (i.e. <dc:myfield>myvalue</dc:myfield> or <xmp:myfield>myvalue</xmp:myfield>), or I have to define my own scheme by adding xmlns:myScheme="http://myScheme.uri" and then use it as <myScheme:myfield>myvalue</myScheme:myfield> ?
Also, in all the examples I found, this data is stored inside <rdf:Description> which is inside <rdf:RDF> which is inside <x:xmpmeta> - is it a standard requirement? I don't see it in the XMP specification for storage in files...
For now, based on the examples, I plan to embed something like this:
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='MyTool v 0.0.1'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about=''
xmlns:dc='http://purl.org/dc/elements/1.1/'
xmlns:myDoc='http://some.custom.uri/'>
<dc:format>image/jpeg</dc:format>
<myDoc:doctype>scan</myDoc:doctype>
<myDoc:originalfilename>20190519121225_003.jpg</myDoc:originalfilename>
<myDoc:originalimagewidth>1684</myDoc:originalimagewidth>
<myDoc:originalimageheight>2788</myDoc:originalimageheight>
<myDoc:langOCR>EN-US</myDoc:langOCR>
<myDoc:acquisitiondatetime>2019-05-19T12:12:25Z</myDoc:acquisitiondatetime>
<myDoc:documentdate>2019-01-02</myDoc:documentdate>
<myDoc:pagesindocument>6</myDoc:pagesindocument>
<myDoc:page>2</myDoc:page>
<myDoc:textcontent>
Bank
statement
02/01/2019
Page 2 of 6
( Here goes raw OCR content
as multiline text )
</myDoc:textcontent>
<dc:subject>
<rdf:Bag>
<rdf:li>bank</rdf:li>
<rdf:li>statement</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
Does it make sense at all? I'm sure many people already worked on similar tasks, I don't want to reinvent the wheel...

SQL query to find columns with more than one string "a href" in them

I'm organizing articles in a big database and I face a problem - I need to find all articles with two or more links in them.
Every link is HTML link and has form .... How do I SELECT from article database with all links that a have at least two a href in them?
I was taught how to select one a href but two?...
SELECT * FROM `Articles5` WHERE
content LIKE "%a href%"
How to double this?
Had you tried using your own code but twice?
SELECT * FROM `Articles5` WHERE
content LIKE "%a href%a href%"
At first you can get all Articless WHERE content LIKE "%a href%" and put in temporary table. Then replace this value and find "a href" once more.
P.S For this purpose I advice you to use FullTextSearch

Maintaining K2 module formatting in IntroText

So basically I've installed K2 on a Joomla! based website and I have IntroText turned on for category views. There is a Master category, Master, of which all other categories are subcategories, which is how you have to apply IntroText to multiple categories because K2 is finicky - set all the options you want in the Master category and call them from the subcategories.
My particular problem is with losing HTML formatting in IntroText, so all blog posts would look like a brick of text (red lines denote where a paragraph should start, i.e. the <p> tag):
It should look like the following:
I have tried changing pretty much every setting in the Category View option in the Master category to no avail, as the page the IntroText on is a K2 Category page. I have also tried turning ON HTML formatting (it is off by default) and then excluding the p and br tags from being removed, to no avail.
After some Googling, I couldn't find an answer, but I will admit there were one or two I didn't try (mainly modding the PHP files, I would have tried if the answer actually had some feedback). So, if anyone has any suggestions or ideas, let me know please, I would appreciate it. I can provide more details if needed, but for the time being the site is offline. If you would like some more clarification on the website set up I can oblige, but I believe I've added all relevant K2 details that I could.
EDIT: I have also tried commenting out these lines in /~siteDir~/public_html/modules/mod_articles_category/helper.phpto fix the issue, which also didn't work.
$introtext = str_replace('<p>', ' ', $introtext);
$introtext = str_replace('</p>', ' ', $introtext);
$introtext = strip_tags($introtext, '<a><em><strong>');
EDIT2: I tried to just remove the whole _cleanIntrotext function and the call for it, $item->introtext = self::_cleanIntrotext($item->introtext);, but this also didn't work... which means something somewhere else is stripping the IntroText too?
In order to keep the initial formatting of the introtext inside the K2 content module you need to have your item creation options to using two editors. One for introtext and one for full text. You also need to have a specific amount of words in the introtext. This is imperative as you cannot limit the introtext word number afterwards in the module settings, as by default the word number strips out all html tags to avoid breaking the page code as the Introtext word limit field states: Introtext word limit Leave blank to diasble. If you enable this option, all html tags from the text will be cleaned up to make sure the html structure of the site does not brake.. Which imho is a good thing to leave it as is since you don't always know beforehand which <p> or other tag will be left open. I did this and it worked for me; the module kept all the initial formatting. So in conclusion in order to keep the formatting intact leave the Introtext word limit of the module blank and store separately the introtext and the fulltext when creating articles. If you need more info tell me.
EDIT
The same applies to category listings. As I see in your website you have set a limit to introtext inside your category settings. So you should check which category -parent or child- has the limit set and remove it. Go to Components>K2>Categories Click to edit the category(-ies) that have the limit, it should be in Item view options in category listings>Introtext word limit if you hover over the field title you'll see the same warning tooltip about tag stripping that I mentioned before.
To set your item creation to use two editors go to Componets>K2>Parameters>Advanced>Use one editor window for introtext & fulltext set this to No. That way you'll have two separate editors for your item. However this is not obligatory and if you don't need that you can simply add a Read more break anywhere in your text by clicking the button under the editor.

How to show related content using like in mysql?

I currently have a table for products with it's own set of tags and a table for news with it's own set of tags. I wanted to add related news to the products page so I was thinking of using like but since the column tags in the products page is something like
(Products) tags- manutd, man utd, football
(news) tags - manutd, blah, bruha [this one is related]
(news) tags - man, utd, bruha [this one is not related]
I wanted to use a query to show all news containing any of the tags(from products) seperated by commas using mysql. How should I go about constructing such a query? If there is a better way of doing this a little explanation would be helpful too. Thanks
Do you have the product tags at hand or do you want to join the two tables based on their tag similarity? In the first case, I would try something like this:
select ...
from News n
where n.tags REGEXP 'manutd|man utd|football'
Note that I used the product tag string you provided above, replaced the commas by | and removed the whitespace to the left and right of the commas.