I'm trying to build a MediaWiki-based website for a very specific purpose. Namely, I would like to create a field guide for a specific group of animals (reptiles and amphibians). Since the people I would want to generate content on the website aren't necessarily techies, I'd like to make things as easy and painless as possible for contributors.
Now, in most groups of animals, taxonomic designations are fluid, and change all the time. As an example, consider the following:
A species used to be called Genus1 species1. It was then called Genus2 species1. As of now, this species has been split into several species, say Genus2 species1, Genus2 species2, Genus2 species3, etc. In the worst case, anything about the nomenclature and classification of the species could change, including, but not limited to, the species being moved, split or merged with any other species.
For users, these changes should be transparent. That is, on typing in http://url_of_wiki/wiki/Genus1_species1, they should automatically be redirected to the lowest taxonomic group (in this case Genus2) that is non-ambiguous. Essentially, if a page is redesignated (moved, split or merged), I would like to automatically create all new pages and redirects required.
I should be able to implement this as an extension quite easily. However, I've read the MediaWiki documentation on extensions, but haven't been able to figure out just what part of MediaWiki it would be best to target.
So, the question is, is this type of extension best implemented as a parser extension, by adding new tags, or a user-interface extension, or a combination of the two (a user-interface extension backed by a parser extension)?
Nice challenging problem! If it were up to me I would solve it in a different way:
use page level for genera and
sub page level for species.
This will automatically take care of renaming since redirects will be made.
Alternatively:
- use page level for species and
- categories for genera.
Then use an if pagename template (see Wikipedia example) to change the category based on the page name.
Or possibly combine these methods.
(See also Wikis and Wikipedia)
Related
I am trying to setup a multilingual encyclopedia (4 languages), where I can have both:
Articles that are translations of other languages, and
Articles that are in a specific language only.
As the wiki grows, I understand that the content of each language can vary.
However, I want to be able to work as fluently as possible between languages.
I checked this article, dating back to 2012, which has a comment from Tgr that basically condemns both solutions.
I also checked this Mediawiki Help Article, but it gives no explanation about the differences between both systems.
My questions are:
1- what is the preferred option now for a multilingual wiki environment that gives the most capabilities and best user experience, given that some of the languages I want are right to left, and some are left to right.
So I want the internationalization of category names, I need to link the categories their corresponding translations, and want users to see the interface in the language that the article is written in.
So Basically as if I have 4 encyclopedias, but the articles are linked to their corresponding translations.
2- Which system would give me a main page per language? So the English readers would see an English homepage, and the French readers see a French homepage..etc?
EDIT:
I have a dedicated server, so the limitation of shared hosting is not there.
Thank you very much.
The Translate extension is meant for maintaining identical translations and tracking up-to-date status while other solutions (interwiki links, Wikibase, homegrown language templates) typically just link equivalent pages together. Translate is useful for things like documentation, but comes with lots of drawbacks (for example, WYSIWYG editing becomes pretty much impossible and even source editing requires very arcane syntax). It's best used for content which is created once and then almost never changes.
You cannot get internationalized category names in a single wiki as far as I know. (Maybe if you wait a year or so... there is ongoing work to fix that, by more powerful Wikibase integration.) Large multi-language wikis like Wikimedia Commons just do that manually (create a separate category page for each category in each language).
Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.
I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.
The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.
I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.
There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.
I'm considering using mediawiki as my company's internal knowledge base and am trying to understand how to build out effective team sections. Unfortunately, I'm not finding much information on this.
Ideally we'd have a separate knowledge base sections for devs, product, design and HR; all in the same system with the ability to cross-link. Each of these sections would be able to have it's own landing page and we could search for content specifically within that section.
It looks like using categories might work, but initially this feels clunky and I'm not sure if it provides the level of hierarchy I'm looking for. I would love to get your ideas and any links to examples that have done this well.
Thank you!
If by segregation you mean limited visibility (ie. team members generally shouldn't be able to see other members' content), then MediaWiki is probably not the right choice for you as it does not have granular read access control.
If you are simply looking for content organization, namespaces provide an ugly but easy way of partitioning (almost everything supports filtering by namespace). Categories are more elegant but not so well integrated - you can filter search results by category but you can't do it for most other things like recent changes or user contributions.
I am working on a user-moderated database and settled on MediaWiki with Semantic MediaWiki as an engine. I installed Semantic Forms to force the end users to conform to a certain standard when creating or editing entries. The problem is that since a user can add a semantic notation to any form text input it can throw off the proper structure of the system, i.e. if it was an IMDB clone a user can add [[Directed by:Forest Gump]] which would then result in the movie "Forest Gump" showing up under a list of directors.
I doubt that there's any setting that can simply turn this off or on, but I've had one or two ideas as to how to get it working.
One, perhaps there's a way to disable semantic notation on specific namespaces and put the forms on those namespaces. I have a feeling that this will cause the forms to merely break.
Another idea is to modify the code. This is clearly the less ideal approach. To get started, I believe I would need to create some sort of filter on SFTextAreaInput which would disable semantic notations for the user inserted text, but alas I'm unsure as to how to get started on that.
Well, Semantic MediaWiki is still a Wiki. In your classical enterprise database, you restrict the users' input options as a means of ensuring data integrity. That isn't what wikis do; the thinking with a wiki is, yes, the user can enter incorrect information, but another user will amend it and let the first user know what was wrong.
I wouldn't try to coerce SMW into rigid data acquisition. I mean, you do have options such as removing the standard input fields in forms:
'''Free text:'''
{{{standard input|free text|rows=10}}}
If users are selecting a movie page when they should be selecting a director page, then you probably want to encourage correct selection by populating the form control from the Directors category, like:
{{{field|Director|input type=combobox|values from category=Directors}}}
Yes, they can still go very far out of their way to select "Forrest Gump", but if that happens then the fact that someone wilfully circumvented the preselected correct options is a more pressing concern than the fact that the system permits it.
Wikis work best when the system encourages rather than enforces valid knowledge.
My name is Wolfgang Fahl I am behind the smartMediaWiki approach. You might want to go the smartMediaWiki route
see
http://semantic-mediawiki.org/wiki/SMWCon_Spring_2015/smartMediaWiki
For a start don't go just by the property values but e.g. also by a category.
{{#ask: [[Category:Movie]] [[Directed by::+]]
|?Directed by
}}
will only show pages that have both the property set and are in the correct category.
In the smartMediaWiki approach you'd create a topic "Movie" and the entry of movies would be done via Forms. This is an elaboration of the SemanticForms and semantic PageSchemas idea that recently evolved. You can find out more about this at SMWCon Barcelona 2015 this fall.
This is often situation, but here is latest example:
Companies have various contact data (addresses, phone numbers, e-mails...) when they make job ad, they have checkboxes where they choose how they want to be contacted. It is basically descriptive data. User when reading an ad sees something like "You can apply by mail, in person...", except if it's "through web portal" or "by e-mail" because then appropriate buttons should appear. These options are stored in database, and client (owner of the site, not company making an ad) can change them (e.g. they can add "by telepathy" or whatever), yet if they tamper with "e-mail" and "web-portal" options, they screw their web site.
So how should I handle data where everything behaves same way except "this thing" that behaves this way, and "that thing" that behaves some other way, and data itself is live should be editable by client.
You've tagged your question as "language-agnostic", and not all languages cleanly support polymorphism, but that's the way I would approach this.
Each option has some type, and different types require different properties to be set. However, every type supports some sort of "render" method that can display the contact method as needed. Since the properties (phone number, or web address, etc.) are type-specific, you can validate the administrator's input when creating these "objects", to make sure that the necessary data is provided and valid. Since you implement the render method, rather than spitting out HTML provided by a user, you can ensure that the rendered page is correct. It's less flexible, but safer and more user friendly.
In the database, you can have one sparsely populated table that holds data for all types of contacts, or a "parent" table with common properties and sub-tables with type-specific properties. It depends on how many types you have and how different they are. In either case, you would have some sort of type indicator, so that you know the type of object to which the data should be be bound.
First of all, think twice do you really need it. Reason is simple. You are supposed to serve specific need and input data is a mean to provide that service. If data does not fit with existing service then what is its value and who are consumer of that specific information?
There are two possible answers: You are expanding your client base or you need to change existing service because of change of demand. In both cases you need to star from development of business model. If you describe what service you need and what information it should provide you will avoid much of specific data and come with clear requirements easy to implement in software.
I'd recommend the resolution pattern for this, based on the mention of a database. The link above describes it, but it's actually a lot simpler than it sounds. You write a database query that returns all the possible options (for example, you read the standard options and the customized options together using perhaps a UNION or a JOIN depending on your schema) - the COALESCE SQL keyword is then useful to find the first 'resolution' of the option value that isn't NULL.
Well, if all it is is that you have two options that are special, and then anything else is dealt with in the same way, then store your options as strings, and if either of the two special ones appears in that list, then show the appropriate stuff for that special item.
Just check your list of items for the two special ones. Nothing fancy.
By writing a very simple Rules Engine. You can use an out-of-the box implementation, or you can roll your own. Since your case seems so simple, I tend to roll my own, because it means less dependencies (YMMV).