How do you grab an article including the links in a usable format?

How do you grab an article including the links in a usable format? - mediawiki

I have an internal deployment of mediawiki. In some articles are external links. I have another page that makes API calls to the wiki to pull articles into another website. When I pull those articles in, links do not get pulled in properly. Here is an example.
Wiki article:
Use [http://example.com THIS LINK] to contact the vendor.
API URL:
https://mysite.com/mediawiki/api.php?action=query&format=json&prop=extracts&titles=Vendor
API results:
Use THIS LINK to contact the vendor.
Notice the link is completely stripped away. I've also tried to add my own html into mediawiki for links but mediawiki escapes < and > symbols and so the API see's '&lt' and '&gt'. Also mediawiki displays html and not an actual link.
How do I make mediawiki API calls and keep link information?

For this, you can use action=parse instead. The query would look like this:
https://mysite.com/mediawiki/api.php?action=parse&format=json&page=Vendor&prop=text

Related

WikiMedia API - How to determine which portal(s) a Page belongs to?

I wish to determine whether a given Wikipedia page belongs to a certain Wikipedia Portal using the MediaWiki API. So far, I have been experimenting with the page properties of the API but I cannot seem to find a way to derive what Portal a given page belongs to.
As an example, on the Wikipedia page for Cake in the very bottom of the page, I can press Show on the section Cakes, and a bunch of links to different cake pages show up. There I can also see that all of these belong to the Food portal. It is that information that I would wish to extract from a given page using the MediaWiki API.

As far as I know, there is actually no formal definition of "belongings to a portal" in Wikipedia. Opposed to categories which are part of the MediaWiki software, portals are custom pages for Wikipedia that are aimed to make it easier to explore a topic.
Instead of a formal definition though, you can use an heuristic and determine the connection between the page and some portal based on one of them linking to the other. There are API endpoints for both:
(Note: 100 is the id of the 'Portal` namespace)
Which portal pages are linked from the page "Cake" or "Pizza"
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=links&titles=Cake%7CPizza&plnamespace=100
Which portal pages link to the page "Cake" or "Pizza"
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=linkshere&titles=Cake%7CPizza&lhnamespace=100
(though as you can see, many unrelated portals link to "Cake" and none link to "Pizza")
A combined query for both directions
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=links%7Clinkshere&titles=Cake%7CPizza&plnamespace=100&lhnamespace=100

So trough some more investigation i found the answer:
I ended up using the Revisions property in the API. This allows me to to give a series of page titles that I want to investigate, and have the HTML of each page returned to me in json format. Then I can just search for lines containing Portal and figure out what portal (if any) the page belongs to.
If anyone are in a similar situation, here is an example query to the API:
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Bread|Bubble_tea|Pizza&format=json&redirects&rvprop=content&rvslots=main

Modifying storefront HTML using Shopify app

I have been reviewing the Rest Admin API to try to figure out the answer to this question and I may be simply be looking at the wrong documentation.
We're trying to develop an application that will add custom data-driven pages to the site that will take product(s) from multiple selected categories and display them all on a single page, with checkout forms for each. This is done already by other apps, but we have to do a custom implementation so we can match the client's specific functionality needs. An example of an app that does something similar is the Bundle Builder app, which appears to modify the output of {{ content_for_layout }} in the theme.liquid file. It outputs some JSON gathered from the Shopify database (which can be done with the Shopify REST API) and an empty div. Getting the data isn't my concern, but I can't find anywhere in the docs I've looked at where it describes how to modify storefront HTML output.
I suspect it may do this by adding a template (but it has not added that template to the theme files) and associating it with the page URL, or by modifying the output of an existing template, or by adding a section and somehow integrating it with a page, or otherwise, but I have been unable to find documentation for how to do any of those tasks in the docs I've looked at. Other apps appear to add HTML to the storefront as well, such as Privy (which adds pop-ups), Easy Contact Form, and User Photos
What am I missing?

If you want to fill in an empty element with content, one easy way is to use an App Proxy. Shopify will make a secure callback to your endpoint of choice, and you can return data. You could also return Liquid and Shopify will render it along side the rest of the page chrome, ensuring your Liquid becomes the page.

Any way to get json of the facebook page in 2019?

I'm trying to make parser from facebook page posts to Squarespace blog posts. I already did it for Instagram but I need json file of the Facebook page.
I need something similar as I have for inst (https://api.instagram.com/v1/users/[USER-ID]/media/recent?access_token=[TOKEN]);
I found few articles how to do it but all of them was made before the facebook’s massive data leak and after that they have made many changes to their app’s permissions and imposed many other rules.

https://developers.facebook.com/docs/graph-api/reference/page/feed/
You need a Page Token of the Page if you own the Page
You need Page Public Content Access approved by Facebook for Pages you do not own
Since you tagged the question with the JS SDK: Make sure not to use hardcoded Access Token on the client! They are always meant to be kept secret.

Google bot crawling on AngularJS site with HTML5 Mode routes

We have an AngularJS site using HTML5 routes. I just did some test "Fetch as Google" runs. The results are a bit confusing:
On the fetching tab, I see our site as it looks on view source, with all the front end bindings {{ }}, and not all the HTML rendered
On the rendering tab, our site looks perfectly fine, no {{ }} variables, it seems like Google bot fetched and rendered the site fine, which is maybe in line with this, http://googlewebmastercentral.blogspot.ae/2014/05/rendering-pages-with-fetch-as-google.html.
However, we are already prepared for Google to not be able to crawl our site, so we have already added , so the Google bot revisits our page with “?_escaped_fragment_=". We followed this, https://developers.google.com/webmasters/ajax-crawling/docs/getting-started (section "3. Handle pages without hash fragments"). In our Nginx config we have something like this:
if ($args ~ "_escaped_fragment_=") {
serve the static HTML snapshots
}
, and indeed it works fine, if we pass the _escaped_fragment_= ourselves. However, the Google bot never tried to crawl our site with this param, so it never crawled the snapshot. Are we missing something? Should we also add agent detection for Google bot on our Nginx conf? Something like this?
if ($http_user_agent ~* "googlebot|yahoo|bingbot|baiduspider|yandex|yeti|yodaobot|gigabot|ia_archiver|facebookexternalhit|twitterbot|developers\.google\.com") {
server from snapshots
}
It would be great if we can understand this better, thank you so much in advance!
UPDATE:
I just read this, http://scotch.io/tutorials/javascript/angularjs-seo-with-prerender-io?_escaped_fragment_=tag#caveats. So, it seems that when using the manual tools (Fetch as Google), we should pass ourselves either #! or ?_escaped_fragment_= in the right place. Indeed, if I pass ?_escaped_fragment_= in our case, I do see the HTML snapshot that we have created.
Is that true? Is this how it works indeed?
UPDATE 2
On the bottom of this thread, a Google employee verifies that for Google Webmasters "Fetch as Google", you need to manually pass the _escaped_fragment_= param yourself, https://productforums.google.com/forum/#!msg/webmasters/fZjdyjq0n98/PZ-nlq_2RjcJ
Cheers,
Iraklis

I will try to answer your questions based on our experiences in the last month of developing a SPA with HTML5 mode.
How do I get Googlebot to use ?_escaped_fragment_= instead of the direct links.
This is actually quite simple but easy to overlook. In fact, there are two different ways to get Googlebot to try the escaped_fragment. The first method is to run your site in non-html5 mode. This means that your URLs will be of the form:
http://my.domain.com/base/#!some/path/on/website
Googlebot recognizes the #! and makes a second call to your server with an altered URL:
http://my.domain.com/base/?_escaped_fragment_=some/path/on/website
Which you can then handle as you wish. The second way to get Googlebot to try _escaped_fragment_ mode is to include the following meta tag on the index page you supply to the bot:
<meta name="fragment" content="!">
This will make googlebot check the other version of the webpage every time it sees the tag. Interestingly you can use both these techniques together or you can do what we ended up doing, which is running in html5 mode with the meta tag. This means that your URLs will be escaped as follows:
http://my.domain.com/base/some/path/on/website?_escaped_fragment_=
Interestingly, the bot will not put anything at the end of the fragment. But depending on what webserver you are running, you can easily map this with a pattern matching the "_escaped_fragment_" text to your alternate bot page. For more information on the escaped fragment go here.
"Fetch as Googlebot" returns two different versions of my page, the source with {{}} and the rendered page looking correct. What does that mean?
Google's Bots can actually interpret JavaScript to a limited extent since early 2014. For more information, read the official blog entry on google webmasters here. However, as is made clear in the blog entry, this comes with a lot of caveats. For instance:
Googlebot does not guarantee to execute all javascript code.
Googlebot will attempt to find links in the javascript to follow and use them to help find more pages.
Googlebot will render the preview in webmasters tools by executing as much of the javascript as it can (thus the lack of {{}} in the rendered version).
Googlebot will not necessarily use the rendered version in order to build the meta information about your site for its index.
As of 18/12/2014, we are still unsure if Googlebot can actually extract any information from an SPA in rendered mode for its index beyond finding links to follow in the javascript. In our experience, Googlebot will include {{}} in its index listing so that when you try to use {{}} to fill meta information (description, keywords, title, etc...) your site looks like this in Google Search results:
{{meta.siteTitle}}
http://my.domain.com/base/some/path/on/website
{{meta.description}}
rather than what you expect which might look like this:
Domain
http://my.domain.com/base/some/path/on/website
This is a random page on my domain. An excellent example page to be sure!

GoogleBot for Search Engine uses _escaped_fragment_ but we can not be sure for other services
Google recommend to serve an HTML snapshot of AJAX website by using hashbang (#!) and _escaped_fragment_ param.
But as often for new Google feature all Google services do not support it from the begging.
For now, by experience, we are sure GoogleBot for indexing webpage use HTML snapshot and _escaped_fragment_. You can check your Server Access Logs to be sure Google did it on your application.
(For now and by experience, nothing official by Google) other services like PageSpeed Insight, Webmaster Tools parser, Richsnippet testing tools, etc.: hasbang (#!) is not supported. You have to use _escaped_fragment_.
Should you use User Agent detection to serve HTML snapshot?
No. Just don't. For different reasons :
You just do not know which services/bots on the web would like to parse your content and you can not be exhaustive (for instance, think of all the social networks existing on the web using Bot to create a snippet of your content : you can not handle them one by one)
This can be considered as cloacking : serving a different version depending on type of user on the same URL, which is basically wrong for SEO.

Google looks for #! in our site urls and then takes everything after the #! and adds it in _escaped_fragment_ query parameter. Some developers create basic html pages with real data and serve these pages from server side at the time of crawling. So , why not we render same pages with PhantomJS on serve side which has _escaped_fragment_.
For more detail please read this blog .

Maybe a bit outdated, but for the completeness:
According to the statement from May 23, 2014 Google bot is now able to "see your content more like modern Web browsers".
According to their statement from October 14, 2015 Google deprecated the AJAX crawling scheme.
So using the HTML5 History API (html5mode in angular) should be no problem to Google.

How to embed/integrate WordPress blog into my own web site?

I have a WordPress blog account already (abc.wordpress.com). And I have my own web site: www.xyz.com
I would like to integrate my WordPress blog content into my own site. Hopefully something like blog.xyz.com or just replace the home page of xyz.com with abc.wordpress.com
I know that I can download WordPress' code from wordpress.org and run my own WordPress. And having my own MySQL database, but WordPress is always releasing new code. I don't have the time to keep updating the source on my end to match it.
I'm running my own site as a hobby, so I prefer to let WordPress.com to manage the content for me and continue reuse my own blog at abc.wordpress.com, but make the content show up in my own site: xyz.com
I hope I was clear when explaining this.
Anyone knows a way to do this?
Thanks.

If your main worry is about the updates, I would say don't be. A simple click of the 'Updates' button in the wordpress admin is all you need to do in order to apply the updates for wordpress. A notification will pop up alerting you of any updates.
And as Calle has already mentioned, you can retrieve your content via RSS, or you could just export your current content from Wordpress.com, import the content into your own site, and manage it there. Everything would be in one spot.
Good Luck.

I don't know how good you are with programming but there's a PHP library called Simple Pie which would help you retrieve your content via RSS (which Wordpress automatically generates for you). The adress is here: http://simplepie.org/
If you are not very good with programming, perhaps you can get someone to do it for you or find a script which is already written somewhere. I do think RSS is definitely the best way to go.
I also think you exaggerate the problems of hosting Wordpress yourself. It's not something that you have to keep updated with, and if you want to, all you have to do is log in from time to time, perhaps once a month (how often are you writing articles?), and click "update" and Wordpress will do everything for you. Both for your plugins and WP version.

For the ability to use your own domain (xyz.com) and have wordpress redirect users from abc.wordpress.com(your wordpress blog) to your domain requires a premium account.
If you have a premium account then you can just log in to wordpress.com, click 'upgrades' and select 'domains'. From there you will see the option "Map an Existing Domain" and you will want to enter your domain here. Now your wordpress.com blog is what will show when users enter your domain's url (xyz.com).
Alternatively, if you need a workaround with a free wordpress.com account then you want to just embed your blog and for that you will need to use an RSS feed. Note: this method will not maintain your wordpress styles it will merely transport the content. Also by default not all browsers support RSS feeds.
You can view your blog's current feed by adding 'feed' to the end of your wordpress.com url, i.e. abc.wordpress.com/feed. You can read more about feeds here (http://en.support.wordpress.com/feeds/). Now you are just left with the task of figuring out how to embed the feed into your page.
One final hail-mary you might attempt is just redirecting your domain to your blog. Reference on how to do this different ways here: (http://css-tricks.com/redirect-web-page/). Example, place this tag in the section of your domains pages:
<meta http-equiv="refresh" content="0; URL='http://google.com'" />
(this will redirect after 0 seconds to the specified url)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008