Web-scraping only a specific domain - html

I am trying to make a web scrpper that, for this example, scrapes news articles from Reuters.com. I want to get the title and date. I know I will ultimately just have to pull the source code from each address and then parse the HTML using something like JSoup.
My question is: How do I ensure I do this for each news article on Reuters.com? How do I know I have hit all the reuters.com addresses? Is there any APIs that can help me with this?

What you are referring to is called web scraping plus web crawling. What you have to do is visit every link matching some criteria (crawling) and then scrape the content (scraping). I've never used them but here are two java frameworks for the job
http://wiki.apache.org/nutch/NutchTutorial
https://code.google.com/p/crawler4j/
Of course you will have to use jsoup (or simillar) for parsing the content after you've collected the urls
Update
Check this out Sending cookies in request with crawler4j? for a better list of crawlers. Nutch is pretty good, but very complicated if the only thing you want is to crawl one site. crawler4j is very simple but I don't know if it supports cookies (and if that matters to you it's a deal breaker).

Try this website http://scrape4me.com/
I was able to generate this url for headline: http://scrape4me.com/api?url=http%3A%2F%2Fwww.reuters.com%2F&head=head&elm=&item[][DIV.topStory]=0&ch=ch

Related

Using Wikipedia API on custom wikis like Bulbapedia

Does anyone have experience in using the Wiki API Sandbox with making REST calls on custom wikis? By custom wiki I mean something like http://bulbapedia.bulbagarden.net/wiki/.
I particularly want to get access to some of the Pokemon content found on Bulbapedia, but not sure where to start or if it's even possible to use REST on custom wikis.
My current solution is to just use a standard wikipedia page with calls like:
To Get All Pokemon:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=List_of_Pok%C3%A9mon
To Get Bulbasaur:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Bulbasaur
I get some JSON that I can work with, but would love to be able to explore the content of a Bulbapedia page AND have access to all of Ken Sugimori's artwork.
Yes, MediaWiki comes with API bundled. Furthermore, since 1.27 it includes a rewritten ApiSandbox that I originally wrote as an extension. So as Bulbapedia is running 1.27.1, it has the sandbox too.

Wikipedia API - How to get all links from multiple pages?

I'm new to wiki api. I have read how to get all links from a specific page, and managed to do so, but in my case I need a list of links from many pages. And sending a request for each page is inefficient. This is the kind of request I use -
http://en.wikipedia.org/w/api.php?action=query&format=jsonfm&generator=links&gpllimit=500&redirects=true&pageids=1234
I must admit that I don't fully understand what each argument means. So -
How do you chain multiple pageids to 'pageids' argument? I guess that's a silly question but I didn't find any reference :\
Can the response point out which page owns each link?
Thanks!
You can just join page IDs (or names if you use the titles parameter) with | which is in general how you make lists in the MediaWiki API. I don't think there is a way to find out which link comes from which page, though.

Where is the Data stored on Website

I am at this website -
http://www.zoominfo.com/s/#!search/company/1.64.eyJjb21wYW55TmFtZSI6xIB2YWx1xIw6ImEiLCJpc1VzZWTEjXRyxJN9fQ%3D%3D
If you see the company name - Agilent Technologies Inc.
Its neither there in page source, nor in any json format.
But it does show in the Dom of Chrome Developer tool.
I have looked and analysed almost every requests that it sent, but still couldn't find where this data is saved.
By where the data is saved - I am looking to find where I can scrape that data from?
If by using python-requests and BeautifulSoup
I do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
I am still learning python, and it would be a very useful information if someone helps me with this.
Thanks in advance.
After the HTML is loaded, js requests for the data through an XMLHTTPREQUEST which is loaded right after the request is received on your client. That's why you see the DOM element right there using element inspector.
You didn't mention what goal you want to achieve or what tool you are using. Please be specific on your question. If you do not have any idea about this kind of pattern, google out angularjs, see some example.
do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
It means that javascript embedded in the page is sending an extra HHTP request to the web server. It is likely that the "Agilent Technologies Inc." text is being returned in the server's response to that request, and the javascript in the page is then injecting the text into the DOM in the appropriate place.
Where is the Data stored on Website
That is a completely different question ...
(You have already noted that the data (e.g. the company name) gets injected into the page displayed by your browser.)
On the server side, the data could be stored in the web server (or its back-end systems) in a variety of ways. Or it might not be stored at all. There is no way of knowing ... without looking at the server-side code and configurations.

How to include the result of an api request in a template?

I'm creating a wiki using Mediawiki for the first time. I would like to include automatically all backlinks of the current page in a template (like the "See also" section). I tried to play with the API, successfully, but I still haven't succeed in including the useful section of the result in my template.
I have been querying Google and Stackoverflow for days (maybe in the wrong way) but I'm still stuck.
Can somebody help me?
As far as I know, there is no reasonable way to do that. Probably the closest you could get is to write a JavaScript code that reacts on the presence of a specific HTML element in the page, makes the API request and then updates the HTML to include the result.
It’s not possible in wiki text to execute any JavaScript or use even more uncommon HTML. As such you won’t be able to use the MediaWiki API like that.
There are multiple different options you have to achieve something like this though:
You could use the API by including custom JavaScript code on MediaWiki:Common.js. The code there will be included automatically and can be used to enhance the wiki experience. This obviously requires JavaScript on the client so it might not be the best option; but at least you could use the API directly. You would have to add something to figure out where to place the results correctly though.
A better option would be to use an extension that gives you this output. You can either try to find an extension that already provides this functionality, or write your own that uses the internal MediaWiki API (not the JS one) to access that content.
One extension I could personally recommend you that does this (and many other things), is DynamicPageList (full disclosure: I’m somewhat affiliated with that project). It allows you to perform complex page selections.
For example what you are trying to do is to find all pages which link to your page. This can be easily done by DPL like this:
{{ #dpl: linksto = {{FULLPAGENAME}} }}
I wrote a blog post recently showing how to call the API to get the job queue size and display that inside of the wiki page. You can read about it at Display MediaWiki job queue size inside your wiki. This solution does require the External Data extension however. The code looks like:
{{#get_web_data: url={{SERVER}}{{SCRIPTPATH}}/api.php?action=query&meta=siteinfo&siprop=statistics&format=json
| format=JSON
| data=jobs=jobs}}
{{#external_value:jobs}}
You could easily swap in a different API call to get other data. For the specific item your looking for, #poke's answer above is probably better.

Why doesn't Wikipedia have extensions?

Look at a random wikipedia article like http://en.wikipedia.org/wiki/Impostor_syndrome, I see that there's no .html attached to the end of the address. In fact, if I do try to put a .html after it, Wikipedia tells me "Wikipedia does not have an article with this exact name." How come it doesn't need any file extensions?
More a superuser question?
There is no law saying that an html file has to end in .html or .htm and since wiki generates pages from a database there is really no file page there anyway (except in a cache).
Not having .htm or .php is moresensible - why do you care what technology they use when you ask for a url? It would be like having to put the operating system of the recipient at the end of their email address.
if you make a call to a website it probably looks like
www.example.com/siteA/index.html
this request just tells the webserver you want to see a resource that is called index.html in siteA.
the website that runs on this server has to determine what you want to see and how the data is loaded.
index.html could be a file in the siteA directory
or
it can be row with the key "index.html" in the siteA-table in your database.
so the part siteA/index.html is just a resource identifier. the grammar of this resource identifier is completely free and is determined per website.
url rewriting is also common to make url easier to read and remember.
for example there could be a rewrite rule to accomplish the following:
if the user enters something like
www.example.com/download/demo.zip
rewrite it so your website sees it like:
www.example.com/download.php?file=demo.zip
Wikipedia's servers map the url to the page you want. .html is just a naming convention that, today is mostly historical from the period of static pages when urls actually were names of files on the server. In fact, there may be no file at all, where the server queries the database and a web framework sends out the html on the fly.
Wikipedia is most likely using the Apache module mod_rewrite in order to not have to link paths directly to a file system path.
See: http://en.wikipedia.org/wiki/Rewrite_engine#Web_frameworks
However programming languages can also take control of the incoming URLs and return data depending on the structure of the link according to some set of rules, for example the Django web framework employees a URL dispatcher.
That's because Wikipedia uses MediaWiki's feature of URL shortening.
Actually when you search for a file it really loads a php file. Try searching for a word that doesn't exist, for example "Pazaz". The URL is http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=pazaz . Notice index.php in the URL.
To tell the truth it's not a MediaWiki feature, it's Apache. For further info http://www.mediawiki.org/wiki/Manual:Short_URL .
URL routing is your answer for example in ASP read below source from
The ASP.NET MVC framework includes a flexible URL routing system that enables you to define URL mapping rules within your applications. The routing system has two main purposes:
Map incoming URLs to the application and route them so that the right Controller and Action method executes to process them
Construct outgoing URLs that can be used to call back to Controllers/Actions (for example: form posts, links, and AJAX calls)
I would suggest that sites like this use some sort of Model View Controller framework similar to Ruby on Rails where the url 'directories' form a part of a request/url route...
In frameworks that are MVC based, the url 'directories' can dictate what View/Controller to utilise as well as what action should be taken with the data.
eg: shop.com/product/carrots
Where product is a view/controller and carrots is the data. The framework then analyses which action/route to take. Default could be viewing the product information and price of the carrot.