How can I use google to find all unknown webpages point to one specific known web page? - html

It's easy to find all children pages of one webpage. But it is not trivial to get all parent pages of one webpage, how can I do that by using Google?

You can't and Google cannot help you as it doesn't index all of the web.
At best it follows links on other pages, or is initiated by someone wanting to have something indexed explicitly.
Create a server. Put an HTML page on it that no other page on your server has a link to. Name the page with some non-guessable UUID in the name.
Google will not find this unless they start to randomly change parts of URLs to test for existing pages (a lengthy process).
Within that page you can have links pointing to other pages. It is a parent page for those pages, is a web page, and will not be found via Google.

Related

Search All Files in a Website

I am working on a project in which I need to find all files in a website. For example my webpage contains index.html and a PDF file.
How others can find out that there is a PDF file in my website domain?
You need a scraper of some sort.
ex) http://scrapy.org/
You can traverse a webpage by their links.
If you think of a page as a node and link as children
you can easily cover all files of that website.
If that specific website shows every link in its pages,
this method is possible, if not then you have to use
other means such as search engine to look through it's indexed pages.

an html tag for displaying html received from a specific url

I created an API of sorts, that when you navigate to it, returns information in html.
On my website, I would like to have the web page reach out to the API and display the information as part of the web page (sort of like a webpage reaches out for an img). What HTML tag would be best suited to achieving this result? I came across the and tags but not really sure which would be best.
I am building this myself thus have full control over how the content is delivered back to the page. Is there specific pattern that is used for such "modular" sourcing of information? I could rewrite my website to - prior to serving the web page - reach out to the api and pull the info itself and then include the results in html but a) this would be more complex and require changes in several places b) will become really complex as the number of such api call results I would want to include increases.
You can use Iframe for this purpose and when you recieve html which you want to display , you can simply set html content in that iframe's ID :
document.getElementById('myIframe').contentWindow.document.write("<html><body>Here is your html</body></html>");
Hope this helps.
As far as i know, using iframes is rather depricated. I always use div-tags for such tasks.
document.getElementById("targetdiv").innerHTML = "New HTML-Content";
More info on divs: http://www.w3schools.com/tags/tag_div.asp

How to find the parent page of a webpage

I have a webpage that it cannot be accessed through my website.
Say, my website is www.google.com and the webpage that I cannot access using the website is like www.google.com/iamaskingthis/asdasd. This webpage appears on the google results when I type its content, however there is nothing which sends me to that page on my website.
I've already tried analyzing the page source to find its parent location but I can't seem to find it. I want to delete that page, but since I cannot find it, I can't destroy it either.
Thank you
You can use a robots.txt file to prevent search engine bots from visiting a page, and thus not showing search results for it.
For example, you can create a robots.txt file in the root of your website and add the following content to it:
User-agent: *
Disallow: /mysecretpage.html
More details at: http://www.robotstxt.org/robotstxt.html
There is no such concept as a 'parent page'. If you mean, by which link Google found the page, plese keep in mind, that it need not be under your control: If I put a link to www.google.com/iamaskingthis/asdasd on a page on my website and thegooglebat crawls it, it will know about it.
To make it short: There is no reliable way of hiding a page on a website. Use authentication, if you want to restrict access.
Google will crawl the page even if the button is gone, as it already has the page stored in it's records. The only way to disallow google crawling to it is either robots.txt or simply deleting it off the server (via FTP or your hostings control panel).

dynamic sitemap : xml or html

I've created website with dynamic content, and I want google to know all my pages, so I've given a file "mysitemap.xml" via webmaster tools.
Basically, my links are like mysite.com/one-id/one-name , with one-id an id between 1 and 2000 (but will be greater with the time...).
I'm wondering if I need to create a page on my website (a kind of html sitemap), which will list all these links to help google bots to find my web pages, or is it enough for google to have the xml sitemap?
The problem is that the html sitemap will be very ugly and only a "for google" page, so I want to avoid this...
No, Google only requires a sitemap.
Sitemap is for Search Engines and Navigation is for humans.
Sitemap includes page's content type, update frequency, last modified, etc.
Navigation may include dropdown menus, hyperlinks, etc.

How should I handle autolinking in wiki page content?

What I mean by autolinking is the process by which wiki links inlined in page content are generated into either a hyperlink to the page (if it does exist) or a create link (if the page doesn't exist).
With the parser I am using, this is a two step process - first, the page content is parsed and all of the links to wiki pages from the source markup are extracted. Then, I feed an array of the existing pages back to the parser, before the final HTML markup is generated.
What is the best way to handle this process? It seems as if I need to keep a cached list of every single page on the site, rather than having to extract the index of page titles each time. Or is it better to check each link separately to see if it exists? This might result in a lot of database lookups if the list wasn't cached. Would this still be viable for a larger wiki site with thousands of pages?
In my own wiki I check all the links (without caching), but my wiki is only used by a few people internally. You should benchmark stuff like this.
In my own wiki system my caching system is pretty simple - when the page is updated it checks links to make sure they are valid and applies the correct formatting/location for those that aren't. The cached page is saved as a HTML page in my cache root.
Pages that are marked as 'not created' during the page update are inserted into the a table of the database that holds the page and then a csv of pages that link to it.
When someone creates that page it initiates a scan to look through each linking page and re-caches the linking page with the correct link and formatting.
If you weren't interested in highlighting non-created pages however you could just have a checker to see if the page is created when you attempt to access it - and if not redirect to the creation page. Then just link to pages as normal in other articles.
I tried to do this once and it was a nightmare! My solution was a nasty loop in a SQL procedure, and I don't recommend it.
One thing that gave me trouble was deciding what link to use on a multi-word phrase. Say you had some text saying "I am using Stack Overflow" and your wiki had 3 pages called "stack", "overflow" and "stack overflow"....which part of your phrase gets linked to where? It will happen!
My idea would be to query the titles like SELECT title FROM articles and simply check if each wikilink is in that array of strings. If it is you link to the page, if not, you link to the create page.
In a personal project I made with Sinatra (link text) after I run the content through Markdown, I do a gsub to replace wiki words and other things (like [[Here is my link]] and whatnot) with proper links, on each checking if the page exists and linking to create or view depending.
It's not the best, but I didn't build this app with caching/speed in mind. It's a low resource simple wiki.
If speed was more important, you could wrap the app in something to cache it. For example, sinatra can be wrapped with the Rack caching.
Based on my experience developing Juli, which is an offline personal wiki with autolink, generating static HTML approach may fix your issue.
As you think, it takes long time to generate autolinked Wiki page. However, in generating static HTML situation, regenerating autolinked Wiki page happens only when a wikipage is newly added or deleted (in other words, it doesn't happen when updating wikipage) and the 'regenerating' can be done in background so that usually I don't matter how it take long time. User will see only the generated static HTML.