How to parse website and get information - html

I am trying to parse a website.This is what Im doing I download the source and traverse the data using nokogiri and get the information I needed like links, content, etc. I already have the script for getting the data. But I stumbled a problem when the link only works when you click on it on a live site.
This is the example source I'm trying to traverse.
<div class="story-item-content group">
<div class="story-item-details">
<h3 class="story-item-title">
How NOT to fix your computer, part 2.
<span class="external-link-icon"></span>
</h3>
<p class="story-item-description">
zug.com <a href="/news/technology/how_not_to_fix_your_computer_part_2" class="story-item-teaser">— After you read this you should understand what not to do.
<span class="timestamp">21 hr 59 min ago</span></a>
<a class="crawl4link" href="http://crawl4.digg.internal/permalink/view/how_not_to_fix_your_computer_part_2">View in Crawl 4</a>
</p>
</div>
So in line 4. the link href="/story/r/how_not_to_fix_your_computer_part_2
only works in a live site. When I download the source and click the link. It won't work. I'm guessing the link is save in the server. Any idea how do i get the full link?. I was thinking of having a script that clicks that link, in that way I can get the working link. Any idea how to do this? thnx

that url is a relative url,
so if the website you're at is:
http://mywebsite.com/index.html
then your full link is
http://mysebsite.com/story/r/how_not_to_fix_your_computer_part_2

It's a relative link, relative to the the root directory of the website. Just prepend domain (i.e. example.com/story/r/how_not_to_fix_your_computer_part_2).
The reason clicking the link won't work is that the href value is a relative one... relative to the location that the file is stored on. Once you download the page to your local computer it is no longer relative to the original domain, the browser will assume it is looking for a file at http://localhost/story/r/how_not_to_fix_your_computer_part_2. And since there isn't a file or a resource at that URL, it fails.
What you want to do is change the href value to an absolute url by prepending the original domain (i.e. digg.com/story/r/how_not_to_fix_your_computer_part_2). Then it will work when you click it from your local drive.
You won't need to worry about the numbers added on to the url when it finally resolves, that will be handled by the resource at the digg.com/story/r/how_not_to_fix_your_computer_part_2 url.

Related

Why <a> tag doesn't navigate properly to url if using www.something.com

I was working on tag and let say my domain was www.abc.com. If I use href="http://example.com" it does properly navigate to intended url. However if I use href="www.example.com" it doesn't navigate to intended url.
not properly navigate
properly navigate
properly navigate
I was reading the anchor specs in https://html.spec.whatwg.org, unfortunately could not find this specific case.
The browser must know if you want to link to another website or a different file/page of your own website. The browser always asumes that you want to link to a file on your own server if you do not specifiy the protocol.
In fact: The only reason you can leave the protocol out when typing a url into the addressbar of your browser is because the browser just asumes that you want to use the http-protocol. This is not possbile with urls inside the A tag.
If you don't specify an absolute url it will think it's a route inside your site.
Possible values using href attributes:
An absolute URL - points to another web site (like href="http://www.example.com/default.htm")
A relative URL - points to a file within a web site (like href="default.htm")
Link to an element with a specified id within the page (like href="#top")
Other protocols (like https://, ftp://, mailto:, file:, etc..)
A script (like href="javascript:alert('Hello');")
Because When you click on it.. Your Browser Will suppose he need to find this link in the same file extension.
Example: if your html file extension is e://tst.html
when click on tag at Browser it will go to e:// and search about file with name "www.google.com" and not find it..
Use not properly navigate to inform Browser you need to navigate to Another website

image link not working if i don't open the image's link first

I have my images as links, I put them in the 'img' tag in the 'source="https://i.imgur.com/ABCDEF..."' space. My issue is, that if I don't open each link first, they don't load.
I don't really know what to try, the way I wrote the code works, only that i have to open each link first.
Links are in a JSON structure in github, I'm putting that info in my html via javascript
<ul id="galleryUl">
<h1 class="tracking-in-expand-fwd" id="h1Name">ANGELA & VALENTINA</h1>
<img src="https://i.imgur.com/ZOHGX1Z.jpg">
<img src="https://i.imgur.com/AWOW84K.jpg">
<img src="https://i.imgur.com/xXZYJjF.jpg">
<img src="https://i.imgur.com/mQhqGIG.jpg">
<img src="https://i.imgur.com/PfzJb37.jpg">
</ul>
It works here, but not in my page
screenshot of the problem
Errors 401 and 403 are authorization errors, and usually it means that the content is stored in an area that the user needs to be logged into/ that they have no access to. From the path shown in the screenshot, however, it looks as though the images are being called incorrectly from the local host. If the https:// part of the url is missing in your code that creates the image url, correct that first and try again, as without it. [Otherwise the code is expecting the images to be stored on your local machine.]
Hope this helps

HTML Link doesn't work properly sometimes

I am using a local server for my applications and sometimes when I created a button or a link to another page in a new tab, it turns out to not working properly. It's not always like this, but sometimes, might sound silly. I give example below.
Let's say my application is **programmingworld** which exists in www folder, then in index.html file, I create a link for a button like this
Download Codes</div>
When I open it in a browser and click the button, sometimes it goes to http://localhost/programmingworld/www.google.co.uk where nothing is displayed on the page. It supposed to be www.google.co.uk in the new tab where I can see the google homepage.
Can you please tell me why?
You should write:
Download Codes</div>
If you didn't write http:// at the the beginning of the hyperlink, it will be search you your local directories or files.
To make sure that the link goes to where you intend and not where it goes try adding // or http://.
Example:
Google
or
Google
With // it will try http and https.
You're missing https:// before www.google.co.uk
So you're markup should look like this:
<a href="https://www.google.co.uk">
<div class="button" id="button=popup">Download Codes</div>
</a>
you can also do it like this (no https):
<a href="//google.co.uk">
<div class="button" id="button=popup">Download Codes</div>
</a>
Because you haven't included the protocol in your URL. it must start with either http:// or https://
Also, remove the div from inside the anchor tag.
Your question suggests that you need to do a little bit more testing on basic html.
I would most definitely suggest using https://
I've had similar problems such as that, and in order to fix them try adding https.

Is it possible to link to a div without changing the URL (in HTML)?

So, basically, I want people to be able to navigate my website, through links to Divs, but PREVENT the browser from changing the current URL (it adds #divname at the end of the .html file).
I have something like this:
<div id="modalLogin" class="modalLogin">
<!-- random stuff here -->
</div>
And somewhere else I have a link to that Div:
<a href="#modalLogin">
<img class="btnLogin" src="../images/btnLogin.png" alt="Log in!"/>
</a>
But, as I mentioned before, whenever they click those kind of links, the URL changes. I'd like to be able to navigate the website WITHOUT that happening. If at all possible, using just HTML (no JavaScript, no jQuery, no AJAX).
While we're at it, I've seen entire websites not changing their URL at all (even when I've traced the requests and am clearly navigating through different files), and some don't even show you the 'expected address' (the URL on the bottom left of the browser). How do I do that?
Thanks in advance!
P.S.: I've searched this website, and apparently all 'similar' questions ask just about the opposite: how TO change the URL.
I think You Can Use Javascript for This.
//Grab your current Url
var url = window.location.toString();
//Remove anchor from url using the split
url = url.split("#")[0];
SAMPLE JSFIDDLE

Anchor with hash in href attribute is opening a new page

I have never seen this behavior. I have a simple hash link on a website. The link looks like this:
<a href='#view_123'>Click</a>
On my test server, when I click, it simply changes the url to
http://www.myserver.com/mypage.aspx#view_123
And the page does not redirect anywhere. However, when I push this same link to my live server, it causes the browser to redirect to:
http://www.myserver.com/www.myserver.com#view_123
This makes no sense to me. The only way around this is to put the full url of the page in the href with the hash appended to the end, but this is causing me other problems and is not what I want to do.
The only clue I've come across is the MIME type, but I'm pretty sure mine is correct as "text/html".
There is no javascript causing this. I can hover over the link, and the url hint in Chrome shows the incorrect url.
Have you tried changing the target tag?
<a href='#view_123' target='_self'>Click</a>
or
<a href='#view_123' target='_top'>Click</a>