Scraping raw javascript and css files with Scrapy - html

I'd like to scrape all the linked javascript and css files on a give domain with Scrapy. The issue is that I don't quite understand how to extract the links from the link elements.
Assume I'm scraping example.com. There are links to js and css of the form:
<link rel="stylesheet" href="/path_to_css/example.css"/>
<script src="/path_to_js/example.js"></script>
These links start from the root domain, so no problem. But if the links are like the ones below, it starts to get confusing:
<link rel="stylesheet" href="path_to_css/example.css"/>
<script src="path_to_js/example.js"></script>
These relative URLs are supposed to work such that if I'm on example.com/some_page/ the link paths are appended to that like: example.com/some_page/path_to_js/example.js. That's not how it always works in actual web pages however. On some web sites with language selection eg.example.com/en/some_page, the relative paths start from example.com/en instead of the full path of that page.
So, while expecting to find the files at example.com/en/some_page/path_to_js/example.js, you find them at example.com/en/path_to_js/example.js
Is there any way to understand from where the relative paths start from?

While scraping, Scrapy allows you to create an absolute URL from a Relative URL
You could do something like this
for link in response.css("link"):
response.urljoin(link.css("::attr(href)").extract_first())
for script in response.css("script"):
response.urljoin(script.css("::attr(src)").extract_first())

Related

why does my css file when refreshed changes to index.html page in bracket editor

I am a newbie in Front end languages. I am having a difficulty with my CSS web display, whenever I get to write my CSS codes on Bracket editor,then get to link it with any of the html codes. For example using <link rel="stylesheet" href="css/mine.css>..then get to refresh the CSS.mine page, it automatically displays my index.html codes page, which is not related to what I expect, in terms of it loading to other html pages I linked it with or I'm working with.
Please any solution, on how to overcome this challenge?
Thanks
Hopefully I am getting it right, but normally the web browsers render the html files in the project folder, when it comes to css files, they are used for styling the html page, so you have to link the css file to the html file you want to render, for example index.html and index.css. Then in the head of the index.html you put the link tag
<link rel="stylesheet" href="index.css">
as an example. Then you save and reload(or if using any automated terminal just save). CSS files are only displayed in the linked html files, so this means

Jekyll doesn't apply CSS to posts on GitHub Pages

Yes, I know that this question has already been asked multiple times, and I've applied every solution that I could find, but I still have this issue:
If you visit my Jekyll site (yasath.github.io), the homepage and the tags page in the navbar render the CSS perfectly and look beautiful. However, when I click on a post (like this one), the CSS completely fails to render and I end up with an old-looking white background, Times New Roman text page!
My config.yml file has the right URLs (as far as I know!), and when I view the source of both pages with Chrome's developer tools, they both import the same CSS file correctly.
Hopefully, someone can give me advice specific to my site! This is the GitHub repo for the site, and again, here is the actual site.
The links to your stylesheets in your <head> section are written as relative links. You have, for example:
<link rel="stylesheet" href="assets/css/app.min.css">
When a URL starts with a directory name like "assets", the browser looks for that directory relative to the URL it's displaying right now. So when you're at yasath.github.io, it goes to yasath.github.io/assets/css/app.min.css... but when you're at https://yasath.github.io/2017/09/04/hello-jekyll.html, it looks for a stylesheet at https://yasath.github.io/2017/09/04/hello-jekyll.html/assets/css/app.min.css, which of course doesn't exist.
You want to start your URL with a /. That tells the browser to look relative not to the page it's displaying, but to the root of the website. So in your head template use:
<link rel="stylesheet" href="/assets/css/app.min.css">
...and similarly with all your other stylesheet URLs.

How to format URL for GET Request in Webmatrix

I'm working on a simple CRUD application to keep track of phones using Webmatrix 3. My Default.cshtml file displays the table.
When clicking on a row it goes to the EditPN.cshtml page for the user to edit the information for that record.
Now, following this tutorial I look into the value of UrlData[0] in my EditPN page.
Everything works fine with just one problem: since the URL ends up being something like this:
http://localhost:64053/EditPN/2223334444 my paths for CSS and JS files are off. My brute force solution has been to have both:
<link href="_css/myStyles.css" rel="stylesheet">
<link href="../_css/myStyles.css" rel="stylesheet">
in my _Layout.cshtml.
That way both http://localhost:64053/Default.cshtml and http://localhost:64053/EditPN/222333444 will have the CSS styles.
Since I don't like that, I tried to format the URL string to be this: http://localhost:64053/EditPN?pn=2223334444. Didn't work.
Tried this too: http://localhost:64053/EditPN.cshtml?pn=2223334444. Didn't work either. It doesn't even go to the EditPN.cshtml page.
How can I solve this issue? Oh, and BTW, I don't want to use the Webmatrix helpers. I want to keep things under JS, jQuery, etc.
You should prefix the url to your css and js files with a ~/, which tells web pages to work out the relative path:
<link href="~/_css/myStyles.css" rel="stylesheet" />

how can i connect my css to my JSP files stored in the WEB-INF folder? Websphere/JSP

I am using ibm websphere and creating a Dynamic web project. All of my JSP files are in my WEB-INF folder and i use servlet mapping in my web.xml file to make them accessible. This has worked fine so far. however i have problem with my CSS. as always, my CSS file is located in WebContent in a folder named css. heres my link for my jsp
<link rel="stylesheet" href = "css/styles.css">
I'm having no luck getting my css to show...
what am i missing?
The relative URLs in the generated HTML output are by the browser interpreted relative to the request URL (as you see in browser's address bar), not to their physical location in the server's disk file system. It's namely the webbrowser who has got to download them by a HTTP request, it's not the webserver who has got to include them from disk somehow.
One of the ways is to use a domain-relative path for those resources, i.e. start with /. You can use ${pageContext.request.contextPath} to dynamically inline the current webapp's context path.
<link rel="stylesheet" href="${pageContext.request.contextPath}/css/styles.css">
This will end up in the generated HTML output as follows:
<link rel="stylesheet" href="/yourContextPath/css/styles.css">
This way the browser will be able to download them properly.
See also:
Browser can't access/find relative resources like CSS, images and links when calling a Servlet which forwards to a JSP
I think you need to see it from the browser's perspective, how it is the URL of the page, the context path and the current path.
If your app context path is for example "myApp" then you can do something like this to make it work:
<link rel="stylesheet" href = "/myApp/css/styles.css">
If you want to make it relative so it does not depend on the context path, then if your url looks like http://localhost:8080/myApp/myservlet/file.jsp
Then your link tag would be
<link rel="stylesheet" href = "../css/styles.css">
Firebug or the chrome console may be really helpful to understand what the browser is trying to fetch.
Hope this helps!

link stylesheet from included header in PHP

I'm currently working on updating a "legacy" website to xhtml/css, so that I can go ahead and proceed on a re-design. All of the pages have the header included via PHP. The issue is is that if I reference the style sheet from the header as "style.css" it looks in the current directory for the style sheet where of course there is no style sheet. Do I need to use an absolute path, or is there a better way to do this?
The line below should work in any HTML/PHP file in any directory, included/required or not, as long as the directory "assets" is in your home directory. I think i'm right in saying this is true for all "href" attributes (i.e. in anchors).
<link href="/assets/css/style.css" rel="stylesheet" type="text/css" />
If you're including a CSS file with a PHP inluclude, you must know the relative path from every file in which you are running the include function - no absolute URLs are allowed.
The path to the CSS file is relative to the URL which you used to request the main PHP page (the one in browser address bar), not to the local disk file system path where the PHP page is located in the server machine. CSS files are namely loaded by the webbrowser, not by webserver.
So to figure the relative style sheet path which you'd like to use in <link href> in the HTML head, you need to know the absolute URL of both the PHP page and the CSS file so that you can extract the relative CSS path from it.