I have thousands of uique urls that look like this:
<url>
<loc>http://my_site_url/view_profile=1577</loc>
<changefreq>daily</changefreq>
</url>
are these urls that google or any other SE will crawl?
Yes, those links will definitely be crawled, but you should ensure that in addition to your sitemap they are in some way properly linked. Could the urls be improved? Definitely, in the following ways:
Seperate words with dashes, not underscores
Make use of url rewriting to change that = to a /
You may also want to consider using a username rather than profile number. You might find Google's Article on URL Structure interesting as well.
Related
I want to extract the pages mentioned in the infobox and templates of pages.
E.g. From this page:
https://en.wikipedia.org/wiki/DNA
I want to extract all of the links in the infobox, like: "Genetics", "Introduction to Genetics" etc.
I want to do it, by using the sql dumps, possibly avoiding to parse the xml of whole pages, and I don't want to do it with APIs.
I could not find a way.
While Pagelinks does include also the links of infoboxes, I cannot find a way to exclude them.
I thought Templatelinks may have that info, but it is not: I could not find the pageids of the corresponding links in infoboxes.
Where is this information stored?
Or which kind of tables should I look at?
I consulted previous questions:
where can I find the infobox templates used in wiki?
and Mediawiki reference:
https://www.mediawiki.org/wiki/Manual:Templatelinks_table#Schema_summary
but could not find a solution.
That is a sidebar rather than an infobox: https://en.wikipedia.org/wiki/Template:Genetics_sidebar
I don't think there's a way of doing it other than parsing the content of the template to extract the links or using the API: e.g. https://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Template:Genetics%20sidebar&pllimit=100&plnamespace=0
Something like this should also work but it's not returning any results for me:
SELECT * from pagelinks
where pl_title = 'Genetics_sidebar'
and pl_namespace = 0
and pl_from_namespace = 10
https://quarry.wmcloud.org/query/71442
I'm trying to find a specific link from a web page using windows command line and tools. I think Xidel can do what I want to do.
In the page, the link is used like this:
file: 'http://link.link/index.txt'
Note: there's only one line like this. Now if I can set something like
file: '{%link}'
then I'll be able to extract the link. Also if I want to change the word index.txt to something like root.txt and then use aria2 to download the link as http://link.link/root.txt , what do I need to do?
(I don't have any experience with any of these tools/command like scripts, I just wanted to make something that does this (some alternatives are already available but I want to do it myself) and this only. So I did search for it and have an idea on how can I do it but extrating the exact url seems to be the hardest part since I couldn't find anything that might help me in xidel's docs)
Xidel is meant to extract data from HTML/XML/json files, but it can also extract from CSV's and TXT if you know how to use the $raw variable and xidel/xquery functions, like extract(), tokenize() and replace().
Post the URL or the source (or part thereof) of the webpage and I'll see how I can help you.
Going through another crazy website migration!
I have HTML img src urls that look like this
http://blog.example.com/imagename.jpg
Image formats can also be jpg, png, or gif
We need a regex that finds every url that has the domain then "/imagename.jpg" immediately after.
Very new to regex, what would the expression be?
Better Alternative for WordPress Migrations
If you are moving your website and you will want to replace all references to the old site with the new domain, I suggest you use David Coveney's Serialized Search & Replace DB v2.1.0. You'll want to run this on a new copy of the database, always have a backup handy. Import the database on the destination server, then run the tool - You don't even have to upload the server files.
When I do this coming from a development server to live domain, I usually do two search & replaces:
One for URLs, very basic:
Search: mywebsite.devserver.com
Replace: my-new-website.com
And one for file paths:
Search: /vhosts/devserver.com/mywebsite
Replace: /vhosts/my-new-website.com/httpdocs
(Note: This is assuming the majority of the file path is the same for both servers. Your search & replace paths may need to be more accurate)
The reason you want serialized search and replace is that some data is stored in PHP-serialized format, and if you change the value with a text editor or in MySQL directly, it may not be able to unserialize afterwards.
Regex Answer
Select images hosted by blog.example.com with the following regex pattern:
((http|https)://blog\.example.com/[^ \r\n]+\.(jpg|jpeg|png|gif))
Which basically searches for this: http(s)://blog.example.com/*.(jpg/png/etc)
Matches the URLs in the following examples:
http://example.com/imagename.jpg
http://blog.example.com/imagename.jpg
http://blog.example.com/favicon.png
http://blog.example.com/uploads/2013/05/kitten.gif
https://blog.example.com/ssl-secure.png
This is my favorite gif https://blog.example.com/some-hilarious-image.gif hahaha
DOES NOT match any of these:
blog.example.com/google.png
https://blog.google.com/google.png
our website is http://blog.example.com and has an image named /imagename.png
http://blog.example.com/
WHY it doesn't match those (by line):
Does not include http(s)://
Hosted by google
Paragraph text, where the URL is split into two parts
Not an image
$1 returns the full URL of the image.
I tested this on RegexTester.com. You can copy the pattern in the top field, and all of the examples in the box below. The red highlights are matches.
Many good suggestions already, and why would a wordpress site hardcode domain name to links, but thats not our problem right now. If you need a regex then try this:
(?<=<img).+(?<=src=["'])(.+(?:jpe?g|gif|png))
EXPLAINED:
(?<=<img).+(?<=src=["']) - be sure we're inside an <img> tag up to src attribute
(.+(?:jpe?g|gif|png)) capture everything up to required extension
I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/
I'm using HTML5's History API (through History.js) to dynamically rewrite URLs. I would like them to be the following format:
http://www.example.com/example/article/page
where both 'article' and 'page' are set by the History API.
However, this doesn't quite work as expected, as the pushState or replaceState provided by History.js only seem to work on the part of the URL which is after the last slash.
A quick example: if I'm at http://www.example.com/example/ and do pushState('Article-Title/1'), the url becomes http://www.example.com/example/Article-Title/1. Now I can change the page number with pushState('2'), but I have no way of changing the Article-Title part, which is what I'm after. window.location.href, which is used in Ben Lupton's example, can change the URL, but it also causes a "hard" redirect.
I suppose that this constraint is in place in order to prevent XSS; however, it bothers me greatly. Is there a reasonable way around it?
Found it: the URL I'm pushing has to start with a forward slash. That's all it takes.
You can also use relative URLS : ../../Another-Section/1