i was just wondering (since i didn't find anything quick on Google) if its possible (and how do i achieve that) to search directly in an html file, and ignore the tags or not as i please?
explaining a bit further. we wrote a crawler and obviously the crawler gives back the HTML of the page. But if i feel like searching the content of the crawler, do i need 2 separate fields one with html and one without or i can just have one field with html and search ignoring the html tags or not.
thanks in advance.
If i correctly understand you, all you need is to set search indexes without html tags?
We solved that problem this way:
class PostIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(model_attr='text', use_template=True, document=True)
and in template (search/indexes/blogs/post_test.html) we just used striptags filter
{{ object.content|striptags }}
After that you need to build_schema and rebuild_index. Now it search correctly without tags.
Related
I'm currently working on a blog using Django and SQLite for the back end. In my setup, I stored my articles in the database in this sort of form:
<p> <strong>The Time/Money Tradeoff</strong> </p> <p> As we flesh out High Life, Low Price, you will notice that sometimes we will suggest deals and solutions that may cost slightly more than their alternatives. We won’t always suggest the cheapest laptop...
On the page itself, I have this code for where I use the session data:
<p>{{request.session.article.0.blog_article}}</p>
I had assumed that the web broswer would be able to read the HTML tags. However, it prints on the page in that form, with the visible <p> tags and the like. I think this is because it's stored as a Unicode string in the database and is put onto the page between two quotation marks. If I paste the HTML code onto the page, the format looks like I wanted it to look, but I want it to be an automated process (tell Django which article ID I want, it plugs the elements of the page into the template and everything looks great).
How can I get the stored article in a form where the page can see the HTML tags?
By default django would autoescape all strings in the template, so when you render html code in the template, they just show up as the literal html code. But you could use safe filter to turn this off:
<p>{{request.session.article.0.blog_article|safe}}</p>
If I want to be able to show only certain tags in (say as in a forum post) using django tempalte variables how would I do that?
Say the content of my post is:
<div><b>Hell</div>o <i>everyone</i></b>
I don't want to show the div tags, but the b and i tags are fine. I know you can use |safe and autoescape but that seems to escape all html. Is there a better way to do this?
You could use a Custom Django Filter with a Regular Expression that does this.
Have a look here: http://djangosnippets.org/snippets/60/ replace the Regular Expression with what you need to remove the HTMl tags you don't want.
I have a text field in my MySQL table. I want to format all new lines with the <br> or some sort of formatting for the template. Is there anything built into the framework for this? I tried to read into the following page:
https://docs.djangoproject.com/en/dev/ref/templates/builtins/?from=olddocs
But seems like that page won't work for this? Is there another documentation I can refer to? Thanks!
It sounds like you need the linebreaksbr template filter.
Normally, you would use it in the template:
{{ instance.fieldname|linebreaksbr }}
However, it's possible to import it and use it in your view as well:
from django.template.defaultfilters import linebreaksbr
text_with_br = linebreaksbr(instance.fieldname)
The advantage of using linebreaksbr instead of writing your own snippet, is that the linebreaksbr takes care of autoescaping for you.
I decided to do it the following way: "<br />".join(word.split("\n")). Not sure if that's the best way...still digging into it. It certainly works though!
It may be overkill for you depending on your use case, but I use django-tinymce in my admin area to add rich text editing fields to charfields that will be used in templates. This saves a html string in the database and in your template you can just use:
{{ model.field|safe }}
to output it without losing the html formatting. It's quite easy to set up.
I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/
I have a file having some URLs per line. I need to extract the "keywords" present in the tags i.e. if there is meta tag for "keywords" then i want to get "content" value for it.
Example: if the web-page has this meta-tag:
<meta name="keywords" content="wikipedia,encyclopedia">
then for that URL i want "wikipedia,encyclopedia" to be extracted.
One approach is to download the web-page using "wget" and then parse it using some standard HTML parser.
I was wondering is there any better way to do this without downloading the entire web-page.
No -- you have to download the whole page .. or interrupt downloading after receiving some amount of data (which is even worse and much more complicated to do as AFAIK it cannot be done with wget and you will have to code your own wget).
If you're comfortable with some PHP, you should be able to put something together pretty easily by wrapping a loop around QueryPath.
Swiping an example from the docs, this:
require 'QueryPath/QueryPath.php';
$url = 'http://example.com';
print qp($url, 'title')->text();
...will go out and get the document at example.com, extract the text of the title tag and output it.
It'd only take a little more work to make that look for meta keywords tags and extract the content attribute, especially if you're already familiar with jQuery. (It's a bit of a simplification, but a large chunk of QueryPath is more or less implementing a "server-side jQuery.")
If you pursue this programmatic method and have further questions, they should probably go on the main Stack Overflow site where there's also an active querypath tag.
Here you have another solution:
http://simplehtmldom.sourceforge.net
I didn't try it yet!