Should I create a slug on the fly or store in DB? - slug

A slug is part of a URL that describes or titles a page and is usually keyword rich for that page improving SEO. e.g. In this URL PHP/JS - Create thumbnails on the fly or store as files that last section "php-js-create-thumbnails-on-the-fly-or-store-as-files" is the slug.
Currently I am storing the slug for each page with the page's record in the DB. The slug is generated from the Title field when the page is generated and stored with the page. However, I'm considering generating the slug on the fly in case I want to change it. I'm trying to work out which is better and what others have done.
So far I've come up with these pro points for each one:
Store slug:
- "Faster" processor doesn't need to generate it each time (is generated once)
Generate-on-the fly:
- Flexible (can adjust slug algorithm and don't need to regen for whole table).
- Uses less space in DB
- Less data transferred from DB to App
What else have I missed and how do/would you do it?
EDIT:
I'd just like to clarify what looks like a misunderstanding in the answers. The slug has no effect on landing on the correct page. To understand this just chop off or mangle any part of the slug on this site. e.g.:
PHP/JS - Create thumbnails on the fly or store as files
PHP/JS - Create thumbnails on the fly or store as files
PHP/JS - Create thumbnails on the fly or store as files
will all take you to the same page. The slug is never indexed.
You wouldn't need to save the old slugs. If you landed on a page which had an "old slug" then you can detected that and just do a 301 redirect to the correctly "slugged" one. In the examples above, if Stack Overflow implemented it, then when you landed on any of the links with truncated slugs above, it would compare the slug in the url to the one generated by the current slug algorithm and if different it would do a 301 redirect to the same page but with the new slug.
Remember that all internally generated links would immediately be using the new algorithm and only links from outside pointing in would be using the old slug.

Wouldn't changing the slugs for existing pages be a really bad idea? It would break all your inlinks for a start.
Edit, following Guy's clarification in the question: You still need to take old slugs into account. For instance: if you change your slug algorithm Google could start to see multiple versions of each page, and you could suffer a duplicate content penalty, or at best end up sharing PR and SERPs between multiple versions of the same page. To avoid that, you'd need a canonical version of the page that any non-canonical slugs redirected to - and hence you'd need the canonical slug in the database anyway.

You might need to take another thing into consideration, what if you want the user/yourself to be able to define their own slugs. Maybe the algorithm isn't always sufficient.
If so you more or less need to store it in the database anyhow.
If not I don't think it matters much, you could generate them on the fly, but if you are unsure whether you want to change them or not let them be in the database. In my eyes there is no real performance issue with either method (unless the on-the-fly generation is very slow or something like that).
Choose the one which is the most flexible.

For slug generation I don't think that generation time should be an issue, unless your slug algorithm is insanely complicated! Similarly, storage space won't be an issue.
I would store the slug in the database for the simple reason that slugs usually form part of a permalink and once a permalink is out in the wild it should be considered immutable. Having the ability to change a slug for published data seems like a bad idea.

The best way to handle slugs is to only store the speaking part of the slug in the database and keep the routing part with the unique identifier for dynamic generation. Otherwise (if you store the whole url or uri) in the database it might become a massive task to rewrite all the slugs in the database first if you changed your mind about how to call them.
Let's take this questions SO slug as example:
/questions/807195/should-i-create-a-slug-on-the-fly-or-store-in-db
it's:
/route/unique-ID/the-speaking-part-thats-not-so-important
The dynamic part is obviously:
/route/unique-ID/
And the one I would store in the database is the speaking part:
the-speaking-part-thats-not-so-important
This allows you to always change your mind about the route's name and do the proper redirects without to have to look inside the database first and you're not forced to do db changes. The unique Id is always your database data unique Id so you can identify it correctly and you of cause know what your routes are.
And don't forget to set the canonical tag. If you take a look inside this page code it's there:
<link rel="canonical" href="http://stackoverflow.com/questions/807195/should-i-create-a-slug-on-the-fly-or-store-in-db" />
This allows search engines to identify the correct page link and ignore others in case you have duplicate content.

Related

Best practices when storing multimedia posts SQL DB

I have searched StackOverflow for an answer to this question, and I've been surprised to find very little information for what seems to be a very common task
Let's say I have an app that allows users to make posts. These posts can contain text, of course, but I also want the users to be able to insert images, and possibly videos.
So here's the dilemma. The first idea that comes to mind for storing these posts would be making a table like this:
CREATE TABLE posts(id INTEGER PRIMARY KEY AUTO_INCREMENT, owner VARCHAR(36) NOT NULL, message VARCHAR(MAX), _timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP);
id is an identifier for the post itself.
owner is an identifier for the person who created the post.
message contains the message, as text.
_timestamp represents the time created.
However, since SQL wasn't really made for storing images and other files, the images are being stored off-database. For sake of example, let's say they're stored using a product similar to Google Cloud Storage.
So, the question is, how should the message be formatted in such a way that they contain data (for example, a link) that would point to the images, without having to do too much work on the frontend code? (And without letting the user know that they're doing anything other than inserting an image).
From experience with GitHub and StackOverflow, Markdown is obviously nice, but not as user-friendly as I'd want, and doesn't work with images exactly the way I want.
I've thought about using HTML to format the message, but that brings up to main problems:
How should I store HTML in such a way that prevents XSS (Cross-site Scripting)? Should I just escape everything in such a way that it can still be read as HTML on the frontend?
Let's say this app is a mobile app. This means I would either have to make my own HTML parser or find an existing library for it.
So what is the best practice for this?
I see this type of functionality all the time, so what are those people (such as Facebook, Google, etc.) using?
Not only have I encountered this problem, but I feel like there should be a good answer for this on StackOverflow for others who encounter this problem.
Specifically, I want to know whether HTML is a good option, or if I should consider something else. As far as right now, I'm planning to use plain HTML, and make public URIs for Cloud Storage objects
Not speaking about specific implementation I would say you never want to insert the image/video data into the post.
These should always be either an attachment or a link.
So either you let the user to insert links into the post or you let them add attachments which are then uploaded to the server and link to them is placed into the post.
Let's say you have a situation where a user drops the image/video/audio/whatever data into the post. In that case you would fire an event that uploads the data to your storage and places the link into the post when it's done. That's what happens when you CTRL-C CTRL-V an image into GitHub message for example.
Regarding XSS, you should strip the inserted data off any javascript and stuff that you don't like and you should be fine. There are many libraries that can do this for you.

Asp.NET Core 2 Images

I have a couple of questions about images, since I don't know what is better for my purposes. Also this might me helpful for other people because I couldn't find this info in other questions.
Well, although this is an asp.net core 2.0 application the first question could is a general question about images.
QUESTION 1
When I have images that I want to load everytime I usually add a query string so the explorers like Chrome or IE don't get the chached image they have. In my case I add the time ticks to the url of the image, this way it loads the image everytime since the query string is always different:
filePath += "?" + DateTime.Now.Ticks;
But in my case I have a panel where the administrators of the page can change a lot of images. The problem, when they change those images if there is no query string the users are going to see an old image they have stored in their explorer cache.
The question is, if I add the query string to many images is not bad for the performance? is there any other solution for this?
QUESTION 2
I also have photos of the users and other images stored in the site. When I saw a image all the visitors of the site can see the path (for example: www.site.com/user_files/user_001/photo001.jpg).
Is there a way to hide those paths or transform in another thing is asp.net core 2.0?
Thanks a lot.
Using something like ticks will get the job done, but in a very naive way. You're going to put more stress both on your server and the clients, since effectively the image will have to be refetched every single time, regardless of whether it has changed or not. If you will have any mobile users, the situation is far worse for them, as they'll be forced to redownload all these resources over and over, usually over limited (and costly) data plans.
A far better approach is to use a cryptographic digest, often called a "hash". Essentially, the same data encrypted in the same way will return the same hash. It's usually used to detect tampering with transmitted data, but since each message will (generally) have a unique hash and that hash will be the same each time for the same piece of data, you can also use this to generate a cache-busting query string that only changes when the image data itself changes.
Now, to be thorough, there's technically no guarantee that two messages won't result in the same hash. Instances where that occurs are called "collisions" and they can happen. However, if you use a sufficiently complex algorithm like SHA256, the likelihood of collisions is greatly reduced. Regardless, it should not be a real issue for concern for this particular use case of cache-busting images.
Simplistically, to create the hash, you simply do something like:
string hash;
using (var sha256 = SHA256.Create())
{
hash = Convert.ToBase64String(sha256.ComputeHash(imageBytes));
}
The value of hash then will be something like z1JZs/EwmDGW97RuXtRDjlt277kH+11EEBHtkbVsUhE=.
However, ASP.NET Core has an ImageTagHelper built-in that will handle this for you. Essentially, you just need to do:
<img src="/path/to/image.jpg" asp-append-version="true" />
As for your second question, about hiding or obfuscating the image path, that's not strictly possible, but can be worked around. The URL you use to reference the image uniquely identifies that resource. If you change it in any way, it's effectively not the same resource any more, and thus, would not locate the actual image you wanted to display. So, in a strict sense, no, you cannot change the URL. However, you can proxy the request through a different URL, effectively obfuscating the URL for the original image.
Simply, you'd just have an action on some controller that takes an image path (as part of the query string), loads that from the filesystem and returns it as a response. Care should be taken limit the scope of files that can be returned like this, both based on directory (only allow your image directory, for example, not C:\Windows\, etc.) and file type (only allow images to be returned, not random text files, config files, etc.). That portion is straight-forward enough, and you can find many examples online if you need them.
Ultimately, this doesn't really solve anything, though, because now your image path is simply in the query string instead. However, now that you've set this part up, you can encrypt that part of the query string using the Data Protection API. There's some basic getting started information available in the docs. Essentially, you're just going to encrypt the image path when creating the URL, and then in your action that returns the image, you decrypt the path first before running the rest of the code. For the encryption part, you can create a tag helper to do this for you without having to have a ton of logic in your views.

Random Article button

I'd like to create a button on a menu bar that can generate a link to a random article from my blog posts (much like Wikipedia has). It's for a client, and they'd like to have this functionality on the site. I'm not familiar with PHP so I'd like to find a way around that, especially since I don't have access to the root user on my server host's mySQL installation (if this is relevant).
I had a theoretical solution: have a .txt or .xml file containing a list of all the URLs to each of the posts, with a "key" assigned to each of them. Then, when the user clicks the random article button, the current time (ex. 1:45) is hashed and mapped to a specific URL. I am fairly new to Drupal, however, I was wondering if there was some way to have the random article button use a .c file to execute these steps. The site is being hosted on a server that uses Apache 2, and I looked through some modules that were implemented in C code. I'm pretty new to all of this (although proficient in C), and spent many fruitless hours searching for solutions.
In a pure Drupal fashion (don't know if you are interested by this kind of solution), you could create a view (create a block) which retrieve blog posts, use a random sort criteria and limit results to 1 item. Then configure this view to display fields, and add only one field : post title, and check "link to content" in this field parameters window. You'll get one random blog post title which will be rendered as a link to this blog post.
Finally in Structure->Block assign your new block in a region to see it.
It's a pure Drupal / Views / no-code-just-clicks :) way, but it will be far more maintainable and easy to setup than introducing C for such a simple feature.
Views module
Let me know if you try this and have problems configuring your view or anything else.
Good luck

Eliminating extra DBQueries in django by storing the absolute_url

Using Django blogging, I have a template that looks like:
{{ post.title }}
This looks innocuous enough, but it ends up generating yet another lookup of the user of the blog post, something that I (in most cases) already know. However that isn't my point.
The URL looks like:
http://localhost:8000/blog/post/mark/2010/08/Aspect-Oriented-Prog/
and part of the reason it looks like that is so that the URL is somewhat self explanatory, and won't change with time.
What I'm very curious about is what are the possible problems with storing this URL in the database along with the blog post? If it isn't supposed to change, and I store it there, then fetching the blog, gives me the absolute_url without having to fetch the user and rebuild the URL.
I'm thinking the part I store does not include /blog/post, but includes the post specific info so that I can do:
{% url blog-post blog %} and have it paste the pieces together.
Just for the record, yes, I could do selected_related, except in my case, I'm actually coming at this backward from an activity log where I'm getting the object like so:
def get_edited_object(self):
"Returns the edited object represented by this log entry"
return self.content_type.get_object_for_this_type(pk=self.object_id)
and I haven't figured out how to add the select related to this, but wonder if I need to, given the fact I can add the absolute_url to the object itself.
I realize this is somewhat subjective, but what I really need is someone to play devils advocate for why I shouldn't do this, because it seems to simple and straightforward, I don't see any reason not to.
I believe this is a case of normalizing versus denormalizing. The normalizing school would argue that if you have the necessary information to create the URL available in the database then you should be computing it rather than storing and retrieving. Denormalizing would let you get away without computing it each time.
I'll take a stab at playing devil's advocate. I have two arguments.
If you decide to change the scheme of your URLs for any reason - say migrating to another top level domain, or changing any element in the path (say "/b" instead of "/blog") then you'd have an unnecessary data migration in your hands.
Users are allowed edit the blog posts. If an user changes the title of her blog post then the slug will have to be generated again, which in turn means that the URL would have to be computed and stored again.
If the user handle can change (I know this is unlikely, but I have seen sites who let you do this) then you'd have to compute and store again.

Markdown or HTML

I have a requirement for users to create, modify and delete their own articles. I plan on using the WMD editor that SO uses to create the articles.
From what I can gather SO stores the markdown and the HTML. Why does it do this - what is the benefit?
I can't decide whether to store the markdown, HTML or both. If I store both which one do I retrieve and convert to display to the user.
UPDATE:
Ok, I think from the answers so far, i should be storing both the markdown and HTML. That seems cool. I have also been reading a blog post from Jeff regarding XSS exploits. Because the WMD editor allows you to input any HTML this could cause me some headaches.
The blog post in question is here. I am guessing that I will have to follow the same approach as SO - and sanitize the input on the server side.
Is the sanitize code that SO uses available as Open Source or will I have to start this from scratch?
Any help would be much appreciated.
Thanks
Storing both is extremely useful/helpful in terms of performance and compatiblity (and eventually also social control).
If you store only Markdown (or whatever non-HTML markup), then there's a performance cost by parsing it into HTML flavor everytime. This is not always noticeably cheap.
If you store only HTML, then you'll risk that bugs are silently creeping in the generated HTML. This would lead to lot of maintenance and bugfixing headache. You'll also lose social control because you don't know anymore what the user has actually filled in. You'd for example as being an admin also like to know which users are trying to do XSS using <script> and so on. Also, the enduser won't be able to edit the data in Markdown format. You'd need to convert it back from HTML.
To update the HTML on every change of Markdown version, you just add one extra field representing the Markdown version being used for generating the HTML output. Whenever this has been changed in the server side at the moment you retrieve the row, re-parse the data using the new version and update the row in the DB. This is only an one-time extra cost.
By storing both you only have to process the markdown once (when it is posted). You would then retrieve the HTML so that you can load your pages faster.
If you only stored one, you'd forever have to recreate the other for either the display view or the edit view.