I have searched StackOverflow for an answer to this question, and I've been surprised to find very little information for what seems to be a very common task
Let's say I have an app that allows users to make posts. These posts can contain text, of course, but I also want the users to be able to insert images, and possibly videos.
So here's the dilemma. The first idea that comes to mind for storing these posts would be making a table like this:
CREATE TABLE posts(id INTEGER PRIMARY KEY AUTO_INCREMENT, owner VARCHAR(36) NOT NULL, message VARCHAR(MAX), _timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP);
id is an identifier for the post itself.
owner is an identifier for the person who created the post.
message contains the message, as text.
_timestamp represents the time created.
However, since SQL wasn't really made for storing images and other files, the images are being stored off-database. For sake of example, let's say they're stored using a product similar to Google Cloud Storage.
So, the question is, how should the message be formatted in such a way that they contain data (for example, a link) that would point to the images, without having to do too much work on the frontend code? (And without letting the user know that they're doing anything other than inserting an image).
From experience with GitHub and StackOverflow, Markdown is obviously nice, but not as user-friendly as I'd want, and doesn't work with images exactly the way I want.
I've thought about using HTML to format the message, but that brings up to main problems:
How should I store HTML in such a way that prevents XSS (Cross-site Scripting)? Should I just escape everything in such a way that it can still be read as HTML on the frontend?
Let's say this app is a mobile app. This means I would either have to make my own HTML parser or find an existing library for it.
So what is the best practice for this?
I see this type of functionality all the time, so what are those people (such as Facebook, Google, etc.) using?
Not only have I encountered this problem, but I feel like there should be a good answer for this on StackOverflow for others who encounter this problem.
Specifically, I want to know whether HTML is a good option, or if I should consider something else. As far as right now, I'm planning to use plain HTML, and make public URIs for Cloud Storage objects
Not speaking about specific implementation I would say you never want to insert the image/video data into the post.
These should always be either an attachment or a link.
So either you let the user to insert links into the post or you let them add attachments which are then uploaded to the server and link to them is placed into the post.
Let's say you have a situation where a user drops the image/video/audio/whatever data into the post. In that case you would fire an event that uploads the data to your storage and places the link into the post when it's done. That's what happens when you CTRL-C CTRL-V an image into GitHub message for example.
Regarding XSS, you should strip the inserted data off any javascript and stuff that you don't like and you should be fine. There are many libraries that can do this for you.
Related
I am doing some volunteer work for a charity that is using a couple online systems that store their donors and related data. I would like to find a way to store a URL as a custom field in such a way that they can put corresponding links between donors in one of the systems in order to quickly find the same donor in another system. The only built-in method in the products being used is to store a single value in a field labeled "website" which is originally intended to store a value for any website associated with the donor. I would like to avoid using this field if possible and instead create a custom field.
However, the rub is the custom fields only have a handful of options (clear text, date, currency, etc). There is no option to store a URL or something like rich text). I've thought of a couple less optimal ways to make the values stored in those fields clickable (a browser plugin or a proxy) however both of those have obvious drawbacks that I would like to avoid.
What I am wondering and hoping someone has a possible answer for, is if there are an ways of storing a value in a clear text field that might disrupt or escape the underlying html encoding such that the displayed link is clickable. I already control the values being put into these fields (users cannot enter their own values, they are essentially read-only), so security isn't much of a concern.
I have very limited access or influence to have any system level changes, however I would like to make this possible as it would help them a great deal (their users are all volunteers with limited time and education). I've tried a few tricks but havn't found anything that doesn't get converted to unicode or escaped (it could be that it's completely controlled for at output, i simply don't know).
My current attempts have been limited to using the built in forms submission, I may explore their import and/or API methods on the theory that might allow better low-level access to storing the actual values in the system, however I'm still not certain what to try other than adding .
I have also tried an inline script to add the corresponding tab, however that seems to break the form submission method (perhaps it'll work via csv import or via the API)
Does anyone have suggestions for other things I could try before I go any further? I'm a bit of a novice and feel like there may be something else obvious I haven't tried.
Hello Stack Overflow Community!
I am making a directory of many thousand custom mods for a game using HTML tables. When I started this project, I thought one HTML page would be slow, but adequate for the ~4k files I was expecting. As I progressed, I realized there are tens of thousands of files I need to have in these tables, and let the user search though to find what they are missing to load up a new scenario. Each entry has about 20 text entries and a small image (~3KB). I only need to be able to search through one column.
I'm thinking of dividing the tables across several pages on my website to help loading speeds and improve overall organization. But then a user would have to navigate to each page, and perform a search there. This could take a while and be very cumbersome.
I'm not great at website programming. Can someone advise a way to allow the user to search through several web pages and tables from one location? Ideally this would jump to the location in the table on the new webpage, or maybe highlight the entry like the browser's search function does.
You can see my current setup here : https://www.loco-dat-directory.site/
Hopefully someone can point me in the right direction, as I'm quite confused now :-)
This would be my steps,
Copy all my info into an excel spredsheet, then convert that to json, then make that an array for javascript (myarray), then can make an input field, and on click an if statement if input == myarray[0].propertyName
if you want something more than an exact match, you'd need https://lodash.com/
in your project.
Hacky Solution
There is a browser tool, called TableCapture, to capture data from html tables and load into excel/spreadsheets - where you are basically deferring to spreadsheet software to manage the searching.
You would have to see if:
This type of tool would solve your problem - maybe you can pull each HTML page's contents manually, then merge these pages into a document with multiple "sheets", and then let people download the "spreadsheet" from your website.
If you do not take on the labor above and just tell other people to do it, then you'd have to see if you can teach the people how to perform the search and do this method on their own. eg. "download this plugin, use it on these pages, search"
Why your question is difficult to answer
The reason why it will be hard for people to answer you in stackoverflow.com (usually code solutions) is that you need a more complicated solution (in my opinion) than hard coded tables and html/css/javascript.
This type of situation is exactly why people use databases and APIs to accept requests ("term": "something") for information and deliver responses ( "results": [...] ).
Thank you everyone for your great advice. I wasn't aware most of these potential solutions existed, and it was good to see how other people were tackling problems of similar scope.
I've decided to go with DataTables for their built-in sorting and filtering : https://datatables.net/
I'm also going to use a javascript array with an input field on the main page to allow users to search for which pack their mod is in. This will lead them to separate pages on my site, each with a unique datatable for a mod pack. Separate pages will load up much quicker than one gigantic page trying to show everything.
I have a couple of questions about images, since I don't know what is better for my purposes. Also this might me helpful for other people because I couldn't find this info in other questions.
Well, although this is an asp.net core 2.0 application the first question could is a general question about images.
QUESTION 1
When I have images that I want to load everytime I usually add a query string so the explorers like Chrome or IE don't get the chached image they have. In my case I add the time ticks to the url of the image, this way it loads the image everytime since the query string is always different:
filePath += "?" + DateTime.Now.Ticks;
But in my case I have a panel where the administrators of the page can change a lot of images. The problem, when they change those images if there is no query string the users are going to see an old image they have stored in their explorer cache.
The question is, if I add the query string to many images is not bad for the performance? is there any other solution for this?
QUESTION 2
I also have photos of the users and other images stored in the site. When I saw a image all the visitors of the site can see the path (for example: www.site.com/user_files/user_001/photo001.jpg).
Is there a way to hide those paths or transform in another thing is asp.net core 2.0?
Thanks a lot.
Using something like ticks will get the job done, but in a very naive way. You're going to put more stress both on your server and the clients, since effectively the image will have to be refetched every single time, regardless of whether it has changed or not. If you will have any mobile users, the situation is far worse for them, as they'll be forced to redownload all these resources over and over, usually over limited (and costly) data plans.
A far better approach is to use a cryptographic digest, often called a "hash". Essentially, the same data encrypted in the same way will return the same hash. It's usually used to detect tampering with transmitted data, but since each message will (generally) have a unique hash and that hash will be the same each time for the same piece of data, you can also use this to generate a cache-busting query string that only changes when the image data itself changes.
Now, to be thorough, there's technically no guarantee that two messages won't result in the same hash. Instances where that occurs are called "collisions" and they can happen. However, if you use a sufficiently complex algorithm like SHA256, the likelihood of collisions is greatly reduced. Regardless, it should not be a real issue for concern for this particular use case of cache-busting images.
Simplistically, to create the hash, you simply do something like:
string hash;
using (var sha256 = SHA256.Create())
{
hash = Convert.ToBase64String(sha256.ComputeHash(imageBytes));
}
The value of hash then will be something like z1JZs/EwmDGW97RuXtRDjlt277kH+11EEBHtkbVsUhE=.
However, ASP.NET Core has an ImageTagHelper built-in that will handle this for you. Essentially, you just need to do:
<img src="/path/to/image.jpg" asp-append-version="true" />
As for your second question, about hiding or obfuscating the image path, that's not strictly possible, but can be worked around. The URL you use to reference the image uniquely identifies that resource. If you change it in any way, it's effectively not the same resource any more, and thus, would not locate the actual image you wanted to display. So, in a strict sense, no, you cannot change the URL. However, you can proxy the request through a different URL, effectively obfuscating the URL for the original image.
Simply, you'd just have an action on some controller that takes an image path (as part of the query string), loads that from the filesystem and returns it as a response. Care should be taken limit the scope of files that can be returned like this, both based on directory (only allow your image directory, for example, not C:\Windows\, etc.) and file type (only allow images to be returned, not random text files, config files, etc.). That portion is straight-forward enough, and you can find many examples online if you need them.
Ultimately, this doesn't really solve anything, though, because now your image path is simply in the query string instead. However, now that you've set this part up, you can encrypt that part of the query string using the Data Protection API. There's some basic getting started information available in the docs. Essentially, you're just going to encrypt the image path when creating the URL, and then in your action that returns the image, you decrypt the path first before running the rest of the code. For the encryption part, you can create a tag helper to do this for you without having to have a ton of logic in your views.
Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.
I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.
The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.
I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.
There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.
I've recently inherited a ASP.NET MVC 4 code base. One problem I noted was the use of some database ids (ints) in the urls as well in html form submissions. The code in its present state is exploitable through both URL tinkering and creating custom HTML posts with different numbers.
Now while I can easily fix the URL problems by using session state or additional auth checks i'm less sure about the database ids that get embedded into the HTML that the site spits out (i.e. I give them a drop down to fill). When the ids come back in a post how can I be sure I put them there as valid options?
What is considered "best practice" in terms of addressing this problem?
While I appreciate I could just "GUID it up" I'm hesitant to do so because I find them a pain in the ass to work with when debugging databases.
Do I have a choice here? Must I GUID to prevent easy guessing of ids or is there some kind of DRY mechanism I can use to validate the usage of ids as they come back into the site?
UPDATE: A commenter asked about the exploits I'm expecting. Lets say I spit out a HTML form with a drop down list of all the locations one can import "treasure" from. The id of the locations that the user owns are 1,2 and 3, these are provided in the HTML. But the user examines the html, fiddles with it and decides to put together a POST with the id of 4 selected. 4 is not his location, its someone else's.
Validate the ID passed against the IDs the user can modify.
It may seem tedious, but this is really the only way to make sure the user has access to what they're trying to modify. Using GUIDs without validation is security by obscurity: sure guessing them is hard, but you can potentially guess them given enough resources.
You can do this at the top of the controller before you do anything else with the posted data. If there's a violation, just throw an exception and have your global exception handler deal with it; you don't need to handle it in a pretty way since you can safely assume that the user is tampering with data in an unsupported way.
The issue you describe is known as "insecure direct object references," and the OWASP group recommends two policies for dealing with this issue:
using session-based indirect object references, and
validating all accesses to object references.
An example of Suggestion #1 would be that instead of having dropdown options 1, 2, and 3, you assign each option a GUID that is associated with the original ID in a map in the user's session. When you get a POST from that user, you check to see what object the given ID was supposed to be tied to. OWASP's ESAPI has some libraries to help with this in various languages.
But in many cases Suggestion #1 is actually counterproductive. For example, in many cases you want to have URLs that can be copy/pasted from one user to another. Process #2 is generally seen as the most foolproof way to address this issue.
You are describing Broken Access Control with Insecure Ids. Once you've identified the threat and decided which Ids are owned by certain users, ensure checks are in place for this server side.