I am doing some volunteer work for a charity that is using a couple online systems that store their donors and related data. I would like to find a way to store a URL as a custom field in such a way that they can put corresponding links between donors in one of the systems in order to quickly find the same donor in another system. The only built-in method in the products being used is to store a single value in a field labeled "website" which is originally intended to store a value for any website associated with the donor. I would like to avoid using this field if possible and instead create a custom field.
However, the rub is the custom fields only have a handful of options (clear text, date, currency, etc). There is no option to store a URL or something like rich text). I've thought of a couple less optimal ways to make the values stored in those fields clickable (a browser plugin or a proxy) however both of those have obvious drawbacks that I would like to avoid.
What I am wondering and hoping someone has a possible answer for, is if there are an ways of storing a value in a clear text field that might disrupt or escape the underlying html encoding such that the displayed link is clickable. I already control the values being put into these fields (users cannot enter their own values, they are essentially read-only), so security isn't much of a concern.
I have very limited access or influence to have any system level changes, however I would like to make this possible as it would help them a great deal (their users are all volunteers with limited time and education). I've tried a few tricks but havn't found anything that doesn't get converted to unicode or escaped (it could be that it's completely controlled for at output, i simply don't know).
My current attempts have been limited to using the built in forms submission, I may explore their import and/or API methods on the theory that might allow better low-level access to storing the actual values in the system, however I'm still not certain what to try other than adding .
I have also tried an inline script to add the corresponding tab, however that seems to break the form submission method (perhaps it'll work via csv import or via the API)
Does anyone have suggestions for other things I could try before I go any further? I'm a bit of a novice and feel like there may be something else obvious I haven't tried.
Related
I have searched StackOverflow for an answer to this question, and I've been surprised to find very little information for what seems to be a very common task
Let's say I have an app that allows users to make posts. These posts can contain text, of course, but I also want the users to be able to insert images, and possibly videos.
So here's the dilemma. The first idea that comes to mind for storing these posts would be making a table like this:
CREATE TABLE posts(id INTEGER PRIMARY KEY AUTO_INCREMENT, owner VARCHAR(36) NOT NULL, message VARCHAR(MAX), _timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP);
id is an identifier for the post itself.
owner is an identifier for the person who created the post.
message contains the message, as text.
_timestamp represents the time created.
However, since SQL wasn't really made for storing images and other files, the images are being stored off-database. For sake of example, let's say they're stored using a product similar to Google Cloud Storage.
So, the question is, how should the message be formatted in such a way that they contain data (for example, a link) that would point to the images, without having to do too much work on the frontend code? (And without letting the user know that they're doing anything other than inserting an image).
From experience with GitHub and StackOverflow, Markdown is obviously nice, but not as user-friendly as I'd want, and doesn't work with images exactly the way I want.
I've thought about using HTML to format the message, but that brings up to main problems:
How should I store HTML in such a way that prevents XSS (Cross-site Scripting)? Should I just escape everything in such a way that it can still be read as HTML on the frontend?
Let's say this app is a mobile app. This means I would either have to make my own HTML parser or find an existing library for it.
So what is the best practice for this?
I see this type of functionality all the time, so what are those people (such as Facebook, Google, etc.) using?
Not only have I encountered this problem, but I feel like there should be a good answer for this on StackOverflow for others who encounter this problem.
Specifically, I want to know whether HTML is a good option, or if I should consider something else. As far as right now, I'm planning to use plain HTML, and make public URIs for Cloud Storage objects
Not speaking about specific implementation I would say you never want to insert the image/video data into the post.
These should always be either an attachment or a link.
So either you let the user to insert links into the post or you let them add attachments which are then uploaded to the server and link to them is placed into the post.
Let's say you have a situation where a user drops the image/video/audio/whatever data into the post. In that case you would fire an event that uploads the data to your storage and places the link into the post when it's done. That's what happens when you CTRL-C CTRL-V an image into GitHub message for example.
Regarding XSS, you should strip the inserted data off any javascript and stuff that you don't like and you should be fine. There are many libraries that can do this for you.
Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.
I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.
The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.
I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.
There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.
I am working on a user-moderated database and settled on MediaWiki with Semantic MediaWiki as an engine. I installed Semantic Forms to force the end users to conform to a certain standard when creating or editing entries. The problem is that since a user can add a semantic notation to any form text input it can throw off the proper structure of the system, i.e. if it was an IMDB clone a user can add [[Directed by:Forest Gump]] which would then result in the movie "Forest Gump" showing up under a list of directors.
I doubt that there's any setting that can simply turn this off or on, but I've had one or two ideas as to how to get it working.
One, perhaps there's a way to disable semantic notation on specific namespaces and put the forms on those namespaces. I have a feeling that this will cause the forms to merely break.
Another idea is to modify the code. This is clearly the less ideal approach. To get started, I believe I would need to create some sort of filter on SFTextAreaInput which would disable semantic notations for the user inserted text, but alas I'm unsure as to how to get started on that.
Well, Semantic MediaWiki is still a Wiki. In your classical enterprise database, you restrict the users' input options as a means of ensuring data integrity. That isn't what wikis do; the thinking with a wiki is, yes, the user can enter incorrect information, but another user will amend it and let the first user know what was wrong.
I wouldn't try to coerce SMW into rigid data acquisition. I mean, you do have options such as removing the standard input fields in forms:
'''Free text:'''
{{{standard input|free text|rows=10}}}
If users are selecting a movie page when they should be selecting a director page, then you probably want to encourage correct selection by populating the form control from the Directors category, like:
{{{field|Director|input type=combobox|values from category=Directors}}}
Yes, they can still go very far out of their way to select "Forrest Gump", but if that happens then the fact that someone wilfully circumvented the preselected correct options is a more pressing concern than the fact that the system permits it.
Wikis work best when the system encourages rather than enforces valid knowledge.
My name is Wolfgang Fahl I am behind the smartMediaWiki approach. You might want to go the smartMediaWiki route
see
http://semantic-mediawiki.org/wiki/SMWCon_Spring_2015/smartMediaWiki
For a start don't go just by the property values but e.g. also by a category.
{{#ask: [[Category:Movie]] [[Directed by::+]]
|?Directed by
}}
will only show pages that have both the property set and are in the correct category.
In the smartMediaWiki approach you'd create a topic "Movie" and the entry of movies would be done via Forms. This is an elaboration of the SemanticForms and semantic PageSchemas idea that recently evolved. You can find out more about this at SMWCon Barcelona 2015 this fall.
I have a pretty good project task management system going in Microsoft Access, but one feature I'm still missing is some type of 'quick entry' like facility often found in many good productivity applications.
This is how it would work:
Scenario 1:
You're in another application, working on a few things, and you just remember something that needs to get done. You hit your predefined shortcut: CTRL + ALT + T (again, from outside Microsoft Access) and it brings up a small access form with a text box in to which you can type what needs to get done, e.g.
Inform key stakeholders of concerns regarding timeline
you hit return and that gets saved as a record in Microsoft Access instantly.
An alternative, and slightly more complex scenario...
Scenario 2:
As above, but you want to add further details besides the task name, such as the person you need to speak to, and a due date. The input in to the text box could look like this:
Inform #Sally of concerns regarding timeline >+3
Where '#' tells access to populate a field called 'Contact' with 'Sally' (unless it already exists) and '>+3' is interpreted by access to mean a due date 3 days from today.
How difficult are Scenario 1 and Scenario 2 to perform? What level of VBA/programming knowledge would be required?
Thanks,
I would say it requires a fair amount of confidence in VBA.
You need to register a global hot-key; that is, a keyboard-combination that can be captured from outside the Access application. It requires win-api calls. Here is some code.
You need to know where to place these calls. I believe you have to put them in a standard module, not in the form's class module. (I haven't double-checked this, it's late.)
You need to have a little understanding about what this code is doing. NEVER attempt to type this api-code - copy it from a reliable source, exactly as it is!! You don't need to fully understand the code, but you need to know how (and when) to call each function.
Once you've registered the hot-key then your VBA needs to bring your application to the front and display your form, and focus it. Reliably bringing the application to the front may also require api-calls.
Once your form is opened (and focused) you can have a button on it to parse the information in its textbox. However, if you are designing the form anyway, I would add checkboxes, comboboxes, etc., rather than trying to parse a complex sentence/ statement.
This is often situation, but here is latest example:
Companies have various contact data (addresses, phone numbers, e-mails...) when they make job ad, they have checkboxes where they choose how they want to be contacted. It is basically descriptive data. User when reading an ad sees something like "You can apply by mail, in person...", except if it's "through web portal" or "by e-mail" because then appropriate buttons should appear. These options are stored in database, and client (owner of the site, not company making an ad) can change them (e.g. they can add "by telepathy" or whatever), yet if they tamper with "e-mail" and "web-portal" options, they screw their web site.
So how should I handle data where everything behaves same way except "this thing" that behaves this way, and "that thing" that behaves some other way, and data itself is live should be editable by client.
You've tagged your question as "language-agnostic", and not all languages cleanly support polymorphism, but that's the way I would approach this.
Each option has some type, and different types require different properties to be set. However, every type supports some sort of "render" method that can display the contact method as needed. Since the properties (phone number, or web address, etc.) are type-specific, you can validate the administrator's input when creating these "objects", to make sure that the necessary data is provided and valid. Since you implement the render method, rather than spitting out HTML provided by a user, you can ensure that the rendered page is correct. It's less flexible, but safer and more user friendly.
In the database, you can have one sparsely populated table that holds data for all types of contacts, or a "parent" table with common properties and sub-tables with type-specific properties. It depends on how many types you have and how different they are. In either case, you would have some sort of type indicator, so that you know the type of object to which the data should be be bound.
First of all, think twice do you really need it. Reason is simple. You are supposed to serve specific need and input data is a mean to provide that service. If data does not fit with existing service then what is its value and who are consumer of that specific information?
There are two possible answers: You are expanding your client base or you need to change existing service because of change of demand. In both cases you need to star from development of business model. If you describe what service you need and what information it should provide you will avoid much of specific data and come with clear requirements easy to implement in software.
I'd recommend the resolution pattern for this, based on the mention of a database. The link above describes it, but it's actually a lot simpler than it sounds. You write a database query that returns all the possible options (for example, you read the standard options and the customized options together using perhaps a UNION or a JOIN depending on your schema) - the COALESCE SQL keyword is then useful to find the first 'resolution' of the option value that isn't NULL.
Well, if all it is is that you have two options that are special, and then anything else is dealt with in the same way, then store your options as strings, and if either of the two special ones appears in that list, then show the appropriate stuff for that special item.
Just check your list of items for the two special ones. Nothing fancy.
By writing a very simple Rules Engine. You can use an out-of-the box implementation, or you can roll your own. Since your case seems so simple, I tend to roll my own, because it means less dependencies (YMMV).