I'd love to do this:
UPDATE table SET blobCol = HTTPGET(urlCol) WHERE whatever LIMIT n;
Is there code available to do this? I known this should be possible as the MySQL Docs include an example of adding a function that does a DNS lookup.
MySQL / windows / Preferably without having to compile stuff, but I can.
(If you haven't heard of anything like this but you would expect that you would have if it did exist, A "proly not" would be nice.)
EDIT: I known this would open a whole can-o-worms re security, however in my cases, the only access to the DB is via the mysql console app. Its is not a world accessible system. It is not a web back end. It is only a local data logging system
No, thank goodness — it would be a security horror. Every SQL injection hole in an application could be leveraged to start spamming connections to attack other sites.
You could, I suppose, write it in C and compile it as a UDF. But I don't think it really gets you anything in comparison to just SELECTing in your application layer and looping over the results doing HTTP GETs and UPDATEing. If we're talking about making HTTP connections, the extra efficiency of doing it in the database layer will be completely dwarfed by the network delays anyway.
I don't know of any function like that as part of MySQL.
Are you just trying to retreive HTML data from many URLs?
An alternative solution might be to use Google spreadsheet's importHtml function.
Google Spreadsheets Lets You Import Online Data
Proly not. Best practises in a web-enviroment is to have database-servers isolated from the outside, both ways, meaning that the db-server wouldn't be allowed to fetch stuff from the internet.
Proly not.
If you're absolutely determined to get web content from within an SQL environ, there are as far as I know two possibilities:
Write a custom MySQL UDF in C (as bobince mentioned). The could potentially be a huge job, depending on your experience of C, how much security you want, how complete you want the UDF to be: eg. Just GET requests? How about POST? HEAD? etc.
Use a different database which can do this. If you're happy with SQL you could probably do this with PostgreSQL and one of the snap-in languages such as Python or PHP.
If you're not too fussed about sticking with SQL you could use something like eXist. You can do this type of thing relatively easily with XQuery, and would benefit from being able to easily modify the results to fit your schema (rather than just lumping it into a blob field) or store the page "as is" as an xhtml doc in the DB.
Then you can run queries very quickly across all documents to, for instance, get all the links or quotes or whatever. You could even apply XSL to such a result with very little extra work. Great if you're storing the pages for reference and want to adapt the results into a personal "intranet"-style app.
Also since eXist is document-centric it has lots of great methods for fuzzy-text searching, near-word searching, and has a great full-text index (much better than MySQL's). Perfect if you're after doing some data-mining on the content, eg: find me all documents where a word like "burger" within 50 words of "hotdog" where the word isn't in a UL list. Try doing that native in MySQL!
As an aside, and with no malice intended; I often wonder why eXist is over-looked when people build CMSs. Its a database that can store content in its native format (XML, or its subset (x)HTML), query it with ease in its native format, and can translate it from its native format with a powerful templating language which looks and acts like its native format. Sometimes SQL is just plain wrong for the job!
Sorry. Didn't mean to waffle! :-$
Related
I'm designing a website where users can upload comments on pages, and other users should see those comments. I reached the stage where I have the comments stored in a database, and I know the place they're supposed to go in the html, and I need to connect those two things somehow.
I'm using express and Node.js on the server side, and postgres on the db side.
As of when I'm asking this, it seems to me it's very bad practice to have the user access the database. So I think the server needs to access the database based on the user's request, modify the generalized html's showing of comments to now have the information of the specific comments, save that to a file, and send it to the user. To do this I was thinking of creating an "html generator function" on the server-side that takes in specific comment information and puts it in the generalized html, but that seems like it doesn't scale well and I'm concerned that storing the intermediate file would be inefficient.
Is that the correct approach? Can you tell me known ways of doing this that aren't so hacky?
If you suggest using php, isn't there a problem where php connects to a server and disconnects every time we use it? I would prefer if the server connected once when it booted and did all the fetching when needed instead of connecting every time. It seems to me like that would involve far less overhead (correct me if I'm wrong...)
See the comment of Amadan for the full solution. It's called a "template engine"
Edit:
I highly recommend learning React. I learned EJS and it's difficult to scale. React is infinitely easier to program with for just a little more investment. The old web is much less declarative (& EJS is much less too).
Is there any good way to migrate existing database from Domino Server to Relational database like MySQL without using any tool.
I've explored a bit about this and got to know that its possible using XML but don't know how and what'll be the procedure.
Any help would be appreciated.
Without using any tool: NO.
There are two big difficulties in exporting data:
First is the Notes Richtext, which is a proprietary format that has to be "transcoded" somehow. This is not an easy thing to do "manually" and needs either a lot of coding or some kind of tool.
Second is the fact, that there is no "forced" structure in Notes documents. There can be several forms that "define" how the documents look and there can be different versions of these forms that have been used over the past. A document may or may not contain any number of fields in any thinkable type (the field may even be number in one document and text in the other).
You have to KNOW the structure of your documents to get them out. Of course you can simply export them as "Structured Text" or as "Comma separated values", to get -most- of it, but then you need views that show the documents in the order you need them. Exporting them as XML is another "standard" way to get the data, but then you need to understand the xml to get it into your relational database.
Short: Without (at least very little) coding knowledge OR a tool (that costs money) there is no chance for getting the data out.
Ah yes, there is an "ODBC driver" for Lotus Notes / Domino, but that will not help you much, if you do not know the structure of your documents and how Notes- Databases work, it will also not work.
As Torsten said above, you can't do it without a tool, either you buy one or write one yourself.
I wrote a tool like that several years ago to export Notes databases as XML. There is a bit of work, especially with the rich text fields. You also may want to export/detach attachments and embedded images.
You can read more about my export tool here: http://www.texasswede.com/websites/texasswede.nsf/Page/Notes%20XML%20Exporter
Background
I have a (glorified) CRUD application that I'd like to enable HTML5 offline support with. The cache-manifest system looks simple yet powerful, but I'm curious about how I can allow users to access data while offline.
For example, suppose I have these pages for the entity "Case" (i.e. this is CRM case-management software):
http://myapplication.com/Case
http://myapplication.com/Case/{id}
http://myapplication.com/Case/Create
The first URI contains a paged listing of all cases, using the querystring parameters pageIndex and pageSize, e.g. /Case?pageIndex=2&pageSize=20.
The second URI is the template for editing individual cases, e.g. /Case/1 or /Case/56.
Finally, /Case/Create is the form used to create cases.
The Problem
I would like all three to be available offline.
/Case
The simple way would be to add /Case to the cache-manifest, however that would break paging (as the links wouldn't work).
I think I could instead add something like /Case/AllData which is an XML resource, which is cached and if offline then a script on /Case would use this XML data to populate the list and provide for pagination.
If I go for the latter, how can I have this XML data stored in the in-browser SQL database instead of as a cached resource? I think using the SQL database would be more resilient.
/Case/{id}
This is more complicated. There is the simple solution of manually adding /Case/1, /Case/2, /Case/3 etc... to /Case/1234, but there can be hundreds or even thousands of cases so this isn't very practical.
I think the system should provide access to the 30 most recent cases, for example. As above, how can I store this data in the database?
Also, how would this work? If I don't explicitly add /Case/34 to the manifest and the user clicks on to /Case/34 how can I get the browser to load a page that my JavaScript will populate based on the browser's SQL database data and not display the offline message?
/Case/Create
This one is more simple - as it's just an empty page and on the <form>'s submit action my script would detect if it's offline, and if it is offline then it would add it to the browser's SQL database. Does this sound okay?
Thanks!
I think you need to be looking at a LocalStorage database (though it does have some downsides), but there are other alternatives such as WebSQL and IndexedDB.
Also I don't think you should be using numeric Id's if you are allowing people to create as you will get Primary Key conflicts, it is probably best to use something like a GUID.
Another thing you need is the ability to push those new cases onto the server. there could be multiple...
Can they be edited? If they can I think you really need to be thinking about synchronization and conflict resolution hard very hard if that is the case.
Shameless self promotion, I have a project that is designed to handle these very issues, though it's not done, it's close. You can see it (with an ugly but very functional) demo at https://github.com/forbesmyester/SyncIt
I'm writing a crawler which needs to get data from many websites. The problem is that every website has different structure. How can I easily write a crawler which downloads (correctly) data from (many) different websites? If the structure of a website will change will I need to rewrite the crawler, or are there other methods?
What logical and implemented tools can be used to improve the quality of data mined by an automatic web-crawler (many websites are involved with different structure)?
Thank You!
I presume you want to query it is some way, in which case you should store the data in a flexible data store. A relational database would not be fit for purpose as it has a strict schema, but something like mongodb which lets you store semi structured data without having to define a schema up front, but still provides a powerful query language.
The same goes for how you represent the data in the crawler code. Don't map the data to classes where the structure is defined up front, but use a flexible data structures that can change at runtime. If you are using Java then de-serialise the data into HashMaps. In other languages this might be called Dictionaries or Hashes.
If you're scraping data from websites that actually want to allow you to do that, chances are they will provide some sort of webservice to allow you to query their data in a structured way.
Otherwise, you're on your own, and you might even be violating their terms of use.
If the websites provide no APIs, then you're out cold and you have to write separate extraction module for each data format you're encountering. If the website changes the format, then you have to update your format module. A standard thing to do is to have plugins for every website you're crawling and have a testing framework which does regression testing with data you've already collected. When a test fails you know something went wrong and you can investigate whether you have to update your format plugin or if there is another issue.
Without knowing what kind of data you're collecting it will be very difficult to try to hypothesize about ways to improve the "quality" of the data that was mined.
Maybe you could find out whether the website allows you to access the data like API, if so, you could use this kind of structured data to your website directly. If not, you may need plugins for that. Or you could turn to other web crawlers with API access like Octoparse, to find the way to access their API to your own web crawler.
I have lots of stuff in an app.config, and when changes are necessary, an app restart is required. Bad for my 24x7 web server system (it really is 24x7, not even 23x7). I would like to use a good strategy for keeping the config information in a DB table and query/use it as needed. I googled around a bit and am coming up dry. Does anyone have any suggestions before I re-invent the wheel?
Thanks.
I needed exactly this for my recent application, and couldn't use any application server specific techniques as I needed some console apps run on cronjobs to access them too.
I basically made a couple of small tables to create a registry-style configuration database. I have a table of keys (which all have parent-keys so they can be arranged in a tree structure) and a table of values which are attached to keys. All keys and values are named, so my access functions look like this:
openKey("/my_app");
createKey("basic_settings");
openKey("basic_settings");
createValue("log_directory","c:\logs");
getValue("/my_app/basic_settings","log_directory");
The tree structure allows you to logically separate similar data (e.g. you can have a "log_directory" value under several different keys) and avoids having the overly verbose names you find in properties files.
All the values are just strings (varchar2 in the db), so there's some overhead in converting booleans and numbers: but it's only config data, so who cares?
I also create a "settings_changed" value that has a datetime string in it: so any app can quickly tell if it needs to refresh it's configuration (you obviously need to remember to set it when you change anything though).
There may be tools out there to do this kind of thing already: but this was only a days worth of coding and works a treat. I added command line tools to edit and upload/download parts or all of the tree, then made a quick graphical editor in Java Swing.