Spider that tosses results into mysql

Spider that tosses results into mysql - mysql

Looking to use Sphinx for site search, but not all of my site is in mysql. Rather than reinvent the wheel, just wondering if there's an open source spider that easily tosses its findings into a mysql database so that Sphinx can then index it.
Thanks for any advice.

There's also the XML pipe datasource that can feed documents to Sphinx. Not sure if it'd be any easier to set something up to output your site's content as XML than it would be to insert it into the DB, but it's an option.

If you're not 100% stuck on using Sphinx you could consider Lucerne like this site is? This should work regardless of underlying technology (database driven or static pages).
I am also currently looking to implement a site search. This question may also help.

Related

Can you use SQL to make a database for a HTML website?

I have made a very basic website using HTML. It is basically a template, the thing you start with. I want to use SQL to make a database. I would also like to display all of the data on that screen (index.html). Can you help me achieve that?

Short answer: No, not only with SQL.
SQL is a language used to perform queries in a database (inserting data, deleting, searching, etc.). To use it to display data on your website you would also need to learn some other programming language, so you can write code that will serve as an interface between your website and the database. Two most popular choices are Python (with Flask or Django) or NodeJS, I recommend using Python, since it's known to be somewhat beginner-friendly. I suggest finding a tutorial online to get you started.
Happy coding!

I think, you should use the 'sqlalchemy' for the database system. You can look the sqlalchemy documentation to have more info.

Indexing html with Sphinx without complex scripts

As far as I know the sphinx search engine can index html, but it doesn't have any in-built drivers like it does for sql-data. That means we have to parse and prepare html content ourselves.
Does anyone know of any drivers or third party add-ons make sphinx index html automatically?
Can anyone help? Thanks in advance.

Well if you have a database of the .html filenames, can use
http://sphinxsearch.com/docs/current.html#conf-sql-file-field
to index them, sphinx will load each individiaul file in turn and index the contents.
Combine with
http://sphinxsearch.com/docs/current.html#conf-html-strip

How does stackoverflow manage tags

I am interested in the underlying data structure of the database and the way stackoverflow manages tags. I am about to build application that will rely entirely on tag based filters and I seek for the right approach. What is the best way to design the database, so a minimum queries will have to run in future when working with the sets of tags to filter my data. I did use the search, but couldn't find what I am looking for.

Stackoverflow does not rely entirely on SQL database to work with tags. They cache, pre-sort and pre-aggregate them aggressively.
Read this interesting story of one optimization.
From there you can get some insights on how stackoverflow works.

I don't know if they do it well, but you may want to look at Drupal taxonomy for ideas (http://drupal.org/documentation/modules/taxonomy). If you run the installation, you can look at how they handle this in the generated db.

filemaker pro export and import to mysql via php

could anyone advise me direct me to a site that explains the best way to go about this I'm sure I could figure it out with allot of time invested but just looking for a jump start. I don't want to use the migration tool either as I just want to put fmp xml files on the server and it create new MySql databases based on the fmpxml results provided
thanks

Technically you can write a XSLT to transform the XML files into SQL. It's pretty much straightforward for data (except data in container fields), but with some effort you can even transfer the scheme from DDR reports (but I doubt it worth it for a single project).

Which version of MySQL? v6 has LOAD XML which will make things easy for you.
If not v6, then you are dealing with stored procedures, which can be a pain. If you need v5, it might make sense to install MySQL6, get the data in there using LOAD XML, and then do a mysqldump, which you can import into v5.
Here is a good link:
http://dev.mysql.com/tech-resources/articles/xml-in-mysql5.1-6.0.html

Can I run an HTTP GET directly in SQL under MySQL?

I'd love to do this:
UPDATE table SET blobCol = HTTPGET(urlCol) WHERE whatever LIMIT n;
Is there code available to do this? I known this should be possible as the MySQL Docs include an example of adding a function that does a DNS lookup.
MySQL / windows / Preferably without having to compile stuff, but I can.
(If you haven't heard of anything like this but you would expect that you would have if it did exist, A "proly not" would be nice.)
EDIT: I known this would open a whole can-o-worms re security, however in my cases, the only access to the DB is via the mysql console app. Its is not a world accessible system. It is not a web back end. It is only a local data logging system

No, thank goodness — it would be a security horror. Every SQL injection hole in an application could be leveraged to start spamming connections to attack other sites.
You could, I suppose, write it in C and compile it as a UDF. But I don't think it really gets you anything in comparison to just SELECTing in your application layer and looping over the results doing HTTP GETs and UPDATEing. If we're talking about making HTTP connections, the extra efficiency of doing it in the database layer will be completely dwarfed by the network delays anyway.

I don't know of any function like that as part of MySQL.
Are you just trying to retreive HTML data from many URLs?
An alternative solution might be to use Google spreadsheet's importHtml function.
Google Spreadsheets Lets You Import Online Data

Proly not. Best practises in a web-enviroment is to have database-servers isolated from the outside, both ways, meaning that the db-server wouldn't be allowed to fetch stuff from the internet.

Proly not.
If you're absolutely determined to get web content from within an SQL environ, there are as far as I know two possibilities:
Write a custom MySQL UDF in C (as bobince mentioned). The could potentially be a huge job, depending on your experience of C, how much security you want, how complete you want the UDF to be: eg. Just GET requests? How about POST? HEAD? etc.
Use a different database which can do this. If you're happy with SQL you could probably do this with PostgreSQL and one of the snap-in languages such as Python or PHP.
If you're not too fussed about sticking with SQL you could use something like eXist. You can do this type of thing relatively easily with XQuery, and would benefit from being able to easily modify the results to fit your schema (rather than just lumping it into a blob field) or store the page "as is" as an xhtml doc in the DB.
Then you can run queries very quickly across all documents to, for instance, get all the links or quotes or whatever. You could even apply XSL to such a result with very little extra work. Great if you're storing the pages for reference and want to adapt the results into a personal "intranet"-style app.
Also since eXist is document-centric it has lots of great methods for fuzzy-text searching, near-word searching, and has a great full-text index (much better than MySQL's). Perfect if you're after doing some data-mining on the content, eg: find me all documents where a word like "burger" within 50 words of "hotdog" where the word isn't in a UL list. Try doing that native in MySQL!
As an aside, and with no malice intended; I often wonder why eXist is over-looked when people build CMSs. Its a database that can store content in its native format (XML, or its subset (x)HTML), query it with ease in its native format, and can translate it from its native format with a powerful templating language which looks and acts like its native format. Sometimes SQL is just plain wrong for the job!
Sorry. Didn't mean to waffle! :-$

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008