I am interested in XML. I know it from Google's CSE.
It is often a pain for me to manipulate 3000-rows XML files.
This raises a question.
Why does Google use XML, not MySQL, such that I need to manipulate large XML -files?
XML has at least these advantages over SQL for data interchange purposes:
It's self-describing, you don't need to have any additional information to parse it.
It's a true standard, universally interoperable.
You aren't limited to tabular-oriented data: you can also use it to model hierarchies, for instance.
Probably the best you can do with SQL is ship tables in source code form, ie, as CREATE TABLE statements followed by a lot of INSERT statements. This is fine if you have a compatible database, but since SQL never really crystallized as a standard, interoperability at this level is very poor, and Google would have to offer multiple dialects (perhaps even for incompatible versions of the same DBMS).
XML is mostly human readable and cross platform. How would google send you data from just MYSql? Would you expect them to send you a binary blob that assumes you have the proper database to insert it into? How would you use that blob if MYSql wasn't installed, or a different version of MYSql was installed on your machine than on google?
XML is often uses as a transport format between systems. In CSE I would guess that google is transferring a lot of data from them to you in a format that many systems can use. If they used MySQL it would be no use to me as I don't know anything about it. However, pretty much most modern software frameworks can work with XML.
ADDITIONAL
Also, CSE (Customised Search Engine) probably expects that you don't need to do a lot of manipulation to the XML, just transform if for rendering to a web page. You can very easily perform an XSLT (Extensible Stylesheet Language Transformation) to an XML file to transform it in to an HTML fragment to use on your website.
MySQL is a specific SQL database engine. One not very suitable for providing the backend for the very very large dataset and special special needs that a search engine like google have.
I'm sure you can dig up info on how google's infrastructore, e.g. starting here
Relying on and exposing something specific like MySQL is not something you want to do when exchanging data over the internet.
XML on the other hand, being a general and textual markup language is ideal when you need
to interface and exchange data between systems. Thus it provides an ideal way to interface services such as Google CSE. You don't need to care about the specific implementation google have to provide the data, and Google don't need to care about the specific technology you use to manipulate the data
In addition to #Jared, there are XML databases. If the data is stored in XML, then it can be queried, transformed into html on the fly, or used in applications without the need for wrapping the data.
Why does Google use XML, not MySQL, such that I need to manipulate large XML -files?
access time, because there is no security check routine in DOM level on the accesed/open port /-:
Related
I have implemented RDFa on a shopping website.
Now, how to create triple store using those structured data?
There are thousands of products in the website. So, manually visiting each and every page and extracting RDF is not a good solution. Is there any automatic tools for this?
The answer depends on how you "implemented RDFa". It is unlikely that the majority of your content is expressed as static information, so it is also unlikely that the majority of your content requires scraping.
There are tools, such as D2R Server, that give you facilities for exposing your underlying datastore as a read-only SPARQL endpoint. The only trick will be if you do have static content and wish to expose that as automatically generated RDF as well. That would require some finessing.
The data which is in RDFa format on your website probably comes from a database, where it is in relational form, since you probably didn't add the RDF triples to the HTML manually. So the easiest way to get the data into the triple store would not be from the HTML, but by some kind of transformation of the original data in the database. In the end, RDF triples can be seen as a ternary relation that can well be stored in any relational database.
GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a way of using XSLT to extract the RDF triples from the HTML, in case you do not have access to a relational database that stores the data. Hope this helps.
I'm writing a crawler which needs to get data from many websites. The problem is that every website has different structure. How can I easily write a crawler which downloads (correctly) data from (many) different websites? If the structure of a website will change will I need to rewrite the crawler, or are there other methods?
What logical and implemented tools can be used to improve the quality of data mined by an automatic web-crawler (many websites are involved with different structure)?
Thank You!
I presume you want to query it is some way, in which case you should store the data in a flexible data store. A relational database would not be fit for purpose as it has a strict schema, but something like mongodb which lets you store semi structured data without having to define a schema up front, but still provides a powerful query language.
The same goes for how you represent the data in the crawler code. Don't map the data to classes where the structure is defined up front, but use a flexible data structures that can change at runtime. If you are using Java then de-serialise the data into HashMaps. In other languages this might be called Dictionaries or Hashes.
If you're scraping data from websites that actually want to allow you to do that, chances are they will provide some sort of webservice to allow you to query their data in a structured way.
Otherwise, you're on your own, and you might even be violating their terms of use.
If the websites provide no APIs, then you're out cold and you have to write separate extraction module for each data format you're encountering. If the website changes the format, then you have to update your format module. A standard thing to do is to have plugins for every website you're crawling and have a testing framework which does regression testing with data you've already collected. When a test fails you know something went wrong and you can investigate whether you have to update your format plugin or if there is another issue.
Without knowing what kind of data you're collecting it will be very difficult to try to hypothesize about ways to improve the "quality" of the data that was mined.
Maybe you could find out whether the website allows you to access the data like API, if so, you could use this kind of structured data to your website directly. If not, you may need plugins for that. Or you could turn to other web crawlers with API access like Octoparse, to find the way to access their API to your own web crawler.
So I'm starting to learn XML. It seems like a simple flat file data system of which you can view output by using a server side language of your choice and some parsing. I don't really see the benefit to using XML over storing values in a database and doing the same kind of parsing. I mean it would seem that databases would be faster.
So what can you really do with XML that you can't/shouldn't do with a database? Is XML really that useful?
So what can you really do with XML that you can't/shouldn't do with a database? Is XML really that useful?
XML is an interchange format first and foremost. It allows you to transport structured data between programs, servers, or people, and retain a common parser and schema system.
XML of course can be horribly misused or overused.
This question is to broad (i.e. there are too many aspects in which they differ), yet main reason for XML is not even about data storage. It was designed as ultimate common platform for data exchange with defined rules how data is organised. Thus you can read/write valid XML on almost every platfrom and language.
XML is designed to be more human readable. XML can be opened easily in a text editor and read. Some XML readers can support folding, which also helps with getting a hierarchical organization to your data.
If you're processing files that's a different story. I think databases often have the option of exporting to XML.
You can carry your datas from one type database to another (example from MS-SQL to MySQL) by using XML.
Or sending datas from an application to another, which is used on many web applications.
I think it can be very useful for this.
I think it is comparison of apples to oranges...
There are a lot of usages of XML but it is not primarily used for storing data. It is very loosely coupled data structure when compared to databases.
One of the many usages of XML, which I encounter with very frequently is exchanging data from one program to another. Because it is very simple format one can create an XML file in Java program and other can parse(read) the xml file in VB/C#/Python/Cocoa or any other language.
One such use of XML is Webservices where client programs can call(Execute) code residing on servers, where requests and response both are in XML.
So one can say that strong feature of XML is interoperability.
On the other had databases are mainly used for storing and retrieving data, databases are extremely powerful to do fast retrieval/insertion of values in tables where XML will immensely fail because most of the time XMLs have to be read serially as oppose to tables residing in databases.
XML can contain highly complex tree data structures that cannot be easily represented in relational databases.
XML is also useful for representing documents (Word docs for example or HTML).
The thing that's so appealing about XML is that it is quite simple to create.
Python is a great language for converting text files into XML for example.
XML vs databases is a false dichotomy, because you can store XML in databases. Though it's true that a simple XML document can sometimes be used for an application that would otherwise have needed a database.
If you're dealing with documents (like articles in technical journals) then your only real choice is between XML and some proprietary equivalent. This of course is the problem that XML was originally invented to solve.
XML is also used extensively for data messaging. It supplanted EDI and ASN.1 in this role because it can handle all the complex data that EDI and ASN.1 can handle, but is itself much simpler. More recently we've seen JSON taking over some of this role, especially for "private" (as distinct from standardised) protocols, because JSON is simpler still, and works better with general-purpose programming languages.
XML, like any successful technology, has also been used extensively for problems where it isn't really needed. That's not a misuse, any more than it is a misuse of this forum to send a plain text message in a field that is capable of holding richly formatted text, or to ride my bicycle on a road that's engineered to take 40ton lorries: once the technology is in place, you might as well use it.
I am developing a web application right now, where the user interacts with desktop software that exports files as XML.
I was curious if there was a way to take the data from the XML file and insert that data into a mySQL database using ColdFusion?
Of course you can, ColdFusion provides powerful tools for handling XML.
Typically you'll need to parse XML file into the XML document object with XmlParse and search through it using XPath language with XmlSearch. Fetched data you can easily use for inserting into the database or any other manipulations.
Please note that there are more useful XML functions present, for example you may be interested in validation XML before parsing it.
If you'll need help for specific situations -- please extend your question or ask another one.
If you are working with XML documents that fit into memory when parsed, #Sergii's answer is the right way to go. On the other hand, XML being verbose as it is, and ColdFusion's using a DOM XML parser, can easily lead to Out of Memory errors.
In that situation, given MySQL and ColdFusion, I see two alternative paths. One is exporting the data from the desktop application as CSV, if possible. Then use MySQL's LOAD DATA INFILE, which you can call from ColdFusion to import the data. This is probably the fastest performance.
If you cannot change the desktop application's export format, consider using a Java StAX parser instead. See my answer from another question for an example of how to do this with ColdFusion. This has the advantage of only pulling in part of the XML document into memory at any given time, but is somewhat more difficult to work with than a DOM parser. As such you will not get OOM errors.
Note, there is a third type of parser available as well from Java - SAX - that has the same quality as a StAX parser of not loading the whole document into memory. However, it's a more difficult approach IMO to work with, thus the StAX recommendation.
I'd love to do this:
UPDATE table SET blobCol = HTTPGET(urlCol) WHERE whatever LIMIT n;
Is there code available to do this? I known this should be possible as the MySQL Docs include an example of adding a function that does a DNS lookup.
MySQL / windows / Preferably without having to compile stuff, but I can.
(If you haven't heard of anything like this but you would expect that you would have if it did exist, A "proly not" would be nice.)
EDIT: I known this would open a whole can-o-worms re security, however in my cases, the only access to the DB is via the mysql console app. Its is not a world accessible system. It is not a web back end. It is only a local data logging system
No, thank goodness — it would be a security horror. Every SQL injection hole in an application could be leveraged to start spamming connections to attack other sites.
You could, I suppose, write it in C and compile it as a UDF. But I don't think it really gets you anything in comparison to just SELECTing in your application layer and looping over the results doing HTTP GETs and UPDATEing. If we're talking about making HTTP connections, the extra efficiency of doing it in the database layer will be completely dwarfed by the network delays anyway.
I don't know of any function like that as part of MySQL.
Are you just trying to retreive HTML data from many URLs?
An alternative solution might be to use Google spreadsheet's importHtml function.
Google Spreadsheets Lets You Import Online Data
Proly not. Best practises in a web-enviroment is to have database-servers isolated from the outside, both ways, meaning that the db-server wouldn't be allowed to fetch stuff from the internet.
Proly not.
If you're absolutely determined to get web content from within an SQL environ, there are as far as I know two possibilities:
Write a custom MySQL UDF in C (as bobince mentioned). The could potentially be a huge job, depending on your experience of C, how much security you want, how complete you want the UDF to be: eg. Just GET requests? How about POST? HEAD? etc.
Use a different database which can do this. If you're happy with SQL you could probably do this with PostgreSQL and one of the snap-in languages such as Python or PHP.
If you're not too fussed about sticking with SQL you could use something like eXist. You can do this type of thing relatively easily with XQuery, and would benefit from being able to easily modify the results to fit your schema (rather than just lumping it into a blob field) or store the page "as is" as an xhtml doc in the DB.
Then you can run queries very quickly across all documents to, for instance, get all the links or quotes or whatever. You could even apply XSL to such a result with very little extra work. Great if you're storing the pages for reference and want to adapt the results into a personal "intranet"-style app.
Also since eXist is document-centric it has lots of great methods for fuzzy-text searching, near-word searching, and has a great full-text index (much better than MySQL's). Perfect if you're after doing some data-mining on the content, eg: find me all documents where a word like "burger" within 50 words of "hotdog" where the word isn't in a UL list. Try doing that native in MySQL!
As an aside, and with no malice intended; I often wonder why eXist is over-looked when people build CMSs. Its a database that can store content in its native format (XML, or its subset (x)HTML), query it with ease in its native format, and can translate it from its native format with a powerful templating language which looks and acts like its native format. Sometimes SQL is just plain wrong for the job!
Sorry. Didn't mean to waffle! :-$