How do I index HTML files into Apache SOLR? - html

By default SOLR accepts XML files, I want to perform search on millions of crawled URLS (html).

Usually, the first step I would recommend rolling your own application using SolrJ or similar to handle the indexing, and not do it directly with the DataImportHandler.
Just write your application and have that output the contents of those web pages as a field in a SolrInputDocument. I recommend stripping the HTML in that application, because it gives you greater control. Besides, you probably want to get at some of the data inside that pag, such as <title>, and index it to a different field. An alternative is to use HTMLStripTransformer on one of your fields to make sure it strips HTML out of anything that you send to that field.
How are you crawling all this data? If you're using something like Apache Nutch it should already take care of most of this for you, allowing you to just plug in the connection details of your Solr server.

Solr CEL can accept HTML and indexes them for full-text search: http://wiki.apache.org/solr/ExtractingRequestHandler
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=#tutorial.html"

You can index downloaded html file with solr very well.
This is the fastest way that I did my indexing:
curl http://localhost:8080/solr/update/extract?stream.file=/home/index.html&literal.id=www.google.com
Here stream.file is the local path of your html file and literal.id is url from index.html.

Related

Setting up Stormcrawler and ElasticSearch to crawl our website html file and pdf documents

We are using StormCrawler and ElasticSearch to crawl our website. We followed the documentation for using ElasticSearch with StormCrawler. When we search in Kibana we do get back html files results but not the pdf files content or links. How do we setup Stormcrawler to crawl and store in Elastic our website html and pdf files content. What configuration changes do we need to make. Does this have something to do with outlinks settings? Is there documentation that tell us how to setup StormCrawler and ElasticSearch to crawl html and pdf documents?
You are probably looking at the 'content' index in Kibana but should also look at the 'status' index, the latter should contain PDF docs. A quick look at the logs would have also told you that the PDFs are getting fetched but that the parser is skipping them. The status index contains an ERROR status and a message mentioning 'content-type checking'.
So, how do you fix it? Just add the Tika module as a Maven dependency and follow the steps on its README, this way the PDF docs will get redirected to the Tika Parsing Bolt which is able to extract text and metadata from them. They should then be indexed correctly into the 'content' index.

db.json file is created and added to .gitignore using hexo.io

I have been trying to find what a db.json is and why it is being automatically genereated. All the documentation says in hexo.io is:
$ hexo clean
Cleans the cache file (db.json) and generated files (public).
What is this exactly? Since these are all static pages, is this some sort of makeshift database?
most commonly db.json is used when you're running a server using hexo server. I believe its for performance improvements. It doesn't affect the generation (hexo generate) and deployments(hexo deploy)
db.json file stores all the data needed to generate your site. There are all posts post, tags, categories etc. The data is stored in a JSON formatted string so it's easier and faster to parse the data and generate the site.

Indexing flat XML files in elasticsearch

I'm working on a specific project where external data provided by external providers is to be indexed on our ElasticSearch Engine.
The data is provided as XML flat files.
The idea here is to script something out that reads each file, parse it and launch as many HTTP POST as needed for each one of them.
Is there a simpler way to do this? something like uploading the XML file that gets indexed automatically without any script?
You can use logstash with an xml filter to do this. Takes a bit of work to get setup the first time, but it's the most straightforward way to do it.

How to handle uploading html content to an AppEngine application?

I would like to allow my users to upload HTML content to my AppEngine web app. However if I am using the Blobstore to upload all the files (HTML files, css files, images etc.) this causes a problem as all the links to other files (pages, resources) will not work.
I see two possibilities, but both of them are not very pretty and I would like to avoid using them:
Go over all the links in the html files and change them to the relevant blob key.
Save a mapping between a file and a blob key, catch all the redirections and serve the blobs (could cause problems with same name files).
How can I solve this elegantly without having to go over and change my user's files?
Because app engine is running your content on multiple servers, you are not able to write to the filesystem. What you could do is ask them to upload a zip file containing their html, css, js, images,... The zipfile module from python is available in appengine, so you can unzip these files, and store them individually. This way, you know the directory structure of the zip. This allows you to create a mapping of relative paths to the content in the blobstore. I don't have enough experience with zipfile to write a full example here, I hope someone more experienced can edit my answer, or create a new one with an example.
Saving a mapping is the best option here. You'll need to identify a group of files in some way, since multiple users may upload a file with the same name, then associate unique pathnames with each file in that group. You can use key names to make it a simple datastore get to find the blob associated with a given path. No redirects are required - just use the standard Blobstore serving approach of setting the blobstore header to have App Engine serve the blob to the user.
Another option is to upload a zip, as Frederik suggests. There's no need to unpack and store the files individually, though - you can serve them directly out of the zip in blobstore, as this demo app does.

Query mysql database while indexing Solr documents

I need to update my solr documents with detailed informations i can grab from a mysql database.
Example:
solr field "city" --> "London" (read from an xml source with post.jar tool)
on update time (requestHandler /update already configured with custom plugin to do other stuff) solr should query mysql for more information about "London" (or whatever just read)
solr updates the fields of that document with the query result
i've been trying with a JDBC plugin and with a DIH handler (which i can only use calling /dataimport/ full-import... and i can't in my specific case) and so far no success :(
Any of you had the same problem? How did you solve it? Thanks!
edit: i forgot, for the dih configuration i tried following this guide http://www.cabotsolutions.com/2009/05/using-solr-lucene-for-full-text-search-with-mysql-db/
Please do include the full output of /dataimport/full-import when you access it in your browser. Solr error messages can get cryptic.
Have you considered uploading documents by XML? http://wiki.apache.org/solr/UpdateXmlMessages . Its more powerful, allowing you to use your own logic when uploading documents.
Read each row from SQL and compose an XML document (string) with each document under tags.
Post the entire XML string to /update . Dont forget to set the MIMEtype header as text/xml . And make sure to set your Servler container's (Tomcat, Jetty) upload limit on POSTs (Tomcat has 2mb limit, if I recall right)
dont forget the commit and optimize commands