Wikipedia Extract of Aliases - extract

I know it's possible to extract out Wikipedia content via a dump. However, is it also possible to extract out the search aliases as well?
For instance, that "obama" is an alias of "Barack Obama"?

You can find the data you're looking for (in RDF format) in the redirects datasets that were extracted from Wikipedia by DBpedia.

Related

Kibana Searches Using CSV file

I want to search Kibana against the contents of a csv file.
I have a large data of logs that i want to search for specific parameter(destination IP) and search that parameter against a csv file containing all the ips.
Can anybody give me some referees to relevant document which might help me get the required output.
How many IPs do you have in your CSV file? In general it sounds like a terms query that would be limited to 65K terms.
If it's a long list I'm not sure if this might not be simpler to do programatically against Elasticsearch directly than pasting around large queries or many elements.

How can i publish CSV data as Linked data on Web?

My work is mainly focused on conversion of CSV data to RDF data format. After get RDF data ,i need to publish that RDF data as Linked data on web. Actually i want to convert CSV data to RDF data using java programming by myself then i want to publish that RDF data as Linked data on web using any tools.Can anyone help me finding any ways to do this or give me any suggestion or reference ? which tools i should use for this work? Thanks
You can publish your RDF in a variety of ways. Here is a common reference where they explain the steps, software tools and examples: http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf
In a nutshell, once you have your RDF data, you should think about the following:
1) Which tool/set of tools do I want to use to store my RDF data? For instance, I commonly use Virtuoso because I can use it for free and it facilitates the creation of the endpoint. But you can use Jena TDB, Allegro Graph, or many other triple stores.
2) Which tool do I use to make my data derreferenceable? For example, I use Pubby because I can configure it easily. But you can use Jena TDB (for the previous step) + Fuseki + Snorql for the same purpose. See the reference above for more information on the links and features of each tool.
3) Which datasets should I link to? (i.e., which data from other datasets do I reference, in order to make my dataset part of the Linked Data cloud?)
4) How should I link to these datasets? For example, the SILK framework can be used to analyze which of the URIs of your dataset are owl:sameAs other URIs in the target dataset of your choice.
Many people just publish their RDF in their endpoints, without linking it to other datasets. Although this follows the Linked Data principles (http://www.w3.org/DesignIssues/LinkedData.html), it is always better to link to other existing URIs when possible.
This is a short summary, assumming you already have the RDF data created. I hope it helps.
You can use Tarql (https://tarql.github.io/) or if you want to do more advanced mapping you can use SparqlMap (http://aksw.org/Projects/SparqlMap).
In both cases you will end up having a SPARQL endpoint which you can make available on-line and people can query your data.
Making each data item available under its URL is a very good idea, following the Linked Data principles as mentioned by #daniel-garijo in the other answer: http://www.w3.org/DesignIssues/LinkedData.html.
So you can also publish the data-items with all its properties in individual files.

Best method to scrape large number of Wikipedia tables to MySQL database

What would be the best programmatic way to grab all the HTML tables of Wikipedia main article pages where the pages' titles match certain keywords? Then I would like to take the column names and table data and put them into a database.
Would also grab the URL and page name for attribution.
I don't need specifics just some recommended methods or links to some tutorials perhaps.
The easy approach to this is not to scrape the wikipedia website at all. All of the data, metadata, and associated media that form Wikipedia are available in structured formats; so preclude any need to scrape their web pages.
To get the data from Wikipedia into your database (which you may then search, slice and dice 'til your heart's content):
Download the data files.
Run the SQLize tool of your choice
Run mysqlimport
Drink a coffee.
The URL of the original article should be able to be re-constructed from the page title pretty easily.

Get data from Wiki to CSV file / database

What is the easiest way to get some data from wikipedia? I would like to get it as CSV file.
Basically the data what I would like to get is just list of names. For example, all the British actors names from this page: http://en.wikipedia.org/wiki/List_of_British_actors_and_actresses
(All from A-Z and names would be enough).
Is this possible? Also this would be done only once so no need for caching or anything like. Just simple get data perform. But I have no clue how to do it really.
PHP, JS, Jquery, JSON would be nice. No java or anything like that!
Have a look at DBPedia and Google Refine. IIRC Google Refine had an example extracting and cleaning data from Wikipedia (see video tutorial). And DBPedia is a database copy of Wikipedia already.

How add html and text files to Sphinx index?

From Sphinx reference manual: «The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, mailboxes, and so on»
But I can't find how add text files and html files to index. Quick Sphinx usage tour show setup for MySQL database only.
How I can do this?
Your should look at the xmlpipe2 data source.
From the manual:
xmlpipe2 lets you pass arbitrary full-text and attribute data to Sphinx in yet another custom XML format. It also allows to specify the schema (ie. the set of fields and attributes) either in the XML stream itself, or in the source settings.
I would suggest that you insert the texts in a database. That way you can retrieve them and probably highlight your search results much easier and faster.