Quick implementation for very large indexed text search? - mysql

I have a single text file that is about 500GB (ie a very large log file) and would like to build an implementation to search it quickly.
So far I have created my own inverted index with a SQLite Database but this doesn't scale well enough.
Can anyone suggest a fairly simple implementation that would allow quick searching of this massive document?
I have looked at Solr and Lucene but these look too complicated for a quick solution, I'm thinking a database with built in full-text indexing (MySQl, Raven, Mongo etc.) may be the simplest solution but have no experience with this.

Since you are looking at text processing for log files I'd take a close look at the Elasticsearch Logstask Kibana stack. Elasticsearch provides the Lucene based text search. Logstash parses and loads the log file into Elasticsearch. And Kibana provides a visualization and query tool for searching and analyzing the data.
This is a good webinar on the ELK stack by one of their trainers: http://www.elasticsearch.org/webinars/elk-stack-devops-environment/
As an experienced MongoDB, Solr and Elasticsearch user I was impressed by how it easy it was to get all three components up and functional analyzing log data. And it also has a robust user community, both here on stackoverflow and elsewhere.
You can download it here: http://www.elasticsearch.org/overview/elkdownloads/

convert log file to csv then csv import to mysql, mongodb etc.
mongodb:
for help :
mongoimport --help
json file :
mongoimport --db db --collection collection --file collection.json
csv file :
mongoimport --db db--collection collection --type csv --headerline --file collection.csv
Use the “--ignoreBlanks” option to ignore blank fields. For CSV and TSV imports, this option provides the desired functionality in most cases: it avoids inserting blank fields in MongoDB documents.
link Guide:
mongoimport , mongoimport v2.2
then define index on collection and enjoy :-)

Related

Backup core data, one entity only

My application requires some kind of data backup and some kind of data exchange between users, so what I want to achieve is the ability to export an entity but not the entire database.
I have found some help but for the full database, like this post:
Backup core data locally, and restore from backup - Swift
This applies to the entire database.
I tried to export a JSON file, this might work except that the entity I'm trying to export contains images as binary data.
So I'm stuck.
Any help exporting not the full database but just one entity or how to write a JSON that includes binary data.
Take a look at protobuf. Apple has an official swift lib for it
https://github.com/apple/swift-protobuf
Protobuf is an alternate encoding to JSON that has direct support for serializing binary data. There are client libraries for any language you might need to read the data in, or command-line tools if you want to examine the files manually.

import large JSON file into mongo

Due to some migrations and change in server, I have to change my Mongo database from old to new data is also need to transfer but data is too much now almost ~4GB of each file. In total, I have almost 20 files.
My problem is when I upload to new collections it says "tostring" error. I read and come to know there is the limit from MongoDB of 16mb to import a file.
How can I import JSON file into MongoDB? Thank you in advance.
If you read the documentation for mongoexport it says:
Avoid using mongoimport and mongoexport for full instance production
backups. They do not reliably preserve all rich BSON data types,
because JSON can only represent a subset of the types supported by
BSON. Use mongodump and mongorestore as described in MongoDB Backup
Methods for this kind of functionality.
Rather than using mongoexport to create a json file and then mongoimport to reimport it, you should use mongodump and mongorestore.

How to import from sql dump to MongoDB?

I am trying to import data from MySQL dump .sql file to get imported into MongoDB. But I could not see any mechanism for RDBMS to NoSQL data migration.
I have tried to convert the data into JSON and CSV but it is not giving m the desired output in the MongoDB.
I thought to try Apache Sqoop but it is mostly for SQL or NoSQL to Hadoop.
I could not understand, how it can be possible to migrate data from 'MySQL' to 'MongoDB'?
I there any thought apart from what I have tried till now?
Hoping to hear a better and faster solution for this type of migration.
I suggest you dump Mysql data to a CSV file,also you can try other file format,but make sure the file format is friendly so that you can import the data into MongoDB easily,both of MongoDB and Mysql support CSV file format very well.
You can try to use mysqldump or OUTFILE keyword to dump Mysql databases for backup,using mysqldump maybe takes a long time,so have a look at How can I optimize a mysqldump of a large database?.
Then use mongoimport tool to import data.
As far as I know,there are three ways to optimize this importing:
mongoimport --numInsertionWorkers N It will start several insertion workers, N can be the number of cores.
mongod --njournal Most of the continuous disk usage come from the journal,so disable journal might be a good way for optimizing.
split up your file and start parallel jobs.
Actually in my opinion, importing data and exporting data aren't difficulty,it seems that your dataset is large,so if you don't design you document structure,it still make your code slower,it is not recommended doing automatic migrations from relational database to MongoDB,the database performance might not be good.
So it's worth designing your data structure, you can check out Data models.
Hope this helps.
You can use Mongify which helps you to move/migrate data from SQL based systems to MongoDB. Supports MySQL, PostgreSQL, SQLite, Oracle, SQLServer, and DB2.
Requires ruby and rubygems as prerequisites. Refer this documentation to install and configure mongify.

Ingesting MySQL data to GeoMesa analytics

I am new to GeoMesa. I mean I just typed geomesa command. So, after following the command line tools tutorial on GeoMesa website. I found some information on ingesting data to geomesa through a .csv file.
So, for my research:
I have a MySQL database storing all the information sent from an Android Application.
And I want to perform some geo spatial analytics on it.
Right now I am converting my MySQL table to .csv file and then ingest it into geomesa as adviced on GeoMesa website.
But my questions are:
Is there any other better option because data is in GB and its a streaming data, hence I have to make .csv file regularly?
Is there any API through which I can connect my MySQL database to geomesa?
Is there any way to ingest using .sql dump file because that would be more easier then .csv file?
Since you are dealing with streaming data, I'd point to two GeoMesa integrations:
First, you might want to check out NiFi for managing data flows. If that fits into your architecture, then you can use GeoMesa with NiFi.
Second, Storm is quite popular for working with streaming data. GeoMesa has a brief tutorial for Storm here.
Third, to ingest sql dumps directly, one option would be to extend the GeoMesa converter library to support them. So far, we haven't had that as a feature request from a customer or a contribution to the project. It'd definitely be a sensible and welcome extension!
I'd also point out the GeoMesa gitter channel. It can be useful for quicker responses.

Logstash INPUT MySQL

Can't find any input plugin for Relational Databases in Logstash Documentation.
What is the best approach to import data from one Relational Database Table with logstash? Is to connect Elastic Search directly to the database using JDBC?
You'll need to use JDBC River (https://github.com/jprante/elasticsearch-river-jdbc) for loading JDBC data into elastic search (or write your own code to do it).
It looks like there are several JIRAs open requesting JDBC loading in Logstash, but they haven't been worked: https://logstash.jira.com/browse/LOGSTASH-1764
There's this
WIP: Under Development, NOT FOR PRODUCTION
This is a plugin for Logstash.
It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way.
Logstash provides infrastructure to automatically generate documentation for this plugin. We use the asciidoc format to write documentation so any comments in the source code will be first converted into asciidoc and then into html. All plugin documentation are placed under one central location.
So far, there is no any Logstash API for reading SQL.
For my recommendation, you can write a program/script such as JAVA/python to read the logs from sql database and write to a file. Then use logstash file
API to read from the file. The Logstash website has getting started tutorial. It is easy to learn.
Good Luck