SOLR - Best approach to import 20 million documents from csv file - csv

My current task on hand is to figure out the best approach to load millions of documents in solr.
The data file is an export from DB in csv format.
Currently, I am thinking about splitting the file into smaller files and having a script while post this smaller ones using curl.
I have noticed that if u post high amount of data, most of the time the request times out.
I am looking into Data importer and it seems like a good option
Any others ideas highly appreciated
Thanks

Unless a database is already part of your solution, I wouldn't add additional complexity to your solution. Quoting the SOLR FAQ it's your servlet container that is issuing the session time-out.
As I see it, you have a couple of options (In my order of preference):
Increase container timeout
Increase the container timeout. ("maxIdleTime" parameter, if you're using the embedded Jetty instance).
I'm assuming you only occasionally index such large files? Increasing the time-out temporarily might just be simplest option.
Split the file
Here's the simple unix script that will do the job (Splitting the file in 500,000 line chunks):
split -d -l 500000 data.csv split_files.
for file in `ls split_files.*`
do
curl 'http://localhost:8983/solr/update/csv?fieldnames=id,name,category&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary #$file
done
Parse the file and load in chunks
The following groovy script uses opencsv and solrj to parse the CSV file and commit changes to Solr every 500,000 lines.
import au.com.bytecode.opencsv.CSVReader
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument
#Grapes([
#Grab(group='net.sf.opencsv', module='opencsv', version='2.3'),
#Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
#Grab(group='ch.qos.logback', module='logback-classic', version='1.0.0'),
])
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");
new File("data.csv").withReader { reader ->
CSVReader csv = new CSVReader(reader)
String[] result
Integer count = 1
Integer chunkSize = 500000
while (result = csv.readNext()) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", result[0])
doc.addField("name_s", result[1])
doc.addField("category_s", result[2])
server.add(doc)
if (count.mod(chunkSize) == 0) {
server.commit()
}
count++
}
server.commit()
}

In SOLR 4.0 (currently in BETA), CSV's from a local directory can be imported directly using the UpdateHandler. Modifying the example from the SOLR Wiki
curl http://localhost:8983/solr/update?stream.file=exampledocs/books.csv&stream.contentType=text/csv;charset=utf-8
And this streams the file from the local location, so no need to chunk it up and POST it via HTTP.

Above answers have explained really well the ingestion strategies from single machine.
Few more options if you have big data infrastructure in place and want to implement distributed data ingestion pipeline.
Use sqoop to bring data to hadoop or place your csv file manually in hadoop.
Use one of below connector to ingest data:
hive- solr connector, spark- solr connector.
PS:
Make sure no firewall blocks connectivity between client nodes and solr/solrcloud nodes.
Choose right directory factory for data ingestion, if near real time search is not required then use StandardDirectoryFactory.
If you get below exception in client logs during ingestion then tune autoCommit and autoSoftCommit configuration in solrconfig.xml file.
SolrServerException: No live SolrServers available to handle this
request

Definitely just load these into a normal database first. There's all sorts of tools for dealing with CSVs (for example, postgres' COPY), so it should be easy. Using Data Import Handler is also pretty simple, so this seems like the most friction-free way to load your data. This method will also be faster since you won't have unnecessary network/HTTP overhead.

The reference guide says ConcurrentUpdateSolrServer could/should be used for bulk updates.
Javadocs are somewhat incorrect (v 3.6.2, v 4.7.0):
ConcurrentUpdateSolrServer buffers all added documents and writes them into open HTTP connections.
It doesn't buffer indefinitely, but up to int queueSize, which is a constructor parameter.

Related

Concurrent MySQL writing with GORM leads to an error

I have implemented a complex csv import script in Golang.
I use a Workerpool implementation for it. Inside that workerpool, workers run through 1000s of small csv files, categorizing, tagging and branding the products.
And they all write to the same database table. So far so good.
The problem i'm facing is, that if i chose more than 2 workers, the process crashes with the following message randomly
The workflow is
foreach (csv) {
workerPool.submit(csv)
}
func worker(csv) {
foreach (line) {
import(line)
}
}
import(line) {
product = get(line)
product.category = determine_category(product)
product.brand = determine_brand(product)
save(brand)
product.tags = determine_tags(product)
//and after all
save(product)
}
I tried to wrap the save() calls in transactions, but it didn't help.
Now i have the following questions:
Is MySQL suited to save concurrently to 1 table?
If transactions are need to accomplish this, where should they be set?
Is the Go SQL Driver (where the error ALWAYS happens in packets.go:1102) suited to do this ?
Could anyone help me (maybe by hiring for a few hours)?
I'm completely stuck. I can also share the sourcecode if that helps. But I first wanted to know i you guess that it's rather my code or a general issue.
Open a new db connection in each goroutine (or thread, for languages that use threads).
MySQL's protocol is stateful, which means if multiple goroutines attempt to use the same connection, the requests and responses get very confused.
You would have the same problem trying to share any other kind of stateful protocol connection between goroutines.
For example ftp is also a stateful protocol, and that may be easier to understand. A client goroutine might send a message like "get file x" and the response should be a series of messages containing the content of that file. If another goroutine tries to use the same connection while that request/response is inprogress, both clients will be confused. The second goroutine will read packets that belong to a file it didn't request. The first goroutine who requested the file will find some packets it was expecting have already been read.
Similarly, MySQL's protocol does not support multiple client goroutines sharing a single connection.

Django Insert Data into database via php

For now I always created the structure and logic of my backend with Django. But when I insert data into the database I alwas did that directly via a http request to a php script.
When the projects grows it is getting pretty messy. As well, there were always complications with timestamps from the database to the backend.
I want to eliminate all these flaws but could not find any good example If I could just make a request to a certain view in Django, containing all the information I want to put into the database. Django and the database are on the same machine, but the data that is being inserted comes from different devices.
Maybe you could give me a hint how to search further for this
You can just create a python script and run that.
Assuming you have a virtualenv, ensure you have to activated. And put this in the root of your Django project.
#!/usr/bin/env python
import os
import django
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "myapp.settings")
django.setup()
# Import after setup, to ensure they are initiliazed properly.
from myapp.models import MyModel, OtherModel
if __name__ == "__main__":
# Create your objects.
obj = MyModel.objects.create(some_value="foo")
other_obj = OtherModel.objects.create(title="Bar", ref=obj)
You can also use transaction to ensure it only commits all or none.
from django.db import transaction
with transaction.atomic():
obj = MyModel.objects.create(some_value="foo")
other_obj = OtherModel.objects.create(title="Bar", ref=obj)
Should one of the creations fail, everything is rolled back. This prevent from ending up with a half filled or corrupt database.

How do I create a lot of sample data for firestore?

Let's say I need to create a lot of different documents/collections in firestore. I need to add it quickly, like copy and paste json. I can't do that with standard firebase console, because adding 100 documents will take me forever. Is there any solutions for to bulk create mock data with a given structure in firestore db?
If you switch to the Cloud Console (rather than Firebase Console) for your project, you can use Cloud Shell as a starting point.
From the Cloud Shell environment you'll find tools like node and python installed and available. Using whatever one you prefer, you can write a script using the Server Client libraries.
For example in Python:
from google.cloud import firestore
import random
MAX_DOCUMENTS = 100
SAMPLE_COLLECTION_ID = u'users'
SAMPLE_COLORS = [u'Blue', u'Red', u'Green', u'Yellow', u'White', u'Black']
# Project ID is determined by the GCLOUD_PROJECT environment variable
db = firestore.Client()
collection_ref = db.collection(SAMPLE_COLLECTION_ID)
for x in range(0, MAX_DOCUMENTS - 1):
collection_ref.add({
u'primary': random.choice(SAMPLE_COLORS),
u'secondary': random.choice(SAMPLE_COLORS),
u'trim': random.choice(SAMPLE_COLORS),
u'accent': random.choice(SAMPLE_COLORS)
})
While this is the easiest way to get up and running with a static dataset, it lives a little to be desired. Namely with Firestore, live dynamic data is needed to exercises it's functionally, such as real-time queries. For this task, using Cloud Scheduler & Cloud Functions is a relatively easy way to regularly updating sample data.
In addition to the sample generation code, you'll specific the update frequency in the Cloud Scheduler. For instance in the image below, */10 * * * * defines a frequency of every 10 minutes using the standard unix-cron format:
For non-static data, often a timestamp is useful. Firestore provides a way to have a timestamp from the database server added at write-time as one of the fields:
u'timestamp': firestore.SERVER_TIMESTAMP
It is worth noting that timestamps like this will hotspot in production systems if not sharded correctly. Typically 500 writes/second to the same collection is the maximum you will want so that the index doesn't hotspot. Sharding can be as simple something like as each user having their own collection (500 writes/second per user). However for this example, writing 100 documents every minute via a scheduled Cloud Function is definitely not an issue.
FireKit is a good resource to use for this purpose. It even allows sub-collections.
https://retroportalstudio.gumroad.com/l/firekit_free

503 Service Unable: No registered leader was found after waiting for 4000 ms

I've recently started using solr. I'm using the latest Solr v6.1.0. I followed the quick start tutorial to get a feel of it. Being, a windows user I had to resort to the other way of importing my .csv data using Post tool for Windows
I am primarily interested in seeing how Solr can handle and search large data sets like the one I have. It is a 522 MB my_db.csv file which properly formatted (ran various python scripts to check that).
I started the solr cloud by the usual procedure. Then, I imported a part of this dataset (to be specfic, 29 lines of my_db.csv) to see if it works.
Shell:
C:\Users\MAC\Downloads\solr-6.1.0\solr-6.1.0>java -Dc=gettingstarted -Ddata=files -Dauto=yes -jar example\exampledocs\post.jar example\exampledocs\29lines.csv
Result was:
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file 29lines.csv (text/csv) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:01:28.106
Fortunately, it worked perfectly and I was able to use the default velocity search wrapper that they provide by going to http://localhost:8983/solr/gettingstarted_shard2_replica1/browse . It had all my data stored so far. 29 rows to be precise.
Now, I wanted to see if the whole 522 MB of data would be imported for which I used the same command (just replaced the .csv file, ofcourse) and then I run it. I did expect it to take a while, and after nearly 10 minutes it had inserted around 32,674 out of 1,300,000 and then it threw out this error.
Result was:
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file omdbFull.csv (text/csv) to [base]
SimplePostTool: WARNING: Solr returned an error #503 (Service Unavailable) for url: http://localhost:8983/solr/gettingstarted/update
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">503</int><int name="QTime">128191</int></lst><lst name="error"><lst name="metadata"><str name="error-cla
ss">org.apache.solr.common.SolrException</str><str name="root-error-class">org.apache.solr.common.SolrException</str></lst><str name="msg">No register
ed leader was found after waiting for 4000ms , collection: gettingstarted slice: shard2</str><int name="code">503</int></lst>
</response>
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 503 for URL: http://localhost:89
83/solr/gettingstarted/update
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:08:36.342
Summary
This was strange. I wasn't exaclty sure why this had happened. Is it perhaps that I have to change some kind of a "timeout" parameter for it to commit in? Unfortunately I wasn't able to see any such option for the windows post tool.
I found the solution to my problem. The problem wasn't that the file was huge. Which in my case was around 500 MB csv. I'm sure it will go through for even larger files.
The thing is, I think Solr has some kind of auto recognizing the kind of values are input in an index. For instance, my CSV had a column with "Years" like "2015","2014","1970"... etc. but when this column also had improper years which I didn't know, like "2014-2015", "1980-1988".
Solr would stop and throw out an exception because this was not a year but an year range. It wasn't expecting a value of this sort.
Summary
To fix the problem, I simply filtered out the faulty year rows and volla! it processed my 500 MB csv in around 15 minutes. After thatm I had a good nice database ready to be searched!

how to send really really large json object as response - node.js with express

I have been getting this error FATAL ERROR: JS Allocation failed - process out of memory and I have pinpointed it to be the problem that I am sending really really large json object to res.json (or JSON.stringify)
To give you some context, I am basically sending around 30,000 config files (each config file has around 10,000 lines) as one json object
My question is, is there a way to send such a huge json object or is there a better way to stream it (like using socket.io?)
I am using: node v0.10.33, express#4.10.2
UPDATE: Sample code
var app = express();
app.route('/events')
.get(function(req, res, next) {
var configdata = [{config:<10,000 lines of config>}, ... 10,000 configs]
res.json(configdata); // The out of memory error comes here
})
After a lot of try, I finally decided to go with socket.io to send each config file at a time rather than all config files at once. This solved the problem of out of memory which was crashing my server. thanks for all your help
Try to use streams. What you need is a readable stream that produces data on demand. I'll write simplified code here:
var Readable = require('stream').Readable;
var rs = Readable();
rs._read = function () {
// assuming 10000 lines of config fits in memory
rs.push({config:<10,000 lines of config>);
};
rs.pipe(res);
You can try increasing the memory node has available with the --max_old_space_size flag on the command line.
There may be a more elegant solution. My first reaction was to suggest using res.json() with a Buffer object rather than trying to send the entire object all in one shot, but then I realize that whatever is converting to JSON will probably want to use the entire object all at once anyway. So you will run out of memory even though you are switching to a stream. Or at least that's what I would expect.