QueryRequest req=new QueryRequest(solrQuery);
NoOpResponseParser responseParser = new NoOpResponseParser();
responseParser.setWriterType("csv");
searcherServer.setParser(responseParser);
NamedList<Object> resp=searcherServer.request(req);
QueryResponse res = searcherServer.query(solrQuery);
responseString = (String)resp.get("response");
I use the above code to get the output in CSV format. The data I am trying to fetch is huge (In billions). So I want to include deep pagination of SOLR and get chunks of CSV output. Is there a way to do? Also, with the current version of SOLR (I cannot upgrade) I have to use the above code to get CSV output.
I tried the below way to fetch the results.
searcherServer = new HttpSolrServer(url);
SolrQuery solrQuery = new SolrQuery();
solrQuery.setQuery(query);
solrQuery.set("fl","field1");
solrQuery.setParam("wt", "csv");
solrQuery.setStart(0);
solrQuery.setRows(1000);
solrQuery.setSort(SolrQuery.SortClause.asc("field2"));
In the output from the above code has wt as javabin. So I cannot get the CSV output.
Any suggestions?
You have two ways.
use Solr export request handler (or add it) and wt=csv parameter. Just to be clear, this is an Implicit Request Handler usually available even in older Solr versions and specifically designed to handle scenarios that involve exporting millions of records.
implement deep paging correctly. I suggest Yonic post paging and deep paging, it easier than you think. But after you'll have correctly implement, you also need to create the csv file by yourself.
The solution I found was:
SolrQuery solrQuery = new SolrQuery();
solrQuery.setQuery(query); //what you want to fetch
QueryResponse res = searcherServer.query(solrQuery);
int numFound = (int)res.getResults().getNumFound();
int rowsToBeFetched = (numFound > 1000 ? (int)(numFound/6) : numFound);
for(int i=0; i< numFound; i=i+rowsToBeFetched ){
solrQuery.set("fl","fieldToBeFetched");
solrQuery.setParam("wt", "csv");
solrQuery.setStart(i);
solrQuery.setRows(rowsToBeFetched);
QueryRequest req=new QueryRequest(solrQuery);
NoOpResponseParser responseParser = new NoOpResponseParser();
responseParser.setWriterType("csv");
searcherServer.setParser(responseParser);
NamedList<Object> resp=searcherServer.request(req);
responseString = (String)resp.get("response"); //This is in CSV format
}
Pros:
Since I don't get the result at once, it was faster.
The output was csv.
Hitting solr multiple items isn't costly.
Cons:
The result is not unique, meaning there can be repeated data based on what you are fetching.
To get unique results, you can use facets.
Thanks!
Related
I am working on a discord bot written in nodejs, the bot utilises a mysql database server to store information. The problem I have run into is that I cannot seem to retrieve the data from the database in a neat way, every single thing I try seems to run into some issue or another.
The select query returns an object called RowDataPacket. When googling every single result will reference this solution: Object.values(JSON.parse(JSON.stringify(rows)))
It postulates that I should get the values back, but I dont I get an array back that is as hard to work with as the rowdatapacket object.
This is a snippet of my code:
const kenneledMemberRolesTableName = 'kenneled_member_roles'
const kenneledMemberKey = 'kenneled_member'
const kenneledMemberRoleKey = 'kenneled_member_role_id'
const kenneledStaffMemberKey = 'kenneled_staff_member'
const kenneledDateKey = 'kenneled_date'
const kenneledReturnableRoleKey = 'kenneled_role_can_be_returned'
async function findKenneledMemberRoles(kenneledMemberId) {
let sql = `SELECT CAST(${kenneledMemberRoleKey} AS Char) FROM ${kenneledMemberRolesTableName} WHERE ${kenneledMemberKey} = ${kenneledMemberId}`
let rows = await databaseAccessor.runQuery(sql)
let result = JSON.parse(JSON.stringify(rows)).map(row => {
return row.kenneled_member_role_id
})
return result
}
This seemed to work, until I had to do a type conversion on the value, now the dot notations requires me to reference row.CAST(kenneled_member_role_id AS Char), this cannot work, and I have found no other way to retrieve the data than through dot notation. I swear there must be a better way to work with mysql rowdatapackets but the solution eludes me
I figured out something that works, however I still feel like this is an inelegant solution, I would love to hear from others if I am misunderstanding how to work with mysql code in nodejs, or if this is just a consequence of the library:
let result = JSON.parse(JSON.stringify(rows)).map(row => {
return row[`CAST(${kenneledMemberRoleKey} AS CHAR)`];
})
So what I did is I access the value through brackets instead of dot notation, this seems to work, and at least makes me able to store part of or the whole expression in a constant variable, hiding the ugliness.
Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))
I am trying to move data from a SPARQL endpoint to a JSONObject. Using RDF4J.
RDF4J documentation does not address this directly (some info about using endpoints, less about converting to JSON, and nothing where these two cases meet up).
Sofar I have:
SPARQLRepository repo = new SPARQLRepository(<My Endpoint>);
Map<String, String> headers = new HashMap<String, String>();
headers.put("Accept", "SPARQL/JSON");
repo.setAdditionalHttpHeaders(headers);
try (RepositoryConnection conn = repo.getConnection())
{
String queryString = "SELECT * WHERE {GRAPH <urn:x-evn-master:mwadata> {?s ?p ?o}}";
GraphQuery query = conn.prepareGraphQuery(queryString);
debug("Mark 2");
try (GraphQueryResult result = query.evaluate())
this fails because "Server responded with an unsupported file format: application/sparql-results+json"
I figured a SPARQLGraphQuery should take the place of GraphQuery, but RepositoryConnection does not have a relevant prepare statement.
If I exchange
try (RepositoryConnection conn = repo.getConnection())
with
try (SPARQLConnection conn = (SPARQLConnection)repo.getConnection())
I run into the problem that SPARQLConnection does not generate a SPARQLGraphQuery. The closest I can get is:
SPARQLGraphQuery query = (SPARQLGraphQuery)conn.prepareQuery(QueryLanguage.SPARQL, queryString);
which gives a runtime error as these types cannot be cast to eachother.
I do not know how to proceed from here. Any help or advise much appreciated. Thank you
this fails because "Server responded with an unsupported file format: application/sparql-results+json"
In RDF4J, SPARQL SELECT queries are tuple queries, so named because each result is a set of bindings, which are tuples of the form (name, value). In contrast, CONSTRUCT (and DESCRIBE) queries are graph queries, so called because their result is a graph, that is, a collection of RDF statements.
Furthermore, setting additional headers for the response format as you have done here is not necessary (except in rare circumstances), the RDF4J client handles this for you automatically, based on the registered set of parsers.
So, in short, simplify your code as follows:
SPARQLRepository repo = new SPARQLRepository(<My Endpoint>);
try (RepositoryConnection conn = repo.getConnection()) {
String queryString = "SELECT * WHERE {GRAPH <urn:x-evn-master:mwadata> {?s ?p ?o}}";
TupleQuery query = conn.prepareTupleQuery(queryString);
debug("Mark 2");
try (TupleQueryResult result = query.evaluate()) {
...
}
}
If you want to write the result of the query in JSON format, you could use a TupleQueryResultHandler, for example the SPARQLResultsJSONWriter, as follows:
SPARQLRepository repo = new SPARQLRepository(<My Endpoint>);
try (RepositoryConnection conn = repo.getConnection()) {
String queryString = "SELECT * WHERE {GRAPH <urn:x-evn-master:mwadata> {?s ?p ?o}}";
TupleQuery query = conn.prepareTupleQuery(queryString);
query.evaluate(new SPARQLResultsJSONWriter(System.out));
}
This will write the result of the query (in this example to standard output) using the SPARQL Query Results JSON format. If you have a non-standard format in mind, you could of course also create your own TupleQueryResultHandler implementation.
For more details on the various ways in which you can process the result (including iterating, streaming, adding to a List, or just directly sending to a result handler), see the documentation on querying a repository. As an aside, the javadoc on the RDF4J APIs is pretty extensive too, so if your Java editing environment has support for displaying that, I'd advise you to make use of it.
I am building a Railo app which deals with a lot of JSON data sent back and forth via Ajax. I've identified an opportunity to optimize its performance, but I'd like to hear some advice from the community before I tackle it.
Here is a good example of the situation.
I have an action on the server that queries a set of bid responses, serializes them to JSON, then returns them to my javascript on the front end, which then parses and renders some HTML. The format in which Railo returns the JSON is the familiar two-node object:
{"COLUMNS":["one","two","three",...],"DATA":["value","value","value",...]}
I wrote a function that utilizes underscore's map() function to convert this format into an array of objects with named nodes:
function toArgsObject(data,columns) {
return _.map(data, function(w){
var q = {};
for (var i=0; i < w.length; i++) { eval("q."+columns[i]+" = w[i]"); };
return q;
});
};
This gets the job done nicely, however the performance isn't very good! Even with swift js interpreters like those in webkit and firefox, this function often accounts for 75% of processing time in the functions that call it, especially when the data sets are large. I'd like to see how much improvement I would get by offloading this processing to the server, but I don't quite have the cfml / cfscript chops to write an efficient version of this.
What I need to come back from the server, then, would look like this:
[
{"one":"value","two":"value","three":"value"},
{"one":"value","two":"value","three":"value"}.
...
]
I understand that the format used by SerializeJSON creates responses that are far smaller and therefore use less bandwidth to send. This is where the experimentation comes in. I'd like to see how it impacts my application to do things differently!
has anyone written a JSON Serializer that can return data in this format?
If you need an array of structures in JS, you can easily convert the query into this type of dataset on the server-side and apply the SerializeJSON().
Quick example 1
<cfset dataset = [
{"one":"value","two":"value","three":"value"},
{"one":"value","two":"value","three":"value"}
] />
<cfoutput><p>#SerializeJSON(dataset)#</p></cfoutput>
(I love the freedom of structure definition syntax in Railo)
Quick example 2
<cfquery datasource="xxx" name="qGetRecords">
select userId, login, email from users limit 0,3
</cfquery>
<cfset dataset = [] />
<cfloop query="qGetRecords">
<cfset record = {} />
<cfset record["one"] = qGetRecords.userId />
<cfset record["two"] = qGetRecords.login />
<cfset record["three"] = qGetRecords.email />
<cfset ArrayAppend(dataset, record) />
</cfloop>
<cfoutput>
<p>#SerializeJSON(qGetRecords)#</p>
<p>#SerializeJSON(dataset)#</p>
</cfoutput>
Should work for you.
eval should only be used in a few very rare cases and this certainly isn't one of them. Stop using eval and do this instead:
q[columns[i]] = w[i];
In JavaScript, foo['bar'] is the equivalent of foo.bar.
In Coldfusion 10 or Railo 4, you could use the toArray() function of the Underscore.cfc library to convert your query into the format you want before calling serializeJSON(). Example in cfscript:
exampleQuery = queryNew("one,two,three","Varchar,Varchar,Varchar",
[
["value","value","value"],
["value","value","value"]
]);
arrayOfStructs = _.toArray(exampleQuery);
serializeJSON(arrayOfStructs);
Result:
[{ONE:"value", TWO:"value", THREE:"value"}, {ONE:"value", TWO:"value", THREE:"value"}]
The toArray() function returns an array of structs that matches your query, which is why the resulting JSON is formatted in the way that you want.
[Disclaimer: I wrote Underscore.cfc]
I am trying to get the records from the 'many' table of a one-to-many relationship and add them as a list to the relevant record from the 'one' table.
I am also trying to do this in a single database request.
Code derived from Linq to Sql - Populate JOIN result into a List almost achieves the intended result, but makes one database request per entry in the 'one' table which is unacceptable. That failing code is here:
var res = from variable in _dc.GetTable<VARIABLE>()
select new { x = variable, y = variable.VARIABLE_VALUEs };
However if I do a similar query but loop through all the results, then only a single database request is made. This code achieves all goals:
var res = from variable in _dc.GetTable<VARIABLE>()
select variable;
List<GDO.Variable> output = new List<GDO.Variable>();
foreach (var v2 in res)
{
List<GDO.VariableValue> values = new List<GDO.VariableValue>();
foreach (var vv in v2.VARIABLE_VALUEs)
{
values.Add(VariableValue.EntityToGDO(vv));
}
output.Add(EntityToGDO(v2));
output[output.Count - 1].VariableValues = values;
}
However the latter code is ugly as hell, and it really feels like something that should be do-able in a single linq query.
So, how can this be done in a single linq query that makes only a single database query?
In both cases the table is set to preload using the following code:
_dc = _db.CreateLinqDataContext();
var loadOptions = new DataLoadOptions();
loadOptions.LoadWith<VARIABLE>(v => v.VARIABLE_VALUEs);
_dc.LoadOptions = loadOptions;
I am using .NET 3.5, and the database back-end was generated using SqlMetal.
This link may help
http://msdn.microsoft.com/en-us/vcsharp/aa336746.aspx
Look under join operators. You'll probably have to change from using extension syntax other syntax too. Like this,
var = from obj in dc.Table
from obj2 in dc.Table2
where condition
select