Couchbase cannot access views while adding new documents - couchbase

I'm using Cocuhbase Community Edition 6.0.0 build 1693 with below server config on a single node.
Total ram - 24GB
Data - 7608 MB
Index - 3255 MB
Search - 576 MB
With a Stale - OK and Update_After
ViewResult result = bucket.query(ViewQuery.from(design, view)
.key(id)
.skip(skip)
.limit(limit)
.stale(Stale.UPDATE_AFTER));
List<T> ourDocs = new ArrayList<>();
result
.iterator()
.forEachRemaining(
(doc) ->
ourDocs.add(
fromJsonObject(doc.document().content(), type))
);
While I create new documents from a C# TPL task. I cannot access the views, which was a similar issue with CouchDB.
Is there any other config Im missing here?

Related

Read list of strings from MySQL Stored Proc in .NET 6

I have a MySQL (not SQL Server) database with a Stored Procedure that returns a tabular result with one (1) column of strings.
I would like to get that result into my .NET application as some sort of IEnumerable<string> or List<string> etc.
What do?
I've tried playing with MySql.EntityFrameworkCore but get stuck quickly. Entity Framework Core either wants to generate tables based on models or models based on tables. I want neither. I just want my strings, plain and simple.
I've tried making a POCO with a single property and the [Keyless] attribute but no dice. If I define a DbSet<Poco> then the table doesn't exist, if I try to do context.Set<Poco>().FromSql('call my_stored_proc();'); then EF core complains the DbSet doesn't exist.
I'm using .NET 6 and the latest versions of above mentioned MySQL EntityFrameworkCore NuGet. Searching for answers is made harder by a lot of answers either assuming SQL Server or using older versions of EF core with methods that my EF core doesn't seem to have. And some results claim that EF core 6 doesn't work with .NET 6?
I'm also happy bypassing EF entirely if that's easier.
What you are asking for will eventually be available in EF Core 7.0 - Raw SQL queries for unmapped types.
Until then, the minimum you need to do is to define a simple POCO class with single property, register it as keyless entity and use ToView(null) to avoid EF Core associate a db table/view with it.
e.g.
POCO:
public class StringValue
{
public string Value { get; set; }
}
Your DbContext subclass:
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
modelBuilder.Entity<StringValue>(builder =>
{
builder.HasNoKey(); // keyless
builder.ToView(null); // no table/view
builder.Property(e => e.Value)
.HasColumnName("column_alias_in_the_sp_result");
});
}
Usage:
var query = context.Set<StringValue>()
.FromSqlRaw(...)
.AsEnumerable() // needed if SP call raw SQL is not composable as for SqlServer
.Select(r => r.Value);
EF is an ORM, not a data access library. The actual data access is performed by ADO.NET using db-specific providers. You don't need an ORM to run a query and receive results :
await using DbConnection connection = new MySqlConnection("Server=myserver;User ID=mylogin;Password=mypass;Database=mydatabase");
await connection.OpenAsync();
using DbCommand command = new MySqlCommand("SELECT field FROM table;", connection);
await using var reader = command.ExecuteReader();
while (await reader.ReadAsync())
Console.WriteLine(reader.GetString(0));
ADO.NET provides the interfaces and base implementations like DbConnection. Individual providers provide the data-specific implementations. This means the samples you see for SQL Server work with minimal modifications for any other database. To execute a stored procedure you need to set the CommandType to System.Data.CommandType.StoredProcedure :
using var command = new MySqlCommand("my_stored_proc", connection);
command.CommandType=CommandType.StoredProcedure
In this case I used the open source MySqlConnector provider, which offers true asynchronous commands and fixes a lot of the bugs found in Oracle's official Connector/.NET aka MySQL.Data. The official MySql.EntityFrameworkCore uses the official provider and inherits its problems.
ORMs like EF and micro-ORMs like Dapper work on top of ADO.NET to generate SQL queries and map results to objects. To work with EF Core use Pomelo.EntityFrameworkCore.MySql. With 25M downloads it's also far more popular than MySql.EntityFrameworkCore (1.4M).
If you only want to map raw results to objects, try Dapper. It constructs the necessary commands based on the query and parameters provided as anonymous objects, opens and closes connections as needed, and maps results to objects using reflection. Until recently it was a lot faster than EF Core in raw queries but EF caught up in EF Core 7 :
IEnumerable<User> users = cnn.Query<User>("get_user",
new {RoleId = 1},
commandType: CommandType.StoredProcedure);
No other configuration is needed. Dapper will map columns to User parameters by name and return an IEnumerable<T> of the desired classes.
The equivalent functionality will be added in EF Core 7's raw SQL queries for unmapped types

All the executors are not being used when reading JSON(zipped .gz) in GCP from google dataproc spark cluster using spark-submit

I just got introduced to this wonderful world of Big Data and Cloud technology, using GCP(dataproc) and pyspark. I have ~5 GB size JSON file(zipped, gz file) containing ~5 million records, I need to read each row and process only those rows which satisfies a certain condition. I have my working code and I issued a spark-submit with --num-partitions=5 but still only one worker is used to carry out the action.
This is the spark-submit command I am using:
spark-submit --num-executors 5 --py-files /home/user/code/dist/package-0.1-py3.6.egg job.py
job.py:
path = "gs://dataproc-bucket/json-files/data_5M.json.gz"
mi = spark.read.json(path)
inf_rel = mi.select(mi.client_id,
mi.user_id,
mi.first_date,
F.hour(mi.first_date).alias('hour'),
mi.notes).rdd.map(foo).filter(lambda x: x)
inf_relevance = inf_rel.map(lambda l: Row(**dict(l))).toDF()
save_path = "gs://dataproc-bucket/json-files/output_5M.json"
inf_relevance.write.mode('append').json(save_path)
print("END!!")
Dataproc config:
(I am using the free account for now, once I get working solution will add more cores and executors)
(Debian 9, Hadoop 2.9, Spark 2.4)
Master node:2 vCPU, 7.50 GB memory
Primary disk size: 32 GB
5 Worker nodes: 1 vCPU, 3.75 GB memory
Primary disk type: 32 GB
After spark-submit I can see in web UI that 5 executors were added but then only 1 executor remains active and perform all task and rest 4 are released.
I did my research and most of the questions talk about accessing data via JDBC.
Please suggest what I am missing here.
P.S. Eventually I would read 64 json files of 5 GB each, so might use 8 core * 100 workers.
Your best bet is to preprocess the input. Given a single input file, spark.read.json(... will create a single task to read and parse the JSON data as Spark cannot know ahead of time how to parallelize it. If your data is in line-delimited JSON format (http://jsonlines.org/), the best course of action would be to split it into manageable chunks beforehand:
path = "gs://dataproc-bucket/json-files/data_5M.json"
# read monolithic JSON as text to avoid parsing, repartition and *then* parse JSON
mi = spark.read.json(spark.read.text(path).repartition(1000).rdd)
inf_rel = mi.select(mi.client_id,
mi.user_id,
mi.first_date,
F.hour(mi.first_date).alias('hour'),
mi.notes).rdd.map(foo).filter(lambda x: x)
inf_relevance = inf_rel.map(lambda l: Row(**dict(l))).toDF()
save_path = "gs://dataproc-bucket/json-files/output_5M.json"
inf_relevance.write.mode('append').json(save_path)
print("END!!")
Your initial step here (spark.read.text(...) will still bottleneck as a single task. If your data isn't line-delimited or (especially!) you anticipate you will need to work with this data more than once, you should figure out a way to turn your 5GB JSON file into 1000 5MB JSON files before getting Spark involved.
.gz files are not splittable, so they're read by one core and placed onto a single partition.
see Dealing with a large gzipped file in Spark for reference.

Azure Cosmos DB Number of Columns Limit

The Azure Table Service documentation states that entities (rows) must have at most 255 properties, which I understand to mean these tables can have at most 255 columns, which seems highly restrictive.
Two questions: first, do the same limits apply to Cosmos DB Table Storage? I can't seem to find any documentation that says one way or another, though the language of "entities" is still used. And second--if the same limit applies in Cosmos DB--is there any useful way around this limit for storage and querying, along the lines of JSON in SQL Server?
EDIT: here is some example code that attempts to write entities with 260 properties to Cosmos DB Table Storage and the error that is thrown. Account names and keys and such are redacted
# Libraries
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
import csv
import os
# Connect
## Table Storage
"""
access_key = 'access_key'
table_service = TableService(account_name='account_name', account_key= access_key)
"""
## Cosmos DB Table Storage
connection_string = "connection_string"
table_service = TableService(connection_string=connection_string)
# Create Table
if not table_service.exists('testTable'):
table_service.create_table('testTable')
length = 260
letters = [chr(i) for i in range(ord('a'), ord('z') + 1)]
keys = [a + b + c for a in letters for b in letters for c in letters][:length]
values = ['0' * (8 - len(str(i))) + str(i) for i in range(length)]
entity = dict(zip(keys, values))
entity['PartitionKey'] = 'TestKey'
entity['RowKey'] = '1'
table_service.insert_entity('testTable', entity)
This raises "ValueError: The entity contains more properties than allowed."
first, do the same limits apply to Cosmos DB Table Storage?
Based on the Azure Table storage limits, as you said ,max number of properties in a table entity is 255. However,I just found below statement in Azure Cosmos DB limits.
Azure Cosmos DB is a global scale database in which throughput and
storage can be scaled to handle whatever your application requires. If
you have any questions about the scale Azure Cosmos DB provides,
please send email to askcosmosdb#microsoft.com.
According to my test(I tired to add 260 properties into an entity), Azure Cosmos DB Table API accept that properties exceed 255.
If you want to get official reply, you could send email to above address.
is there any useful way around this limit for storage and querying,
along the lines of JSON in SQL Server?
If you want to store and query data of json format, I suggest you using cosmos db SQL API.It is versatile and flexible.You could refer to the doc.
Besides, if your data are stored in sql server database now. You could use Migration Tool to import data into cosmos db. Or you could Azure Data Factory to do more custom transmission.
Hope it helps you.
Since this pops pretty high on Google searches: As of now, it's 255 (-2 if you encrypt)
I just did a quick test using pytest:
from azure.cosmosdb.table import TableService
field_number = 250
entity = get_dummy_dict_entry_with_many_col(field_number)
for x in range(field_number, 1000):
print("Adding entity with {} elements.".format(len(entity)))
table_service.insert_entity(my_test_table_name, entity)
field_number += 1
entity["Field_nb_{}".format(field_number)] = field_number
entity["RowKey"] += str(field_number)
and got an exception in "def _validate_entity(entity, encrypt=None):"
# Two properties are added during encryption. Validate sufficient space
max_properties = 255
if encrypt:
max_properties = max_properties - 2
# Validate there are not more than 255 properties including Timestamp
if (len(entity) > max_properties) or (len(entity) == max_properties and 'Timestamp' not in entity):
> raise ValueError(_ERROR_TOO_MANY_PROPERTIES)
E ValueError: The entity contains more properties than allowed.

NodeJS - Process out of memory for 100+ concurrent connections

I am working on an IoT application where the clients send bio-potential information every 2 seconds to the server. The client sends a CSV file containing 400 rows of data every 2 seconds. I have a Socket.IO websocket server running on my server which captures this information from each client. Once this information is captured, the server must push these 400 records into a mysql database every 2 seconds for each client. While this worked perfectly well as long as the number of clients were small, as the number of clients grew the server started throwing the "Process out of memory exception."
Following is the exception received :
<--- Last few GCs --->
98522 ms: Mark-sweep 1397.1 (1457.9) -> 1397.1 (1457.9) MB, 1522.7 / 0 ms [allocation failure] [GC in old space requested].
100059 ms: Mark-sweep 1397.1 (1457.9) -> 1397.0 (1457.9) MB, 1536.9 / 0 ms [allocation failure] [GC in old space requested].
101579 ms: Mark-sweep 1397.0 (1457.9) -> 1397.0 (1457.9) MB, 1519.9 / 0 ms [last resort gc].
103097 ms: Mark-sweep 1397.0 (1457.9) -> 1397.0 (1457.9) MB, 1517.9 / 0 ms [last resort gc].
<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 0x35cc9bbb4629 <JS Object>
2: format [/xxxx/node_modules/mysql/node_modules/sqlstring/lib/SqlString.js:~73] [pc=0x6991adfdf6f] (this=0x349863632099 <an Object with map 0x209c9c99fbd1>,sql=0x2dca2e10a4c9 <String[84]: Insert into rent_66 (sample_id,sample_time, data_1,data_2,data_3) values ? >,values=0x356da3596b9 <JS Array[1]>,stringifyObjects=0x35cc9bb04251 <false>,timeZone=0x303eff...
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted
Following is the code for my server:
var app = require('express')();
var http = require('http').Server(app);
var io = require('socket.io')(http);
var mysql = require('mysql');
var conn = mysql.createConnection({
host: '<host>',
user: '<user>',
password: '<password>',
database: '<db>',
debug: false,
});
conn.connect();
io.on('connection', function (socket){
console.log('connection');
var finalArray = []
socket.on('data_to_save', function (from, msg) {
var str_arr = msg.split("\n");
var id = str_arr[1];
var timestamp = str_arr[0];
var data = str_arr.splice(2);
finalArray = [];
var dataPoint = [];
data.forEach(function(value){
dataPoint = value.split(",");
if(dataPoint[0]!=''){
finalArray.push([dataPoint[0],1,dataPoint[1],dataPoint[2],dataPoint[3]]);
finalArray.push([dataPoint[0],1,dataPoint[4],dataPoint[5],dataPoint[5]]);
}
});
var sql = "Insert into rent_"+id+" (sample_id,sample_time, channel_1,channel_2,channel_3) values ? ";
var query = conn.query (sql, [finalArray],function(err,result){
if(err)
console.log(err);
else
console.log(result);
});
conn.commit();
console.log('MSG from ' + str_arr[1] + ' ' + str_arr[0] );
});
});
http.listen(9000, function () {
console.log('listening on *:9000');
});
I was able to get the server to handle 100 concurrent connections after which I started receiving process out of memory exceptions. Before the database inserts were introduced, the server would simply store the csv as a file on disk. With that set up the server was able to handle 1200+ concurrent connections.
Based on the information available on the internet, looks like the database insert query (which is asynchronous) holds the 400 row array in memory till the insert goes through. As a result, as the number of clients grow, the memory foot-print of the server increases, thereby running out of memory eventually.
I did go through many suggestions made on the internet regarding --max_old_space_size, I am not sure that this is a long term solution. Also, I am not sure on what basis I should decide the value that should be mentioned here.
Also, I have gone through suggestions which talk about async utility module. However, inserting data serially may introduce a huge delay between the time when client inserts data and when the server saves this data to the database.
I have gone in circles around this problem many times. Is there a way the server can handle information coming from 1000+ concurrent clients and save that data into Mysql database with minimum latency. I have hit a road block here, and any help in this direction is highly appreciated.
I'll summarize my comments since they sent you on the correct path to address your issue.
First, you have to establish whether the issue is caused by your database or not. The simplest way to do that is to comment out the database portion and see how high you can scale. If you get into the thousands without a memory or CPU issue, then your focus can shift to figuring out why adding the database code into the mix causes the problem.
Assuming the issues is caused by your database, then you need to start understanding how it is handling things when there are lots of active database requests. Oftentimes, the first thing to use with a busy database is connection pooling. This gives you three main things that can help with scale.
It gives you fast reuse of previously opened connections so you don't have every single operation creating its own connection and then closing it.
It lets you specify the max number of simultaneous database connections in the pool you want at the same time (controlling the max load you throw at the database and also probably limiting the max amount of memory it will use). Connections beyond that limit will be queued (which is usually what you want in high load situations so you don't overwhelm the resources you have).
It makes it easier to see if you have a connection leak problem as rather than just leak connections until you run out of some resource, the pool will quickly be empty in testing and your server will not be able to process any more transactions (so you are much more likely to see the problem in testing).
Then, you probably also want to look at the transaction times for your database connections to see how fast they can handle any given transaction. You know how many transactions/sec you are trying to process so you need to see if your database and the way it's configured and resourced (memory, CPU, speed of disk, etc...) is capable of keeping up with the load you want to throw at it.
You should increase the default memory(512MB) by using the command below:
node --max-old-space-size=1024 index.js
This increases the size to 1GB. You can use this command to further increase the default memory.

Weblogic clustering configuration

I am developing an application with JDeveloper 11.1.1.6.0. I have a problem with my client application when I try to connect to a weblogic server from a cluster from within my application. A certain service runs on this server that I would like to call.
The situation is as follows:
There is a weblogic instance, whose configuration I cannot change at the moment. The weblogic instance has the following servers and clusters:
Admin server AS - (runs on Machine M1) URL: A, port: 1 - URL for connection t3://A:1
Cluster C containing:
Server S1 - (runs on Machine M1) URL: A, port: 2 - uses Database D1 - URL for connection t3://A:2
Server S2 - (runs on Machine M2) URL: B, port: 1 - uses Database D2 - URL for connection t3://B:1
Server S3 - (runs on Machine M2) URL: B, port: 2 - uses Database D2 - URL for connection t3://B:2
I am trying to connect to t3://A:2 and not to the cluster or any of the other two servers. However, it works only every third time, maybe because of the three servers within the cluster. The cluster uses unicast for messaging and round-robin-affinity for load balancing.
I am trying to find out what causes this. Can I change something within the configuration of the weblogic where my client application runs (integrated or standalone)? Or must the configuration setup of the instance with the server cluster be changed?
Thank you in advance!
Best Regards
(23.05.2013)
EDIT:
We use a plain JNDI-Lookup to access an EJB on the remote server in the described scenario.
Context ctx = new InitialContext();
Object o = ctx.lookup(...)
...
jndi.properties:
java.naming.provider.url=t3://A:2
java.naming.factory.initial=weblogic.jndi.WLInitialContextFactory
It seems to be possible to send the JNDI-Request to the right server by setting the property PIN_TO_PRIMARY_SERVER. Yet, subsequent ejb-requests are still routed to the whole cluster using round robin...
Can we do something on client-side to change this behavior to always address the specific server with the url t3://A:2?
I had a similar problem and after trying changing the InvocationContext environment properties, I found that I had very little luck. Instead I had to alter the weblogic-ejb-jar.xml for my stateless session bean.
String destination = "t3://node-alpha:2010";
Hashtable<String, String> env = new Hashtable<String, String>();
env.put( Context.INITIAL_CONTEXT_FACTORY, "weblogic.jndi.WLInitialContextFactory");
env.put( Context.PROVIDER_URL, destination );
// env.put( weblogic.jndi.WLContext.ENABLE_SERVER_AFFINITY, "true" );
// env.put( weblogic.jndi.WLContext.PIN_TO_PRIMARY_SERVER, "true" );
InitialContext ctx = new InitialContext( env );
EJBHome home = (EJBHome) ctx.lookup( JNDI_REMOTE_SYSTEM_SF );
sf = SomeSf.class.cast( home.getClass().getMethod( "create" ).invoke( home ) );
// Check that we are hitting the right server node.
System.out.println( destination + " => " + sf );
Once you start a transaction, you shouldn't change servers, so I would create a stateless bean to receive the targeted calls and from there begin the work you intend to do. You can set a stateless bean as not clusterable in the weblogic-ejb-jar.xml. You actually need to set both items listed below.
<home-is-clusterable>False</home-is-clusterable>
<stateless-bean-is-clusterable>False</stateless-bean-is-clusterable>
What this means is that when getting a reference through the initial context, is that the targeted server will give an instance of the reference to the stateless bean on that particular cluster node.
Using the servers
node-alpha:2010
node-alpha:2011
node-beta:3010
node-beta:3011
With home-is-clusterable & stateless-bean-is-clusterable set to true
Here the first entry is the server it is targeting, then the rest are for fail-over and/or the load balancing (e.g. round robin).
ClusterableRemoteRef(
3980825488277365621S:node-alpha:[2010,2010,-1,-1,-1,-1,-1]:MyDomain:node-alpha
[
3980825488277365621S:node-alpha:[2010,2010,-1,-1,-1,-1,-1]:MyDomain:node-alpha/338,
4236365235325235233S:node-alpha:[2011,2011,-1,-1,-1,-1,-1]:MyDomain:node-alpha/341,
1321244352376322432S:node-beta:[3010,3010,-1,-1,-1,-1,-1]:MyDomain:node-beta/342,
4317823667154133654S:node-beta:[3011,3011,-1,-1,-1,-1,-1]:MyDomain:node-beta/345
]
)/338
With home-is-clusterable & stateless-bean-is-clusterable set to false
weblogic.rmi.internal.BasicRemoteRef - hostID: '-3980825488277365621S:node-alpha:[2010,2010,-1,-1,-1,-1,-1]:MyDomain:node-alpha', oid: '336', channel: 'null'
weblogic-ejb-jar.xml example below.
<weblogic-ejb-jar>
<weblogic-enterprise-bean>
<ejb-name>SomeSf</ejb-name>
<stateless-session-descriptor>
<pool>
<max-beans-in-free-pool>42</max-beans-in-free-pool>
</pool>
<stateless-clustering>
<home-is-clusterable>false</home-is-clusterable>
<stateless-bean-is-clusterable>false</stateless-bean-is-clusterable>
<stateless-bean-methods-are-idempotent>true</stateless-bean-methods-are-idempotent>
</stateless-clustering>
</stateless-session-descriptor>
<transaction-descriptor>
<trans-timeout-seconds>20</trans-timeout-seconds>
</transaction-descriptor>
<enable-call-by-reference>true</enable-call-by-reference>
<jndi-name>SomeSf</jndi-name>
</weblogic-enterprise-bean>
</weblogic-ejb-jar>