Import millions of lines with EF Core

Import millions of lines with EF Core - ef-core-2.1

Many similar questions but cannot find the answer I want. I am trying to import millions on line and each line have information about several EF Core entities (a main one with 50 fields, and 3 related entities with a few fields)
I have tons of duplicates which means I need to always check if the data is here before inserting or to get the FK/navigation properties.
It's quite slow. I am processing about 1000 lines per minute. I have done the few things below but is there something else obvious I can do ?
db.ChangeTracker.QueryTrackingBehavior = QueryTrackingBehavior.NoTracking;
I have a cache dictionary Dictionary with a small table listing all my 50 users to avoid to query the database at each line to get the User navigation property.
Running in release executables outside of VS improved marginally the performance.

Related

How to increase the speed of vtigercrm Opportunity Module

I am currently developing vtigercrm 7.1.0 open source. Vtiger uses mysql for its database and the crm is spilt into multiple modules. The opportunity module is the heart of the crm and contains most of the system fields. As i have been working on the system and adding more fields to the Opportunity module it has been getting progressively slower. I now have over 500 fields in the opportunity module. Each time i add a new field it creates a new column in mysql table vtiger_potentialscf.
If i run select * from vtiger_potentialscf it takes around 10 seconds to finish the query which has a detrimental effect on the end user who has to wait around 13 seconds for the webpage to load. I have read up on mysql and it doesn't like tables with too many columns.
I have been working on this system for months now but feel if i cant find a way to improve the speed i will have to look for an alternative CRM system. Does anyone have any helpful suggestions to improve the speed?

There are multiple things you should check -
Do you really need 500 fields for a module? Can they be moved to other modules or merged in someway to reduce the field count?
Run an explain on the query and see why it is taking that long.
Add indexes on the cf table - usually the potentialid column should be the index but it is possible that is fragmented. You can try defragmenting the table
Try moving some of the fields to the main vtiger_potential table or even a third vtiger_potentialcf1 table to split the data further.
Try altering the mysql configuration for optimal performance. There are multiple guides available over the internet (even stackoverflow has some).
Let me know how it goes for you.

Performance Issues with Include in Entity Framework

I am working on a large application being developed using Repository Pattern, Web APIs, AngularJS. In one of the scenario, I am trying to retrieve data from a single lead which has relations with approx. 20 tables. Lazy loading is disable, so I am using Include to get the data from all the 20 tables. Now, here comes performance issue, if I try to retrieve single record, it takes approx. 15 seconds. This is a huge performance issue. I am returning JSON and my entities are decorated with DataContract(IsReference = true)/ Data Member attribute.
Any suggestions will be highly appreciated.

Include is really nasty for performance because of how it joins.
See more info in my blog post here http://mikee.se/Archive.aspx/Details/entity_framework_pitfalls,_include_20140101
To summarize the problem a bit it's because EF handles Include by joining. This creates a result set where every row includes every column of every joined entity (Some contain null values).
This is even more nasty if the root entity contains large fields (like a long text or a binary) because that one get repeated.
15s is way too much though. I suspect something more is at play like missing indexes.
To summarize the solutions. My suggestion is normally that you load every relation separately or in a multiquery. A simple query like that should be 5-30ms per entity depending on your setup. In this case it would still be quite slow (~1s if you are querying on indexes). Maybe you need to look at some way to store this data in a better format if this query is run often (Cache, document, json in the db). I can't help you with that though, would need far more information as the update paths affect the possibilities a lot.

The performance has been improved by Enabling Lazy Loading.

Efficient way to find 200,000 product names in 20 million articles?

We have two (MySQL) databases, one with about 200.000 products (like "Samsung Galaxy S4", db-size 200 MB) and one with about 10 million articles (plain text, db-size 20GB) which can contain zero, one or many of the product names from the product database. Now we want to find product names in the article texts and store them as facets of the articles while indexing them in elasticsearch. Using regular expressions to find the products is pretty slow, we looked at Apache OpenNLP and Stanford Named Entity Recognizer, for both we have to train our own models and there are some projects at github for integrating those NER tools into elasticsearch, but they don't seem to be ready for production use.
Products and articles are added every day, so we have to run a complete recognition every day. Is NER the way to go? Or any other ideas? We don't have to understand the grammer etc. of the text, we only have to find the product name strings as fast as possible. We can't do the calculation in realtime because that's way to slow, so we have to pre-calculate the connection between articles and products and store them as facets, so we can query them pretty fast in our application.
So what's your recommendation to find so many product names in so many articles?

One of the issues you'll run into the most is consistency... new articles and new product names are always coming in and you'll have an "eventual consistency" problem. So there are three approaches that come to mind that I have used to tackle this kind of problem.
As suggested, use a full text search in MySQL, basically create a loop over your products table, and for each product name do a MATCH AGAIST query and insert productkey, and article key into a tie table. THis is fast, I used to run a system in SQL Server with over 90000 items being searched against 1B sentences. If you had a multithreaded java program that chunked up the categories and exectured the full text query, you may be surpised how fast this will be. Also, this can hammer your DB server.
Use Regex. Put all the products in a collection in memory, and regex find with that list against every document. This CAN be fast if you have your docs in something like hadoop, where it can be parallelized. You could run the job at night, and have it populate a MySQL table... This approach means you will have to start storing your docs in HDFS or some NOSQL solution, or import from MySQL to hadoop daily etc etc.
You can try doing it "at index time", so when a record is indexed in ElasticSearch the extraction will happen then and your facets will be built. I have only used SOLR for stuff like this... problem here is that when you add new products you will have to process in batch again anyway because the previously index docs will not have had the new products extracted from them.
so there may be better options, but the one that scales infinitely (if you can afford the machines) is option 2... the hadoop job.... but this means big change.
These are just my thoughts, so I hope others come up with more clever ideas
EDIT:
As for using NER, I have used NER extensively, mainly OpenNLP, and the problem with this is that what it extracts will not be normalized, or to put it another way, it may extract pieces and parts of a product name, and you will be left dealing with things like fuzzy string matching to align the NER Results to the table of products. OpenNLP 1.6 trunk has a component called the EntityLinker, which is designed for this type of thing (linking NER results to authoritative databases). Also, NER/NLP will not solve the consistency problem, because every time you change your NER model, you will have to reprocess.

I'd suggest a preprocessing step : tokenization. If you do so for the product list and for the incoming articles, than you won't need to have a per-product search : the product list would be an automata where each transition is a given token.
That gives us a trie that you'll use to match products against texts, searching will look like :
products = []
availableNodes = dictionary.root
foreach token in text:
foreach node in availableNodes:
if node.productName:
products.append(node.productName)
nextAvailableNodes = [dictionary.root]
foreach node in availableNodes:
childNode = node.getChildren(token)
if childNode:
nextAvailableNodes.append(childNode)
availableNodes = nextAvailableNodes
As far as I can tell, this algorithm is quite efficient and it allows you to fine-tune node.getChildren() function (e.g. to address capitalization or diacritics issues). Loading products lists as a a trie may take some time , in that case you could cache it as a binary file.
This simple method can easily be distributed using Hadoop or other MapReduce approach, either over texts or over product list, see for instance this article (but you'll probably need more recent / accurate ones).

Apache Camel problems aggregating large (1mil record) CSV files

My question is (1) is there a better strategy for solving my problem (2) is it possible to tweak/improve my solution so it works and doesn't split the aggregation in a reliable manner (3 the less important one) how can i debug it more intelligently? figuring out wtf the aggregator is doing is difficult because it only fails on giant batches that are hard to debug because of their size. answers to any of these would be very useful, most importantly the first two.
I think the problem is I'm not expressing to camel correctly that I need it to treat the CSV file coming in as a single lump and i dont want the aggregator to stop till all the records have been aggregated.
I'm writing a route to digest a million line CSV file, split then aggregate the data on some key primary fields, then write the aggregated records to a table
unforuntaely the primary constraints of the table are getting violated (which also correspond to the aggregation keys), implying that the aggregator is not waiting for the whole input to finish.
it works fine for small files of a few thousand records, but on the large sizes it will actually face in production, (1,000,000 records) it fails.
Firstly it fails with a JavaHeap memory error on the split after the CSV unmarshall. I fix this with .streaming(). This impacts the aggregator, where the aggregator 'completes' too early.
to illustrate:
A 1
A 2
B 2
--- aggregator split ---
B 1
A 2
--> A(3),B(2) ... A(2),B(1) = constraint violation because 2 lots of A's etc.
when what I want is A(5),B(3)
with examples of 100, 1000 etc, records it works fine and correctly. but when it processes 1,000,000 records, which is the real-size it needs to handle, firstly the split() gets an OutOfJavaHeapSpace exception.
I felt that simply changing the heap-size would be a short-term solution and just pushing the problem back until the next upper-limit of records comes through, so I got around it by using the .streaming() on the split.
Unfortunately now, the aggregator is being drip-fed the records, not getting them in a big cludge and it seems to be completing early and doing another aggregation, which is violating my primary constraint
from( file://inbox )
.unmarshall().bindy().
.split().body().streaming()
.setHeader( "X" Expression building string of primary-key fields)
.aggregate( header("X") ... ).completionTimeout( 15000 )
etc.
I think part of the problem is that I'm depending on the streaming split not timeing out longer than a fixed amount of time, which just isn't foolproof - e.g. a system task might reasonably cause this, etc. Also everytime I increase this timeout it makes it longer and longer to debug and test this stuff.
PRobably a better solution would be to read the size of the CSV file that comes in and not allow the aggregator to complete until every record has been processed. I have no idea how I'd express this in camel however.
Very possibly I just have a fundamental stategy misunderstanding of how I should be approaching / describing this problem. There may be a much better (simpler) approach that I dont know.
there's also such a large amount of records going in, I can't realistically debug them by hand to get an idea of what's happening (I'm also breaking the timeout on the aggregator when I do, I suspect)

You can split the file first one a line by line case, and then convert each line to CSV. Then you can run the splitter in streaming mode, and therefore have low memory consumption, and be able to read a file with a million records.
There is some blog links from this page http://camel.apache.org/articles about splitting big files in Camel. They cover XML though, but would be related to splitting big CSV files as well.

MySQL Whats better for speed one table with millions of rows or managing multiple tables?

Im re working an existing PHP/MySql/JS/Ajax web app that processes a LARGE number of table rows for users. Here's how the page works currently.
A user uploads a LARGE csv file. The test one I'm working with has 400,000 rows, (each row has 5 columns).
Php creates a brand new table for this data and inserts the hundreds of thousands of rows.
The page then sorts / processes / displays this data back to the user in a useful way. Processing includes searching, sorting by date and other rows and re displaying them without a huge load time (thats where the JS/Ajax comes in).
My question is should this app be placing the data into a new table for each upload or into one large table with an id for each file? I think the origional developer was adding seperate tables for speed purposes. Speed is very important for this.
Is there a faster way? Is there a better mouse trap? Has anyone ever delt with this?
Remember every .csv can contain hundreds of thousands of rows and hundreds of .csv files can be uploaded daily. Though they can be deleted about 24 hrs after they were last used (Im thinking cron job any opinions?)
Thank you all!
A few notes based on comments:
All data is unique to each user and changes so the user wont be Re accessing this data after a couple of hours. Only if they accidentally close the window and then come right back would they really re visit for the same .csv.
No Foreign keys required all csv's are private to each user and dont need to be cross referenced.

I would shy away from putting all the data into a single table for the simple reason that you cannot change the data structure.
Since the data is being deleted anyway and you don't have a requirement to combine data from different loads, there isn't an obvious reason for putting the data into a single table. The other argument is that the application now works. Do you really want to discover some requirement down the road that implies separate tables after you've done the work?
If you do decide on a single table, then use table partitioning. Since each user is using their own data, you can use partitions to separate each user load into a separate partition. Although there are limits on partitions (such as no foreign keys), this will make access the data in a single table as fast as accessing the original data.

Given 105 rows and 102 CSVs per day, you're looking at 10 million rows per day (and you say you'll clear that data down regularly). That doesn't look like a scary figure for a decent db (especially given that you can index within tables, and not across multiple tables).
Obviously the most regularly used CSVs could be very easily held in memory for speed of access - perhaps even all of them (a very simple calculation based on next to no data gives me a figure of 1Gb if you flush every over 24 hours. 1Gb is not an unreasonable amount of memory these days)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008