I'm trying to extract various landmarks from the osm planet-latest pbf.
For testing purposes I'm currently trying to extract schools, but the command is taking way to long its already been 20 minutes. The following is my osmosis command on my Linux machine.
osmosis/bin/osmosis --rbf planet-latest.osm.pbf --nkv keyValueList="amenity.college" --wx ssxschools.osm
Can some one please tell me if I'm doing anything wrong or does it usually take this long. If so what can I do to optimize the speed of extracting the data.
BTW I'm using an aws r4xl2 machine running on linux with 8 vcpu and 61 gib of memory. I thought renting a machine with a good amount of memory would help.
The planet file is huge and compressed. I wouldn't expect this command to finish within 20 minutes. However you can give osmium a try, it is said to be faster than osmosis.
Related
I want to run a machine learning algorithm as my endgame- research code that is thusfar unproven and unpublished for text mining purposes. The text is already obtained, but was scraped from warc format obtained from the Common Crawl. I'm in the process of preparing the data for machine learning purposes, and one of the analysis tasks that's desirable is IDF- Inverse Document Frequency analysis of the corpus prior to launching into the ML application proper.
It's my understanding that for IDF to work, each file should represent one speaker or one idea- generally a short paragraph of ascii text not much longer than a tweet. The challenge is that I've scraped some 15 million files. I'm using Strawberry Perl on Windows 7 to read each file and split on the tag contained in the document such that each comment from the social media in question falls into an element of an array (and in a more strongly-typed language would be of type string).
From here I'm experiencing performance issues. I've let my script run all day and it's only made it through 400,000 input files in a 24 hour period. From those input files it's spawned about 2 million output files representing one file per speaker of html-stripped text with Perl's HTML::Strip module. As I look at my system, I see that disk utilization on my local data drive is very high- there's a tremendous number of ASCII text writes, much smaller than 1 KB, each of which is being crammed into a 1 KB sector of my local NTFS-formatted HDD.
Is it a worthwhile endeavor to stop the run, set up a MySQL database on my home system, set up a text field in the database that is perhaps 500-1000 characters in max length, then rerun the perl script such that it slurps an input html file, splits it, HTML-strips it, then prepares and executes a string insert vs a database table?
In general- will switching from a file output format that is a tremendous number of individual text files to a format that is a tremendous number of database inserts be easier on my hard drive / faster to write out in the long run due to some caching or RAM/disk-space utilization magic in the DBMS?
A file system can be interpreted as a hierarchical key-value store, and it is frequently used as such by Unix-ish programs. However, creating files can be somewhat expensive, depending also on the OS and file system you are using. In particular, different file systems differ significantly by how access times scale with the number of files within one directory. E.g. see NTFS performance and large volumes of files and directories and How do you deal with lots of small files?: “NTFS performance severely degrades after 10,000 files in a directory.”
You may therefore see significant benefits by moving from a pseudo-database using millions of small files to a “real” database such as SQLite that stores the data in a single file, thus making access to individual records cheaper.
On the other hand, 2 million records are not that much, suggesting that file system overhead might not be the limiting factor for you. Consider running your software with a test workload and use a profiler or other debugging tools to see where the time is spent. Is it really the open() that takes so much time? Or is there other expensive processing that could be optimized? If there is a pre-processing step that can be parallelized, that alone may slash the processing time quite noticeably.
Whow!
A few years ago, we had massive problems in the popular cms. By the plain mostly a good performance. But it changes to the down, when sidepass inlines comes too.
So i wrote some ugly lines to find the fastest way. Note, that the ressources setting the different limits!
1st) I used the time for establishing of a direct adressable point. Everyone haves an own set of flatfiles.
2nd) I made a Ramdisk. Be sure that you have enough for your Project!
3rd) For the backup i used rsync and renundance i compressed/extracted to the Ramdisk in a tar.gz
In practical this way the fastest one is. The conversion of timecode and generating recursive folder-structures is very simple. Read, write, replace, delete too.
The final release results in processing from:
PHP/MySQL > 5 sec
Perl/HDD ~ 1.2 sec
Perl/RamDisk ~ 0.001 sec
When i see, what you are doing there, this construct may be usuable for you. I am not know about the internals your project.
The harddisk will live much longer, your workflow can be optimized through direct addressing. Its accessable from other stages. Will say, you can work on that base from other scripts too. As you believe, a dataprocessing in R, a notifier from shell, or anything else...
Buffering errors like MySQL are no longer needed. Your CPU no longer loops noops.
I was hoping for some guidance in this respect. I have the same set of ASP VBScript code running on two separate machines. The new machine has a better CPU (and more cores), 10 times the amount of RAM and an SSD hard drive (whereas the original was a standard Western Digital drive). Both machines have the same OS running IIS and MySQL.
However where the initial machine (which is 5 years old) will complete the processing of files (200 files) with multiple file reads, multiple database deletes, inserts and selects in under 2 hours, the seconds (far gutsier machine) takes 5 hours.Both machines are running the identical MySQL, IIS, Python and ASP code.
The CPU on the new machine is not burdened (sits idle at 2%), the RAM is not over utilized in any way (under 10% utilization). The code runs in series and doesn't run in parallel threads.
I was hoping for some guidance as to where to investigate the cause without having to rewrite multiple lines of code for better efficiency to try squeeze a second here and a second there. The resources on the new machine are not being utilized at all but I am feeling rather lost in the dark as to how to navigate this and hours of Google searches haven't yielded many positive solutions.
So after 3 months of hard work in developing & switching the company API from PHP to Go I found out that our Go server can't handle more than 20 req/second.
So basically how our API works:
takes in a request
validates the request
fetches the data from the DB using MYSQL
put's the data in a Map
send's it back to the Client in a JSON format
So after writing about 30 APIs I decided to take it for a spin and see how it performance under load test.
Test 1: ab -n 1 -c 1 http://localhost:8000/sales/report the results are "Time per request: 72.623 [ms] (mean)" .
Test 2: ab -n 100 -c 100 http://localhost:8000/sales/report the results are "Time per request: 4548.155 [ms] (mean)" (No MYSQL errors).
How did the number suddenly spike from 72.623 to 4548 ms in the second test?. We expect thousands of requests per day so I need to solve this issue before we finally release it . I was surprised when I saw the numbers ; I couldn't believe it. I know GO can do much better.
So basic info about the server and settings:
Using GO 1.5
16GB RAM
GOMAXPROCS is using all 8 Cores
db.SetMaxIdleConns(1000)
db.SetMaxOpenConns(1000) (also made sure we are using pool of
connections)
Connecting to MYSQL through unix socket
System is running under Ubuntu
External libraries that we are using:
github.com/go-sql-driver/mysql
github.com/gorilla/mux
github.com/elgs/gosqljson
Any ideas what might be causing this? . I took a look at this post but didn't work as I mentioned above I never got any MYSQL error. Thank you in advance for any help you can provide.
Your post doesn't have enough information to address why your program is not performing how you expect, but I think this question alone is worth an answer:
How did the number suddenly spike from 72.623 to 4548 ms in the second test?
In your first test, you did one single request (-n 1). In your second test, you did 100 requests in flight simultaneously (-c 100 -n 100).
You mention that your program communicates to an external database, your program has to wait for that resource to respond. Do you understand how your database performs when you send it 1,000 requests simultaneously? You made no mention of this. Go can certainly handle many hundreds of concurrent requests a second without breaking a sweat, but it depends what you're doing and how you're doing it. If your program can't complete requests as fast as they are coming in, they will pile up, leading to a high latency.
Neither of those tests you told us about are useful to understand how your server performs under "normal" circumstances - which you said would be "thousands of requests per day" (which isn't very specific, but I'll take to mean, "a few a second"). Then it would be much more interesting to look at -c 4 -n 1000, or something that exercises the server over a longer period of time, with a number of concurrent requests which is more like what you expect to get.
I'm not familiar with gosqljson package, but you say your "query by itself is not really complicated" and you're running it against "a well structured DB table," so I'd suggest dropping the gosqljson and binding your query results to structs, then marshalling those structs to json. That should be faster and incur less memory thrashing than using a map[string]interface{} for everything.
But I don't think gosqljson could possibly be that slow, so it's likely not the main culprit.
The way you're doing your second benchmark is not helpful.
Test 2: ab -n 100 -c 100 http://localhost:8000/sales/report
That's not testing how fast you can handle concurrent requests so much as it's testing how fast you can make connections to MySQL. You're only doing 100 queries and using 100 requests, which means Go is probably spending most of its time making up to 100 connections to MySQL. Go probably doesn't even have time to reuse any of the db connections, considering all the other stuff it's doing to satisfy each query, and then, boom, the test is over. You would need to set the max connections to something like 50 and run 10,000 queries to see how long concurrent requests take once a pool of db connections is already established; right now you're basically testing how long it takes Go to build up a pool of db connections.
I've got a query that is running 5x slower on my staging server as opposed to my local dev machine.
Stackoverflow doesn't want to play nicely with the formatting; the query, describes, and explains are located here
Looking at the describe statements, I can't see any difference between the local and remote schemas.
The record counts for the 2 machines are in the same order of magnitude (500k vs 600k)
Edit In Response to Comments
It was my highly unscientific approach of throwing the queries into MySQL Workbench and looking at the query time. The local query time was on the order of 1.3 seconds and the remote query time was on the order of 5.2 seconds (so its 4x as slow). I'm sure there's a better way to test this query time.
The machines are different. My dev machine is a Mac Book Pro with 8 gigs of RAM. The staging server is a linode VPS with 512 megabytes of RAM. There shouldn't be much load on the staging server (I'm the only one that uses it). I've noticed most queries run in approximately the same time frame on the local machine and staging server, so I was confused as to why this one had such a drastically different time frame.
RAM Issue
Since a temporary table isn't being used (no mention in the EXPLAINS), is the amount of RAM still an issue?
Output from free
total used free shared buffers cached
Mem: 508576 453880 54696 0 4428 254200
-/+ buffers/cache: 195252 313324
Swap: 262140 19500 242640
Profiling Added to Gist
It looks like the remote is taking 2.5 seconds 'sending data' whereas the local is only taking 0.5 seconds. Is this an I/O issue? (Complete profiling info in gist)
Your staging server has one sixteenth of the RAM that you Mac Book Pro has.
Without knowing how much RAM is available to your two instances of MySQL, it's hard to be definitive, but that's the first place I'd look.
Also, if you run these queries from the MySQL command line, locally, how do the times compare?
It could be that the increase in time is in network transfer and not query processing.
Actually... network transfer time is the first place I'd look... then MySQL memory usage.
EDIT following question updates
The 'sending data' phase is the phase where the server is sending data to the client ref. I don't know exactly how large your dataset is, but 2.5s seems pretty high for what's probably 50kB of data or so.
Having looked at the profiling data, nearly all the time is spent sending data, so I'd strongly suspect the network here.
EDIT 2
Some research lead me to this page which indicates that the 'Sending data' is misleading and that this is actually the time spend executing your query.
Thus, I really think you need to be looking at CPU and memory usage on your server since it's specced at a level so much lower than your MacBook.
We have 500+ remote locations. Each location has a linux router which checks in to our management system (homemade using RoR3) every 15 minutes.
We need to log and calculate mean uptime of each boxes Internet connectivity.
Each router posts a request every 15 minutes to a script on the server. (Currently this just records the last checkin time and the uptime.)
If we want to plot the historical uptime of each box, what is the most efficient way to do this without clogging our db up.
500 boxes checking in every 15 minutes would (according to my calculations) result in 17,520,000 inserts. Quite a hefty amount of data that I don't think we need.
Could anyone help solve this riddle for us?
Why not take a look at RRDTool (Wiki-entry). It's just the tool for this kind of situation.
It works as a sort of a round-robin self-averaging database, and it's used in many logging applications just for similar purposes to your situation.
As an example take a look at Cacti which is a data-logging / network monitoring and graphing front-end app built around RRDTool (implemented in PHP).