Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal async framework #36

Open
mmomtchev opened this issue Nov 12, 2020 · 24 comments
Open

Universal async framework #36

mmomtchev opened this issue Nov 12, 2020 · 24 comments

Comments

@mmomtchev
Copy link
Contributor

mmomtchev commented Nov 12, 2020

I have started working on an universal async call framework so that the all the bindings could easily be converted to async with minimal code modifications

Here it is the current version:

The NAN define on the function method is changed as this

from NAN_METHOD(Driver::open) it becomes GDAL_ASYNCABLE_DEFINE(Driver::open)
This will automatically create two methods, Driver::open, the sync version and Driver::openAsync, the async version and a third hidden method Driver::open_do which will contain all the code and will be called with async=true|false

Then only the end part of the function has to be modified as this

Calling GDAL is encapsulated in an asyncable (no automatic variables) lambda, for example

GDAL_ASYNCABLE_MAIN(GDALDataset *) = [raw, filename, x_size, y_size, n_bands, type, options]() {
  GDALDataset *ds = raw->Create(filename.c_str(), x_size, y_size, n_bands, type, options->get());
  delete options;
  if (ds == nullptr)
    throw "Error creating dataset";
  return ds;
};

The return value generation is encapsulated too

GDAL_ASYNCABLE_RVAL(GDALDataset *) = [](GDALDataset *ds, GDAL_ASYNCABLE_OBJS) { return Dataset::New(ds); };

And finally everything is automagically executed, either synchronously or asynchronously (3 is the callback argument)

GDAL_ASYNCABLE_EXECUTE(3, GDALDataset*);

If the function needs to protect some objects from the GC, a persist interface is provided

GDAL_ASYNCABLE_PERSIST(passed_array, band->handle());

The RVAL lambda can access the persisted objects - sync/async transformations are automatic

GDAL_ASYNCABLE_RVAL(CPLErr) = [](CPLErr err, GDAL_ASYNCABLE_OBJS o) { return o[0]; };

Here is the full example for the bottom of Driver::create

Before (only sync)

GDALDataset *ds = raw->Create(filename.c_str(), x_size, y_size, n_bands, type, options->get());
delete options;
if (ds == nullptr) {
  Nan::ThrowError("Error creating dataset");
  return;
}
info.GetReturnValue().Set(Dataset::New(ds));

After (sync and async)

// Main execution block
GDAL_ASYNCABLE_MAIN(GDALDataset *) = [raw, filename, x_size, y_size, n_bands, type, options]() {
  GDALDataset *ds = raw->Create(filename.c_str(), x_size, y_size, n_bands, type, options->get());
  delete options;
  if (ds == nullptr) throw "Error creating dataset";
  return ds;
};

// Generate return value
GDAL_ASYNCABLE_RVAL(GDALDataset *) = [](GDALDataset *ds, GDAL_ASYNCABLE_OBJS) { return Dataset::New(ds); };

// Go
GDAL_ASYNCABLE_EXECUTE(6, GDALDataset *);

Here is the full example for the bottom of RasterBandPixels::read

Before (only sync)

CPLErr err = gdal_band->RasterIO(GF_Write, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space);
if (return err != CE_None)
  Nan::ThrowError(CPLGetLastErrorMsg());
info.GetReturnValue().Set(obj);

After (sync and async)

// Copy these to local pointers for the asyncable lambda
uv_mutex_t *async_lock = band->async_lock;
GDALRasterBand *gdal_band = band->get();

// These objects must be protected from the GC while the thread runs
GDAL_ASYNCABLE_PERSIST(obj, band->handle());

// This is the main execution block
GDAL_ASYNCABLE_MAIN(CPLErr) =
  [gdal_band, async_lock, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space]() {
    uv_mutex_lock(async_lock);
    CPLErr err = gdal_band->RasterIO(GF_Write, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space);
    uv_mutex_unlock(async_lock);
    if (err != CE_None)
      throw CPLGetLastErrorMsg();
    return err;
  };

// Generate the return value from the persistent objects array -> o[0] is obj from GDAL_ASYNCABLE_PERSIST
// The persistent objects array is automatically transformed from sync to async mode (through Nan::Persistent)
GDAL_ASYNCABLE_RVAL(CPLErr) = [](CPLErr err, GDAL_ASYNCABLE_OBJS o) { return o[0]; };

// Go
GDAL_ASYNCABLE_EXECUTE(9, CPLErr);

Any comments, suggestions or volunteers?

@mmomtchev
Copy link
Contributor Author

Just one remark: I don't see any way how we could reduce everything to a single lambda, because some of the lambdas need to run in different execution contexts - there are three separate execution contexts here:

  • in the V8 main thread with the JS world stopped (this is the only context in sync mode)
  • in the aux thread with the JS world running (concurrent with the GC) - V8 objects are not accessible here and can move/be freed
  • back in the V8 main thread the JS world stopped for calling the JS callback

@mmomtchev
Copy link
Contributor Author

UPDATE: I reduced the lambdas, if anyone has any ideas how to further simplify this, feel free to comment

@mmomtchev
Copy link
Contributor Author

mmomtchev commented Nov 12, 2020

I am actually hesitating between those two:
(current one)

GDAL_ASYNCABLE_MAIN(CPLErr) =
  [gdal_band, async_lock, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space]() {
    uv_mutex_lock(async_lock);
    CPLErr err = gdal_band->RasterIO(GF_Write, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space);
    uv_mutex_unlock(async_lock);
    if (err != CE_None) throw CPLGetLastErrorMsg();
    return err;
  };
GDAL_ASYNCABLE_RVAL(CPLErr) = [](CPLErr err, GDAL_ASYNCABLE_OBJS o) { return o[0]; };

(another one)

GDAL_ASYNCABLE_DO(gdal_band, async_lock, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space) {
  uv_mutex_lock(async_lock);
  CPLErr err = gdal_band->RasterIO(GF_Read, x, y, w, h, data, buffer_w, buffer_h, type, pixel_space, line_space);
  uv_mutex_unlock(async_lock);
  if (err != CE_None) throw CPLGetLastErrorMsg();
  ret(persistent[0]);
};

I think the second one is better?

UPDATE: this doesn't work in all cases

@mmomtchev
Copy link
Contributor Author

No, I am unable make all the cases work with only one lambda, if anyone has any ideas, I am listening
It is very important that this is perfect before starting to transform the code

@yocontra
Copy link
Owner

@mmomtchev I like it and think the code ultimately comes out to be pretty clean without reducing to a single lambda.

@mmomtchev
Copy link
Contributor Author

Yes, I think I will leave it like this, unless someone comes up with a brilliant idea
In fact all the solutions revolve around some form of a second lambda, because there is that final part with JSObject::New that must be done at the end in the context of the main V8 thread with the JS world stopped
I will start slowly transforming the code - the vector API is absolutely huge - and unlike the raster API where the bottleneck is purely I/O, the vector API also has some CPU-intensive functions which will benefit from multi-threading

@yocontra
Copy link
Owner

@mmomtchev My primary use case is https://www.npmjs.com/package/verrazzano so I'm excited to start benchmarking and see how much of a difference this makes

@mmomtchev
Copy link
Contributor Author

I tried asyncing gdal.LayerFeatures.get for test purposes (a supposedly difficult function), it turned out quite clean:

diff --git a/src/collections/layer_features.cpp b/src/collections/layer_features.cpp
index 852f61d3..83b805a1 100644
--- a/src/collections/layer_features.cpp
+++ b/src/collections/layer_features.cpp
@@ -18,6 +18,7 @@ void LayerFeatures::Initialize(Local<Object> target) {
   Nan::SetPrototypeMethod(lcons, "count", count);
   Nan::SetPrototypeMethod(lcons, "add", add);
   Nan::SetPrototypeMethod(lcons, "get", get);
+  Nan::SetPrototypeMethod(lcons, "getAsync", getAsync);
   Nan::SetPrototypeMethod(lcons, "set", set);
   Nan::SetPrototypeMethod(lcons, "first", first);
   Nan::SetPrototypeMethod(lcons, "next", next);
@@ -91,7 +92,7 @@ NAN_METHOD(LayerFeatures::toString) {
  * @param {Integer} id The feature ID of the feature to read.
  * @return {gdal.Feature}
  */
-NAN_METHOD(LayerFeatures::get) {
+GDAL_ASYNCABLE_DEFINE(LayerFeatures::get) {
   Nan::HandleScope scope;
 
   Local<Object> parent =
@@ -104,9 +105,17 @@ NAN_METHOD(LayerFeatures::get) {
 
   int feature_id;
   NODE_ARG_INT(0, "feature id", feature_id);
-  OGRFeature *feature = layer->get()->GetFeature(feature_id);
-
-  info.GetReturnValue().Set(Feature::New(feature));
+  OGRLayer *gdal_layer = layer->get();
+  uv_mutex_t *async_lock = layer->async_lock;
+  GDAL_ASYNCABLE_PERSIST(parent);
+  GDAL_ASYNCABLE_MAIN(OGRFeature*) = [async_lock, gdal_layer, feature_id]() {
+    uv_mutex_lock(async_lock);
+    OGRFeature *feature = gdal_layer->GetFeature(feature_id);
+    uv_mutex_unlock(async_lock);
+    return feature;
+  };
+  GDAL_ASYNCABLE_RVAL(OGRFeature*) = [](OGRFeature *feature, GDAL_ASYNCABLE_OBJS) { return Feature::New(feature); };
+  GDAL_ASYNCABLE_EXECUTE(1, OGRFeature*);
 }
 
 /**

@mmomtchev
Copy link
Contributor Author

Then I tried these 2 simple gists (they are really simple, they are 90% instrumentation code)
https://gist.github.com/mmomtchev/14c3428255fd9c88ed20517572d8efdc

sync case

open: 2.359s
count: 0.083ms
get: 5.369s
eventLoop didn't start (100% sync code)
gets 662 per-get 8.0672ms

async case

open: 2.261s
count: 0.256ms
get: 5.471s
eventLoopUtilization 0.022852257300014747
gets 662 per-get 4047.8268ms

So, no direct performance gain - this is impossible as long as the operations on the same dataset run sequentially. Hopefully a future version of GDAL will allow this, they have been talking about this and they even have an RFC about removing the big dataset lock.
But while the sync case runs at 100% CPU and doesn't even start the event loop - the async case runs at 2.3% CPU utilization 😄
If you were to read another dataset in parallel, it would be basically a free lunch.
How is this possible? It is because GDAL runs on a secondary thread / secondary core of the CPU.

@mmomtchev
Copy link
Contributor Author

See the per-get duration - it really explodes in the async case, because the GDAL stubs are running parallel - but they are waiting in line for the async_lock

@mmomtchev
Copy link
Contributor Author

The dataset is a 25MB GeoJSON with all the European administrative borders (from my weather site https://www.meteo.guru)

@yocontra
Copy link
Owner

@mmomtchev Yeah, having a second thread is great for the case of having this run on a webserver (we do). If I'm understanding you correctly though, multiple datasets still share a single secondary thread so parsing multiple files at the same time will still block with eachother, or does each dataset receives its own thread in GDAL?

I think the library I linked will see a major speedup if we're able to thread pool within a dataset (probably the RFC you're referring to?) if the coordinate transformation (transformTo) and the reading of features/feature properties is async. The goal is to parse the file and stream features out of it, right now its all happening synchronously (stream needs 16 items to fill backpressure, go loop the dataset and emit them) only one feature is being parsed/transformed at any given time - theoretically we should be able to kick off all 16 items (or however many needed to fill backpressure) at the same time if we have a proper pool and the file format supports reads like that. Do you have a link to that RFC?

@mmomtchev
Copy link
Contributor Author

I added a modified benchmark to the same gist that opens 4 datasets on the same file to be able to read with 4 threads. One must pay 4x times the open cost to get a marginal improvement in speed - in the order of 20%:

open-0: 2.771s
count-0: 0.079ms
open-1: 2.848s
count-1: 0.056ms
open-2: 2.883s
count-2: 0.052ms
open-3: 2.888s
count-3: 0.037ms
get-0: 4.688s
get-1: 4.600s
get-2: 4.565s
get-3: 4.560s
eventLoopUtilization 0.03760332186211492
gets 662 per-get 3719.6174ms

There is also one severe problem with this approach that is general in Node and cannot be easily solved: the maximum number of threads is limited to UV_THREADPOOL_SIZE, 4 by default. This means that when launching a test that creates 1000 async contexts, Node/libuv will randomly choose 4 of those async contexts to run simultaneously. If they happen to be independent, that's good. If they are waiting on each other, well, they will just run sequentially, starving out the others, leaving you with no other option than to increase UV_THREADPOOL_SIZE.

I will start pushing the first async vector functions over the weekend to node-gdal-async if someone is interested to play with.

@yocontra
Copy link
Owner

@mmomtchev I think we should definitely document and recommend raising UV_THREADPOOL_SIZE in the README where we document the async functions.

@mmomtchev
Copy link
Contributor Author

@contra multiple datasets should be completely independent aside from the UV_THREADPOOL_SIZE issue
The problem is if you launch 16 contexts on 4 datasets and Node/libuv picks 4 of them, all on the same dataset, to run, then you have 100% sequential execution.
UV_THREADPOOL_SIZE goes up to 1024, with the price being about 1MB per thread. The default of 4 is quite low, these days most CPUs have more cores.

@mmomtchev
Copy link
Contributor Author

Another solution is to always await so that you don't launch more than one operation per dataset - knowing that the second one will eat one thread pool slot without doing anything until the first one finishes

@mmomtchev
Copy link
Contributor Author

@mmomtchev
Copy link
Contributor Author

I think it is stalled (it is quite an undertaking) but I think that they are still considering options
Some drivers individually support internal multi-threading when calling them - this is something that works independently of Node/libuv async - and these can be controlled via creation/opening options. For example GeoTIFF supports multi-threaded compression/decompression

@mmomtchev
Copy link
Contributor Author

For everyone who is interested, a big chunk of the async vector API has reached usable state and is available at https://github.com/mmomtchev/node-gdal-async
It is compatible with Express-like Node frameworks and can also be used for multi-threading purposes, as it contains lots of CPU-intensive primitives

@mmomtchev
Copy link
Contributor Author

@contra, this will be a huge PR (>5000 lines) which lots of new code for the synchronous API too
I see that lots of people have checked out my branch, feel free to report any problems with it
I wonder how should this be handled, maybe release a beta version before landing everything?

@yocontra
Copy link
Owner

yocontra commented Dec 2, 2020

@mmomtchev Yeah, I think we can push/prebuild it as a beta release then move it to a major bump once its been tested in the ecosystem for a week. I'll use it in our production ETL system and put it through its paces.

@mmomtchev
Copy link
Contributor Author

@contra If this is going to be a 3.0, then I probably should also rewrite the Geometry classes and convert to node-addon-api - these are the two most pressing changes that will also require a major version bump

I have been reassured by the node-addon-api team that the conversion will be straightforward

N-API should also come before the switch to github releases as it will impact the release process - after the switch there will be a single binary per platform that will be compatible with all Node versions

@yocontra
Copy link
Owner

yocontra commented Dec 3, 2020

@mmomtchev I'm fine to do multiple major releases instead of putting it all into a 3.0 - up to you how you think it should be ordered and released though.

@mmomtchev
Copy link
Contributor Author

I ran the node-addon-api conversion tool and there is a few days of work left before it works again
It is not a very complex change - it is mostly regexping and a little bit of refactoring - but it is very invasive, it modifies 1 line out of 4
I don't want to be merging it with the async branch which is already huge by itself, so if you are ok, you should probably create a 2.x branch and then I will submit the PR to master so that I can base the N-API migration upon it
Then maybe the releases can be switched over to github releases
But this means that there won't be any releases on the master branch for at least 2 weeks until it is stable enough again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants