Proper `breakpoint()` hooking #351

goodboy · 2023-03-07T22:14:02Z

We were being super sloppy previous and leaking our override into Python's runtime 😂
This would result in a NoRuntime being raised on bp usage after runtime exit..

Again, much of this work needs final polish and refinement before we propose something more formal for python-trio/trio#1155

This repairs that and will eventually come with at least one test to ensure everything works as expected outside trio/tractor, when the actor stack tears down.

Testing todo:

ensure std breakpoint() calls work outside tractor.open_nursery()
- we might want to ensure something something with threads and/or maybe numba?
add a test suite for use inside an infected asyncio actor:
- asyncio task uses breakpoint() directly, expect normal UX
- asyncio task crashes and trio side enters debug? (or should the aio task get caught?)
- cancel actor from parent and ensure that if some asyncio-task has debugger lock, normal teardown blocking is adhered?

Move it into our `_spawn.do_hard_kill()` since we do indeed rely on the particular process killing sequence on "soft kill" failure cases.

…stuff

…ed` handling

These will verify new changes to the runtime/messaging core which allows us to adopt an "ignore cancel if requested by us" style handling of `ContextCancelled` more like how `trio` does with `trio.Nursery.cancel_scope.cancel()`. We now expect a `ContextCancelled.canceller: tuple` which is set to the actor uid of the actor which requested the cancellation which eventually resulted in the remote error-msg. Also adds some experimental tweaks to the "backpressure" test which it turns out is very problematic in coordination with context cancellation since blocking on the feed mem chan to some task will block the ipc msg loop and thus handling of cancellation.. More to come to both the test and core to address this hopefully since right now this test is failing.

To handle both remote cancellation this adds `ContextCanceled.canceller: tuple` the uid of the cancel requesting actor and is expected to be set by the runtime when servicing any remote cancel request. This makes it possible for `ContextCancelled` receivers to know whether "their actor runtime" is the source of the cancellation. Also add an explicit `RemoteActor.src_actor_uid` which better formalizes the notion of "which remote actor" the error originated from. Both of these new attrs are expected to be packed in the `.msgdata` when the errors are loaded locally.

Turns out stuff was totally broken in these cases because we're either closing the underlying mem chan too early or not handling the "allow_overruns" mode's cancellation correctly..

This adds remote cancellation semantics to our `tractor.Context` machinery to more closely match that of `trio.CancelScope` but with operational differences to handle the nature of parallel tasks interoperating across multiple memory boundaries: - if an actor task cancels some context it has opened via `Context.cancel()`, the remote (scope linked) task will be cancelled using the normal `CancelScope` semantics of `trio` meaning the remote cancel scope surrounding the far side task is cancelled and `trio.Cancelled`s are expected to be raised in that scope as per normal `trio` operation, and in the case where no error is raised in that remote scope, a `ContextCancelled` error is raised inside the runtime machinery and relayed back to the opener/caller side of the context. - if any actor task cancels a full remote actor runtime using `Portal.cancel_actor()` the same semantics as above apply except every other remote actor task which also has an open context with the actor which was cancelled will also be sent a `ContextCancelled` **but** with the `.canceller` field set to the uid of the original cancel requesting actor. This changeset also includes a more "proper" solution to the issue of "allowing overruns" during streaming without attempting to implement any form of IPC streaming backpressure. Implementing task-granularity backpressure cross-process turns out to be more or less impossible without augmenting out streaming protocol (likely at the cost of performance). Further allowing overruns requires special care since any blocking of the runtime RPC msg loop task effectively can block control msgs such as cancels and stream terminations. The implementation details per abstraction layer are as follows. ._streaming.Context: - add a new contructor factor func `mk_context()` which provides a strictly private init-er whilst allowing us to not have to define an `.__init__()` on the type def. - add public `.cancel_called` and `.cancel_called_remote` properties. - general rename of what was the internal `._backpressure` var to `._allow_overruns: bool`. - move the old contents of `Actor._push_result()` into a new `._deliver_msg()` allowing for better encapsulation of per-ctx msg handling. - always check for received 'error' msgs and process them with the new `_maybe_cancel_and_set_remote_error()` **before** any msg delivery to the local task, thus guaranteeing error and cancellation handling despite any overflow handling. - add a new `._drain_overflows()` task-method for use with new `._allow_overruns: bool = True` mode. - add back a `._scope_nursery: trio.Nursery` (allocated in `Portal.open_context()`) who's sole purpose is to spawn a single task which runs the above method; anything else is an error. - augment `._deliver_msg()` to start a task and run the above method when operating in no overrun mode; the task queues overflow msgs and attempts to send them to the underlying mem chan using a blocking `.send()` call. - on context exit, any existing "drainer task" will be cancelled and remaining overflow queued msgs are discarded with a warning. - rename `._error` -> `_remote_error` and set it in a new method `_maybe_cancel_and_set_remote_error()` which is called before processing - adjust `.result()` to always call `._maybe_raise_remote_err()` at its start such that whenever a `ContextCancelled` arrives we do logic for whether or not to immediately raise that error or ignore it due to the current actor being the one who requested the cancel, by checking the error's `.canceller` field. - set the default value of `._result` to be `id(Context()` thus avoiding conflict with any `.result()` actually being `False`.. ._runtime.Actor: - augment `.cancel()` and `._cancel_task()` and `.cancel_rpc_tasks()` to take a `requesting_uid: tuple` indicating the source actor of every cancellation request. - pass through the new `Context._allow_overruns` through `.get_context()` - call the new `Context._deliver_msg()` from `._push_result()` (since the factoring out that method's contents). ._runtime._invoke: - `TastStatus.started()` back a `Context` (unless an error is raised) instead of the cancel scope to make it easy to set/get state on that context for the purposes of cancellation and remote error relay. - always raise any remote error via `Context._maybe_raise_remote_err()` before doing any `ContextCancelled` logic. - assign any `Context._cancel_called_remote` set by the `requesting_uid` cancel methods (mentioned above) to the `ContextCancelled.canceller`. ._runtime.process_messages: - always pass a `requesting_uid: tuple` to `Actor.cancel()` and `._cancel_task` to that any corresponding `ContextCancelled.canceller` can be set inside `._invoke()`.

This actually caught further runtime bugs so it's gud i tried.. Add overrun-ignore enabled / disabled cases and error catching for all of them. More or less this should cover every possible outcome when it comes to setting `allow_overruns: bool` i hope XD

Because obviously we probably want to support `allow_overruns` on the remote callee side as well XD Only found the bugs fixed in this patch this thanks to writing a much more exhaustive test set for overrun cases B)

Previously we were leaking our (pdb++) override into the Python runtime which would always result in a runtime error whenever `breakpoint()` is called outside our runtime; after exit of the root actor . This explicitly restores any previous hook override (detected during startup) or deletes the hook and restores the environment if none existed prior. Also adds a new WIP debugging example script to ensure breakpointing works as normal after runtime close; this will be added to the test suite.

Only found this by luck more or less (while working on something in a client project) and it turns out we can actually get to (yet another) hang state where SIGINT will be ignored by the root actor on teardown.. I've added all the necessary logic flags to reproduce. We obviously need a follow up bug issue and a test suite to replicate! It appears as though the following are required based on very light tinkering: - infected asyncio mode active - debug mode active - the `trio` context must breakpoint *before* `.started()`-ing - the `asyncio` must **not** error

Turns out you can get a case where you might be opening multiple ctx-streams concurrently and during the context opening phase you block for all contexts to open, but then when you eventually start opening streams some slow to start context has caused the others become in an overrun state.. so we need to let the caller control whether that's an error ;) This also needs a test!

goodboy · 2023-05-15T13:27:16Z

Replaced by #362

goodboy requested a review from guilledk March 7, 2023 22:14

goodboy changed the title ~~Proper breakpoint hooking~~ Proper breakpoint() hooking Mar 7, 2023

goodboy added bug Something isn't working testing examples debugger labels Mar 7, 2023

goodboy mentioned this pull request Mar 28, 2023

infected_asyncio task hang (SIGINT ignored!) on debug mode breakpointing? #354

Open

2 tasks

Just call trio.Process.aclose() directly for now?

24a0623

goodboy mentioned this pull request Apr 16, 2023

Switch to pdbp 🏄🏼 #358

Merged

4 tasks

goodboy force-pushed the proper_breakpoint_hooking branch from 2c51c09 to 915664d Compare April 16, 2023 00:23

goodboy force-pushed the proper_breakpoint_hooking branch from 915664d to 0fe4ce5 Compare May 5, 2023 03:43

goodboy added 19 commits May 14, 2023 19:31

Copy the now deprecated trio.Process.aclose()

f667d16

Move it into our `_spawn.do_hard_kill()` since we do indeed rely on the particular process killing sequence on "soft kill" failure cases.

Switch to pdbp since noone is maintaining pdbpp

d1e9b32

Change over debugger tests to use PROMPT var..

125b96b

Use multiline import for debug mod

104edc0

First try: switch debug machinery over to pdbp B)

43ecaf0

Hide actor nursery exit frame

45b61ff

Yeahh.. maybe sticky off by default is a little better for us XD

18bc9f9

pdbp: turn off line truncating by default, fixes terminal resizing …

47ab1dd

…stuff

TOSQUASH 4759e30: turn it ON i guess? XD

6c78bd7

pdbp: adding typing to config settings vars

79aeaee

Enable Context backpressure by default; avoid startup race-crashes?

38330db

More single doc-strs in discovery mod

2d3ccd5

Tweak context doc str

e418513

Add some log-level method doc-strings

f9d72f2

Change a bunch of log levels to cancel, including any `ContextCancell…

c3a6163

…ed` handling

Assign RemoteActorError boxed error type for context cancelleds

6457bac

Log waiter task cancelling msg as cancel-level

9923926

goodboy added 20 commits May 14, 2023 19:37

Augment test cases for callee-returns-result early

a26d1de

Turns out stuff was totally broken in these cases because we're either closing the underlying mem chan too early or not handling the "allow_overruns" mode's cancellation correctly..

Move NoRuntime import inside current_actor() to avoid cycle

46a7d42

Only tuplize .canceller if non-None

bed65b1

Drop brackpressure usage from fan out tests

6bb9b08

Flip allocate log msgs to debug

dbd007c

Fix cluster test to use allow_overruns

19357fc

Adjust aio test for silent cancellation by parent

f60ea22

Set Context._scope_nursery on callee side too

1338cb3

Because obviously we probably want to support `allow_overruns` on the remote callee side as well XD Only found the bugs fixed in this patch this thanks to writing a much more exhaustive test set for overrun cases B)

Ignore drainer-task nursery RTE during context exit

5d65a15

Drop caller cancels overrun test; covered in new tests

95c12c3

Move move context code into new ._context mod

669360a

Tweak doc string

a8eb962

Add (first-draft) infected-asyncio actor task uses debugger example

c95e440

Some more 3.10+ optional type sigs

bdaad74

Oof, fix remaining Actor.cancel() in Actor._from_parent()

1c1f9e0

goodboy force-pushed the proper_breakpoint_hooking branch from f6dd473 to 021bb38 Compare May 14, 2023 23:37

Add news file

9ca5fdf

goodboy mentioned this pull request May 15, 2023

asyncio (mode) debugger support #362

Draft

8 tasks

goodboy closed this May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper `breakpoint()` hooking #351

Proper `breakpoint()` hooking #351

goodboy commented Mar 7, 2023 •

edited

Loading

goodboy commented May 15, 2023

Proper breakpoint() hooking #351

Proper breakpoint() hooking #351

Conversation

goodboy commented Mar 7, 2023 • edited Loading

Testing todo:

goodboy commented May 15, 2023

Proper `breakpoint()` hooking #351

Proper `breakpoint()` hooking #351

goodboy commented Mar 7, 2023 •

edited

Loading