Skip to content

Framework-independent entrypoint detection (JackEE-style finders) #27

@rahlk

Description

@rahlk

Framework-independent entrypoint detection (JackEE-style finders)

Is your feature request related to a problem? Please describe.

codeanalyzer-python produces a symbol table and call graph but has no
notion of which classes/functions are framework entrypoints — the
methods a web/RPC/CLI/task framework calls into from outside the
application's own call graph. Without this, every downstream consumer
must re-derive it ad hoc.

Concretely, the CLDK Java backend already emits this
(JType.is_entrypoint_class, JCallable.is_entrypoint, populated by
the javaee/ finders), so JavaAnalysis.get_entry_point_classes() /
get_entry_point_methods() are real. The Python façade
(cldk.analysis.python.PythonAnalysis) mirrors the same API surface
but get_entry_point_classes, get_entry_point_methods,
get_service_entry_point_classes, and get_service_entry_point_methods
all raise NotImplementedError purely because the backend doesn't
supply the data. This blocks Java/Python feature parity in CLDK.

The analytical problem is also genuinely missing: entrypoints are the
roots for reachability, dead-code, attack-surface, and call-graph
pruning analyses. A call graph with no identified roots is far less
useful.

Describe the solution you'd like

Port the JackEE (Antoniadis et al., PLDI '20) architecture that the
Java backend already uses at the AST level: abstract framework-
independent concepts (EntrypointClass, EntrypointMethod), plus
per-framework finders that map concrete idioms (decorators, base
classes, external route tables) onto those concepts. CRUD detection
is explicitly out of scope for this issue
— entrypoints only.

1. Schema additionscodeanalyzer/schema/py_schema.py (all
defaulted, so existing analysis.json stays loadable):

  • PyClass.is_entrypoint: bool = False
  • PyClass.entrypoint_framework: Optional[str] = None
  • PyCallable.is_entrypoint: bool = False
  • PyCallable.entrypoint_framework: Optional[str] = None

2. Abstract layercodeanalyzer/frameworks/_base.py:

  • ModuleContext — carries per-project routing facts (resolved
    urls.py entries, FastAPI router mounts, Flask blueprint
    registrations) so a finder can answer truthfully for handlers that
    are not decorated at their definition site. This is the piece
    with no Java analog and the main new work.
  • AbstractEntrypointFinder:
    • is_entrypoint_class(class_node, module_ctx) -> bool
    • is_entrypoint_function(func_node, module_ctx) -> bool

3. Concrete finders + factorycodeanalyzer/frameworks/
(entrypoint_factory.py runs every finder and ORs results, mirroring
Java's EntrypointsFinderFactory):

Finder Detection signals
flask.py @app.route/`@bp.get
fastapi.py `@app
django.py CBV bases (View, APIView, ViewSet, …), DRF @api_view/@action, urls.py resolution
tornado.py RequestHandler subclass
celery.py @app.task, @shared_task, @periodic_task
aws_lambda.py def handler(event, context) convention + SAM/serverless template binding
cli.py Click/Typer @click.command, @app.command
grpc.py *Servicer subclass

4. Routing pre-passcodeanalyzer/frameworks/routing/. One pass
per project, consumed by ModuleContext. Emits
{qualified_name → route_metadata}:

  • django_url_resolver.py — walk every urls.py; evaluate
    path()/re_path()/url()/include() chains and .as_view().
  • fastapi_router_resolver.pyapp.include_router(router, prefix=...), app.mount().
  • flask_blueprint_resolver.pyBlueprint + register_blueprint.

5. Wiringcodeanalyzer/syntactic_analysis/symbol_table_builder.py:

  • Run the routing pre-pass before per-file symbol building; thread
    ModuleContext through.
  • During class/function construction, call the entrypoint factory and
    set is_entrypoint / entrypoint_framework.

Acceptance criteria:

  • A Flask @app.route view, a FastAPI @router.get handler, a
    Django CBV referenced only from urls.py, a Celery @shared_task,
    and a Click @cli.command are each flagged is_entrypoint=True
    with the correct entrypoint_framework.
  • A Django function view with no decorator, reachable only via
    a path('...', view) entry in urls.py (including one level of
    include()), is flagged — proving the routing pre-pass works.
  • A plain helper function called only internally is not flagged.
  • Existing serialized analysis.json files load unchanged
    (defaulted fields).
  • is_entrypoint_class is scoped to inheritance-based entrypoints
    (Django CBVs, Tornado, gRPC servicers); it is not used as a
    coarse "is this class worth analyzing" filter the way the Java
    version uses it — in Python the function-level predicate does the
    real work.

Implementation sketch:

  • frameworks/_base.py: dataclasses + ABCs above.
  • Tree-sitter is sufficient for decorator and base-class detection;
    reach for Jedi only to resolve a base class across modules
    (class V(BaseView) where BaseView is imported and itself
    extends APIView).
  • entrypoint_factory.collect(project) -> None mutates the schema
    objects in place during symbol-table construction.
  • Routing resolvers are pure builders (no schema mutation) feeding
    ModuleContext.

Describe alternatives you've considered

  • Decorator-only detection (no routing pre-pass). Simple, covers
    Flask/FastAPI/Celery/Click well, but misses Django entirely
    Django function/class views are bound in urls.py, not at the
    definition site. Rejected as primary; Django is too common to skip.
  • Datalog/Doop-style fact ingestion (literal JackEE). Maximally
    principled and what JackEE does, but codeanalyzer-python is
    AST/tree-sitter based, not a Datalog engine. The Java backend
    already chose the AST-level port of the same architecture; matching
    it keeps the two backends conceptually aligned. Rejected as
    over-engineering for this codebase.
  • Resolve entrypoints in CLDK instead of the backend. CLDK only
    sees the serialized schema; it cannot re-run framework-aware AST
    passes without duplicating the analyzer. Detection belongs in the
    backend that owns the AST. Rejected.
  • Mark every public module-level function an entrypoint. Trivially
    cheap but useless — destroys the signal (the whole point is
    separating framework-invoked roots from internal helpers).
    Rejected.

Additional context

  • Architecture reference: the Java backend's
    src/main/java/com/ibm/cldk/javaee/
    EntrypointsFinderFactory + AbstractEntrypointFinder +
    per-framework finders (spring, jakarta, jax, struts,
    camel). This issue is the Python mirror of only the entrypoint
    half of that package.
  • JackEE: Antoniadis, Filippakis, Krishnan, Ramesh, Allen, Smaragdakis,
    "Static Analysis of Java Enterprise Applications: Frameworks and
    Caches, the Elephants in the Room", PLDI 2020 — the framework-
    independent-concepts + per-framework-mapping design being ported.
  • Downstream unblocker: once is_entrypoint/is_entrypoint_class
    are populated, CLDK's PythonAnalysis.get_entry_point_classes,
    get_entry_point_methods, and the get_service_entry_point_*
    variants become thin readers over these fields — identical to how
    JavaAnalysis already reads them — closing a Java/Python parity
    gap.
  • CRUD detection is intentionally not part of this issue and will
    be specced separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions