Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save coolsoftwaretyler/d7580970d245192ad02879a2493d73c9 to your computer and use it in GitHub Desktop.
Save coolsoftwaretyler/d7580970d245192ad02879a2493d73c9 to your computer and use it in GitHub Desktop.
Some notes and tools for reverse engineering / deobfuscating / unminifying obfuscated web app code

Deobfuscating / Unminifying Obfuscated Web App Code

Table of Contents

Tools

Unsorted

  • https://eslint.org/docs/
    • https://eslint.org/docs/latest/extend/custom-rules#the-context-object
      • The context object is the only argument of the create method in a rule.

      • As the name implies, the context object contains information that is relevant to the context of the rule.

    • https://eslint.org/docs/latest/extend/custom-rules#applying-fixes
      • Applying Fixes If you’d like ESLint to attempt to fix the problem you’re reporting, you can do so by specifying the fix function when using context.report(). The fix function receives a single argument, a fixer object, that you can use to apply a fix.

      • Important: The meta.fixable property is mandatory for fixable rules. ESLint will throw an error if a rule that implements fix functions does not export the meta.fixable property.

      • The fixer object has the following methods:

        • insertTextAfter(nodeOrToken, text): Insert text after the given node or token.
        • insertTextAfterRange(range, text): Insert text after the given range.
        • insertTextBefore(nodeOrToken, text): Insert text before the given node or token.
        • insertTextBeforeRange(range, text): Insert text before the given range.
        • remove(nodeOrToken): Remove the given node or token.
        • removeRange(range): Remove text in the given range.
        • replaceText(nodeOrToken, text): Replace the text in the given node or token.
        • replaceTextRange(range, text): Replace the text in the given range.

        A range is a two-item array containing character indices inside the source code. The first item is the start of the range (inclusive) and the second item is the end of the range (exclusive). Every node and token has a range property to identify the source code range they represent.

        The above methods return a fixing object. The fix() function can return the following values:

        • A fixing object.
        • An array which includes fixing objects.
        • An iterable object which enumerates fixing objects. Especially, the fix() function can be a generator.

        If you make a fix() function which returns multiple fixing objects, those fixing objects must not overlap.

    • https://eslint.org/docs/latest/extend/code-path-analysis
      • Code Path Analysis Details

      • ESLint’s rules can use code paths. The code path is execution routes of programs. It forks/joins at such as if statements.

      • Program is expressed with several code paths. A code path is expressed with objects of two kinds: CodePath and CodePathSegment.

      • CodePath expresses whole of one code path. This object exists for each function and the global. This has references of both the initial segment and the final segments of a code path.

      • CodePathSegment is a part of a code path. A code path is expressed with plural CodePathSegment objects, it’s similar to doubly linked list. Difference from doubly linked list is what there are forking and merging (the next/prev are plural).

      • There are seven events related to code paths, and you can define event handlers by adding them alongside node visitors in the object exported from the create() method of your rule.

  • https://prettier.io/
  • https://github.com/beautify-web/js-beautify
    • Beautifier for javascript

    • This little beautifier will reformat and re-indent bookmarklets, ugly JavaScript, unpack scripts packed by Dean Edward’s popular packer, as well as partly deobfuscate scripts processed by the npm package javascript-obfuscator.

    • https://beautifier.io/
  • https://github.com/shapesecurity/unminify
  • https://github.com/PerimeterX/restringer
  • https://github.com/lelinhtinh/de4js
  • http://www.jsnice.org/
    • Statistical renaming, type inference and deobfuscation

    • https://www.sri.inf.ethz.ch/research/plml
      • Machine Learning for Code This project combines programming languages and machine learning for building statistical programming engines -- systems built on top of machine learning models of large codebases. These are new kinds of engines which can provide statistically likely solutions to problems that are difficult or impossible to solve with traditional techniques.

      • JSNice JSNice de-obfuscates JavaScript programs. JSNice is a popular system in the JavaScript commmunity used by tens of thousands of programmers, worldwide

  • https://github.com/spaceraccoon/webpack-exploder/
  • https://github.com/goto-bus-stop/webpack-unpack
  • https://github.com/goto-bus-stop/amd-unpack
    • extract modules from a bundled AMD project using define/require functions

  • https://github.com/gchq/CyberChef
  • https://github.com/ast-grep/ast-grep
  • https://github.com/dandavison/delta
  • https://github.com/Wilfred/difftastic
    • Difftastic is a structural diff tool that compares files based on their syntax.

    • https://difftastic.wilfred.me.uk/introduction.html
      • Difftastic is a structural diff tool that understands syntax. It supports over 30 programming languages and when it works, it's fantastic.

  • https://github.com/prettydiff/prettydiff
    • Beautifier and language aware code comparison tool for many languages. It also minifies and a few other things

    • https://prettydiff.com/#projects-prettydiff
      • When I first became a developer at Travelocity I would sometimes needs to compare code in different environments where some code existed in its original condition and in other cases was minified. Existing diff tools could not solve for that sort of comparison, and at that time existing JavaScript beautifiers had trouble with complex data structures. So I integrated a web-based diff tool with an existing beautifier and minifier. As the features, capabilities, and requests upon the application grew I eventually wrote my own diff algorithm and beautifiers for the various supported languages.

  • https://github.com/Vunovati/astii
    • A JavaScript AST-aware diff and patch toolset

    • When comparing two JavaScript files, standard diff tools compare the two files line-by-line and output the lines on which the files differ. This tool does not compare the characters of the source files directly but their abstract representation - their abstract syntax trees.

    • This enables you to have more meaningfull diffs between files which may be very simmilar but have different source code formatting.

    • When patching, astii patch will regenerate (original --> AST --> generate) the source file and patch it with the provided diff.

  • https://joern.io/
    • The Bug Hunter's Workbench

    • Query: Uncover attack surface, sloppy coding practices, and variants of known vulnerabilities using an interactive code analysis shell. Joern supports C, C++, LLVM bitcode, x86 binaries (via Ghidra), JVM bytecode (via Soot), and Javascript. Python, Java source code, Kotlin, and PHP support coming soon.

    • Automate: Wrap your queries into custom code scanners and share them with the community or run existing Joern-based scanners in your CI.
    • Integrate: Use Joern as a library to power your own code analysis tools or as a component via the REST API.
    • https://github.com/joernio/joern
      • Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs.

      • Joern is a platform for analyzing source code, bytecode, and binary executables. It generates code property graphs (CPGs), a graph representation of code for cross-language code analysis. Code property graphs are stored in a custom graph database. This allows code to be mined using search queries formulated in a Scala-based domain-specific query language. Joern is developed with the goal of providing a useful tool for vulnerability discovery and research in static program analysis.

    • https://docs.joern.io/
      • Joern is a platform for robust analysis of source code, bytecode, and binary code. It generates code property graphs, a graph representation of code for cross-language code analysis. Code property graphs are stored in a custom graph database. This allows code to be mined using search queries formulated in a Scala-based domain-specific query language. Joern is developed with the goal of providing a useful tool for vulnerability discovery and research in static program analysis.

      • The core features of Joern are:

        • Robust parsing. Joern allows importing code even if a working build environment cannot be supplied or parts of the code are missing.
        • Code Property Graphs. Joern creates semantic code property graphs from the fuzzy parser output and stores them in an in-memory graph database. SCPGs are a language-agnostic intermediate representation of code designed for query-based code analysis.
        • Taint Analysis. Joern provides a taint-analysis engine that allows the propagation of attacker-controlled data in the code to be analyzed statically.
        • Search Queries. Joern offers a strongly-typed Scala-based extensible query language for code analysis based on Gremlin-Scala. This language can be used to manually formulate search queries for vulnerabilities as well as automatically infer them using machine learning techniques.
        • Extendable via CPG passes. Code property graphs are multi-layered, offering information about code on different levels of abstraction. Joern comes with many default passes, but also allows users to add passes to include additional information in the graph, and extend the query language accordingly.
      • https://docs.joern.io/code-property-graph/
      • https://docs.joern.io/cpgql/data-flow-steps/
      • https://docs.joern.io/export/
        • Joern can create the following graph representations for C/C++ code:

          • Abstract Syntax Trees (AST)
          • Control Flow Graphs (CFG)
          • Control Dependence Graphs (CDG)
          • Data Dependence Graphs (DDG)
          • Program Dependence graphs (PDG)
          • Code Property Graphs (CPG14)
          • Entire graph, i.e. convert to a different graph format (ALL)
  • https://github.com/julianjensen/ast-flow-graph
    • ast-flow-graph

    • Creates a CFG from JavaScript source code.

    • This module will read one or more JavaScript source files and produce CFGs (Control Flow Graphs) of the code.

    • Uses espree, escope, estraverse, etc
    • https://github.com/isaacs/yallist
      • Yet Another Linked List

        There are many doubly-linked list implementations like it, but this one is mine.

        For when an array would be too big, and a Map can't be iterated in reverse order.

    • https://github.com/julianjensen/traversals
      • Small module for graph traversals, supporting DFS and BFS with niceties added for pre- and post-order, including their reverses.

      • Some notes from ChatGPT:
        • Provides a small module designed for performing graph traversal operations, specifically Depth-First Search (DFS) and Breadth-First Search (BFS). It includes additional features such as pre-order and post-order traversals, as well as their reverse versions, to enhance the functionality of these standard graph traversal techniques.

    • https://github.com/julianjensen/dominators
      • Various dominator tree algorithms

      • It implements two different methods for finding the immediate dominators of a graph.

      • Some notes from ChatGPT:
        • A dominator tree is a concept used in computer science, particularly in the field of compiler design and program analysis. To understand a dominator tree, let's first look at the concept of dominators in a control flow graph (CFG).

          In a CFG, which represents the flow of control in a program, a node ( A ) is said to dominate another node ( B ) if every path from the start node of the graph to ( B ) must go through ( A ). In other words, ( A ) dominates ( B ) if ( A ) is always encountered before ( B ) when traversing the graph from the start node.

          The concept becomes more nuanced with the idea of immediate dominators. An immediate dominator of a node ( B ) is the last dominator on any path from the start node to ( B ).

          Now, a dominator tree is a tree structure that represents these dominance relationships within a CFG. In this tree:

          • Each node corresponds to a node in the original CFG.
          • There is a directed edge from node ( A ) to node ( B ) if ( A ) is the immediate dominator of ( B ) in the CFG.

          A dominator tree generator, therefore, is a tool or an algorithm that constructs the dominator tree from a given control flow graph. This tool is essential in optimizing compilers and in various program analysis tasks, where understanding the dominance relationships helps in transformations like loop optimization, dead code elimination, and more sophisticated analyses like static single assignment (SSA) form conversion.

          This concept is closely related to computer science and software engineering, particularly in areas concerning compiler construction and code optimization. Given your background in software engineering and ethical hacking, this knowledge could be particularly useful in understanding code structure and flow, especially when analyzing or optimizing complex software systems.

wakaru

webcrack

debundle + related

Blogs / Articles / etc

Libraries / Helpers

Unsorted

Recast + related

  • https://github.com/benjamn/recast
  • https://github.com/facebook/jscodeshift
    • A JavaScript codemod toolkit

    • jscodeshift is a toolkit for running codemods over multiple JavaScript or TypeScript files. It provides:

      • A runner, which executes the provided transform for each file passed to it. It also outputs a summary of how many files have (not) been transformed.
      • A wrapper around recast, providing a different API. Recast is an AST-to-AST transform tool and also tries to preserve the style of original code as much as possible.
    • facebook/jscodeshift#500
      • Bringing jscodeshift up to date

      • The biggest issue is with recast. This library hasn't really had a lot of maintenance for the last couple of years, and there's something like 150+ issues and 40+ pull requests waiting to be merged. It seems like 80% of the issues that are logged against jscodeshift are actually recast issues. In order to fix the jscodeshift's outstanding issues, either recast itself needs to fix them or jscodeshift will need to adopt/create its own fork of recast to solve them. For the past year and a half or so putout's main developer has been maintaining a fork of recast and adding a lot of fixes to it. It might be worthwhile to look at switching to @putout/recast as opposed to the recast upstream. I've also been working on a fork of @putout/recast for evcodeshift that adds a few other things to make evcodeshift transforms more debuggable in vscode.

      • https://github.com/putoutjs/recast
        • https://github.com/putoutjs/printer
        • Prints Babel AST to readable JavaScript. For ESTree use estree-to-babel.

          • Similar to Recast, but twice faster, also simpler and easier in maintenance, since it supports only Babel.
          • As opinionated as Prettier, but has more user-friendly output and works directly with AST.
          • Like ESLint but works directly with Babel AST.
          • Easily extendable with help of Overrides.
    • What can be said about recast can probably also be said to a lesser degree about ast-types

  • https://github.com/codemod-js/codemod
    • codemod rewrites JavaScript and TypeScript using babel plugins

estools + related

Babel

semantic / tree-sitter + related

Shift AST

swc

esbuild

  • https://github.com/evanw/esbuild
    • An extremely fast bundler for the web

    • Written in Golang
    • https://esbuild.github.io/
      • https://esbuild.github.io/faq/#upcoming-roadmap
        • I am not planning to include these features in esbuild's core itself:

          • ..snip..
          • An API for custom AST manipulation
          • ..snip..

          I hope that the extensibility points I'm adding to esbuild (plugins and the API) will make esbuild useful to include as part of more customized build workflows, but I'm not intending or expecting these extensibility points to cover all use cases.

          • https://esbuild.github.io/plugins/
          • https://esbuild.github.io/api/
          • https://news.ycombinator.com/item?id=29004200
            • ESBuild does not support any AST transforms directly

              You can add it, via plugins, but its a serious limitation for a project like Next.js which require's these types of transforms

              You also end up with diminishing returns with the more plugins in you add to esbuild, and I imagine its worse with js plugins than it is with go based ones, none the less, you have zero access to it directly

            • It is trivial to write extensions for esbuild. We've written extensive plugins to perform ast transformations that all run, collectively, in under 0.5 seconds. Make a plugin, add acorn and escodegen.

              • This implies that the plugins are doing the AST transformation outside of esbuild itself (likely still running in JS), so wouldn't really benefit from the fact that esbuild is written in golang like I was hoping.
          • evanw/esbuild#2172
            • Forking esbuild to build an AST plugin tool

            • The internal AST is not designed for this use case at all, and it’s not a use case that I’m going to spend time supporting (so I’m not going to spend time documenting exactly how to do it). I recommend using some other tool if you want to do AST-level stuff, especially because keeping a hack like this working over time as esbuild changes might be a big pain for you.

            • If it really want to do this with esbuild, know that the AST is not cleanly abstracted and is only intended for use with esbuild (e.g. uses a lot of internal data structures, has implicit invariants regarding symbols and tree shaking, does some weird things for performance reasons).

Source Maps

Visualisation/etc

Browser Based Code Editors / IDEs

In addition to the links directly below, also make sure to check out the various online REPL/playground tools linked under various other parts of this page too (eg. babel, swc, etc):

  • https://github.com/microsoft/TypeScript-Website/tree/v2/packages/playground
    • This is the JS tooling which powers the https://www.typescriptlang.org/play/

    • It is more or less vanilla DOM-oriented JavaScript with as few dependencies as possible. Originally based on the work by Artem Tyurin but now it's diverged far from that fork.

    • https://github.com/microsoft/TypeScript-Website/tree/v2/packages/sandbox
      • The TypeScript Sandbox is the editor part of the TypeScript Playground. It's effectively an opinionated fork of monaco-typescript with extra extension points so that projects like the TypeScript Playground can exist.

    • https://github.com/microsoft/TypeScript-Playground-Samples
      • Examples of TypeScript Playground Plugins for you to work from

      • This is a series of example plugins, which are extremely well documented and aim to give you samples to build from depending on what you want to build.

        • TS Compiler API: Uses @typescript/vfs to set up a TypeScript project in the browser, and then displays all of the top-level functions as AST nodes in the sidebar.
        • TS Transformers Demo: Uses a custom TypeScript transformer when emitting JavaScript from the current file in the Playground.
        • Using a Web-ish npm Dependency: Uses a dependency which isn't entirely optimised for running in a web page, but doesn't have too big of a dependency tree that it this becomes an issue either
        • Presenting Information Inline: Using a fraction of the extensive Monaco API (monaco is the text editor at the core of the Playground) to showcase what parts of a TypeScript file would be removed by a transpiler to make it a JS file.

CodeMirror

  • https://codemirror.net/
    • CodeMirror is a code editor component for the web. It can be used in websites to implement a text input field with support for many editing features, and has a rich programming interface to allow further extension.

    • CodeMirror is open source under a permissive license (MIT).

    • A full parser package, often with language-specific integration and extension code, exists for the following languages

    • There is also a collection of CodeMirror 5 modes that can be used, and a list of community-maintained language packages. If your language is not listed above, you may still find a solution there.

    • https://codemirror.net/docs/community/
      • Community Packages

      • This page lists CodeMirror-related packages maintained by the wider community.

  • https://github.com/codemirror/dev
    • Development repository for the CodeMirror editor project

    • This is the central repository for CodeMirror. It holds the bug tracker and development scripts.

      If you want to use CodeMirror, install the separate packages from npm, and ignore the contents of this repository. If you want to develop on CodeMirror, this repository provides scripts to install and work with the various packages.

  • https://github.com/uiwjs/react-codemirror

monaco-editor

Obfuscation / Deobfuscation

Variable Name Mangling

Symbolic / Concolic Execution

  • https://en.wikipedia.org/wiki/Symbolic_execution
    • In computer science, symbolic execution (also symbolic evaluation or symbex) is a means of analyzing a program to determine what inputs cause each part of a program to execute. An interpreter follows the program, assuming symbolic values for inputs rather than obtaining actual inputs as normal execution of the program would. It thus arrives at expressions in terms of those symbols for expressions and variables in the program, and constraints in terms of those symbols for the possible outcomes of each conditional branch. Finally, the possible inputs that trigger a branch can be determined by solving the constraints.

    • https://en.wikipedia.org/wiki/Symbolic_execution#Tools
    • https://en.wikipedia.org/wiki/Symbolic_execution#See_also
      • Abstract interpretation

      • Symbolic simulation

      • Symbolic computation

      • Concolic testing

      • Control-flow graph

      • Dynamic recompilation

  • https://en.wikipedia.org/wiki/Concolic_testing
    • Concolic testing (a portmanteau of concrete and symbolic, also known as dynamic symbolic execution) is a hybrid software verification technique that performs symbolic execution, a classical technique that treats program variables as symbolic variables, along a concrete execution (testing on particular inputs) path. Symbolic execution is used in conjunction with an automated theorem prover or constraint solver based on constraint logic programming to generate new concrete inputs (test cases) with the aim of maximizing code coverage. Its main focus is finding bugs in real-world software, rather than demonstrating program correctness.

    • Implementation of traditional symbolic execution based testing requires the implementation of a full-fledged symbolic interpreter for a programming language. Concolic testing implementors noticed that implementation of full-fledged symbolic execution can be avoided if symbolic execution can be piggy-backed with the normal execution of a program through instrumentation. This idea of simplifying implementation of symbolic execution gave birth to concolic testing.

    • An important reason for the rise of concolic testing (and more generally, symbolic-execution based analysis of programs) in the decade since it was introduced in 2005 is the dramatic improvement in the efficiency and expressive power of SMT Solvers. The key technical developments that lead to the rapid development of SMT solvers include combination of theories, lazy solving, DPLL(T) and the huge improvements in the speed of SAT solvers. SMT solvers that are particularly tuned for concolic testing include Z3, STP, Z3str2, and Boolector.

      • https://en.wikipedia.org/wiki/Satisfiability_modulo_theories
        • In computer science and mathematical logic, satisfiability modulo theories (SMT) is the problem of determining whether a mathematical formula is satisfiable. It generalizes the Boolean satisfiability problem (SAT) to more complex formulas involving real numbers, integers, and/or various data structures such as lists, arrays, bit vectors, and strings. The name is derived from the fact that these expressions are interpreted within ("modulo") a certain formal theory in first-order logic with equality (often disallowing quantifiers). SMT solvers are tools that aim to solve the SMT problem for a practical subset of inputs. SMT solvers such as Z3 and cvc5 have been used as a building block for a wide range of applications across computer science, including in automated theorem proving, program analysis, program verification, and software testing.

      • https://en.wikipedia.org/wiki/Boolean_satisfiability_problem#Algorithms_for_solving_SAT
    • https://en.wikipedia.org/wiki/Concolic_testing#Algorithm
      • Essentially, a concolic testing algorithm operates as follows:

        • Classify a particular set of variables as input variables. These variables will be treated as symbolic variables during symbolic execution. All other variables will be treated as concrete values.
        • Instrument the program so that each operation which may affect a symbolic variable value or a path condition is logged to a trace file, as well as any error that occurs.
        • Choose an arbitrary input to begin with.
        • Execute the program.
        • Symbolically re-execute the program on the trace, generating a set of symbolic constraints (including path conditions).
        • Negate the last path condition not already negated in order to visit a new execution path. If there is no such path condition, the algorithm terminates.
        • Invoke an automated satisfiability solver on the new set of path conditions to generate a new input. If there is no input satisfying the constraints, return to step 6 to try the next execution path.
        • Return to step 4.

        There are a few complications to the above procedure:

        • The algorithm performs a depth-first search over an implicit tree of possible execution paths. In practice programs may have very large or infinite path trees – a common example is testing data structures that have an unbounded size or length. To prevent spending too much time on one small area of the program, the search may be depth-limited (bounded).
        • Symbolic execution and automated theorem provers have limitations on the classes of constraints they can represent and solve. For example, a theorem prover based on linear arithmetic will be unable to cope with the nonlinear path condition xy = 6. Any time that such constraints arise, the symbolic execution may substitute the current concrete value of one of the variables to simplify the problem. An important part of the design of a concolic testing system is selecting a symbolic representation precise enough to represent the constraints of interest.
    • https://en.wikipedia.org/wiki/Concolic_testing#Tools
      • Jalangi is an open-source concolic testing and symbolic execution tool for JavaScript. Jalangi supports integers and strings.
  • https://github.com/Z3Prover/z3
    • The Z3 Theorem Prover

    • https://github.com/Z3Prover/z3/wiki
      • Z3 is an SMT solver and supports the SMTLIB format.

        • https://smtlib.cs.uiowa.edu/
          • SMT-LIB is an international initiative aimed at facilitating research and development in Satisfiability Modulo Theories (SMT).

          • Documents describing the SMT-LIB input/output language for SMT solvers and its semantics;

          • etc
    • https://microsoft.github.io/z3guide/
      • Online Z3 Guide

      • https://github.com/microsoft/z3guide
        • Tutorials and courses for Z3

        • https://microsoft.github.io/z3guide/docs/logic/intro/
          • Introduction Z3 is a state-of-the art theorem prover from Microsoft Research. It can be used to check the satisfiability of logical formulas over one or more theories. Z3 offers a compelling match for software analysis and verification tools, since several common software constructs map directly into supported theories.

            The main objective of the tutorial is to introduce the reader on how to use Z3 effectively for logical modeling and solving. The tutorial provides some general background on logical modeling, but we have to defer a full introduction to first-order logic and decision procedures to text-books in order to develop an in depth understanding of the underlying concepts. To clarify: a deep understanding of logical modeling is not necessarily required to understand this tutorial and modeling with Z3, but it is necessary to understand for writing complex models.

        • https://microsoft.github.io/z3guide/programming/Z3%20JavaScript%20Examples/
          • Z3 JavaScript The Z3 distribution comes with TypeScript (and therefore JavaScript) bindings for Z3. In the following we give a few examples of using Z3 through these bindings. You can run and modify the examples locally in your browser.

  • https://github.com/Samsung/jalangi2
    • Dynamic analysis framework for JavaScript

    • Jalangi2 is a framework for writing dynamic analyses for JavaScript. Jalangi1 is still available at https://github.com/SRA-SiliconValley/jalangi, but we no longer plan to develop it. Jalangi2 does not support the record/replay feature of Jalangi1. In the Jalangi2 distribution you will find several analyses:

      • an analysis to track NaNs.
      • an analysis to check if an undefined is concatenated to a string.
      • Memory analysis: a memory-profiler for JavaScript and HTML5.
      • DLint: a dynamic checker for JavaScript bad coding practices.
      • JITProf: a dynamic JIT-unfriendly code snippet detection tool.
      • analysisCallbackTemplate.js: a template for writing a dynamic analysis.
      • and more ...

      See our tutorial slides for a detailed overview of Jalangi and some client analyses.

    • https://github.com/Samsung/jalangi2#usage
      • Usage

      • Analysis in node.js with on-the-fly instrumentation

      • Analysis in node.js with explicit one-file-at-a-time offline instrumentation

      • Analysis in a browser using a proxy and on-the-fly instrumentation

  • https://github.com/SRA-SiliconValley/jalangi
    • This repository has been archived by the owner on Dec 9, 2017. It is now read-only.

    • We encourage you to switch to Jalangi2 available at https://github.com/Samsung/jalangi2. Jalangi2 is a framework for writing dynamic analyses for JavaScript. Jalangi2 does not support the record/replay feature of Jalangi1. Jalangi1 is still available from this website, but we no longer plan to develop it.

    • Jalangi is a framework for writing heavy-weight dynamic analyses for JavaScript. Jalangi provides two modes for dynamic program analysis: an online mode (a.k.a direct or inbrowser analysis mode)and an offilne mode (a.k.a record-replay analysis mode). In both modes, Jalangi instruments the program-under-analysis to insert callbacks to methods defined in Jalangi. An analysis writer implements these methods to perform custom dynamic program analysis. In the online mode, Jalangi performs analysis during the execution of the program. An analysis in online mode can use shadow memory to attach meta information with every memory location. The offilne mode of Jalangi incorporates two key techniques: 1) selective record-replay, a technique which enables to record and to faithfully replay a user-selected part of the program, and 2) shadow values and shadow execution, which enables easy implementation of heavy-weight dynamic analyses. Shadow values allow an analysis to attach meta information with every value. In the distribution you will find several analyses:

      • concolic testing,
      • an analysis to track origins of nulls and undefined,
      • an analysis to infer likely types of objects fields and functions,
      • an analysis to profile object allocation and usage,
      • a simple form of taint analysis,
      • an experimental pure symbolic execution engine (currently undocumented)

Profiling

Unsorted

  • https://github.com/bytecodealliance/ComponentizeJS
    • ESM -> WebAssembly Component creator, via a SpiderMonkey JS engine embedding

    • Provides a Mozilla SpiderMonkey embedding that takes as input a JavaScript source file and a WebAssembly Component WIT World, and outputs a WebAssembly Component binary with the same interface.

    • https://bytecodealliance.org/articles/making-javascript-run-fast-on-webassembly
      • Making JavaScript run fast on WebAssembly

      • We should be clear here—if you’re running JavaScript in the browser, it still makes the most sense to simply deploy JS. The JS engines within the browsers are highly tuned to run the JS that gets shipped to them.

    • https://github.com/bytecodealliance/wizer
      • The WebAssembly Pre-Initializer Don't wait for your Wasm module to initialize itself, pre-initialize it! Wizer instantiates your WebAssembly module, executes its initialization function, and then snapshots the initialized state out into a new WebAssembly module. Now you can use this new, pre-initialized WebAssembly module to hit the ground running, without making your users wait for that first-time set up code to complete.

        The improvements to start up latency you can expect will depend on how much initialization work your WebAssembly module needs to do before it's ready. Some initial benchmarking shows between 1.35 to 6.00 times faster instantiation and initialization with Wizer, depending on the workload

  • https://wingolog.org/archives/2022/08/18/just-in-time-code-generation-within-webassembly
    • just-in-time code generation within webassembly

My ChatGPT Research / Conversations

These are private chat links, so won't work for others, and are included here only for my reference:

See Also

My Other Related Deepdive Gist's and Projects

Fingerprinting Minified JavaScript Libraries

See Also:

Initial ChatGPT Conversation / Notes

Source: https://chat.openai.com/c/d9b7b64f-aa93-474e-939f-79e376e6d375

Me:

Are there any tools that allow me to identify potentially minified javascript/etc libraries within a webpack or similar bundle?

ChatGPT:

Identifying and fingerprinting minified JavaScript modules within a webpack bundle, regardless of their minification method, can be a challenging task due to the nature of minification and bundling. However, there are approaches and tools you can use to achieve this:

  1. Custom Fingerprinting Script: Given your skills in IT and software engineering, you might consider writing a custom script to analyze the bundles. This script can use heuristics like function and variable name patterns, known code snippets, or specific algorithm implementations to identify libraries. The script can parse the bundle, apply these heuristics, and attempt to match segments of the code with known library patterns.

  2. Entropy Analysis: Minified code typically has higher entropy than non-minified code. By analyzing the entropy of different sections of your bundle, you might be able to identify boundaries between different libraries or modules.

  3. AST Analysis: Abstract Syntax Tree (AST) analysis involves parsing the JavaScript code into a tree structure, which can then be analyzed. You could use AST to look for specific patterns that match known library code structures.

  4. Binary Analysis Tools: Tools like Binwalk, which are traditionally used for analyzing binary files, can sometimes be repurposed for script analysis. They work by scanning files for embedded files and executable code.

  5. Regular Expression Matching: This involves creating regular expressions that match known patterns from popular libraries. While this approach can be time-consuming and may require regular updates, it can be effective for well-known and widely used libraries.

  6. Machine Learning Approaches: With advancements in machine learning, you can train a model to recognize patterns in minified code. This requires a substantial dataset of minified and unminified code to train the model.

  7. Community Databases: Sometimes, the community might maintain databases of fingerprints of popular libraries. These can be used to match against your bundle.

  8. Source Map Reversal: If source maps are available, you can reverse-engineer them to get the original source code, making it easier to identify the libraries.

Remember, the effectiveness of these methods depends on various factors, including the degree of minification, the presence of source maps, and the specific bundling process used. Additionally, given the potential complexity, this might be a time-consuming task, but your background in software engineering and ethical hacking should provide a solid foundation for tackling this challenge.

Me:

Let's work with the following ideas:

Custom Fingerprinting Script: Given your skills in IT and software engineering, you might consider writing a custom script to analyze the bundles. This script can use heuristics like function and variable name patterns, known code snippets, or specific algorithm implementations to identify libraries. The script can parse the bundle, apply these heuristics, and attempt to match segments of the code with known library patterns.

AST Analysis: Abstract Syntax Tree (AST) analysis involves parsing the JavaScript code into a tree structure, which can then be analyzed. You could use AST to look for specific patterns that match known library code structures.

Community Databases: Sometimes, the community might maintain databases of fingerprints of popular libraries. These can be used to match against your bundle.

--

Let's assume that there isn't currently a community database, but that once we figure out an effective way of creating and matching these fingerprints, that we could open source one.

Since we want this to work regardless of whether the library code has been minified or not, we need to take that into account.

How should we approach this?

ChatGPT:

Creating a system to fingerprint JavaScript libraries within a webpack bundle, especially considering the challenges posed by minification, is a multi-step process. The goal is to create a system that can analyze JavaScript code, extract identifiable patterns, and match these against a database of known library fingerprints. Here's a proposed approach:

Establishing a Fingerprinting Methodology

  1. Identify Unique Patterns: Begin by identifying patterns that remain consistent in a library, even after minification. These could include:

    • Unique function signatures.
    • Specific algorithm implementations.
    • Characteristic variable naming schemes that survive minification.
    • Constant values or strings used by the library.
  2. Heuristics for Detection: Develop heuristics to detect these patterns, such as specific function name and parameter combinations, unique method chaining styles, or specific API calls.

Abstract Syntax Tree (AST) Analysis

  1. Parse JavaScript Bundle: Utilize tools like Esprima, Acorn, or Babel to convert JavaScript code into AST, enabling effective analysis of code structure and content.

  2. Pattern Recognition in AST: Implement algorithms for matching parsed code against defined fingerprint criteria, focusing on structural and syntactical patterns and heuristics established earlier. This involves looking for specific node types (like function declarations, variable declarations, etc.) that match the identified patterns.

  3. Handling Minification: Design heuristics and pattern recognition to be resilient to common minification techniques like variable renaming and whitespace removal.

Building and Maintaining the Fingerprint Database

  1. Collecting Fingerprints: Start with popular libraries, analyzing unminified sources to establish baseline fingerprints.

  2. Database Structure and Contributions: Organize the database with library entries, fingerprints, and metadata, and set up a system for validating and integrating community contributions.

Matching Mechanism and Script Integration

  1. Matching Algorithm: Develop an algorithm to compare AST nodes against database fingerprints.

  2. Handling Ambiguities: Design the system to handle cases where a definitive match isn't possible. This could involve ranking potential matches or providing a probability score.

  3. Script Functionality: Ensure the script can process a JavaScript bundle, parse it, and match it against fingerprints. Test the matching algorithm with both minified and unminified versions of known libraries to ensure accuracy.

Handling Minification Variations

  1. Minification Resilience: Ensure your fingerprints are resilient to common minification techniques, like variable renaming and whitespace removal.

  2. Testing Across Minifiers: Test your system with code minified by different tools (e.g., UglifyJS, Terser) to ensure reliability.

User Interface and Accessibility

  1. Ease of Use: Develop a user-friendly interface for both contributing to the database and using the script to identify libraries.

  2. Integration with Development Tools: Consider how your tool could integrate with existing development workflows, such as bundlers or IDEs.

Open Sourcing and Community Engagement

  1. Open Source the Tool: Release the tool and database as open source, encouraging community contributions of new fingerprints and algorithm improvements.

  2. Documentation and Support Channels: Provide thorough documentation and establish channels for community support, feedback, and contributions.

Continuous Improvement and Adaptation

  1. Regular Updates and Algorithm Refinement: Keep the database up-to-date with new library versions and continuously refine matching algorithms based on user feedback and new JavaScript development trends.

  2. Performance Optimization: Optimize for performance, especially when analyzing large bundles, using efficient AST traversal and optimized pattern matching algorithms.

Legal and Ethical Considerations

  1. Respecting Copyright: Ensure that the use of library fingerprints complies with legal and ethical standards, respecting the intellectual property rights of library authors.

This project aligns well with your background in IT and software engineering. The main challenge lies in the nuances of JavaScript minification and the diverse coding patterns across different libraries, but a focused and iterative approach can lead to a robust and useful tool for the community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment