# Fingerprinting Minified JavaScript Libraries ## Table of Contents - [See Also](#see-also) - [Initial ChatGPT Conversation / Notes](#initial-chatgpt-conversation--notes) - [Thoughts / comments as I've articulated them elsewhere](#thoughts--comments-as-ive-articulated-them-elsewhere) - [On `0xdevalias/chatgpt-source-watch`](#on-0xdevaliaschatgpt-source-watch) - [Issue 15: Explore using embeddings/similar to identify/track similar chunks/modules even when renamed](#issue-15-explore-using-embeddingssimilar-to-identifytrack-similar-chunksmodules-even-when-renamed) - [On `j4k0xb/webcrack`](#on-j4k0xbwebcrack) - [Issue 21: rename short identifiers](#issue-21-rename-short-identifiers) - [Issue 62: add smart-rename rule from wakaru](#issue-62-add-smart-rename-rule-from-wakaru) - [Issue 143: `[plugin]` Add support for `data-sentry-component` / `data-sentry-element` / `data-sentry-source-file` (from `@sentry/babel-plugin-component-annotate`)](#issue-143-plugin-add-support-for-data-sentry-component--data-sentry-element--data-sentry-source-file-from-sentrybabel-plugin-component-annotate) - [Issue 151: `[plugin]` plugin to support WordPress Gutenberg specific blocks features (including how it injects `window.React`, `window.wp.element`, etc) within JSX decompilation](#issue-151-plugin-plugin-to-support-wordpress-gutenberg-specific-blocks-features-including-how-it-injects-windowreact-windowwpelement-etc-within-jsx-decompilation) - [Issue 152: `[plugin]` plugin to support unminifying `goober` CSS-in-JS library patterns + related JSX decompilation](#issue-152-plugin-plugin-to-support-unminifying-goober-css-in-js-library-patterns--related-jsx-decompilation) - [Issue 154: "stable" identifier demangling](#issue-154-stable-identifier-demangling) - [On `pionxzh/wakaru`](#on-pionxzhwakaru) - [Issue 34: support `un-mangle` identifiers](#issue-34-support-un-mangle-identifiers) - [Issue 41: Module detection](#issue-41-module-detection) - [Issue 73: add a 'module graph'](#issue-73-add-a-module-graph) - [Issue 74: explore 'AST fingerprinting' for module/function identification (eg. to assist smart / stable renames, etc)](#issue-74-explore-ast-fingerprinting-for-modulefunction-identification-eg-to-assist-smart--stable-renames-etc) - [Issue 121: Explore creating a 'reverse engineered' records.json / stats.json file from a webpack build](#issue-121-explore-creating-a-reverse-engineered-recordsjson--statsjson-file-from-a-webpack-build) - [Issue 140: `[smart-rename]` Add support for `data-sentry-component` / `data-sentry-element` / `data-sentry-source-file` (from `@sentry/babel-plugin-component-annotate`)](#issue-140-smart-rename-add-support-for-data-sentry-component--data-sentry-element--data-sentry-source-file-from-sentrybabel-plugin-component-annotate) - [On `jehna/humanify`](#on-jehnahumanify) - [Issue 97: More deterministic renames across different versions of the same code](#issue-97-more-deterministic-renames-across-different-versions-of-the-same-code) ## See Also - [Fingerprinting Minified JavaScript Libraries / AST Fingerprinting / Source Code Similarity / Etc (0xdevalias' gist)](https://gist.github.com/0xdevalias/31c6574891db3e36f15069b859065267#fingerprinting-minified-javascript-libraries--ast-fingerprinting--source-code-similarity--etc) - [JavaScript Web App Reverse Engineering - Module Identification (0xdevalias' gist)](https://gist.github.com/0xdevalias/28c18edfc17606f09cf413f97e404a60#javascript-web-app-reverse-engineering---module-identification) - [Reverse Engineered Webpack Tailwind-Styled-Component (0xdevalias' gist)](https://gist.github.com/0xdevalias/916e4ababd3cb5e3470b07a024cf3125#reverse-engineered-webpack-tailwind-styled-component) - [Deobfuscating / Unminifying Obfuscated Web App / JavaScript Code (0xdevalias' gist)](https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#deobfuscating--unminifying-obfuscated-web-app--javascript-code) - [Obfuscation / Deobfuscation](https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#obfuscation--deobfuscation) - [Variable Name Mangling](https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#variable-name-mangling) - [Stack Graphs / Scope Graphs](https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#stack-graphs--scope-graphs) - [My ChatGPT Research / Conversations](https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#my-chatgpt-research--conversations) - https://github.com/j4k0xb/webcrack - https://github.com/j4k0xb/webcrack/issues/21 - > rename short identifiers - https://github.com/pionxzh/wakaru - https://github.com/pionxzh/wakaru/issues/34 - > support `un-mangle` identifiers - https://github.com/pionxzh/wakaru/issues/41 - > Module detection - https://github.com/pionxzh/wakaru/issues/73 - > add a 'module graph' - https://github.com/pionxzh/wakaru/issues/74 - > explore 'AST fingerprinting' for module/function identification (eg. to assist smart / stable renames, etc) - https://github.com/pionxzh/wakaru/issues/121 - > Explore creating a 'reverse engineered' records.json / stats.json file from a webpack build - https://github.com/jehna/humanify - https://github.com/jehna/humanify/issues/97 - > More deterministic renames across different versions of the same code ## Initial ChatGPT Conversation / Notes Source: https://chat.openai.com/c/d9b7b64f-aa93-474e-939f-79e376e6d375 **Me:** > Are there any tools that allow me to identify potentially minified javascript/etc libraries within a webpack or similar bundle? **ChatGPT:** > Identifying and fingerprinting minified JavaScript modules within a webpack bundle, regardless of their minification method, can be a challenging task due to the nature of minification and bundling. However, there are approaches and tools you can use to achieve this: > > 1. **Custom Fingerprinting Script**: Given your skills in IT and software engineering, you might consider writing a custom script to analyze the bundles. This script can use heuristics like function and variable name patterns, known code snippets, or specific algorithm implementations to identify libraries. The script can parse the bundle, apply these heuristics, and attempt to match segments of the code with known library patterns. > > 2. **Entropy Analysis**: Minified code typically has higher entropy than non-minified code. By analyzing the entropy of different sections of your bundle, you might be able to identify boundaries between different libraries or modules. > > 3. **AST Analysis**: Abstract Syntax Tree (AST) analysis involves parsing the JavaScript code into a tree structure, which can then be analyzed. You could use AST to look for specific patterns that match known library code structures. > > 4. **Binary Analysis Tools**: Tools like Binwalk, which are traditionally used for analyzing binary files, can sometimes be repurposed for script analysis. They work by scanning files for embedded files and executable code. > > 5. **Regular Expression Matching**: This involves creating regular expressions that match known patterns from popular libraries. While this approach can be time-consuming and may require regular updates, it can be effective for well-known and widely used libraries. > > 6. **Machine Learning Approaches**: With advancements in machine learning, you can train a model to recognize patterns in minified code. This requires a substantial dataset of minified and unminified code to train the model. > > 7. **Community Databases**: Sometimes, the community might maintain databases of fingerprints of popular libraries. These can be used to match against your bundle. > > 8. **Source Map Reversal**: If source maps are available, you can reverse-engineer them to get the original source code, making it easier to identify the libraries. > > Remember, the effectiveness of these methods depends on various factors, including the degree of minification, the presence of source maps, and the specific bundling process used. Additionally, given the potential complexity, this might be a time-consuming task, but your background in software engineering and ethical hacking should provide a solid foundation for tackling this challenge. **Me:** > Let's work with the following ideas: > > Custom Fingerprinting Script: Given your skills in IT and software engineering, you might consider writing a custom script to analyze the bundles. This script can use heuristics like function and variable name patterns, known code snippets, or specific algorithm implementations to identify libraries. The script can parse the bundle, apply these heuristics, and attempt to match segments of the code with known library patterns. > > AST Analysis: Abstract Syntax Tree (AST) analysis involves parsing the JavaScript code into a tree structure, which can then be analyzed. You could use AST to look for specific patterns that match known library code structures. > > Community Databases: Sometimes, the community might maintain databases of fingerprints of popular libraries. These can be used to match against your bundle. > > -- > > Let's assume that there isn't currently a community database, but that once we figure out an effective way of creating and matching these fingerprints, that we could open source one. > > Since we want this to work regardless of whether the library code has been minified or not, we need to take that into account. > > How should we approach this? **ChatGPT:** > Creating a system to fingerprint JavaScript libraries within a webpack bundle, especially considering the challenges posed by minification, is a multi-step process. The goal is to create a system that can analyze JavaScript code, extract identifiable patterns, and match these against a database of known library fingerprints. Here's a proposed approach: > > ### Establishing a Fingerprinting Methodology > > 1. **Identify Unique Patterns**: Begin by identifying patterns that remain consistent in a library, even after minification. These could include: > - Unique function signatures. > - Specific algorithm implementations. > - Characteristic variable naming schemes that survive minification. > - Constant values or strings used by the library. > > 2. **Heuristics for Detection**: Develop heuristics to detect these patterns, such as specific function name and parameter combinations, unique method chaining styles, or specific API calls. > > ### Abstract Syntax Tree (AST) Analysis > > 1. **Parse JavaScript Bundle**: Utilize tools like Esprima, Acorn, or Babel to convert JavaScript code into AST, enabling effective analysis of code structure and content. > > 2. **Pattern Recognition in AST**: Implement algorithms for matching parsed code against defined fingerprint criteria, focusing on structural and syntactical patterns and heuristics established earlier. This involves looking for specific node types (like function declarations, variable declarations, etc.) that match the identified patterns. > > 3. **Handling Minification**: Design heuristics and pattern recognition to be resilient to common minification techniques like variable renaming and whitespace removal. > > ### Building and Maintaining the Fingerprint Database > > 1. **Collecting Fingerprints**: Start with popular libraries, analyzing unminified sources to establish baseline fingerprints. > > 2. **Database Structure and Contributions**: Organize the database with library entries, fingerprints, and metadata, and set up a system for validating and integrating community contributions. > > ### Matching Mechanism and Script Integration > > 1. **Matching Algorithm**: Develop an algorithm to compare AST nodes against database fingerprints. > > 2. **Handling Ambiguities**: Design the system to handle cases where a definitive match isn't possible. This could involve ranking potential matches or providing a probability score. > > 3. **Script Functionality**: Ensure the script can process a JavaScript bundle, parse it, and match it against fingerprints. Test the matching algorithm with both minified and unminified versions of known libraries to ensure accuracy. > > ### Handling Minification Variations > > 1. **Minification Resilience**: Ensure your fingerprints are resilient to common minification techniques, like variable renaming and whitespace removal. > > 2. **Testing Across Minifiers**: Test your system with code minified by different tools (e.g., UglifyJS, Terser) to ensure reliability. > > ### User Interface and Accessibility > > 1. **Ease of Use**: Develop a user-friendly interface for both contributing to the database and using the script to identify libraries. > > 2. **Integration with Development Tools**: Consider how your tool could integrate with existing development workflows, such as bundlers or IDEs. > > ### Open Sourcing and Community Engagement > > 1. **Open Source the Tool**: Release the tool and database as open source, encouraging community contributions of new fingerprints and algorithm improvements. > > 2. **Documentation and Support Channels**: Provide thorough documentation and establish channels for community support, feedback, and contributions. > > ### Continuous Improvement and Adaptation > > 1. **Regular Updates and Algorithm Refinement**: Keep the database up-to-date with new library versions and continuously refine matching algorithms based on user feedback and new JavaScript development trends. > > 2. **Performance Optimization**: Optimize for performance, especially when analyzing large bundles, using efficient AST traversal and optimized pattern matching algorithms. > > ### Legal and Ethical Considerations > > 1. **Respecting Copyright**: Ensure that the use of library fingerprints complies with legal and ethical standards, respecting the intellectual property rights of library authors. > > This project aligns well with your background in IT and software engineering. The main challenge lies in the nuances of JavaScript minification and the diverse coding patterns across different libraries, but a focused and iterative approach can lead to a robust and useful tool for the community. ## Thoughts / comments as I've articulated them elsewhere This is mostly making references to issues I have opened, or comments I have written in them.. but sometimes it also just includes an interesting issue within the same theme as the types of things I am interested about here as well. ### On `0xdevalias/chatgpt-source-watch` #### Issue 15: Explore using embeddings/similar to identify/track similar chunks/modules even when renamed > I did some initial exploratory work for this in a script ages back; can't remember if it was in `chatgpt-source-watch` or [`udio-source-watch`](https://github.com/0xdevalias/udio-source-watch) repo, and not sure if it ever got to being committed or if it's just somewhere locally still. > > The general gist of this issue is that between webpack/similar builds, sometimes the chunk identifiers are renamed, which can mess up our diffing. Often times it's relatively easy to see/guess the renames based on looking at the diffs themselves (eg. in the [`_buildManifest.js`](https://github.com/0xdevalias/chatgpt-source-watch/commits/main/unpacked/_next/static/%5BbuildHash%5D/_buildManifest.js) / [`webpack.js`](https://github.com/0xdevalias/chatgpt-source-watch/blob/e78982472adbc9c5d8fd525ab2aba270f49c1006/unpacked/_next/static/chunks/webpack.js#L122-L266) files; but then it's a semi-manual process of renaming these to align so that the diffs look correctly (I believe I wrote some scripts to assist with this at some point also, probably alongside the one mentioned earlier, but similarly may not have been committed anywhere yet). > > Similarly, sometimes the chunk identifiers themselves may not have changed, but the module identifiers and/or which chunk they are in may have moved around; causing similar issues with diffing/identifying what is actually new code vs just being moved around, etc. > > The idea here is basically to use embeddings / similarity search / etc to compare the chunk files (which is what my initial script does), or the modules within them (which is a more recent idea I had for further enhancements to this) to find the closest match; which then allows us to infer in a programmatic/automated way whether its likely to have been renamed; after which we can handle it appropriately. > > _Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/15#issue-2834205626_ > I'll try and find my older scripts later, but for now, a couple of initial references that may be useful, initially found from this issue: > > - https://github.com/All-Hands-AI/openhands-aci/pull/34 > - https://github.com/UKPLab/sentence-transformers > - > Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co. > - > State-of-the-Art Text Embeddings > - https://sbert.net/ > - > Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. It can be used to compute embeddings using Sentence Transformer models ([quickstart](https://sbert.net/docs/quickstart.html#sentence-transformer)) or to calculate similarity scores using Cross-Encoder models ([quickstart](https://sbert.net/docs/quickstart.html#cross-encoder)). This unlocks a wide range of applications, including [semantic search](https://sbert.net/examples/applications/semantic-search/README.html), [semantic textual similarity](https://sbert.net/docs/usage/semantic_textual_similarity.html), and [paraphrase mining](https://sbert.net/examples/applications/paraphrase-mining/README.html). > - https://sbert.net/docs/sentence_transformer/usage/semantic_textual_similarity.html > - > Semantic Textual Similarity > > For Semantic Textual Similarity (STS), we want to produce embeddings for all texts involved and calculate the similarities between them. The text pairs with the highest similarity score are most semantically similar. See also the [Computing Embeddings](https://sbert.net/examples/applications/computing-embeddings/README.html) documentation for more advanced details on getting embedding scores. > - https://sbert.net/docs/quickstart.html#cross-encoder > - > Cross Encoder > > Characteristics of Cross Encoder (a.k.a reranker) models: > > - Calculates a similarity score given pairs of texts. > > - Generally provides superior performance compared to a Sentence Transformer (a.k.a. bi-encoder) model. > > - Often slower than a Sentence Transformer model, as it requires computation for each pair rather than each text. > > - Due to the previous 2 characteristics, Cross Encoders are often used to re-rank the top-k results from a Sentence Transformer model. > - https://sbert.net/examples/applications/computing-embeddings/README.html > - > Computing Embeddings > - https://sbert.net/docs/sentence_transformer/usage/efficiency.html > - > Speeding up Inference > > Sentence Transformers supports 3 backends for computing embeddings, each with its own optimizations for speeding up inference > - https://huggingface.co/spaces/mteb/leaderboard > - > MMTEB: Massive Multilingual Text Embedding Benchmark > > The MMTEB leaderboard compares text embedding models on 1000+ languages. > > _Originally posted by @0xdevalias in [#15](https://github.com/0xdevalias/chatgpt-source-watch/issues/15#issuecomment-2638317388)_ > > I did some initial exploratory work for this in a script ages back; can't remember if it was in `chatgpt-source-watch` or [`udio-source-watch`](https://github.com/0xdevalias/udio-source-watch) repo, and not sure if it ever got to being committed or if it's just somewhere locally still. > > Looks like it was in `udio-source-watch`, and that it is still only local, not committed/pushed anywhere. Here are the seemingly relevant scripts/bits. > > ## `requirements.txt`: > > ``` > numpy==1.26.4 > scikit-learn==1.4.2 > ``` > > ## `scripts/text_similarity_checker.py` > > ```python > #!/usr/bin/env python > > # TODO: It would be interesting to see how this TfidfVectorizer + cosine_similarity method compares with using difflib's SequenceMatcher + ratio methods: > # https://docs.python.org/3/library/difflib.html#sequencematcher-objects > # See also: > # https://docs.python.org/3/library/difflib.html#difflib.get_close_matches > # Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings). > > # TODO: This ChatGPT chat has some examples of how to calculate this sort of thing in JavaScript: > # https://chatgpt.com/c/7fef26fd-0531-4079-b508-43904ff3e089 > # See also: > # https://github.com/NaturalNode/natural/ > # https://naturalnode.github.io/natural/ > # https://blog.logrocket.com/natural-language-processing-node-js/ > # https://winkjs.org/ > # https://winkjs.org/wink-nlp/bm25-vectorizer.html > # BM25 is a major improvement over the classical TF-IDF based algorithms. The weights for a specific term (i.e. token) is computed using the BM25 algorithm. > # https://github.com/winkjs/wink-nlp > # https://github.com/winkjs/wink-nlp-utils > # https://winkjs.org/wink-nlp-utils/ > # https://github.com/winkjs/wink-distance > > import argparse > from sklearn.feature_extraction.text import TfidfVectorizer > from sklearn.metrics.pairwise import cosine_similarity > import os > > def read_file(file_path): > with open(file_path, 'r', encoding='utf-8') as file: > return file.read() > > def calculate_similarities(main_file, other_files): > documents = [read_file(main_file)] + [read_file(f) for f in other_files] > tfidf_vectorizer = TfidfVectorizer() > tfidf_matrix = tfidf_vectorizer.fit_transform(documents) > main_doc_matrix = tfidf_matrix[0:1] > similarities = cosine_similarity(main_doc_matrix, tfidf_matrix[1:]) > return list(zip(other_files, similarities.flatten())) > > def main(): > parser = argparse.ArgumentParser(description="Calculate cosine similarity between a main file and a list of other files.") > parser.add_argument("main_file", type=str, help="The main file to compare.") > parser.add_argument("other_files", nargs='+', type=str, help="A list of other files to compare against the main file.") > args = parser.parse_args() > > # Filter out the main file early if it's accidentally included in other_files > filtered_files = [f for f in args.other_files if f != args.main_file] > > if not os.path.isfile(args.main_file): > print(f"Error: '{args.main_file}' does not exist or is not a file.") > return > > for file_path in filtered_files: > if not os.path.isfile(file_path): > print(f"Error: '{file_path}' does not exist or is not a file.") > return > > results = calculate_similarities(args.main_file, filtered_files) > sorted_results = sorted(results, key=lambda x: x[1], reverse=True) > > # for other_file, similarity in sorted_results: > # print(f"Similarity between {args.main_file} and {other_file}: {similarity:.4f}") > > print(f"Comparing against: {args.main_file}") > for other_file, similarity in sorted_results: > print(f"{other_file}: {similarity:.4f}") > > if __name__ == "__main__": > main() > ``` > > ## `scripts/rename-chunk.sh` > > ```shell > #!/usr/bin/env zsh > > # Check if the correct number of arguments is provided > if [ "$#" -ne 2 ]; then > echo "Usage: $0 " > exit 1 > fi > > old_file=$1 > new_file=$2 > > # Check if both arguments are regular files > if [ ! -f "$old_file" ]; then > echo "Error: $old_file is not a file." > exit 1 > fi > > if [ ! -f "$new_file" ]; then > echo "Error: $new_file is not a file." > exit 1 > fi > > # Check if the old file is tracked by Git > if ! git ls-files --error-unmatch "$old_file" &> /dev/null; then > echo "Error: $old_file is not tracked by Git." > exit 1 > fi > > echo "Starting the file renaming process..." > > # Temporarily rename the new file to preserve it > mv $new_file $new_file.new > > # Use git mv to rename the old file to the new file's original name > git mv $old_file $new_file > > # Restore the originally new file from its temporary name > mv $new_file.new $new_file > > echo "File renaming complete. $old_file has been renamed to $new_file." > ``` > > ## `useful-commands.md` > > This isn't the full file, just some relevant looking snippets from it: > > ````markdown > See how similar a chunk file is to other chunk files (to find potential chunkID churn): > > ```bash > npm run chunk:check 1793 > > # chunk-check () { ./scripts/text_similarity_checker.py unpacked/_next/static/chunks/${1}.js unpacked/_next/static/chunks/*.js | head -n 5; }; chunk-check 1793 > ``` > > Rename a chunk file that changed due to chunkID churn: > > ```bash > npm run chunk:rename 7073 1793 > > # chunk-rename () { ./scripts/rename-chunk.sh unpacked/_next/static/chunks/${1}.js unpacked/_next/static/chunks/${2}.js; }; chunk-rename 7073 1793 > ``` > ```` > > _Originally posted by @0xdevalias in [#15](https://github.com/0xdevalias/chatgpt-source-watch/issues/15#issuecomment-2638327124)_ ### On `j4k0xb/webcrack` #### Issue 21: rename short identifiers > When I was exploring this concept in my own deobfuscation PoC project, I was exploring to make the variable names unique + have them add sort of semantic information about their source/scope. > > Eg. if it was an arg to a function, it might be `arg_1`. Or potentially if the function is `foo`, it might end up as `foo_arg_1` > > It looks like most of the PoC code I was playing with was local/in a pretty messy/hacky state, but I did find a link in it to an online REPL I was playing around with some of it in. Not sure how outdated that code is, but it might be useful: > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples > > There were a number of different AST parsers I was playing around with, but I think that this babel code may have been the latest (not sure which one): > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples#babel_v1_3.js > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples#babel_v1_3_clean.js > > Within those files, I believe the functions `getNameFromPath`, `getPrefix` (and older commented out functions `getTypePrefix`, `getPrefix` > > --- > > **Edit:** Came across this in another issue here: > > > I published my decompiler that I used in the above example. I think it might be a good reference for adding this feature. > > https://github.com/e9x/krunker-decompiler > > > > _Originally posted by @e9x in https://github.com/j4k0xb/webcrack/issues/10#issuecomment-1546764062_ > > And looking at it's `libRenameVars` code seems to be taking a vaguely similar approach to how I was looking at doing things in my original PoC that I described above: > > - https://github.com/e9x/krunker-decompiler/blob/master/src/libRenameVars.ts > - `getVarPrefix` will set a prefix based on the type (eg. `func`, `arg`, `Class`, `imported`, `var`) > - `getName` generates a new variable name that does not conflict with existing names or reserved keywords > - `generateName` generates a new name for a variable considering its scope, type, and the context in which it is used (e.g., whether it's a class, a function variable, etc.). > It employs various AST manipulations to ensure the generated name is appropriate and does not conflict with existing names. > > A more generalised summary/overview (via ChatGPT): > > > Certainly, the code implements a sophisticated algorithm for renaming variables in a JavaScript program, adhering to several high-level rules and strategies: > > > > 1. **Type-Specific Prefixing**: > > - The `getVarPrefix` function assigns specific prefixes to variable names based on their type (e.g., "func" for function names, "arg" for parameters). This approach helps in identifying the role of a variable just by its name. > > > > 2. **Avoiding Reserved Keywords**: > > - The script includes a comprehensive list of reserved JavaScript keywords. If a variable's name matches a reserved keyword, it is prefixed with an underscore to prevent syntax errors. > > > > 3. **Unique Naming with Context Consideration**: > > - The `generateName` function ensures that each variable gets a unique name that doesn't conflict with other variables in its scope. It also considers the context in which a variable is used. For example, if a variable is part of a class, it may receive a name that reflects this context, using `pascalCase` or `camelCase` as appropriate. > > > > 4. **Handling Special Cases**: > > - The script contains logic to handle special cases, such as variables that are function expressions (`isFuncVar`) or class instances (`isClass`). This affects the naming convention applied to these variables. > > > > 5. **Randomness with Mersenne Twister**: > > - A Mersenne Twister is used to generate random elements for variable names, ensuring that the names are not only unique within the scope of the program but also less predictable. > > > > 6. **AST-Based Renaming**: > > - The script analyzes the Abstract Syntax Tree (AST) of the program to understand the structure and scope of variables. This analysis guides the renaming process, ensuring that the new names are consistent with the variable's usage and position in the code. > > > > 7. **Scope Analysis with ESLint Scope**: > > - By leveraging `eslint-scope`, the script can accurately determine the scope of each variable. This is crucial in avoiding name collisions and ensuring that the renaming respects lexical scoping rules in JavaScript. > > > > 8. **Consideration for Exported and Assigned Variables**: > > - The script pays special attention to variables that are exported or assigned in specific ways (e.g., through `Object.defineProperty`). It ensures that these variables receive names that are appropriate for their roles. > > > > In summary, the script uses a combination of type-based naming conventions, context consideration, randomness, AST analysis, and scope analysis to systematically rename variables in a JavaScript program. This approach aims to enhance readability, avoid conflicts, and maintain the logical structure of the program. > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807290502_ --- > And for an even cooler/more extreme version of improving variable naming; I just came across this blog post / project from @jehna that makes use of `webcrack` + ChatGPT for variable renaming: > > - https://thejunkland.com/blog/using-llms-to-reverse-javascript-minification.html > - > Using LLMs to reverse JavaScript variable name minification > > This blog introduces a novel way to reverse minified Javascript using large language models (LLMs) like ChatGPT and llama2 while keeping the code semantically intact. The code is open source and available at [Github project Humanify](https://github.com/jehna/humanify). > - https://github.com/jehna/humanify > - > Un-minify Javascript code using ChatGPT > - > This tool uses large language modeles (like ChatGPT & llama2) and other tools to un-minify Javascript code. Note that LLMs don't perform any structural changes – they only provide hints to rename variables and functions. The heavy lifting is done by Babel on AST level to ensure code stays 1-1 equivalent. > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807342986_ --- > I came across another tool today that seemed to have a start on implementing some 'smart rename' features: > > - https://github.com/pionxzh/wakaru > - https://wakaru.vercel.app/ > > Digging through the code lead me to this: > > - https://github.com/pionxzh/wakaru/tree/main/packages/unminify#smart-rename > - > Rename minified identifiers with heuristic rules. > - https://github.com/pionxzh/wakaru/blob/main/packages/unminify/src/transformations/smart-rename.ts > - `handleDestructuringRename`, `handleFunctionParamsRename`, `handlePropertyRename`, `handleReactRename`, `getElementName` > - https://github.com/pionxzh/wakaru/blob/main/packages/unminify/src/utils/identifier.ts#L28-L75 > - `generateName`, `getUniqueName` > - https://github.com/pionxzh/wakaru/blob/master/packages/unminify/src/transformations/__tests__/smart-rename.spec.ts > > There's also an issue there that seems to be exploring how to improve 'unmangling variable names' as well: > > - https://github.com/pionxzh/wakaru/issues/34 > > Which I wrote the following extra thoughts on: > > > I just finished up writing some thoughts/references for variable renaming on the `webcrack` repo, that could also be a useful idea for here. (see quotes below) > > > > When I was exploring PoC ideas for my own project previously, I was looking to generate a file similar to the 'module map' that this project is using; but instead of just for the names of modules, I wanted to be able to use it to provide a 'variable name map'. Though because the specific variables used in webpack/etc can change between builds, my thought was that first 'normalising' them to a 'known format' based on their context would make sense to do first. > > > > That could then be letter enhanced/expanded by being able to pre-process these 'variable name mappings' for various open source projects in a way that could then be applied 'automagically' without the end user needing to first create them. > > > > It could also be enhanced by similar techniques such as what the `humanify` project does, by using LLMs/similar to generate suggested variable name mappings based on the code. > > > > My personal ideal end goal for a feature like that would then allow me to use it within an IDE-like environment, where I can rename variables 'as I explore', knowing that the mappings/etc will be kept up to date. > > > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1807393509_ > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807398430_ --- > Another link from my reference notes that I forgot to include earlier; my thoughts on how to rename otherwise unknown variables are based on similar concepts that are used in reverse engineering tools such as IDA: > > - https://hex-rays.com/blog/igors-tip-of-the-week-34-dummy-names/ > - > In IDA’s disassembly, you may have often observed names that may look strange and cryptic on first sight: `sub_73906D75`, `loc_40721B`, `off_40A27C` and more. In IDA’s terminology, they’re called dummy names. They are used when a name is required by the assembly syntax but there is nothing suitable available > - https://www.hex-rays.com/products/ida/support/idadoc/609.shtml > - > IDA Help: Names Representation > - > Dummy names are automatically generated by IDA. They are used to denote subroutines, program locations and data. Dummy names have various prefixes depending on the item type and value > > --- > > > And a few more I was looking at recently as well (that is sort of basically `smart-rename`: > > > > - https://binary.ninja/2023/09/15/3.5-expanded-universe.html#automatic-variable-naming > > - > Automatic Variable Naming > > > One easy way to improve decompilation output is to come up with [better default names](https://github.com/Vector35/binaryninja-api/issues/2558) for variables. There’s a lot of possible defaults you could choose and a number of different strategies are seen throughout different reverse engineering tools. Prior to 3.5, Binary Ninja left variables named based on their origin. Stack variables were var_OFFSET, register-based variables were reg_COUNTER, and global data variables were (data_). While this scheme isn’t changing, we’re being much more intelligent about situations where additional information is available. > > > > > > For example, if a variable is passed to a function and a variable name is available, we can now make a much better guess for the variable name. This is most obvious in binaries with type libraries. > > - > This isn’t the only style of default names. Binary Ninja also will name loop counters with simpler names like `i`, or `j`, `k`, etc (in the case of nested loops) > > - https://github.com/Vector35/binaryninja-api/issues/2558 > > > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1822263687_ > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1822262649_ #### Issue 62: add smart-rename rule from wakaru - https://github.com/j4k0xb/webcrack/issues/62 - > add smart-rename rule from wakaru - https://github.com/j4k0xb/webcrack/pull/63 - > feat: rename destructuring - https://github.com/j4k0xb/webcrack/pull/100 - > feat: configurable smart rename - https://github.com/pionxzh/wakaru/blob/main/packages/unminify/README.md#smart-rename - https://github.com/pionxzh/wakaru/blob/main/packages/unminify/src/transformations/smart-rename.ts - `handleDestructuringRename` - `handleReactRename` #### Issue 143: `[plugin]` Add support for `data-sentry-component` / `data-sentry-element` / `data-sentry-source-file` (from `@sentry/babel-plugin-component-annotate`) See: - [On `pionxzh/wakaru`](#on-pionxzhwakaru) - [Issue 140: `[smart-rename]` Add support for `data-sentry-component` / `data-sentry-element` / `data-sentry-source-file` (from `@sentry/babel-plugin-component-annotate`)](#issue-140-smart-rename-add-support-for-data-sentry-component--data-sentry-element--data-sentry-source-file-from-sentrybabel-plugin-component-annotate) Note: This was crossposted to the following issues: - https://github.com/j4k0xb/webcrack/issues/143 - https://github.com/pionxzh/wakaru/issues/140 - https://github.com/jehna/humanify/issues/350 #### Issue 151: `[plugin]` plugin to support WordPress Gutenberg specific blocks features (including how it injects `window.React`, `window.wp.element`, etc) within JSX decompilation > Mostly creating this based on the exploration I did in https://github.com/j4k0xb/webcrack/issues/10#issuecomment-2693645060 before I realised it was likely unrelated to more core React / JSX handling. > > I suspect the bulk of this is niche enough that it wouldn't make sense to include in core, but would be a good candidate for a plugin as per https://github.com/j4k0xb/webcrack/issues/143#issuecomment-2692345330 / https://github.com/j4k0xb/webcrack/issues/143#issuecomment-2692517232 > > This is also aligned to `wakaru`'s proposed module-detection feature: > > - https://github.com/pionxzh/wakaru/issues/41 > > @j4k0xb I also don't expect this to be something you create; but figured since I already did the deeper exploration in this repo, I may as well create a standalone reference point for it, even if this issue ends up getting closed. > > --- > > From my prior exploration: > > > **Edit 2:** Looking a bit deeper, I think `window.wp.element` relates more specifically to how the Wordpress Gutenberg editor may inject things: > > > > - https://github.com/search?type=code&q=window.wp.element > > - https://wordpress.org/gutenberg/ > > - https://github.com/WordPress/gutenberg > > - https://developer.wordpress.org/block-editor/reference-guides/packages/packages-create-block/ > > - https://github.com/WordPress/gutenberg/tree/trunk/packages/create-block > > - https://github.com/WordPress/gutenberg/blob/2103d5021066593f25f2baae9038b0cf23372b7f/packages/create-block/lib/templates/es5/index.js.mustache#L9-L14 > > - We can see `wp.element.createElement` / etc usage here > > - And where it's reading from `window.wp` here > > - https://github.com/WordPress/gutenberg/blob/2103d5021066593f25f2baae9038b0cf23372b7f/packages/create-block/lib/templates/es5/index.js.mustache#L71 > > > > Specifically in the 'plain JS' usage: > > > > - https://github.com/hrsetyono/gutenberg-tutorial/blob/476f19ee0413ebf719df8981dc982b6aa5b64348/README.md?plain=1#L19-L31 > > - https://developer.wordpress.org/block-editor/getting-started/fundamentals/javascript-in-the-block-editor/#javascript-without-a-build-process > > - > When you opt out of a build process, you interact directly with WordPress’s [JavaScript APIs](https://developer.wordpress.org/block-editor/reference-guides/packages/) through the global `wp` object. > > - https://developer.wordpress.org/block-editor/reference-guides/packages/#using-the-packages-via-wordpress-global > > - > JavaScript packages are available as a registered script in WordPress and can be accessed using the `wp` global variable. > > - https://www.npmjs.com/org/wordpress > > > > So using `window.wp.element` would map to a version of `@wordpress/element`, provided by the backend through the `window.wp` global: > > > > - https://github.com/WordPress/gutenberg/tree/trunk/packages/element > > - https://developer.wordpress.org/block-editor/reference-guides/packages/packages-element/ > > > > Whereas in the non-static version, we can see that `registerBlockType` directly refers to the imported `Edit` / `Save`, which seem to handle their own imports, and/or use a JSX transform defined elsewhere in the build chain: > > > > - https://github.com/WordPress/gutenberg/blob/2103d5021066593f25f2baae9038b0cf23372b7f/packages/create-block/lib/templates/block/index.js.mustache#L26-L43 > > - https://github.com/WordPress/gutenberg/blob/2103d5021066593f25f2baae9038b0cf23372b7f/packages/create-block/lib/templates/block/edit.js.mustache > > - https://github.com/WordPress/gutenberg/blob/2103d5021066593f25f2baae9038b0cf23372b7f/packages/create-block/lib/templates/block/save.js.mustache > > - https://developer.wordpress.org/block-editor/getting-started/fundamentals/javascript-in-the-block-editor/#javascript-with-a-build-process > > > > We can also see that the `window.React` global might come from Wordpress Gutenberg as well, as we can see from this example code that injects it: > > > > - https://github.com/search?q=repo%3AWordPress%2Fgutenberg%20window.React&type=code > > - https://github.com/WordPress/gutenberg/tree/2103d5021066593f25f2baae9038b0cf23372b7f/packages/editor#blockcontrols > > - https://github.com/WordPress/gutenberg/tree/2103d5021066593f25f2baae9038b0cf23372b7f/packages/editor#richtext > > > > We also get another clue here, where again `window.React` is injected into the function, and then a followup note to that: > > > > - https://github.com/WordPress/gutenberg/blob/2103d5021066593f25f2baae9038b0cf23372b7f/docs/how-to-guides/plugin-sidebar-0.md#step-1-get-a-sidebar-up-and-running > > - > For this code to work, those utilities need to be available in the browser, so you must specify `wp-plugins`, `wp-editor`, and `react` as dependencies of your script. > > - > Here is the PHP code to register your script and specify the dependencies: > > > > > > ```php > > > function sidebar_plugin_register() { > > > wp_register_script( > > > 'plugin-sidebar-js', > > > plugins_url( 'plugin-sidebar.js', __FILE__ ), > > > array( 'wp-plugins', 'wp-editor', 'react' ) > > > ); > > > } > > > add_action( 'init', 'sidebar_plugin_register' ); > > > ``` > > > > So I guess, similar to the comment made in https://github.com/j4k0xb/webcrack/issues/143#issuecomment-2692345330, the deeper specifics of this may belong in a [separate plugin](https://github.com/j4k0xb/webcrack/issues/143#issuecomment-2692517232) instead of `webcrack` core. > > > > Though.. I do wonder if the `window.React` (assigned to a variable) usage is generic enough that it might make sense to include in core? > > > > - https://github.com/search?type=code&q=window.React > > > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/10#issuecomment-2693645060_ > > ## See Also > > - https://github.com/j4k0xb/webcrack/issues/152 > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/151#issue-2890824882_ #### Issue 152: `[plugin]` plugin to support unminifying `goober` CSS-in-JS library patterns + related JSX decompilation > Mostly creating this based on the exploration I did in https://github.com/j4k0xb/webcrack/issues/10#issuecomment-2693645060 in case there is no generic way to solve that in core, and it needs to be a more library specific plugin solution as per https://github.com/j4k0xb/webcrack/issues/143#issuecomment-2692345330 / https://github.com/j4k0xb/webcrack/issues/143#issuecomment-2692517232 > > This is also aligned to `wakaru`'s proposed module-detection feature: > > - https://github.com/pionxzh/wakaru/issues/41 > > @j4k0xb I also don't expect this to be something you create; but figured since I already did the deeper exploration in this repo, I may as well create a standalone reference point for it, even if this issue ends up getting closed. > > --- > > From my prior exploration: > > > **Edit 3:** Looking at the code from https://github.com/j4k0xb/webcrack/issues/10#issuecomment-2692599211 again, I think there is another case where JSX-like things may not be currently getting decompiled properly, which is syntax like this: > > > > ```js > > /* ..snip.. */ > > /* 541 */ var Z = h("div")` > > /* 542 */ display: flex; > > /* 543 */ justify-content: center; > > /* 544 */ margin: 4px 10px; > > /* 545 */ color: inherit; > > /* 546 */ flex: 1 1 auto; > > /* 547 */ white-space: pre-line; > > /* 548 */ `; > > /* ..snip.. */ > > /* 567 */ let c = t.createElement(Z, { > > /* 568 */ ...e.ariaProps > > /* 569 */ }, g(e.message, e)); > > /* ..snip.. */ > > ``` > > > > Looking higher up in the file, we see the definition for `h`: > > > > ```js > > /* ..snip.. */ > > /* 106 */ function h(e, t) { > > /* 107 */ let l = this || {}; > > /* 108 */ return function () { > > /* 109 */ let i = arguments; > > /* 110 */ function n(a, o) { > > /* 111 */ let c = Object.assign({}, a); > > /* 112 */ let s = c.className || n.className; > > /* 113 */ l.p = Object.assign({ > > /* 114 */ theme: p && p() > > /* 115 */ }, c); > > /* 116 */ l.o = / *go\d+/.test(s); > > /* 117 */ c.className = m.apply(l, i) + (s ? " " + s : ""); > > /* 118 */ if (t) { > > /* 119 */ c.ref = o; > > /* 120 */ } > > /* 121 */ let r = e; > > /* 122 */ if (e[0]) { > > /* 123 */ r = c.as || e; > > /* 124 */ delete c.as; > > /* 125 */ } > > /* 126 */ if (w && r[0]) { > > /* 127 */ w(c); > > /* 128 */ } > > /* 129 */ return y(r, c); > > /* 131 */ } > > /* 132 */ if (t) { > > /* 133 */ return t(n); > > /* 134 */ } else { > > /* 135 */ return n; > > /* 136 */ } > > /* 137 */ }; > > /* 138 */ } > > /* ..snip.. */ > > ``` > > > > And searching GitHub code for `/ *go\d+/.test` leads us to the > > - https://github.com/search?type=code&q=%22%2F+*go%5Cd%2B%2F.test%22 > > - https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/styled.js#L20-L71 > > - https://github.com/cristianbote/goober > > - > goober, a less than 1KB css-in-js solution > > - https://goober.rocks/ > > > > Which we can then also see additional confirmation for in earlier code as well: > > > > - https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/core/get-sheet.js#L11-L25 > > > > ```js > > /* ..snip.. */ > > /* 6 */ let i = e => typeof window == "object" ? ((e ? e.querySelector("#_goober") : window._goober) || Object.assign((e || document.head).appendChild(document.createElement("style")), { > > /* 7 */ innerHTML: " ", > > /* 8 */ id: "_goober" > > /* 9 */ })).firstChild : e || l; > > /* ..snip.. */ > > ``` > > > > Which seems to be used across a number of libs/projects: > > > > - https://github.com/search?type=code&q=%22%23_goober%22+OR+%22window._goober%22 > > > > Sometimes inlined directly: > > > > - https://github.com/KevinVandy/tanstack-query/blob/69476f0ce5778afad4520ed42485b4110993afed/packages/query-devtools/src/utils.tsx#L305-L323 > > > > This may end up being another case where, similar to the comment made in https://github.com/j4k0xb/webcrack/issues/143#issuecomment-2692345330, the deeper specifics of this may belong in a [separate plugin](https://github.com/j4k0xb/webcrack/issues/143#issuecomment-2692517232) instead of `webcrack` core; but it makes me wonder if there is some kind of generic way we can identify a pattern of these sort of React component generator libraries so that the JSX decompilation can work effectively with them? > > > > Similar'ish prior art from `wakaru`: > > > > - https://github.com/pionxzh/wakaru/issues/40 > > - https://github.com/pionxzh/wakaru/issues/40#issuecomment-1809704264 > > - https://github.com/pionxzh/wakaru/issues/40#issuecomment-1809962543 > > > > Looking back at the main format of the `styled` function (which was `Z` in the above code): > > > > - https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/styled.js#L15-L20 > > - > `styled(tag, forwardRef) > > > > This returns an inner wrapper function, which seems to use tagged template literal syntax to provide the CSS, and then it reads that from the `arguments` into `_args`: > > > > - https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/styled.js#L23-L24 > > - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals#tagged_templates > > > > It then uses the `_args` to create the CSS class name: > > > > - https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/styled.js#L41-L43 > > > > And then processes the `tag` (eg. `"div"`) passed to the original function: > > > > - https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/styled.js#L50-L59 > > > > Eventually 'rendering' that through the 'pragma' `h`: > > > > - https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/styled.js#L66 > > > > Which was assigned during `setup` earlier: > > > > - https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/styled.js#L4-L13 > > > > Tracing through the code in our bundle to find that 'pragma' function binding, we find `t.createElement` ends up being assigned to `h` (or `y` as it's called in our minified code): > > > > ```js > > /* ..snip.. */ > > /* 582 */ (function (e, t, l, i) { > > /* 583 */ c.p = undefined; > > /* 584 */ y = e; > > /* 585 */ p = undefined; > > /* 586 */ w = undefined; > > /* 587 */ })(t.createElement); > > /* ..snip.. */ > > ``` > > > > And of course, we know that `t` relates to our React global: > > > > ```js > > /* ..snip.. */ > > /* 2 */ var t = window.React; > > /* ..snip.. */ > > ``` > > > > This obviously ends up going through a few extra steps of more library specific indirection that probably doesn't make sense to be in `webcrack` core.. but I wonder if we're able to trace/follow the React global / `createElement` 'pragma' / `h` through so that JSX decompilation can work correctly? > > > > In the case of this library it also inserts the additional wrapping component [`Styled`](https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/styled.js#L26-L67) in the middle.. but I think if the `createElement` 'pragma' flowed through properly.. that might end up being properly figured out as nested JSX anyway; as the `Styled` just ends up wrapping our provided `tag` component: > > > > - https://github.com/cristianbote/goober/blob/5f0b43976fac214262c2c8921b1691fc4729ec98/src/styled.js#L69 > > > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/10#issuecomment-2693645060_ > > ## See Also > > - https://github.com/j4k0xb/webcrack/issues/151 > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/152#issue-2890987070_ #### Issue 154: "stable" identifier demangling > When diffing deobfuscating minified code (not obfuscated) often most changes are simple identifier renames. > So my feature request is "stable" identifier demangling, kinda like the current "All Names" option but stabler. > > Currently, it just counts up: > > ```js > // input > var a = 100, b = 500, c = 1000; > // output > var v = 100; > var v2 = 500; > var v3 = 1000; > ``` > > Now if we add a new variable at the top all variables are changed which causes a huge diff: > > ```js > // input > var a = 1, b = 100, c = 500, d = 1000; > // output > var v = 1; > var v2 = 100; > var v3 = 500; > var v4 = 1000; > ``` > > So instead my suggestion is somehow making the chosen name stable. > An idea I had was hashing various attributes of a variable like: > > - the initialization value > - count usages > - general location (which function it's in) > > With the example from above: > > ```js > // input > var a = 100, b = 500, c = 1000; > // output > var v100_0_g = 100; > var v500_0_g = 500; > var v1000_0_g = 1000; > > // input > var a = 1, b = 100, c = 500, d = 1000; > // output > var v1_0_g = 1; // only changed line! > var v100_0_g = 100; > var v500_0_g = 500; > var v1000_0_g = 1000; > ``` > > Where the format is `v${initialValue}_${usages}_${scope}` (scope = "g"lobal). Of course this is a very naive example, real world would probably involve a hash. > > _Originally posted by @Le0Developer in https://github.com/j4k0xb/webcrack/issues/154#issue-2895194646_ > I'm currently testing a possible implementation here on my branch: [Le0Developer/webcrack@`feat`/stable-02](https://github.com/Le0Developer/webcrack/tree/feat/stable-02?rgh-link-date=2025-03-08T13%3A12%3A30.000Z) > > Output is huge but a LOT more stable. Tested on [Cloudflare-Mining/Cloudflare-Datamining@`4f4e67f`](https://github.com/Cloudflare-Mining/Cloudflare-Datamining/commit/4f4e67fb6e0c91d800ca81b5086fa55428ab5310) and it reduced the (further decompiled diff using webcrack) from over 2000 lines to just under 400 lines with only the actual changes (and https://github.com/j4k0xb/webcrack/issues/156). > > _Originally posted by @Le0Developer in https://github.com/j4k0xb/webcrack/issues/154#issuecomment-2708276924_ ### On `pionxzh/wakaru` #### Issue 34: support `un-mangle` identifiers > I just finished up writing some thoughts/references for variable renaming on the `webcrack` repo, that could also be a useful idea for here. (see quotes below) > > When I was exploring PoC ideas for my own project previously, I was looking to generate a file similar to the 'module map' that this project is using; but instead of just for the names of modules, I wanted to be able to use it to provide a 'variable name map'. Though because the specific variables used in webpack/etc can change between builds, my thought was that first 'normalising' them to a 'known format' based on their context would make sense to do first. > > That could then be later enhanced/expanded by being able to pre-process these 'variable name mappings' for various open source projects in a way that could then be applied 'automagically' without the end user needing to first create them. > > It could also be enhanced by similar techniques such as what the `humanify` project does, by using LLMs/similar to generate suggested variable name mappings based on the code. > > My personal ideal end goal for a feature like that would then allow me to use it within an IDE-like environment, where I can rename variables 'as I explore', knowing that the mappings/etc will be kept up to date. > > --- > > > When I was exploring this concept in my own deobfuscation PoC project, I was exploring to make the variable names unique + have them add sort of semantic information about their source/scope. > > > > Eg. if it was an arg to a function, it might be `arg_1`. Or potentially if the function is `foo`, it might end up as `foo_arg_1` > > > > It looks like most of the PoC code I was playing with was local/in a pretty messy/hacky state, but I did find a link in it to an online REPL I was playing around with some of it in. Not sure how outdated that code is, but it might be useful: > > > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples > > > > There were a number of different AST parsers I was playing around with, but I think that this babel code may have been the latest (not sure which one): > > > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples#babel_v1_3.js > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples#babel_v1_3_clean.js > > > > Within those files, I believe the functions `getNameFromPath`, `getPrefix` (and older commented out functions `getTypePrefix`, `getPrefix` > > > > --- > > > > **Edit:** Came across this in another issue here: > > > > > I published my decompiler that I used in the above example. I think it might be a good reference for adding this feature. > > > https://github.com/e9x/krunker-decompiler > > > > > > _Originally posted by @e9x in https://github.com/j4k0xb/webcrack/issues/10#issuecomment-1546764062_ > > > > And looking at it's `libRenameVars` code seems to be taking a vaguely similar approach to how I was looking at doing things in my original PoC that I described above: > > > > - https://github.com/e9x/krunker-decompiler/blob/master/src/libRenameVars.ts > > - `getVarPrefix` will set a prefix based on the type (eg. `func`, `arg`, `Class`, `imported`, `var`) > > - `getName` generates a new variable name that does not conflict with existing names or reserved keywords > > - `generateName` generates a new name for a variable considering its scope, type, and the context in which it is used (e.g., whether it's a class, a function variable, etc.). > > It employs various AST manipulations to ensure the generated name is appropriate and does not conflict with existing names. > > > > A more generalised summary/overview (via ChatGPT): > > > > > Certainly, the code implements a sophisticated algorithm for renaming variables in a JavaScript program, adhering to several high-level rules and strategies: > > > > > > 1. **Type-Specific Prefixing**: > > > - The `getVarPrefix` function assigns specific prefixes to variable names based on their type (e.g., "func" for function names, "arg" for parameters). This approach helps in identifying the role of a variable just by its name. > > > > > > 2. **Avoiding Reserved Keywords**: > > > - The script includes a comprehensive list of reserved JavaScript keywords. If a variable's name matches a reserved keyword, it is prefixed with an underscore to prevent syntax errors. > > > > > > 3. **Unique Naming with Context Consideration**: > > > - The `generateName` function ensures that each variable gets a unique name that doesn't conflict with other variables in its scope. It also considers the context in which a variable is used. For example, if a variable is part of a class, it may receive a name that reflects this context, using `pascalCase` or `camelCase` as appropriate. > > > > > > 4. **Handling Special Cases**: > > > - The script contains logic to handle special cases, such as variables that are function expressions (`isFuncVar`) or class instances (`isClass`). This affects the naming convention applied to these variables. > > > > > > 5. **Randomness with Mersenne Twister**: > > > - A Mersenne Twister is used to generate random elements for variable names, ensuring that the names are not only unique within the scope of the program but also less predictable. > > > > > > 6. **AST-Based Renaming**: > > > - The script analyzes the Abstract Syntax Tree (AST) of the program to understand the structure and scope of variables. This analysis guides the renaming process, ensuring that the new names are consistent with the variable's usage and position in the code. > > > > > > 7. **Scope Analysis with ESLint Scope**: > > > - By leveraging `eslint-scope`, the script can accurately determine the scope of each variable. This is crucial in avoiding name collisions and ensuring that the renaming respects lexical scoping rules in JavaScript. > > > > > > 8. **Consideration for Exported and Assigned Variables**: > > > - The script pays special attention to variables that are exported or assigned in specific ways (e.g., through `Object.defineProperty`). It ensures that these variables receive names that are appropriate for their roles. > > > > > > In summary, the script uses a combination of type-based naming conventions, context consideration, randomness, AST analysis, and scope analysis to systematically rename variables in a JavaScript program. This approach aims to enhance readability, avoid conflicts, and maintain the logical structure of the program. > > > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807290502_ > > --- > > > And for an even cooler/more extreme version of improving variable naming; I just came across this blog post / project from @jehna that makes use of `webcrack` + ChatGPT for variable renaming: > > > > - https://thejunkland.com/blog/using-llms-to-reverse-javascript-minification.html > > - > Using LLMs to reverse JavaScript variable name minification > > > This blog introduces a novel way to reverse minified Javascript using large language models (LLMs) like ChatGPT and llama2 while keeping the code semantically intact. The code is open source and available at [Github project Humanify](https://github.com/jehna/humanify). > > - https://github.com/jehna/humanify > > - > Un-minify Javascript code using ChatGPT > > - > This tool uses large language modeles (like ChatGPT & llama2) and other tools to un-minify Javascript code. Note that LLMs don't perform any structural changes – they only provide hints to rename variables and functions. The heavy lifting is done by Babel on AST level to ensure code stays 1-1 equivalent. > > > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807342986_ > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1807393509_ > > For now, we have `smart-rename` that can guess the variable name based on the context. I would like to expand it to cover some other generic cases. > > Linking to my `smart-rename` related issues to keep the contextual link here: > > - https://github.com/pionxzh/wakaru/issues/40 > - https://github.com/pionxzh/wakaru/issues/48 > - https://github.com/pionxzh/wakaru/issues/49 > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1818578969_ --- > > Another link from my reference notes that I forgot to include earlier; my thoughts on how to rename otherwise unknown variables are based on similar concepts that are used in reverse engineering tools such as IDA: > > > > - https://hex-rays.com/blog/igors-tip-of-the-week-34-dummy-names/ > > - > In IDA’s disassembly, you may have often observed names that may look strange and cryptic on first sight: `sub_73906D75`, `loc_40721B`, `off_40A27C` and more. In IDA’s terminology, they’re called dummy names. They are used when a name is required by the assembly syntax but there is nothing suitable available > > - https://www.hex-rays.com/products/ida/support/idadoc/609.shtml > > - > IDA Help: Names Representation > > - > Dummy names are automatically generated by IDA. They are used to denote subroutines, program locations and data. Dummy names have various prefixes depending on the item type and value > > > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1822262649_ > > --- > > And a few more I was looking at recently as well (that is sort of basically `smart-rename`: > > - https://binary.ninja/2023/09/15/3.5-expanded-universe.html#automatic-variable-naming > - > Automatic Variable Naming > > One easy way to improve decompilation output is to come up with [better default names](https://github.com/Vector35/binaryninja-api/issues/2558) for variables. There’s a lot of possible defaults you could choose and a number of different strategies are seen throughout different reverse engineering tools. Prior to 3.5, Binary Ninja left variables named based on their origin. Stack variables were var_OFFSET, register-based variables were reg_COUNTER, and global data variables were (data_). While this scheme isn’t changing, we’re being much more intelligent about situations where additional information is available. > > > > For example, if a variable is passed to a function and a variable name is available, we can now make a much better guess for the variable name. This is most obvious in binaries with type libraries. > - > This isn’t the only style of default names. Binary Ninja also will name loop counters with simpler names like `i`, or `j`, `k`, etc (in the case of nested loops) > - https://github.com/Vector35/binaryninja-api/issues/2558 > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1822263687_ --- > Was looking closer at the `sourcemap` spec today, and the `names` field jumped out at me as potentially useful: > > - https://tc39.es/source-map-spec/#names > - > `names`: a list of symbol names used by the `mappings` entry > - https://tc39.es/source-map-spec/#mappings > - > `mappings`: a string with the encoded mapping data (see _4.1 Mappings Structure_) > - https://tc39.es/source-map-spec/#mappings-structure > - > The `mappings` data is broken down as follows: > > > > - each group representing a line in the generated file is separated by a semicolon (`;`) > > - each segment is separated by a comma (`,`) > > - each segment is made up of `1`, `4`, or `5` variable length fields. > - It then goes on to describe the segment's in greater detail, but the specific part I was thinking could be relevant here would be this: > - > If present, the zero-based index into the `names` list associated with this segment. This field is a base 64 VLQ relative to the previous occurrence of this field unless this is the first occurrence of this field, in which case the whole value is represented. > > Obviously if there is a full `sourcemap` for the webapp, then `wakaru` isn't really needed anyway.. but what I was thinking of here is that in combination with module detection (see - https://github.com/pionxzh/wakaru/issues/41), if there are `sourcemaps`s available for that original module, then we could potentially extract the original function/variable/etc names from the `names` field of the `sourcemap`, and use them in a sort of 'smart-rename with sourcemap' type way. > > --- > > Another `sourcemap` related idea I had (which probably deserves it's own issue) is that it would be cool to be able to 'retroactively generate a `sourcemap`) for a webapp, based on the unminified output from `wakaru`; such that we could than take that `sourcemap`, and apply it to the original minified web app source for debugging the live app. > > **Edit:** Created a new issue to track this: > > - https://github.com/pionxzh/wakaru/issues/66 > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1837858619_ --- > > It isn't very meaningful to support such a feature when you can access all the source code. > > @pionxzh I was specifically talking about it in terms of bundled modules (eg. React, etc), and not the unique web app code of the app itself. > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1839791102_ --- > > You mean like, for popular open-source projects, we can put some sourcemap in our project / read from the chunk, and then reverse map the minified variable and function name back to normal? > > @pionxzh Similar to that, but probably not "put the sourcemap in our project" directly; but more process the sourcemaps from popular open-source projects and extract those details to an 'intermediary form'. That 'intermediary form' would be similar to the 'module map' file, as I described earlier in this thread: > > > When I was exploring PoC ideas for my own project previously, I was looking to generate a file similar to the 'module map' that this project is using; but instead of just for the names of modules, I wanted to be able to use it to provide a 'variable name map'. Though because the specific variables used in webpack/etc can change between builds, my thought was that first 'normalising' them to a 'known format' based on their context would make sense to do first. > > > > That could then be later enhanced/expanded by being able to pre-process these 'variable name mappings' for various open source projects in a way that could then be applied 'automagically' without the end user needing to first create them. > > > > It could also be enhanced by similar techniques such as what the `humanify` project does, by using LLMs/similar to generate suggested variable name mappings based on the code. > > > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1807393509_ > > --- > > > A configuration table/profile can be provided to allow users to manually write correspondences. wakaru can simply include the rules of the better known packages. > > @StringKe _nods_, sounds like we are thinking about similar things here :) > > --- > > > Can you specify the content that you would expect to have? and the corresponding behavior > > @pionxzh For me personally, I haven't deeply thought through all the use cases in depth, but at a high level I basically want to be able to take a web app that is going to be re-built multiple times, and be able to have a 'config file' similar to the 'module mapping' that wakaru has/had; but that also allows me to specify the variable/function names ('symbols') that are used within it. > > The slightly more challenging part is that because the app will be re-built multiple times, the minified variables will change (sometimes every build), so we can't easily use those as the 'key' of the mapping. One idea I had for solving that is potentially by first renaming all of the variables based on a 'stable naming pattern' (eg. `func_*`, `arg_*`, `const_*`, etc; and then could just use a counter/similar based on the 'scope' it's being defined in) that would be generated based on the scope/type of the 'symbol', and would therefore be resilient to the minified variable names changing each build. Those 'stable intermediary names' could then potentially be used for the keys in the variable mapping. > > Though then we also need to figure out what level of 'granularity' makes sense to generate those 'stable intermediary names' at; as having a 1:1 mapping of those 'stable name scopes' to JS scopes could potentially end up being really noisy in the mapping file. So maybe using a 'higher abstracted scope' would make more sense (eg. at the module level or similar) > > My original hacky implementation of this in my own PoC code was using JS objects/JSON to map an explicit minified variable name to it's 'proper' name; but that broke because the minified names changed between builds. Even by implementing the 'stable naming pattern', if those 'stable names' included a 'counter' in them (eg. `func_1`, `const_8`, etc) we still probably wouldn't want to use those stable names directly as the key of an object, as if a new variable was added 'in between' in a later build, that would flow on to 'shifting' the 'counter' for every variable of a matching type afterwards, which would be a lot of effort to manually update in a mapping file. While I haven't thought too deeply about it, I think that by using an array in the mapping file, it should simplify things so that we only need to make a small change to 'fix the mappings' when a new variable is added that 'shifts' everything. > > Even by using the array concept in the mappings file, there is still some manual pain/effort involved in trying to keep the mapping 'up to date' in newer builds. That's what lead me into some of the deeper/more esoteric ideas/thinking around 'fingerprinting' that I expand on below. > > -- > > Another area I started looking into (but haven't deeply explored yet) for both figuring out how to map variable names to sections of code in a 'smart' way, and potentially also for module identification (see #41); is in the space of 'structural AST fingerprinting' or 'code similarity' algorithms and similar. (I realise that this is a rather deep/esoteric angle to be looking at this from, and that there are likely going to be far simpler/easier ways to implement the variable mapping/module identification in a 'good enough' way without going to this level of depth; but I'm curious to explore it regardless, to see if any good ideas come out of it) > > I haven't gotten too far in my reading yet (got distracted on other things), but the high level of my idea was that maybe we could generate an 'AST fingerprint' that isn't impacted by the variable/function/etc names ('symbols') changing during minification; and then use that as the basis for the 'key' in the 'mappings file'; as that fingerprint could theoretically still identify a 'scope' (which might be a literal JS scope, or might be a higher level abstraction that we decide makes sense; the most abstract being probably at the bundled module level) even if the bundler decides to move some functions around to a different module/etc. Then obviously if we were able to generate those 'resilient fingerprints' to identify code even when it's been minified, that would make perfect sense to apply to module detection/etc (see #41) as well. > > Some of the high level ideas / search terms that I was using to start my research in that area was things like: > > - AST fingerprinting > - Source code similarity fingerprinting > - Control flow graphs > - Call flow graphs > - Program dependence graph > - etc > > Here is a link dump of a bunch of the tabs I have open but haven't got around to reviewing in depth yet, RE: 'AST fingerprinting' / Code Similarity / etc: > >
> Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity > > - https://openreview.net/forum?id=BJxWx0NYPr > - > On the Effectiveness of Program Analysis in AI and Vice Versa > - https://dl.acm.org/doi/10.1145/3486860 > - > A Unified Framework for Code Representation Learning > - https://inria.hal.science/hal-01648996/document > - > A Survey of Program Slicing Techniques > - https://hal.science/hal-00627811/document > - > From Program Slicing to Abstract Interpretation > - https://igm.univ-mlv.fr/~chilowi/research/syntax_tree_fingerprinting/syntax_tree_fingerprinting_ICPC09.pdf > - > Syntax Tree Fingerprinting for Source Code Similarity Detection > - https://ieeexplore.ieee.org/document/9960266 > - > Exploring the Landscape of Deep Learning for Source Code Analysis > - https://en.wikipedia.org/wiki/Program_dependence_graph > - > Program Dependence Graph - Wikipedia > - https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=3437&context=theses > - > Enhancing Software Reliability Using Program Analysis Techniques > - https://dl.acm.org/doi/pdf/10.1145/1286821.1286826 > - > Static Analysis of Executables to Detect Malicious Patterns > - https://patents.google.com/patent/US9459861B1/en > - > US Patent US9459861B1: Techniques for Code Analysis and Optimization > - https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/software-nsf.pdf > - > Software Engineering Research and Practice: NSF Workshop Report > - https://www.computer.org/csdl/journal/ts/2023/08/10125077/1Nc4Vd4vb7W > - > Advances in Automated Program Repair: IEEE Transactions > - https://stackoverflow.com/questions/7283702/assembly-level-function-fingerprint > - > Stack Overflow: Assembly-Level Function Fingerprinting > - https://en.wikipedia.org/wiki/Control-flow_graph > - > Control-Flow Graph - Wikipedia > - https://www.kapravelos.com/publications/fptechniques-www23.pdf > - > Novel Techniques for Malware and Vulnerability Analysis in Web Applications > - https://pubmed.ncbi.nlm.nih.gov/12197666/ > - > PubMed: Program Analysis Techniques for Software Debugging > - https://stackoverflow.com/questions/15087195/data-flow-graph-construction > - > Stack Overflow: Data Flow Graph Construction in Programming > - https://www.cs.columbia.edu/~suman/secure_sw_devel/Basic_Program_Analysis_CF.pdf > - > Basic Program Analysis Techniques for Secure Software Development > - https://codeql.github.com/docs/writing-codeql-queries/about-data-flow-analysis/ > - > CodeQL Guide: Understanding Data Flow Analysis for Secure Software Development > - https://codereview.stackexchange.com/questions/276387/call-flow-graph-from-python-abstract-syntax-tree > - > Code Review Stack Exchange: Call Flow Graph from Python AST > - https://clang.llvm.org/docs/DataFlowAnalysisIntro.html > - > LLVM Clang Documentation: Introduction to Data Flow Analysis > - https://www.researchgate.net/figure/AST-based-program-graph-with-additional-control-and-data-flow-edges_fig2_370980383 > - > ResearchGate: AST-Based Program Graphs with Control and Data Flow Edges > >
> > -- > > Another idea I've had, but only lightly explored so far, is looking into how various projects like Terser, Webpack, etc choose their minified variable names in general; but also how they handle 'stable minified variables' between builds (which is something that I know at least Webpack has some concept of). My thought there is that by understanding how they implement their own 'stable minified variables between builds', that we might be able to leverage to either a) do similar, or b) be able to reverse engineer that in a way that might be able to be 'retroactively applied' on top of an existing minified project that didn't use 'stable minified variables', to 'stabilise' them. > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1843850057_ #### Issue 41: Module detection > > that can help us transform the code and give the extracted module a better name other than `module-xxxx.js` > > This could then also tie in well with some of the ideas for 'unmangling identifiers' that I laid out here: > > - https://github.com/pionxzh/wakaru/issues/34#issuecomment-1807393509 > > Theoretically if we can identify a common open source module, we could also have pre-processed that module to extract variable/function names, that we could then potentially apply back to the identified module. > > I kind of think of this like '[debug symbols](https://en.wikipedia.org/wiki/Debug_symbol)' used in compiled binaries. > > Though technically, if you know the module and can get the original source; and you know the webpacked version of that code; you could also generate a [sourcemap](https://github.com/mozilla/source-map) that lets the user map between the 2 versions of the code. > > --- > > When I was manually attempting to reverse and identify the modules in #40, a couple of techniques I found useful: > > - searching for `Symbol()`s > - searching for React `.displayName` and similar > - searching for other arrays of static strings/similar > - once interesting candidates had been found, searching for them on GitHub code search to try and identify the library/narrow things down > > --- > > **Edit:** This might not be useful right now, but just added a new section to one of my gists with some higher level notes/thoughts on fingerprinting modules; that I might expand either directly, or based on how this issue pans out: > > - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#file-fingerprinting-minified-javascript-libraries-md > > While it might be more effort than it's worth, it may also be possible to extract the patterns that wappalyzer was using to identify various libraries; which I made some basic notes on in this revision to the above gist: > > - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581/revisions#diff-3a443fd1fa55e0ff7e0206791ae1fb8bbe16f7f1ae5505998cbafb38318f8e69R381-R441 > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/41#issuecomment-1810097408_ --- > With regards to module detection/similar for React, these might be interesting/useful: > > - https://github.com/facebook/react/issues/20186 > - https://github.com/facebook/react/pull/26446 > - https://github.com/facebook/react/issues/27774 > - https://github.com/markerikson/react-prod-sourcemaps > - > A tool to update app sourcemaps with the original code of ReactDOM's production builds > - > This package includes: > > > > - the actual sourcemaps > > - logic to search an input sourcemap for specific ReactDOM prod artifacts by content hash and replace them with the "original" pre-minified bundle source via the sourcemaps > > - a CLI tool that will load a given input sourcemap file and rewrite it > > - a build tool plugin that will automatically replace `react-dom` sourcemaps > - https://github.com/facebook/react/issues/27515#issuecomment-1824920920 > - > I don't know how W3Techs counts but the HTTP Archive Almanac 2022 [uses Wappalyzer v6.10.26](https://almanac.httparchive.org/en/2022/javascript#library-usage), whose [React version detection logic seems to look](https://github.com/tunetheweb/wappalyzer/blob/d04e1c5d4ea5e1dc8368e47215dd7bb80718a12f/src/technologies/r.json#L395-L421) for (a) a global React.version property or (b) a version number in a script filename that clearly indicates React. Both of these are very uncommon ways to deploy React these days. Even for detecting React as a whole, it uses data attributes no longer used by React or a _reactRootContainer property that is not added when using modern React APIs such as React 18 createRoot (and only looks for that property on divs that are direct children of ). > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/41#issuecomment-1836835324_ --- > This specific implementation is more related to detecting and injecting into webpack modules at runtime, but it might have some useful ideas/concepts that are applicable at the AST level too: > > - https://github.com/GooseMod/GooseMod/blob/master/src/util/discord/webpackModules.js > > ```js > // ..snip.. > > export const common = { // Common modules > React: findByProps('createElement'), > ReactDOM: findByProps('render', 'hydrate'), > > Flux: findByProps('Store', 'connectStores'), > FluxDispatcher: findByProps('register', 'wait'), > > i18n: findByProps('Messages', '_requestedLocale'), > > channels: findByProps('getChannelId', 'getVoiceChannelId'), > constants: findByProps('API_HOST') > }; > ``` > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/41#issuecomment-1890296652_ #### Issue 73: add a 'module graph' > > Introducing module graph: Like Webpack and other bundlers, a module graph can help us unminify/rename identifiers and exports from bottom to top. > > @pionxzh This sounds like an awesome idea! > > --- > > > Based on 1, the steps gonna be like `[unpacked] -> [???] -> [unminify]`. This new step will build the module graph, do module scanning, rename the file smartly, and provide this information to `unminify`. > > @pionxzh I've only thought about this a little bit, and it depends on how 'all encompassing' you want the module graph to be, but I think it might even make sense for it (or some other metadata/graph) to capture the mapping from original files -> unmapped as well. > > -- > > For some background context (to help understand some of the things I describe for the graph later on below), the workflow I've been thinking about/following for my own needs would probably be as follows: > > - My original workflow: > - identify when a new build has been published + the manifest/chunk/etc URLs from that ([Ref](https://github.com/0xdevalias/chatgpt-source-watch/blob/main/scripts/add-new-build.sh#L34-L92)) > - download all of the raw script files from the website and save them 'as is' in `raw/` ([Ref](https://github.com/0xdevalias/chatgpt-source-watch/tree/main/orig)) > - do a 'first stage' 'light unpack' of the relevant manifest/chunks/etc for this build from `raw/` by stripping the hashes from the filenames/etc, run prettier on them, and save in `unpacked-stage1`; I also manually figure out if any chunks have changed their identifier, and remove any chunks from the old build that no longer exist in the new build (Ref: [1](https://github.com/0xdevalias/chatgpt-source-watch/blob/main/scripts/add-new-build.sh#L94-L110), [2](https://github.com/0xdevalias/chatgpt-source-watch/tree/main/unpacked)) > - Additional steps now that I have `wakaru`: > - do a 'wakaru unpack' of all of the relevant manifest/chunks/etc in `unpacked-stage1/`, and save them into `unpacked-stage2/` > - do a 'wakaru unminify' of all the modules in `unpacked-stage2/`, and save them in `unminified` > > While that workflow might be overkill for a lot of people, I like that it allows me to keep the outputs of each of the 'intermediary steps' available, and can cross reference between them if/as needed. I might find that as I start to use this more, that I don't find it useful to keep some of those intermediate steps; but at least for now, that is my workflow. > > -- > > Now with that background context, going back to my thoughts about the graph/etc; I think it would be useful to be able to have a graph/similar that shows: > > - `a1-b1-c1-ha-sh/_buildManifest.js` contains chunk files `["filefoo-abc123.js", "etc.js"]` ([Ref](https://github.com/0xdevalias/chatgpt-source-watch/blob/main/unpacked/_next/static/%5BbuildHash%5D/_buildManifest.js)) > - `a1-b1-c1-ha-sh/_ssgManifest.js` contains chunk files `["ssgbar-abc123.js", "ssg-etc.js"]` ([Ref](https://github.com/0xdevalias/chatgpt-source-watch/blob/main/unpacked/_next/static/%5BbuildHash%5D/_ssgManifest.js)) > - `webpack-a2b2c2hash.js` contains chunk files `["aaaa-bbbb.js", "etc.js"]` ([Ref](https://github.com/0xdevalias/chatgpt-source-watch/blob/main/unpacked/_next/static/chunks/webpack.js)) > - `filefoo-abc123.js` contains chunk `[1337, ...]` > - chunk `1337` > - contains modules `[1, 3, 7, 24]` > - which were renamed to `["module1.js", "aUsefulName.js", "a/path/and/a/reallyUsefulName.js", "module24.js"]` > > And then the actual 'internal module mapping' stuff of what imports/exports what, etc. > > I'm not sure exactly how to map the data, but I would probably start with identifying the main 'types' involved, and what makes sense to know/store about each of them. The following might not be complete, but it's what I came up with from a 'first pass': > > - a 'build' > - all of the original file names > - (some of the below may make sense to be nested under this, not sure) > - build manifest ([Ref](https://github.com/0xdevalias/chatgpt-source-watch/blob/main/unpacked/_next/static/%5BbuildHash%5D/_buildManifest.js)) > - original filename > - build hash > - renamed to filename > - chunks (and I think the URL paths that map to them; at least for those related to pages (possibly a next.js thing) ([Ref](https://github.com/0xdevalias/chatgpt-source-watch/blob/main/unpacked/_next/static/%5BbuildHash%5D/_buildManifest.js#L4-L91))) > - ssg manifest ([Ref](https://github.com/0xdevalias/chatgpt-source-watch/blob/main/unpacked/_next/static/%5BbuildHash%5D/_ssgManifest.js)) > - original filename > - build hash > - renamed to filename > - etc? (I haven't actually looked at one of these with real data in it yet) > - chunk files (of which the `webpack.js` chunk seems a bit special I think?) ([Ref](https://github.com/0xdevalias/chatgpt-source-watch/tree/main/unpacked/_next/static/chunks)) > - original filename > - chunk hash > - renamed to filename > - chunk IDs that were included in it > - chunks/modules > - original chunk filename/etc? > - (probably will be the same as the 'chunk files' section above; might be a better way to layout this data, but I thought it probably didn't make sense to nest it under the chunk files structure) > - chunkID in the bundle > - moduleIDs in the chunk > - modules > - chunkID that originally contained it > - moduleID from the bundle/chunk > - filename the module was renamed into > - imported moduleIDs > - exports > > This 'metadata file' / graph / etc could then potentially also include the stuff I've talked about before ([Ref](https://github.com/pionxzh/wakaru/issues/34#issuecomment-1843850057)) for being able to 'guide' the variable/function/etc names used during unminification. > > -- > > I haven't thought deeply through the above yet; it might turn out that some of the things I described there might make sense being split into 2 different things; but I wanted to capture it all while it was in my head. > > --- > > > In the module graph, we can have a map for all exported names and top-level variables/functions, which also allows the user to guide the tool to improve the mapping. > > > Module graph also brings the possibility of cross-module renaming. For example, un-indirect-call shall detect some pattern and rename the minified export name back to the real name. > > @pionxzh 👌🏻🎉 > > --- > > > I like the idea of "AST fingerprinting". This can also be used in module scanning to replace the current regex implementation. > > @pionxzh Definitely. Though I (or you, or someone) need to dig into the concepts a bit more and figure out a practical way to implement it; as currently it's sort of a theory in my mind, but not sure how practical it will be in reality. > > Created a new issue for that exploration: > > - https://github.com/pionxzh/wakaru/issues/74 > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/73#issuecomment-1853047271_ --- > I was wanting to visualize the dependencies between my unminified modules, and stumbled across this project: > > - https://github.com/pahen/madge > - > Create graphs from your CommonJS, AMD or ES6 module dependencies > - https://github.com/pahen/madge#cli > - https://github.com/pahen/madge#api > - https://github.com/pahen/madge#configuration > - https://github.com/pahen/madge#using-mixed-import-syntax-in-the-same-file > > It mentioned two of it's dependencies, which sound like they could potentially be useful here: > > - https://github.com/dependents/node-dependency-tree > - > Get the dependency tree of a module > - https://github.com/dependents/node-filing-cabinet > - > Get the file location associated with a dependency/partial's path > - > The object form is a mapping of the dependency tree to the filesystem – where every key is an absolute filepath and the value is another object/subtree. > > --- > > Off the top of my head, I think the 'high level' module-graph within `wakaru` would probably make the most sense to be linked based on the module ID's, rather than the actual `import`/`export`s / module filenames. That way it would be more robust/not need to change as things are renamed/moved around/etc. So these libraries may not be super useful 'as is' for this. > > --- > > Some useful commands for visualising module dependencies: > > - https://github.com/pahen/madge#cli > > ```bash > # Get the module dependencies as a static .svg image > madge --image graph.svg path/src/app.js > > # Get the module dependencies as a graphviz DOT file > madge --dot path/src/app.js > graph.gv > > # Get the module dependencies as json > madge --json path/src/app.js > dependencies.json > ``` > > The graphviz dot output can then be further explored through an interactive tool such as: > > - https://github.com/tintinweb/vscode-interactive-graphviz > - > Interactive Graphviz Dot Preview for Visual Studio Code > - https://marketplace.visualstudio.com/items?itemName=tintinweb.graphviz-interactive-preview > > If there are missing dependencies, these are worth noting for how to see/improve it: > > - https://github.com/pahen/madge#missing-dependencies > - https://github.com/pahen/madge#using-both-javascript-and-typescript-in-your-project > - https://github.com/pahen/madge#using-mixed-import-syntax-in-the-same-file > > --- > > In addition to the above, a couple of other 'dependency graph' viewers I came across when I was looking for tools for this today: > > - https://www.jetbrains.com/help/webstorm/module-dependency-diagram.html > - While this created a super in depth/detailed graph that is theoretically zoomable/etc, it also was basically unusably slow when run against a large chunk/module. > - https://marketplace.visualstudio.com/items?itemName=sz-p.dependencygraph > - > vscode-dependencyGraph > > A plugin for vscode to view your project's dependency graph > - I haven't tried this yet, but the screenshots look alright > - https://github.com/juanallo/vscode-dependency-cruiser > - I haven't tried this yet, looks more basic/less interactive than some of the other options > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/73#issuecomment-1865631024_ --- > Not 100% sure, but Webpack's `stats.json` file sounds like it might be relevant here (if not directly, then maybe as a source of inspiration): > > > > Even more tangentially related to this, I've pondered how much we could 're-construct' the files necessary to use tools like bundle analyzer, without having access to the original source (or if there would even be any benefit to trying to do so): > > > > > > - https://github.com/webpack-contrib/webpack-bundle-analyzer > > > - > Webpack plugin and CLI utility that represents bundle content as convenient interactive zoomable treemap > > > - https://github.com/webpack-contrib/webpack-bundle-analyzer#usage-as-a-cli-utility > > > - > You can analyze an existing bundle if you have a webpack stats JSON file. > > > > > > > > You can generate it using `BundleAnalyzerPlugin` with `generateStatsFile` option set to `true` or with this simple command: `webpack --profile --json > stats.json` > > > - https://webpack.js.org/api/stats/ > > > - > Stats Data > > > > When compiling source code with webpack, users can generate a JSON file containing statistics about modules. These statistics can be used to analyze an application's dependency graph as well as to optimize compilation speed. > > > - https://nextjs.org/docs/pages/building-your-application/optimizing/bundle-analyzer > > > - https://www.npmjs.com/package/@next/bundle-analyzer > > > > > > My gut feel is that we probably can figure out most of what we need for it; we probably just can't give accurate sizes for the original pre-minified code, etc; and the module names/etc might not be mappable to their originals unless we have module identification type features (see https://github.com/pionxzh/wakaru/issues/41) > > > > > > _Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/9#issuecomment-1974432157_ > > > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/121#issuecomment-1974433150_ > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/73#issuecomment-1974474289_ --- > The Stack Graph / Scope Graph links/references I shared in https://github.com/pionxzh/wakaru/issues/34#issuecomment-2035859278 may be relevant to this issue as well. > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/73#issuecomment-2035875282_ #### Issue 74: explore 'AST fingerprinting' for module/function identification (eg. to assist smart / stable renames, etc) > Have been spending some more time in binary reverse engineering land lately, and (re-)stumbled across this tool (Diaphora). While it's focus is on binary reverse engineering, some of the features it mentioned sounded like they would be interesting/useful to look deeper into for this 'AST Fingerprinting' sort of idea, eg. > > - Porting symbol names and comments > - Similarity ratio calculation > - Call graph matching calculation > - Dozens of heuristics based on graph theory, assembler, bytes, functions' features, etc... > - Pseudo-code based heuristics > > There might be some ideas/patterns/algorithms/similar that we could use from there for implementing AST fingerprinting on JS code. > > --- > > - http://diaphora.re/ > - > Diaphora > > A Free and Open Source Program Diffing Tool > - > Diaphora (διαφορά, Greek for 'difference') version 3.0 is the most advanced program diffing tool (working as an IDA plugin) available as of today (2023). It was released first during SyScan 2015 and has been actively maintained since this year: it has been ported to every single minor version of IDA since 6.8 to 8.3. > > > > Diaphora supports versions of IDA >= 7.4 because the code only runs in Python 3.X (Python 3.11 was the last version being tested). > - https://github.com/joxeankoret/diaphora > - > Diaphora, the most advanced Free and Open Source program diffing tool. > - > Diaphora has many of the most common program diffing (bindiffing) features you might expect, like: > > > > - Diffing assembler. > > - Diffing control flow graphs. > > - Porting symbol names and comments. > > - Adding manual matches. > > - Similarity ratio calculation. > > - Batch automation. > > - Call graph matching calculation. > > - Dozens of heuristics based on graph theory, assembler, bytes, functions' features, etc... > > > > However, Diaphora has also many features that are unique, not available in any other public tool. The following is a non extensive list of unique features: > > > > - Ability to port structs, enums, unions and typedefs. > > - Potentially fixed vulnerabilities detection for patch diffing sessions. > > - Support for compilation units (finding and diffing compilation units). > > - Microcode support. > > - Parallel diffing. > > - Pseudo-code based heuristics. > > - Pseudo-code patches generation. > > - Diffing pseudo-codes (with syntax highlighting!). > > - Scripting support (for both the exporting and diffing processes). > - https://github.com/FernandoDoming/r2diaphora > - > `r2diaphora` > > r2diaphora is a port of [Diaphora](https://github.com/joxeankoret/diaphora) to [radare2](https://github.com/radareorg/radare2) and MariaDB. It also uses [r2ghidra](https://github.com/radareorg/r2ghidra) as decompiler by default, with support for other decompilers such as `pdc`. > - > Port of the binary diffing library, diaphora, for radare2 and mariadb > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/74#issuecomment-1868216682_ --- > The Stack Graph / Scope Graph links/references I shared in https://github.com/pionxzh/wakaru/issues/34#issuecomment-2035859278 may be relevant to this issue as well. > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/74#issuecomment-2035874076_ --- > Some more 'prior art' from the binary reverse engineering world: > > - https://hex-rays.com/products/ida/tech/flirt/in_depth/ > - > IDA F.L.I.R.T. Technology: In-Depth > - > One major stumbling block in the disassembly of programs written in modern high > level languages is the time required to isolate library functions. > - > To assist IDA users we attempted to create an algorithm to recognize the > standard library functions. > - > The idea > > To address those issues, we created a database of all the functions from all libraries we wanted to recognize. IDA now checks, at each byte of the program being disassembled, whether this byte can mark the start of a standard library function. > > > > The information required by the recognition algorithm is kept in a signature file. Each function is represented by a pattern. Patterns are first 32 bytes of a function where all variant bytes are marked. > > > > ..snip.. > - > Sequences of bytes are kept in the nodes of the tree. > - > When two functions have the same first 32 bytes, they are stored in the same leaf of the tree. To resolve that situation, we calculate the CRC16 of the bytes starting from position 33 until till the first variant byte. The CRC is stored in the signature file. The number of bytes used to calculate that CRC also needs to be saved, as it differs from function to function. > - etc > > While the exact specifics of that method won't be relevant here (since we're operating on JS, and not raw bytes); some of the more general concepts might be. > > Interestingly, that ends up being a more refined version of some binary offset finding code I wrote for another project: > > - https://github.com/0xdevalias/poc-re-binsearch > - > Proof of Concept (PoC) code/notes exploring reverse engineering techniques for macOS fat binaries, focusing on binary searching and automatic offset identification > - https://github.com/0xdevalias/poc-re-binsearch/blob/main/find_bytes.py > - https://github.com/0xdevalias/poc-re-binsearch/blob/main/find_fat_binary_offsets.py > - https://github.com/0xdevalias/poc-re-binsearch/blob/main/diff_hex.py > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/74#issuecomment-2042100894_ --- > > I've been thinking about this topic and found this repo when searching for if someone had done it before and/or for a debundler to build it on top of. > > @anka-213 Curious (if you're open to/able to share), what your use case for this sort of thing would be? > > --- > > > The basic idea I was imagining was: > > > > 1. First perform some basic normalization of the code > > 2. Rename all local variables according to some deterministic scheme > > > This approach doesn't do any kind of fuzzy matching, but as long as the normalization works well enough and the output doesn't vary in too many ways that are difficult to normalize away depending on e.g. bundler config, it should be fairly reliable. > > @anka-213 This basically aligns with one of the ways of how I was thinking it would probably work at a high level as well; though I think the key/crux of it would be figuring out the normalisation (including stabilising or not including variable/function identifiers that churn) in a way that is resilient to all the 'optimisations' a bundler/minifier might choose to make. > > That may mean that it would need to run on 'partially unminified' code, though in the ideal case, it should be able to work with as little 'pre-processing' of the minified code as possible; as this module identification would be used as part of the unminification process (for certain aspects). > > --- > > > The approach is kind of similar to how the content addressed programming language [Unison](https://www.unison-lang.org/docs/the-big-idea/) does their hashing. > > @anka-213 Just had a read through that blog, and it sounds like a really interesting approach! > > --- > > > If we want to allow more fine-grained fingerprinting we could use some kind of De Bruijn index instead for the local variables, so local snippets would have the same variable names regardless of their context. This wouldn't produce valid JS code, but that doesn't matter since the result is only used for hashing, not for output. > > @anka-213 I only quickly skimmed the wiki pages for [De Bruijn index](https://en.wikipedia.org/wiki/De_Bruijn_index) / [De Bruijn notation](https://en.wikipedia.org/wiki/De_Bruijn_notation), so I might not be grasping it fully, but from what I saw, it seems like you could probably model it in a way that would fit the semantics to produce valid JS variable names/code still. > > --- > > Another method (that I can't remember if I've ever written out in full here) is somewhat based on the more manual approach I was taking at one point: > > > > that can help us transform the code and give the extracted module a better name other than `module-xxxx.js` > > > > This could then also tie in well with some of the ideas for 'unmangling identifiers' that I laid out here: > > > > - https://github.com/pionxzh/wakaru/issues/34#issuecomment-1807393509 > > > > Theoretically if we can identify a common open source module, we could also have pre-processed that module to extract variable/function names, that we could then potentially apply back to the identified module. > > > > I kind of think of this like '[debug symbols](https://en.wikipedia.org/wiki/Debug_symbol)' used in compiled binaries. > > > > Though technically, if you know the module and can get the original source; and you know the webpacked version of that code; you could also generate a [sourcemap](https://github.com/mozilla/source-map) that lets the user map between the 2 versions of the code. > > > > --- > > > > When I was manually attempting to reverse and identify the modules in #40, a couple of techniques I found useful: > > > > - searching for `Symbol()`s > > - searching for React `.displayName` and similar > > - searching for other arrays of static strings/similar > > - once interesting candidates had been found, searching for them on GitHub code search to try and identify the library/narrow things down > > > > --- > > > > **Edit:** This might not be useful right now, but just added a new section to one of my gists with some higher level notes/thoughts on fingerprinting modules; that I might expand either directly, or based on how this issue pans out: > > > > - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#file-fingerprinting-minified-javascript-libraries-md > > > > While it might be more effort than it's worth, it may also be possible to extract the patterns that wappalyzer was using to identify various libraries; which I made some basic notes on in this revision to the above gist: > > > > - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581/revisions#diff-3a443fd1fa55e0ff7e0206791ae1fb8bbe16f7f1ae5505998cbafb38318f8e69R381-R441 > > > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/41#issuecomment-1810097408_ > > Specifically, identifying the types of things that are usually not minified/mangled by a bundler/minifier (`Symbol()`s, React's `.displayName`, static strings, etc; and using those parts as the 'source' for the fingerprint/similar. In a way, I guess this could also be considered a type of normalisation. > > One benefit of this approach, is that those same 'key identifiers' can be used with [GitHub Code search](https://github.com/search?type=code) or similar tools to help narrow down and identify an otherwise unknown module/library. This could even probably be partially automated using the [GitHub API](https://docs.github.com/en/rest/search/search); and then provide an easy way for users to contribute the relevant details/hash/etc for an identified module back to the 'core database' (in a similar way to how `nmap` allows users to submit service fingerprints (Ref: [1](https://nmap.org/submit/), [2](https://nmap.org/book/osdetect-fingerprint-format.html)) > > Here is some further 'prior art' from a tool that seems to use this sort of method to target the functions it wants to interact with: > > > This specific implementation is more related to detecting and injecting into webpack modules at runtime, but it might have some useful ideas/concepts that are applicable at the AST level too: > > > > - https://github.com/GooseMod/GooseMod/blob/master/src/util/discord/webpackModules.js > > > > ```js > > // ..snip.. > > > > export const common = { // Common modules > > React: findByProps('createElement'), > > ReactDOM: findByProps('render', 'hydrate'), > > > > Flux: findByProps('Store', 'connectStores'), > > FluxDispatcher: findByProps('register', 'wait'), > > > > i18n: findByProps('Messages', '_requestedLocale'), > > > > channels: findByProps('getChannelId', 'getVoiceChannelId'), > > constants: findByProps('API_HOST') > > }; > > ``` > > > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/41#issuecomment-1890296652_ > > --- > > This is potentially more of a generalised/'naive' approach to the problem, but it would also be interesting to see if/how well an embedding model tuned for code would do at solving this sort of problem space: > > - https://openai.com/blog/introducing-text-and-code-embeddings > - https://platform.openai.com/docs/guides/embeddings > - > An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness. > - https://platform.openai.com/docs/api-reference/embeddings > - https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/ > - > Faiss: A library for efficient similarity search > > --- > > Also, here's the latest version of my open tabs 'reading list' in this space of things, in case any of it is relevant/interesting/useful here: > >
> Unsorted/Unreviewed Link Dump RE: 'AST fingerprinting' / Code Similarity (v2) > > - https://en.wikipedia.org/wiki/Content_similarity_detection > - > Content similarity detection > - https://arxiv.org/abs/2306.16171 > - > A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges (2023) > - > Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance. > - https://link.springer.com/article/10.1007/s10664-017-9564-7 > - > A comparison of code similarity analysers (2017) > - > Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code. > - https://www.researchgate.net/publication/2840981_Winnowing_Local_Algorithms_for_Document_Fingerprinting > - > Winnowing: Local Algorithms for Document Fingerprinting (2003) > - > Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing's performance is within 33% of the lower bound. Finally, we also give experimental results on Web data, and report experience with Moss, a widely-used plagiarism detection service. > - https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf > - https://www.researchgate.net/publication/375651686_Source_Code_Plagiarism_Detection_with_Pre-Trained_Model_Embeddings_and_Automated_Machine_Learning > - > Source Code Plagiarism Detection with Pre-Trained Model Embeddings and Automated Machine Learning (2023) > - https://aclanthology.org/2023.ranlp-1.34.pdf > - https://www.researchgate.net/publication/262322336_A_Source_Code_Similarity_System_for_Plagiarism_Detection > - > A Source Code Similarity System for Plagiarism Detection (2013) > - > Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g. modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. To be considered effective, a source code similarity detection system must address these issues. To address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a well-known conformism test. The proposed system showed promising results as compared with the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cut-off threshold values of 35–70%. > - https://www.mdpi.com/2076-3417/10/21/7519 > - > A Source Code Similarity Based on Siamese Neural Network (2020) > - > Finding similar code snippets is a fundamental task in the field of software engineering. > > Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese > > Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method. > - https://www.researchgate.net/publication/337196468_Detecting_Source_Code_Similarity_Using_Compression > - > Detecting Source Code Similarity Using Compression (2019) > - > Different forms of plagiarism make a fair assessment of student assignments more difficult. Source code plagiarisms pose a significant challenge especially for automated assessment systems aimed for students' programming solutions. Different automated assessment systems employ different text or source code similarity detection tools, and all of these tools have their advantages and disadvantages. In this paper, we revitalize the idea of similarity detection based on string complexity and compression. We slightly adapt an existing, third-party, approach, implement it and evaluate its potential on synthetically generated cases and on a small set of real student solutions. On synthetic cases, we showed that average deviation (in absolute values) from the expected similarity is less than 1% (0.94%). On the real-life examples of student programming solutions we compare our results with those of two established tools. The average difference is around 18.1% and 11.6%, while the average difference between those two tools is 10.8%. However, the results of all three tools follow the same trend. Finally, a deviation to some extent is expected as observed tools apply different approaches that are sensitive to other factors of similarities. Gained results additionally demonstrate open challenges in the field. > - https://ceur-ws.org/Vol-2508/paper-pri.pdf > - https://www.nature.com/articles/s41598-023-42769-9 > - > Binary code similarity analysis based on naming function and common vector space (2023) > - > Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match > - https://arxiv.org/abs/2305.03843 > - > REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models (2023) > - > This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art cross-language search tool by up to 44.7\%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called REINFOREST. > - https://www.usenix.org/conference/usenixsecurity21/presentation/ahmadi > - > Finding Bugs Using Your Own Code: Detecting Functionally-similar yet Inconsistent Code (2021) > - > Probabilistic classification has shown success in detecting known types of software bugs. However, the works following this approach tend to require a large amount of specimens to train their models. We present a new machine learning-based bug detection technique that does not require any external code or samples for training. Instead, our technique learns from the very codebase on which the bug detection is performed, and therefore, obviates the need for the cumbersome task of gathering and cleansing training samples (e.g., buggy code of certain kinds). The key idea behind our technique is a novel two-step clustering process applied on a given codebase. This clustering process identifies code snippets in a project that are functionally-similar yet appear in inconsistent forms. Such inconsistencies are found to cause a wide range of bugs, anything from missing checks to unsafe type conversions. Unlike previous works, our technique is generic and not specific to one type of inconsistency or bug. We prototyped our technique and evaluated it using 5 popular open source software, including QEMU and OpenSSL. With a minimal amount of manual analysis on the inconsistencies detected by our tool, we discovered 22 new unique bugs, despite the fact that many of these programs are constantly undergoing bug scans and new bugs in them are believed to be rare. > - https://www.usenix.org/system/files/sec21summer_ahmadi.pdf > - https://theory.stanford.edu/~aiken/moss/ > - > MOSS: A1 System for Detecting Software Similarity > - https://github.com/fanghon/antiplag > - > antiplag similarity checking software for program codes, documents, and pictures > > The software mainly checks and compares the similarities between electronic assignments submitted by students. It can detect the similarities between electronic assignments submitted by students and can analyze the content of multiple programming languages ​​​​(such as java, c/c++, python, etc.) and multiple formats (txt, doc, docx, pdf, etc.) Comparative analysis of text and image similarities in multiple formats (png, jpg, gif, bmp, etc.) between English and simplified and traditional Chinese documents, and output codes, texts, and images with high similarity, thereby helping to detect plagiarism between students. the behavior of. > - https://github.com/dodona-edu/dolos > - > Dolos > > Dolos is a source code plagiarism detection tool for programming exercises. Dolos helps teachers in discovering students sharing solutions, even if they are modified. By providing interactive visualizations, Dolos can also be used to sensitize students to prevent plagiarism. > - https://dolos.ugent.be/ > - https://dolos.ugent.be/about/algorithm.html > - > How Dolos works > > Conceptually, the plagiarism detection pipeline of Dolos can be split into four successive steps: > > > > - Tokenization > > - Fingerprinting > > - Indexing > > - Reporting > - > Tokenization > > To be immune against masking plagiarism by techniques such as renaming variables and functions, Dolos doesn't directly process the source code under investigation. It starts by performing a tokenization step using [Tree-sitter](http://tree-sitter.github.io/tree-sitter/). Tree-sitter can generate syntax trees for many programming languages, converts source code to a more structured form, and masks specific naming of variables and functions. > - > Fingerprinting > > To measure similarities between (converted) files, Dolos tries to find common sequences of tokens. More specifically, it uses subsequences of fixed length called k-grams. To efficiently make these comparisons and reduce the memory usage, all k-grams are hashed using a rolling hash function (the one used by Rabin-Karp in their string matching algorithm). The length k of k-grams can be with the -k option. > > > > To further reduce the memory usage, only a subset of all hashes are stored. The selection of hashes is done by the Winnowing algorithm as described by [(Schleimer, Wilkerson and Aiken)](http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf). In short: only the hash with the smallest numerical value is kept for each window. The window length (in k-grams) can be altered with the -w option. > > > > The remaining hashes are the fingerprints of the analyzed files. Internally, these are stored as simple integers. > - > Indexing > > Because Dolos needs to compare all files with each other, it is more efficient to first create an index containing the fingerprints of all files. For each of the fingerprints encountered in any of the files, we store the file and the corresponding line number where we encountered that fingerprint. > > > > As soon as a fingerprint is stored in the index twice, this is recorded as a match between the two files because they share at least one k-gram. > - > Reporting > > Dolos finally collects all fingerprints that occur in more than one file and aggregates the results into a report. > > > > This report contains all file pairs that have at least one common fingerprint, together with some metrics: > > > > - similarity: the fraction of shared fingerprints between the two files > > - total overlap: the absolute value of shared fingerprints, useful for larger projects > > - longest fragment: the length (in fingerprints) of the longest subsequence of fingerprints matching between the two files, useful when not the whole source code is copied > - https://dolos.ugent.be/about/languages.html > - https://dolos.ugent.be/about/publications.html > - > Publications > > Dolos is developed by [Team Dodona](https://dodona.be/en/about/) at Ghent University in Belgium. Our research is published in the following journals and conferences. > - https://github.com/danielplohmann/mcrit > - > MinHash-based Code Relationship & Investigation Toolkit (MCRIT) > > MCRIT is a framework created to simplify the application of the MinHash algorithm in the context of code similarity. It can be used to rapidly implement "shinglers", i.e. methods which encode properties of disassembled functions, to then be used for similarity estimation via the MinHash algorithm. It is tailored to work with disassembly reports emitted by [SMDA](https://github.com/danielplohmann/smda). > - https://github.com/BK-SCOSS/scoss > - > scoss > > A Source Code Similarity System - SCOSS > - https://github.com/island255/source2binary_dataset_construction > - > Source2binary Dataset Construction > > This is the repository for the paper "One to One or One to many? What function inline brings to binary similarity analysis". > - https://github.com/JackHCC/Pcode-Similarity > - > Pcode-Similarity > > Algorithm for calculating similarity between function and library function. > - https://github.com/JackHCC/Awesome-Binary-Code-Similarity-Detection-2021 > - > Awesome Binary code similarity detection 2021 > > Awesome list for Binary Code Similarity Detection in 2021 > - https://github.com/Jaso1024/Semantic-Code-Embeddings > - > SCALE: Semantic Code Analysis via Learned Embeddings (2023) > > 3rd best paper on Artificial Intelligence track | presented at the 2023 International Conference on AI, Blockchain, Cloud Computing and Data Analytics > > This repository holds the code and supplementary materials for [SCALE: Semantic Code Analysis via Learned Embeddings](https://ieeexplore.ieee.org/document/10390981). This research explores the efficacy of contrastive learning alongside large language models as a paradigm for developing a model capable of creating code embeddings indicative of code on a functional level. > > Existing pre-trained models in NLP have demonstrated impressive success, surpassing previous benchmarks in various language-related tasks. However, when it comes to the field of code understanding, these models still face notable limitations. Code isomorphism, which deals with determining functional similarity between pieces of code, presents a challenging problem for NLP models. In this paper, we explore two approaches to code isomorphism. Our first approach, dubbed SCALE-FT, formulates the problem as a binary classification task, where we feed pairs of code snippets to a Large Language Model (LLM), using the embeddings to predict whether the given code segments are equivalent. The second approach, SCALE-CLR, adopts the SimCLR framework to generate embeddings for individual code snippets. By processing code samples with an LLM and observing the corresponding embeddings, we assess the similarity of two code snippets. These approaches enable us to leverage function-based code embeddings for various downstream tasks, such as code-optimization, code-comment alignment, and code classification. Our experiments on the CodeNet Python800 benchmark demonstrate promising results for both approaches. Notably, our SCALE-FT using Babbage-001 (GPT-3) achieves state-of-the-art performance, surpassing various benchmark models such as GPT-3.5 Turbo and GPT-4. Additionally, Salesforce's 350-million parameter CodeGen, when trained with the SCALE-FT framework, surpasses GPT-3.5 and GPT-4. > - https://github.com/Aida-yy/binary-sim > - > binary similarity using Deep learning > - > Features: Function semantic information + control flow graph > > > > Semantic feature extraction: extract the byte data, assembly instruction data, and integer data of the function respectively, use independent encoders (DPCNN, TextCNN) to encode the text representation, and obtain its Embedding representation. > > > > Structural feature extraction, based on CFG and the assembly instructions in each block, generates ACFG, uses graph neural network to encode ACFG, and obtains Embedding representation; in addition, considering that the node order of the control flow graph of similar functions is also similar, the CFG's The adjacency matrix is ​​taken as input and CNN is used to obtain its Embedding representation. > > > > Contrastive learning model structure: InfoNCE loss + In-batch negatives > > > > - [Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection](https://keenlab.tencent.com/en/whitepapers/Ordermatters.pdf) > > - [Investigating Graph Embedding Methods for Cross-Platform Binary Code Similarity Detection](https://www.mhumbert.com/publications/eurosp22_2.pdf) > > - [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/pdf/2104.08821.pdf) > - https://github.com/jorge-martinez-gil/crosslingual-clone-detection > - > Transcending Language Barriers in Software Engineering with Crosslingual Code Clone Detection (2024) > > Systematic study to determine the best methods to assess the similarity between code snippets in different programming languages > >
> > You can also find the first link dump of content in the collapsible in the [first post on this issue](https://github.com/pionxzh/wakaru/issues/74#issue-2038720195). > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/74#issuecomment-2084114246_ > Further 'prior art', an example of an 'obfuscation detector' based on AST structure: > > > > Here are projects that try to support many different ones: [PerimeterX/restringer](https://github.com/PerimeterX/restringer), [ben-sb/javascript-deobfuscator](https://github.com/ben-sb/javascript-deobfuscator) > > > > > Instead I'd rather add more interactive actions that make manually working on unknown obfuscators faster and let the user decide if its safe > > > > Linked from that `restringer` repo, I came across this project: > > > > - https://github.com/PerimeterX/obfuscation-detector > > - > Detect different types of JS obfuscation by their AST structure > > - https://github.com/PerimeterX/obfuscation-detector#supported-obfuscation-types > > - https://github.com/PerimeterX/obfuscation-detector/tree/main/src/detectors > > > > It could be cool to have a similar sort of 'obfuscation detector' feature within `webcrack`, particularly if it was paired with the 'interactive actions'. The 'detector' rules could suggest which obfuscations seem to be in place, and could then potentially recommend corresponding rules, etc. > > > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/76#issuecomment-2116401646_ > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/74#issuecomment-2116415993_ > RE: https://github.com/pionxzh/wakaru/issues/74#issuecomment-2042100894 > > > Some more 'prior art' from the binary reverse engineering world > > **TL;DR:** Some info on various other technologies, with Binary Ninja's new WARP being of particular interest, as well as some more general background knowledge around debug symbol servers / signature servers (eg. with a thought to how similar concepts could be used alongside module identifiers / etc for JS libs) > > Similar to IDA's [FLIRT](https://hex-rays.com/products/ida/tech/flirt/in_depth/) / [FLAIR](https://cloud.google.com/blog/u/1/topics/threat-intelligence/flare-ida-pro-script/) signatures, and what I was doing in [0xdevalias/poc-re-binsearch](https://github.com/0xdevalias/poc-re-binsearch) > > - https://binary.ninja/2024/11/20/4.2-frogstar.html#warp-advanced-function-matching-algorithm-alpha > - > WARP: Advanced Function Matching Algorithm Alpha > > This release features a new way to transfer function information between binaries. Unlike our existing SigKit tool, WARP is meant for whole function matching. This means fewer false positives and more opportunities to match on smaller functions, thanks to WARP’s function constraints. > - > For more information about WARP, visit the documentation [here](https://docs.binary.ninja/dev/annotation.html?h=warp#warp-signature-libraries)! > - https://www.seandeaton.com/binary-ninja-warp-signatures/ > - > Trying Out Binary Ninja's new WARP Signatures with IPSW Diff'ing > > Binary diff'ing is pretty complex, but being able to apply markup from one binary to another is quite powerful. Binary Ninja's new WARP extends previous efforts, using SigKit, to quickly identify library functions. > - https://docs.binary.ninja/dev/annotation.html > - > Applying Annotations > - > - [Symbols](https://docs.binary.ninja/dev/annotation.html?h=sigkit#symbols) covers how to work with Symbols in a binary > > - [Types](https://docs.binary.ninja/dev/annotation.html?h=sigkit#types) documents creating and interacting with types through the API > > - [Tags](https://docs.binary.ninja/dev/annotation.html?h=sigkit#tags) describes how to create tags and bookmarks > > - [Type Libraries](https://docs.binary.ninja/dev/typelibraries.html) explains how to work with Type Libraries, including multiple sources of information from which Binary Ninja can automatically source for type information from and how you can add to them > > - [Signature Libraries](https://docs.binary.ninja/dev/annotation.html?h=sigkit#signature-libraries) explains how to work with the signature library which match statically compiled functions which are then matched with type libraries > - https://docs.binary.ninja/dev/annotation.html?h=sigkit#signature-libraries > - > Signature Libraries > > There are now two different signature library systems: SigKit, and WARP. SigKit will be deprecated in the near future as WARP represents a superset of its features. > - https://docs.binary.ninja/dev/annotation.html?h=sigkit#sigkit-signature-libraries > - > SigKit Signature Libraries > - https://github.com/Vector35/sigkit > - > Signature Kit Plugin > - > Function signature matching and signature generation plugin for Binary Ninja > - > This plugin provides Python tools for generating, manipulating, viewing, loading, and saving signature libraries (`.sig`) for the Signature System. > - https://docs.binary.ninja/dev/annotation.html?h=sigkit#warp-signature-libraries > - > WARP Signature Libraries > > WARP integration is included with Binary Ninja but turned off by default, for more information about WARP itself visit the open source repository here! > > > > The benefit to using WARP over SigKit is that WARP signatures are more comprehensive and as such will have fewer false positives. Alongside fewer false positives WARP will match more functions with less information due to the matching algorithm taking into account function locality (i.e. functions next to each other). After matching has completed WARP functions will be tagged and the types for those functions will be transferred, this means less work for those looking to transfer analysis information from one version of a binary to another version. > - https://github.com/Vector35/warp > - > WARP > > WARP provides a common format for transferring and applying function information across binary analysis tools. > - https://github.com/Vector35/warp#function-identification > - > Function Identification > > Function identification is the main way to interact with WARP, allowing tooling to utilize WARP's dataset to identify common functions within any binary efficiently and accurately. > - https://github.com/Vector35/warp#comparison-of-function-recognition-tools > - > Comparison of Function Recognition Tools > > WARP vs FLIRT > > The main difference between WARP and FLIRT is the approach to identification. > - > Function Identification > > > > - WARP the function identification is described [here](https://github.com/Vector35/warp#function-identification). > > - FLIRT uses incomplete function byte sequence with a mask where there is a single function entry (see: [IDA FLIRT Documentation](https://docs.hex-rays.com/user-guide/signatures/flirt/ida-f.l.i.r.t.-technology-in-depth) for a full description). > > > > What this means in practice is WARP will have less false positives based solely off the initial function identification. When the returned set of functions is greater than one, we can use the list of [Function Constraints](https://github.com/Vector35/warp#function-constraints) to select the best possible match. However, that comes at the cost of requiring a computed GUID to be created whenever the lookup is requested and that the function GUID is *always* the same. > - https://docs.binary.ninja/dev/typelibraries.html > - > Type Libraries > > Type Libraries are collections of type information (structs, enums, function types, etc.) stored in a file with the extension `.bntl`. > - https://binary.ninja/2024/10/01/plugin-spotlight-coolsigmaker.html > - > A common desire in reverse engineering is to match re-used code across multiple binaries. Whether you're doing malware lineage tracking, identifying a statically compiled library, or any other use case about identifying similar code, there are multiple technologies that attempt to solve parts of this problem. Other tools for related problems include [SigKit](https://github.com/Vector35/sigkit) (Binary Ninja's [static library detection](https://docs.binary.ninja/dev/annotation.html?h=sigkit#signature-library)), IDA's [FLIRT/FLAIR](https://docs.hex-rays.com/user-guide/signatures/flirt) and [Lumina](https://docs.hex-rays.com/user-guide/lumina) features, or even more advanced systems like [Diaphora](http://diaphora.re/) or [BinDiff](https://www.zynamics.com/bindiff.html). > > > > Related to those, you might already be familiar with the "SigMaker" style of plugins for various platforms[[1]](https://github.com/ajkhoury/SigMaker-x64) [[2]](https://github.com/apekros/binja_sigmaker) [[3]](https://github.com/Alex3434/Binja-SigMaker). These plugins generate patterns from code that can be used to find said code across different binaries or find the same function reliably between application updates. This is useful for malware classification and static-library identification among other purposes. > > > > [binja_coolsigmaker](https://github.com/unknowntrojan/binja_coolsigmaker) is just that: a fast and reliable "SigMaker" plugin for Binary Ninja. > - http://diaphora.re/ > - > Diaphora has many of the most common program diffing (bindiffing) features you might expect, like: > > > > - Diffing assembler. > > - Diffing control flow graphs. > > - Porting symbol names and comments. > > - Adding manual matches. > > - Similarity ratio calculation. > > - Batch automation. > > - Call graph matching calculation. > > - Dozens of heuristics based on graph theory, assembler, bytes, functions’ features, etc… > - https://github.com/joxeankoret/diaphora > - > Diaphora, the most advanced Free and Open Source program diffing tool. > - https://www.zynamics.com/bindiff.html > - > BinDiff uses a unique graph-theoretical approach to compare executables > by identifying identical and similar functions > - > Identify identical and similar functions in different binaries > - > Port function names, anterior and posterior comment lines, standard comments and local names from one disassembly to the other > - https://github.com/google/bindiff > - > Quickly find differences and similarities in disassembled code > - https://github.com/google/bindiff#further-reading--similar-tools > - > Further reading / Similar tools > > The original papers outlining the general ideas behind BinDiff: > > > > - Thomas Dullien and Rolf Rolles. Graph-Based Comparison of Executable Objects. [bindiffsstic05-1.pdf](https://github.com/google/bindiff/blob/main/docs/papers/bindiffsstic05-1.pdf). SSTIC ’05, Symposium sur la Sécurité des Technologies de l’Information et des Communications. 2005. > > - Halvar Flake. Structural Comparison of Executable Objects. [dimva_paper2.pdf](https://github.com/google/bindiff/blob/main/docs/papers/dimva_paper2.pdf). pp 161-173. Detection of Intrusions and Malware & Vulnerability Assessment. 2004.3-88579-375-X. > > Then in the space of debug symbol servers / similar: > > - https://hex-rays.com/lumina > - > What is a Lumina server? > > A Lumina server keeps track of metadata about some widely-recognizable functions, like their names, prototypes, or operand types. Additionally, Lumina allows you to "export" work that was previously done on another file to other projects. > - > How does Lumina work? > > Your IDA instance exchanges function hash values and metadata with the Hex-Rays Lumina server, instead of entire byte patterns. When hash values provided by IDA match the Lumina knowledge base, your IDA instance downloads the function and applies it to the current IDA binary file database (IDB). > > > > Lumina is implemented as a hash-based lookup table, mapping byte patterns to metadata. For increased resilience, relocatable bits are masked out before hashing. The Lumina server performs lookups purely based on cryptographic digests, so (potentially sensitive) byte patterns are never transferred over the network. > - https://github.com/tc39/ecma426/blob/main/proposals/debug-id.md#appendix-b-symbol-server-support > - > Source Map Debug ID Proposal > > This document presents a proposal to add globally unique build or debug IDs to source maps and generated code, making build artifacts self-identifying and facilitating bidirectional references between Source Maps and generated code. > - > Appendix B: Symbol Server Support > > With debug IDs it becomes possible to resolve source maps and generated code from the server. That way a tool such as a browser or a crash reporter could be pointed to a S3, GCS bucket or an HTTP server that can serve up source maps and build artifacts keyed by debug id. > - https://github.com/getsentry/javascript-debug-ids > - > `javascript-debug-ids` > > JavaScript polyfills, bundler plugins and utils for the [TC39 Debug ID proposal](https://github.com/tc39/source-map/blob/main/proposals/debug-id.md). > - https://github.com/rollup/rollup/blob/master/CHANGELOG.md#4250 > - > Add `output.sourcemapDebugIds` option to add matching debug ids to sourcemaps and code for tools like Sentry or Rollbar > - And some more random info/resources related to debug symbol servers > - https://docs.sentry.io/platforms/native/data-management/debug-files/symbol-servers/ > - > Symbol Servers > - > Sentry can download debug information files from external repositories. This allows you to stop uploading debug files and instead configure a public symbol server or run your own. It is also possible to configure external repositories and upload debug files at the same time. > - https://docs.sentry.io/platforms/native/data-management/debug-files/symbol-servers/#custom-repositories > - > Independent of the internal format, Sentry supports three kinds of custom repositories: > > > > - HTTP Symbol Server: An HTTP server that serves debug files at a configurable path. Lookups in the server should generally be case-insensitive, although an explicit casing can be configured in the settings. Note that Sentry requires a minimum download speed of 4Mb/s to fetch DIFs from custom HTTP symbol servers. > > - Amazon S3 Bucket: Either an entire S3 bucket or a subdirectory. This requires `s3:GetObject`, and optionally `s3:ListBucket` permissions for the configured Access Key. Lookups in the bucket are case-sensitive, which is why we recommend storing all files lower-cased and using a lowercased path casing configuration. > > - Google Cloud Storage Bucket: Either an entire GCS bucket or a subdirectory. This requires `storage.objects.get` and `storage.objects.list` permissions for the configured service account. Lookups in the bucket are case sensitive, which is why we recommend storing all files lower-cased. > - https://docs.sentry.io/platforms/native/data-management/debug-files/symbol-servers/#directory-layouts > - > Directory Layouts > - > The following table contains a mapping from the supported layouts to file path schemas applied for specific files > - https://www.jetbrains.com/help/clion/using-symbol-servers-when-debugging-on-windows.html > - > Use symbol servers when debugging on Windows > - https://learn.microsoft.com/en-us/windows/win32/dxtecharts/debugging-with-symbols#symbol-servers > - > Debugging with Symbols > - > Symbol Servers > - https://learn.microsoft.com/en-us/windows/win32/debug/symbol-servers-and-symbol-stores > - > Symbol Server and Symbol Stores > - https://en.wikipedia.org/wiki/Microsoft_Symbol_Server > - > Microsoft Symbol Server is a Windows technology used to obtain symbol debugging information. > - https://blog.inedo.com/nuget/source-server-debugging/ > - > When you build a NuGet package with Source Link enabled a Git Repository URL and Commit ID will be embedded in the package metadata. This allows Visual Studio to locate the required code files for debug time. > - https://github.com/dotnet/sourcelink > - > Source Link > > Source Link is a language- and source-control agnostic system for providing first-class source debugging experiences for binaries. > - > Source Link [specification](https://github.com/dotnet/designs/blob/main/accepted/2020/diagnostics/source-link.md#source-link-file-specification) describes source control metadata that can be embedded in symbols, binaries and packages to link them to their original sources. > - https://github.com/dotnet/designs/blob/main/accepted/2020/diagnostics/source-link.md#source-link-file-specification > - https://sourceware.org/elfutils/Debuginfod.html > - > [elfutils](https://sourceware.org/elfutils/index.html) debuginfod is a client/server in elfutils 0.178+ that automatically distributes elf/dwarf/source-code from servers to clients such as debuggers across HTTP. > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/74#issuecomment-2568536103_ #### Issue 121: Explore creating a 'reverse engineered' records.json / stats.json file from a webpack build > This is an idea I've had in passing a few times, but keep forgetting to document it: > > - https://medium.com/@songawee/long-term-caching-using-webpack-records-9ed9737d96f2 > - > there are many factors that go into getting consistent filenames. Using Webpack records helps generate longer lasting filenames (cacheable for a longer period of time) by reusing metadata, including module/chunk information, between successive builds. This means that as each build runs, modules won’t be re-ordered and moved to another chunk as often which leads to less cache busting. > - > The first step is achieved by a Webpack configuration setting: `recordsPath: path.resolve(__dirname, ‘./records.json’)` > > This configuration setting instructs Webpack to write out a file containing build metadata to a specified location after a build is completed. > - > It keeps track of a variety of metadata including module and chunk ids which are useful to ensure modules do not move between chunks on successive builds when the content has not changed. > - > With the configuration in place, we can now enjoy consistent file hashes across builds! > - > In the following example, we are adding a dependency (superagent) to the vendor-two chunk. > > > > We can see that all of the chunks change. This is due to the module ids changing. This is not ideal as it forces users to re-download content that has not changed. > > > > The following example adds the same dependency, but uses Webpack records to keep module ids consistent across the builds. We can see that only the vendor-two chunk and the runtime changes. The runtime is expected to change because it has a map of all the chunk ids. Changing only these two files is ideal. > - https://webpack.js.org/configuration/other-options/#recordspath > - > `recordsPath`: Use this option to generate a JSON file containing webpack "records" – pieces of data used to store module identifiers across multiple builds. You can use this file to track how modules change between builds. > - https://github.com/search?q=path%3A%22webpack.records.json%22&type=code > - https://github.com/GooTechnologies/goojs/blob/master/webpack.records.json > > I'm not 100% sure if this would be useful, or partially useful, but I think I am thinking of it tangentially in relation to things like: > > - https://github.com/0xdevalias/chatgpt-source-watch/issues/9 > - https://github.com/pionxzh/wakaru/issues/34 > - https://github.com/pionxzh/wakaru/issues/41 > - https://github.com/pionxzh/wakaru/issues/73 > - https://github.com/pionxzh/wakaru/issues/74 > - etc > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/121#issue-2164642094_ > > Even more tangentially related to this, I've pondered how much we could 're-construct' the files necessary to use tools like bundle analyzer, without having access to the original source (or if there would even be any benefit to trying to do so): > > > > - https://github.com/webpack-contrib/webpack-bundle-analyzer > > - > Webpack plugin and CLI utility that represents bundle content as convenient interactive zoomable treemap > > - https://github.com/webpack-contrib/webpack-bundle-analyzer#usage-as-a-cli-utility > > - > You can analyze an existing bundle if you have a webpack stats JSON file. > > > > > > You can generate it using `BundleAnalyzerPlugin` with `generateStatsFile` option set to `true` or with this simple command: `webpack --profile --json > stats.json` > > - https://webpack.js.org/api/stats/ > > - > Stats Data > > > When compiling source code with webpack, users can generate a JSON file containing statistics about modules. These statistics can be used to analyze an application's dependency graph as well as to optimize compilation speed. > > - https://nextjs.org/docs/pages/building-your-application/optimizing/bundle-analyzer > > - https://www.npmjs.com/package/@next/bundle-analyzer > > > > My gut feel is that we probably can figure out most of what we need for it; we probably just can't give accurate sizes for the original pre-minified code, etc; and the module names/etc might not be mappable to their originals unless we have module identification type features (see https://github.com/pionxzh/wakaru/issues/41) > > > > _Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/9#issuecomment-1974432157_ > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/121#issuecomment-1974433150_ #### Issue 140: `[smart-rename]` Add support for `data-sentry-component` / `data-sentry-element` / `data-sentry-source-file` (from `@sentry/babel-plugin-component-annotate`) Note: This was crossposted to the following issues: - https://github.com/pionxzh/wakaru/issues/140 - https://github.com/j4k0xb/webcrack/issues/143 - https://github.com/jehna/humanify/issues/350 > Sentry has a feature that allows it to annotate built React components with the component name and source filename it was built from, to help provide better error logs. If these are present in the built output, this could be leveraged to extract those details and assist in restoring the original component name and/or source file name: > > - https://docs.sentry.io/platforms/javascript/guides/react/features/component-names/ > - > Sentry helps you capture your React components and unlock additional insights in your application. You can set it up to use React component names instead of selectors. > - > You can capture the names of React components in your application via a [Babel plugin](https://www.npmjs.com/package/@sentry/babel-plugin-component-annotate), which can unlock powerful workflows and decrease ambiguity. > - > Please note that your Sentry browser SDK must be at version `7.91.0` or higher before you can use these features. Only React components in `.jsx` or `.tsx` files can be tracked. > - > The Babel plugin parses your application's JSX source code at build time, and applies additional data attributes onto it. These attributes then appear on the DOM nodes of your application's built HTML, > - > For example, if you had a component named `MyAwesomeComponent` in the file `myAwesomeComponent.jsx`: > > > > ```js > > function MyAwesomeComponent() { > > return
This is a really cool and awesome component!
; > > } > > ``` > > > > After your bundler applied the plugin and built your project, the resulting DOM node would look like this: > > > > ```html > >
> data-sentry-component="MyAwesomeComponent" > > data-sentry-source-file="myAwesomeComponent.jsx" > > > > This is a really cool and awesome component! > >
> > ``` > - https://github.com/getsentry/sentry-javascript-bundler-plugins/tree/main/packages/babel-plugin-component-annotate > - `@sentry/babel-plugin-component-annotate` > - https://github.com/getsentry/sentry-javascript-bundler-plugins/blob/ee73414589a3341c4a4a8ec8efa3116d838e33f8/packages/babel-plugin-component-annotate/src/index.ts#L40-L46 > - ```js > const webComponentName = "data-sentry-component"; > const webElementName = "data-sentry-element"; > const webSourceFileName = "data-sentry-source-file"; > > const nativeComponentName = "dataSentryComponent"; > const nativeElementName = "dataSentryElement"; > const nativeSourceFileName = "dataSentrySourceFile"; > ``` > - https://github.com/search?type=code&q=%22data-sentry-source-file%22+OR+%22data-sentry-component%22 > > ## See Also > > - https://github.com/pionxzh/wakaru/issues/140 > - https://github.com/j4k0xb/webcrack/issues/143 > > _Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/350#issue-2888684046_ ### On `jehna/humanify` #### Issue 97: More deterministic renames across different versions of the same code > Currently, LLMs often guess variable names differently across various versions of the same JavaScript code. This inconsistency complicates versioning, tracking changes, and merging code for anyone regularly analyzing or modifying applications, extensions, etc. > > My suggestion is to create a mapping file that lists generated variable names alongside their LLM-generated alternatives, updated continuously. This would serve as a lookup table for the LLM, helping maintain consistency and reducing variations in the final output. Admittedly, I haven't fully explored the feasibility of this concept, but I believe it would strengthen reverse-engineering processes. > > _Originally posted by @neoOpus in https://github.com/jehna/humanify/issues/97 --- > > My suggestion is to create a mapping file that lists generated variable names alongside their LLM-generated alternatives > > @neoOpus This is similar to an area I have spent a fair bit of time thinking about/prototyping tooling around in the past. One of the bigger issues that you're likely to find here is that with bundlers like webpack/etc, when they minimise the variable names, they won't necessarily choose the same minified variable name for the same code each time. So to make a 'lookup table' type concept work, you first need to be able to stabilise the 'reference key' for each of those variables, even if the bundler chose something different to represent it. > > You can find some of my initial hacky prototypes scattered in this repo: > > - https://github.com/0xdevalias/poc-ast-tools > > My thoughts/notes on this are scattered around a few places, but these may be some useful/interesting places to start: > > - https://github.com/0xdevalias/chatgpt-source-watch/issues/3 > - https://github.com/Wilfred/difftastic/issues/631 > - https://github.com/afnanenayet/diffsitter/issues/819 > - https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js > - https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js > - https://github.com/0xdevalias/chatgpt-source-watch/issues/10 > - https://github.com/pionxzh/wakaru/issues/34 > - https://github.com/pionxzh/wakaru/issues/74 > - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#variable-name-mangling > - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#my-chatgpt-research--conversations > - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#fingerprinting-minified-javascript-libraries > - https://gist.github.com/0xdevalias/31c6574891db3e36f15069b859065267#fingerprinting-minified-javascript-libraries--ast-fingerprinting--source-code-similarity--etc > - https://github.com/pionxzh/wakaru/issues/73 > - https://github.com/pionxzh/wakaru/issues/41 > - https://github.com/j4k0xb/webcrack/issues/21 > > You can see an example of a larger scale project where I was trying to stabilise the minified variable names to reduce the 'noise' in large scale source diffing here: > > - https://github.com/0xdevalias/chatgpt-source-watch > > _Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/97#issuecomment-2347878686_ --- > > Currently, LLMs often guess variable names differently across various versions of the same JavaScript code. This inconsistency complicates versioning, tracking changes, and merging code for anyone regularly analyzing or modifying applications, extensions, etc. > > Just to clarify that I'm on the same page here, is the issue that: > * You have multiple versions of a webapp/website that change over time > * You un-minify all of them > * You need to compare their differencies, and it's proving difficult as Humanify does not generate same names for same minified code > > This is an interesting problem. I'd love to research some ways to implement this. Especially AST fingerprinting seems promising, thank you @0xdevalias for your links. > > _Originally posted by @jehna in https://github.com/jehna/humanify/issues/97#issuecomment-2356724015_ --- > One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source. > > In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs. > > _Originally posted by @jehna in https://github.com/jehna/humanify/issues/97#issuecomment-2356732911_ --- > I would like to share an idea I’ve been considering, even though I’m still in the process of researching this topic. I hope it proves to be useful! > > My suggestion is to break the code down into smaller, modular functions, which seems to be a practice your script might already be implementing. One approach to enhance this is to replace all variable names with generic placeholders (like a, b, c, d) or numerical identifiers (such as 0001, 0002, 0003) by order of apparency. (I honestly don't know how this can be done but maybe via RegEx or just asking LLM to do it). > > Anyway, this would allow for a standardized, minified version of the code. After creating this stripped down and abstracted version, we could calculate a hash of the code as a string. This hash would serve as a unique identifier to track changes portions of the code from different versions of the project and prevent duplicate entries as well as a reference to where to store the future generated variable names. The resulting data could be stored in an appropriate format, such as CSV, NoSQL, or JSON, based on your requirements for speed, scalability, and ease of access. > > Next, we could analyze this stored data from a designated project location or a maybe specified subfolder (into .humanifjs). Here, we could leverage language models (LLMs) to generate meaningful variable names based on the context of the functions. This would create a "reference" that can assist in future analyses of the code. > > When new versions of the obfuscated code are generated (which will have different variable names), we can apply a similar process to compare them with previously processed versions. By using diff techniques, we can identify changes and maintain a collection of these sub-chunks of code, which would help reduce discrepancies. In most cases, we should see a high degree of similarity unless a particular function’s logic has altered. We can then reassign the previously generated variable names (instead of the original variable names or having to generate different ones) to the new code chunks by feeding them as choices for the LLM or assigning them directly programmatically to reduce the need to consume more tokens for the same chunks. > > Additionally, to enhance this process, we could explore various optimizations in how the LLM generates and assigns these variable names, as well as how we handle the storage and retrieval of the chunks. > > I look forward to your thoughts on this approach and any suggestions you may have for improving it further! > > What would make this work better is to make it able to work take advantage of diff (compare) technics to make some sort of sub-chuncks then keeping them available to reduce the discrepancy, maybe also optimize the generation... I hope this makes sense. > > And as you stated here > > > One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source. > > > > In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs. > > This would be optimal indeed as it will allow to leverage the collective work to get the best results. > > PS: I don't have a good machine right now to do some testing myself, nor an API key that allows me to do them properly. > > _Originally posted by @neoOpus in https://github.com/jehna/humanify/issues/97#issuecomment-2359434509_ --- > > One issue related to fingerprinting is that most of the stuff in a modern webapp bundle is dependencies. And most of the dependencies probably have public source code. So in theory it would be possible to build a huge database of open source code fingerprints that would match a specific version of a specific code, and to have a tool that deterministically reverses the code to its actual original source. > > @jehna Agreed. This was one of the ideas that first led me down the 'fingerprinting' path. Though instead of 'deterministically reversing the code to the original source' in its entirety (which may also be useful), my plan was first to be able to detect dependencies and mark them as such (as most of the time I don't care to look too deeply at them), and then secondly to just be able to extract the 'canonical variable/function names' from that original source and be able to apply them to my unminified version (similar to how `humanify` currently uses AI for this step); as that way I know that even if there is some little difference in the actual included code, I won't lose that by replacing it with the original source. These issues on `wakaru` are largely based on this area of things: > > - https://github.com/pionxzh/wakaru/issues/41 > - https://github.com/pionxzh/wakaru/issues/73 > - https://github.com/pionxzh/wakaru/issues/74 > > While it's a very minimal/naive attempt, and definitely not the most robust way to approach things, a while back I implemented a really basic 'file fingerprint' method, mostly to assist in figuring out when a chunk had been renamed (but was otherwise largely the same chunk as before), that I just pushed to `poc-ast-tools` (https://github.com/0xdevalias/poc-ast-tools/commit/b0ef60f8608385c40de2644b3346b1834eb477a0): > > - https://github.com/0xdevalias/poc-ast-tools/blob/main/text_similarity_checker.py > - https://github.com/0xdevalias/poc-ast-tools/blob/main/rename-chunk.sh > > When I was implementing it, I was thinking about embeddings, but didn't want to have to send large files to the OpenAI embeddings API; and wanted a quick/simple local approximation of it. > > Expanding on this concept to the more general code fingerprinting problem; I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed. I would also probably be normalising the code to remove any function/variable identifiers first; and to remove the impact of whitespace differences/etc. > > While it's not applied to generating a fingerprint, you can see how I've used some of these techniques in my approach to creating a 'diff minimiser' for identifying newly changed code between builds, while ignoring the 'minification noise / churn': > > - https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js > - https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js > > --- > > > In theory we could use a similar method to build a local database of already-humanified code, which would make the reverse process more deterministic on subsequent runs. > > @jehna Oh true.. yeah, that definitely makes sense. Kind of like a local cache. > > --- > > > One approach to enhance this is to replace all variable names with generic placeholders (like a, b, c, d) or numerical identifiers (such as 0001, 0002, 0003) by order of apparency. (I honestly don't know how this can be done but maybe via RegEx or just asking LLM to do it). > > @neoOpus This would be handled by parsing the code into an AST, and then manipulating that AST to rename the variables. > > You can see various hacky PoC versions of this with various parsers in my `poc-ast-tools` repo (I don't remember which is the best/most canonical as I haven't looked at it all for ages), eg: > > - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1.js > - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_0_old_combined.js > - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_1.js > - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_2.js > - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_3.js > - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_3_clean.js > - https://github.com/0xdevalias/poc-ast-tools/blob/main/babel_v1_3_cli.js > - etc: https://github.com/0xdevalias/poc-ast-tools > > Which you can see some of the early hacky mapping attempts I was making in these files: > > - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-121de668c4456907-HEAD.json > - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-HEAD-rewritten.json > - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-HEAD.json > - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-HEAD%5E1.json > - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-f9af0280d3150ee2-HEAD.json > - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.167-test.json > - https://github.com/0xdevalias/poc-ast-tools/blob/main/variableMapping.json > > That was the point where I realised I really needed something more robust (such as a proper fingerprint that would survive code minification) to use as the key. > > --- > > > We can then reassign the previously generated variable names (instead of the original variable names or having to generate different ones) to the new code chunks by feeding them as choices for the LLM or assigning them directly programmatically to reduce the need to consume more tokens for the same chunks. > > @neoOpus Re-applying the old variable names to the new code wouldn't need an LLM at all, as that part is handled in the AST processing code within `humanify`: > > - https://thejunkland.com/blog/using-llms-to-reverse-javascript-minification#:~:text=Don%27t%20let%20AI%20touch%20the%20code > - > Don't let AI touch the code > > Now while LLMs are very good at rephrasing and summarizing, they are not very good at coding (yet). They have inherent randomness, which makes them unsuitable for performing the actual renaming and modification of the code. > > > > Fortunately renaming a Javascript variable within its scope is a solved problem with traditional tools like Babel. Babel first parses the code into an abstract syntax tree (AST, a machine representation of the code), which is easy to modify using well behaving algorithms. > > > > This is much better than letting the LLM modify the code on a text level; it ensures that only very specific transformations are carried out so the code's functionality does not change after the renaming. The code is guaranteed to have the original functionality and to be runnable by the computer. > > --- > > > I would like to share an idea I’ve been considering, even though I’m still in the process of researching this topic. I hope it proves to be useful! > > @neoOpus At a high level, it seems that the thinking/aspects you've outlined here are more or less in line with what I've discussed previously in the resources I linked to [in my first comment above](https://github.com/jehna/humanify/issues/97#issuecomment-2347878686). > > --- > > > PS: I don't have a good machine right now to do some testing myself, nor an API key that allows me to do them properly. > > @neoOpus IMO, the bulk of the 'harder parts' of implementing this aren't really LLM related, and shouldn't require a powerful machine. The areas I would suggest most looking into around this are how AST parsing/manipulation works; and then how to create a robust/stable fingerprinting method. > > IMO, figuring the ideal method of fingerprinting is probably the largest / potentially hardest 'unknown' in all of this currently (at least to me, since while I started to gather resources for it, I haven't had the time to deep dive into reading/analysing them all): > > - https://gist.github.com/0xdevalias/31c6574891db3e36f15069b859065267#fingerprinting-minified-javascript-libraries--ast-fingerprinting--source-code-similarity--etc > - https://gist.github.com/0xdevalias/d8b743efb82c0e9406fc69da0d6c6581#fingerprinting-minified-javascript-libraries > > Off the top of my head, I would probably look at breaking things down to at least an individual module level, as I believe usually modules tend to coincide with original source files; and maybe even break things down even further to a function level if needed; and then generate fingerprints for them. > > I would also potentially consider looking at the module/function 'entry/exit' points (eg. imports/exports); or maybe even the entire 'shape' of the module import graph itself. > > I would also probably be normalising the code to remove any function/variable identifiers and to remove the impact of whitespace differences/etc; before generating any fingerprints on it. > > Another potential method I considered for the fingerprints is identifying the types of elements that tend to remain stable even when minified, and using those as part of the fingerprint. As that is one of the manual methods I used to be able to identify a number of the modules listed here: > > - https://github.com/pionxzh/wakaru/issues/41 > - https://github.com/pionxzh/wakaru/issues/40 > - https://github.com/pionxzh/wakaru/issues/79 > - https://github.com/pionxzh/wakaru/issues/88 > - https://github.com/pionxzh/wakaru/issues/89 > - https://github.com/pionxzh/wakaru/issues/87 > - etc: https://github.com/pionxzh/wakaru/issues?q=%22%5Bmodule-detection%5D%22 > > _Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/97#issuecomment-2372638981_