# Fingerprinting Minified JavaScript Libraries ## Table of Contents - [See Also](#see-also) - [Initial ChatGPT Conversation / Notes](#initial-chatgpt-conversation--notes) - [Thoughts / comments as I've articulated them elsewhere](#thoughts--comments-as-ive-articulated-them-elsewhere) - [On `j4k0xb/webcrack`](#on-j4k0xbwebcrack) - [Issue 21: rename short identifiers](#issue-21-rename-short-identifiers) - [On `pionxzh/wakaru`](#on-pionxzhwakaru) - [Issue 34: support `un-mangle` identifiers](#issue-34-support-un-mangle-identifiers) ## See Also - https://github.com/pionxzh/wakaru/issues/41 - > Module detection - https://github.com/pionxzh/wakaru/issues/34 - > support `un-mangle` identifiers ## Initial ChatGPT Conversation / Notes Source: https://chat.openai.com/c/d9b7b64f-aa93-474e-939f-79e376e6d375 **Me:** > Are there any tools that allow me to identify potentially minified javascript/etc libraries within a webpack or similar bundle? **ChatGPT:** > Identifying and fingerprinting minified JavaScript modules within a webpack bundle, regardless of their minification method, can be a challenging task due to the nature of minification and bundling. However, there are approaches and tools you can use to achieve this: > > 1. **Custom Fingerprinting Script**: Given your skills in IT and software engineering, you might consider writing a custom script to analyze the bundles. This script can use heuristics like function and variable name patterns, known code snippets, or specific algorithm implementations to identify libraries. The script can parse the bundle, apply these heuristics, and attempt to match segments of the code with known library patterns. > > 2. **Entropy Analysis**: Minified code typically has higher entropy than non-minified code. By analyzing the entropy of different sections of your bundle, you might be able to identify boundaries between different libraries or modules. > > 3. **AST Analysis**: Abstract Syntax Tree (AST) analysis involves parsing the JavaScript code into a tree structure, which can then be analyzed. You could use AST to look for specific patterns that match known library code structures. > > 4. **Binary Analysis Tools**: Tools like Binwalk, which are traditionally used for analyzing binary files, can sometimes be repurposed for script analysis. They work by scanning files for embedded files and executable code. > > 5. **Regular Expression Matching**: This involves creating regular expressions that match known patterns from popular libraries. While this approach can be time-consuming and may require regular updates, it can be effective for well-known and widely used libraries. > > 6. **Machine Learning Approaches**: With advancements in machine learning, you can train a model to recognize patterns in minified code. This requires a substantial dataset of minified and unminified code to train the model. > > 7. **Community Databases**: Sometimes, the community might maintain databases of fingerprints of popular libraries. These can be used to match against your bundle. > > 8. **Source Map Reversal**: If source maps are available, you can reverse-engineer them to get the original source code, making it easier to identify the libraries. > > Remember, the effectiveness of these methods depends on various factors, including the degree of minification, the presence of source maps, and the specific bundling process used. Additionally, given the potential complexity, this might be a time-consuming task, but your background in software engineering and ethical hacking should provide a solid foundation for tackling this challenge. **Me:** > Let's work with the following ideas: > > Custom Fingerprinting Script: Given your skills in IT and software engineering, you might consider writing a custom script to analyze the bundles. This script can use heuristics like function and variable name patterns, known code snippets, or specific algorithm implementations to identify libraries. The script can parse the bundle, apply these heuristics, and attempt to match segments of the code with known library patterns. > > AST Analysis: Abstract Syntax Tree (AST) analysis involves parsing the JavaScript code into a tree structure, which can then be analyzed. You could use AST to look for specific patterns that match known library code structures. > > Community Databases: Sometimes, the community might maintain databases of fingerprints of popular libraries. These can be used to match against your bundle. > > -- > > Let's assume that there isn't currently a community database, but that once we figure out an effective way of creating and matching these fingerprints, that we could open source one. > > Since we want this to work regardless of whether the library code has been minified or not, we need to take that into account. > > How should we approach this? **ChatGPT:** > Creating a system to fingerprint JavaScript libraries within a webpack bundle, especially considering the challenges posed by minification, is a multi-step process. The goal is to create a system that can analyze JavaScript code, extract identifiable patterns, and match these against a database of known library fingerprints. Here's a proposed approach: > > ### Establishing a Fingerprinting Methodology > > 1. **Identify Unique Patterns**: Begin by identifying patterns that remain consistent in a library, even after minification. These could include: > - Unique function signatures. > - Specific algorithm implementations. > - Characteristic variable naming schemes that survive minification. > - Constant values or strings used by the library. > > 2. **Heuristics for Detection**: Develop heuristics to detect these patterns, such as specific function name and parameter combinations, unique method chaining styles, or specific API calls. > > ### Abstract Syntax Tree (AST) Analysis > > 1. **Parse JavaScript Bundle**: Utilize tools like Esprima, Acorn, or Babel to convert JavaScript code into AST, enabling effective analysis of code structure and content. > > 2. **Pattern Recognition in AST**: Implement algorithms for matching parsed code against defined fingerprint criteria, focusing on structural and syntactical patterns and heuristics established earlier. This involves looking for specific node types (like function declarations, variable declarations, etc.) that match the identified patterns. > > 3. **Handling Minification**: Design heuristics and pattern recognition to be resilient to common minification techniques like variable renaming and whitespace removal. > > ### Building and Maintaining the Fingerprint Database > > 1. **Collecting Fingerprints**: Start with popular libraries, analyzing unminified sources to establish baseline fingerprints. > > 2. **Database Structure and Contributions**: Organize the database with library entries, fingerprints, and metadata, and set up a system for validating and integrating community contributions. > > ### Matching Mechanism and Script Integration > > 1. **Matching Algorithm**: Develop an algorithm to compare AST nodes against database fingerprints. > > 2. **Handling Ambiguities**: Design the system to handle cases where a definitive match isn't possible. This could involve ranking potential matches or providing a probability score. > > 3. **Script Functionality**: Ensure the script can process a JavaScript bundle, parse it, and match it against fingerprints. Test the matching algorithm with both minified and unminified versions of known libraries to ensure accuracy. > > ### Handling Minification Variations > > 1. **Minification Resilience**: Ensure your fingerprints are resilient to common minification techniques, like variable renaming and whitespace removal. > > 2. **Testing Across Minifiers**: Test your system with code minified by different tools (e.g., UglifyJS, Terser) to ensure reliability. > > ### User Interface and Accessibility > > 1. **Ease of Use**: Develop a user-friendly interface for both contributing to the database and using the script to identify libraries. > > 2. **Integration with Development Tools**: Consider how your tool could integrate with existing development workflows, such as bundlers or IDEs. > > ### Open Sourcing and Community Engagement > > 1. **Open Source the Tool**: Release the tool and database as open source, encouraging community contributions of new fingerprints and algorithm improvements. > > 2. **Documentation and Support Channels**: Provide thorough documentation and establish channels for community support, feedback, and contributions. > > ### Continuous Improvement and Adaptation > > 1. **Regular Updates and Algorithm Refinement**: Keep the database up-to-date with new library versions and continuously refine matching algorithms based on user feedback and new JavaScript development trends. > > 2. **Performance Optimization**: Optimize for performance, especially when analyzing large bundles, using efficient AST traversal and optimized pattern matching algorithms. > > ### Legal and Ethical Considerations > > 1. **Respecting Copyright**: Ensure that the use of library fingerprints complies with legal and ethical standards, respecting the intellectual property rights of library authors. > > This project aligns well with your background in IT and software engineering. The main challenge lies in the nuances of JavaScript minification and the diverse coding patterns across different libraries, but a focused and iterative approach can lead to a robust and useful tool for the community. ## Thoughts / comments as I've articulated them elsewhere ### On `j4k0xb/webcrack` #### Issue 21: rename short identifiers > When I was exploring this concept in my own deobfuscation PoC project, I was exploring to make the variable names unique + have them add sort of semantic information about their source/scope. > > Eg. if it was an arg to a function, it might be `arg_1`. Or potentially if the function is `foo`, it might end up as `foo_arg_1` > > It looks like most of the PoC code I was playing with was local/in a pretty messy/hacky state, but I did find a link in it to an online REPL I was playing around with some of it in. Not sure how outdated that code is, but it might be useful: > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples > > There were a number of different AST parsers I was playing around with, but I think that this babel code may have been the latest (not sure which one): > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples#babel_v1_3.js > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples#babel_v1_3_clean.js > > Within those files, I believe the functions `getNameFromPath`, `getPrefix` (and older commented out functions `getTypePrefix`, `getPrefix` > > --- > > **Edit:** Came across this in another issue here: > > > I published my decompiler that I used in the above example. I think it might be a good reference for adding this feature. > > https://github.com/e9x/krunker-decompiler > > > > _Originally posted by @e9x in https://github.com/j4k0xb/webcrack/issues/10#issuecomment-1546764062_ > > And looking at it's `libRenameVars` code seems to be taking a vaguely similar approach to how I was looking at doing things in my original PoC that I described above: > > - https://github.com/e9x/krunker-decompiler/blob/master/src/libRenameVars.ts > - `getVarPrefix` will set a prefix based on the type (eg. `func`, `arg`, `Class`, `imported`, `var`) > - `getName` generates a new variable name that does not conflict with existing names or reserved keywords > - `generateName` generates a new name for a variable considering its scope, type, and the context in which it is used (e.g., whether it's a class, a function variable, etc.). > It employs various AST manipulations to ensure the generated name is appropriate and does not conflict with existing names. > > A more generalised summary/overview (via ChatGPT): > > > Certainly, the code implements a sophisticated algorithm for renaming variables in a JavaScript program, adhering to several high-level rules and strategies: > > > > 1. **Type-Specific Prefixing**: > > - The `getVarPrefix` function assigns specific prefixes to variable names based on their type (e.g., "func" for function names, "arg" for parameters). This approach helps in identifying the role of a variable just by its name. > > > > 2. **Avoiding Reserved Keywords**: > > - The script includes a comprehensive list of reserved JavaScript keywords. If a variable's name matches a reserved keyword, it is prefixed with an underscore to prevent syntax errors. > > > > 3. **Unique Naming with Context Consideration**: > > - The `generateName` function ensures that each variable gets a unique name that doesn't conflict with other variables in its scope. It also considers the context in which a variable is used. For example, if a variable is part of a class, it may receive a name that reflects this context, using `pascalCase` or `camelCase` as appropriate. > > > > 4. **Handling Special Cases**: > > - The script contains logic to handle special cases, such as variables that are function expressions (`isFuncVar`) or class instances (`isClass`). This affects the naming convention applied to these variables. > > > > 5. **Randomness with Mersenne Twister**: > > - A Mersenne Twister is used to generate random elements for variable names, ensuring that the names are not only unique within the scope of the program but also less predictable. > > > > 6. **AST-Based Renaming**: > > - The script analyzes the Abstract Syntax Tree (AST) of the program to understand the structure and scope of variables. This analysis guides the renaming process, ensuring that the new names are consistent with the variable's usage and position in the code. > > > > 7. **Scope Analysis with ESLint Scope**: > > - By leveraging `eslint-scope`, the script can accurately determine the scope of each variable. This is crucial in avoiding name collisions and ensuring that the renaming respects lexical scoping rules in JavaScript. > > > > 8. **Consideration for Exported and Assigned Variables**: > > - The script pays special attention to variables that are exported or assigned in specific ways (e.g., through `Object.defineProperty`). It ensures that these variables receive names that are appropriate for their roles. > > > > In summary, the script uses a combination of type-based naming conventions, context consideration, randomness, AST analysis, and scope analysis to systematically rename variables in a JavaScript program. This approach aims to enhance readability, avoid conflicts, and maintain the logical structure of the program. > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807290502_ --- > And for an even cooler/more extreme version of improving variable naming; I just came across this blog post / project from @jehna that makes use of `webcrack` + ChatGPT for variable renaming: > > - https://thejunkland.com/blog/using-llms-to-reverse-javascript-minification.html > - > Using LLMs to reverse JavaScript variable name minification > > This blog introduces a novel way to reverse minified Javascript using large language models (LLMs) like ChatGPT and llama2 while keeping the code semantically intact. The code is open source and available at [Github project Humanify](https://github.com/jehna/humanify). > - https://github.com/jehna/humanify > - > Un-minify Javascript code using ChatGPT > - > This tool uses large language modeles (like ChatGPT & llama2) and other tools to un-minify Javascript code. Note that LLMs don't perform any structural changes – they only provide hints to rename variables and functions. The heavy lifting is done by Babel on AST level to ensure code stays 1-1 equivalent. > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807342986_ --- > I came across another tool today that seemed to have a start on implementing some 'smart rename' features: > > - https://github.com/pionxzh/wakaru > - https://wakaru.vercel.app/ > > Digging through the code lead me to this: > > - https://github.com/pionxzh/wakaru/tree/main/packages/unminify#smart-rename > - > Rename minified identifiers with heuristic rules. > - https://github.com/pionxzh/wakaru/blob/main/packages/unminify/src/transformations/smart-rename.ts > - `handleDestructuringRename`, `handleFunctionParamsRename`, `handlePropertyRename`, `handleReactRename`, `getElementName` > - https://github.com/pionxzh/wakaru/blob/main/packages/unminify/src/utils/identifier.ts#L28-L75 > - `generateName`, `getUniqueName` > - https://github.com/pionxzh/wakaru/blob/master/packages/unminify/src/transformations/__tests__/smart-rename.spec.ts > > There's also an issue there that seems to be exploring how to improve 'unmangling variable names' as well: > > - https://github.com/pionxzh/wakaru/issues/34 > > Which I wrote the following extra thoughts on: > > > I just finished up writing some thoughts/references for variable renaming on the `webcrack` repo, that could also be a useful idea for here. (see quotes below) > > > > When I was exploring PoC ideas for my own project previously, I was looking to generate a file similar to the 'module map' that this project is using; but instead of just for the names of modules, I wanted to be able to use it to provide a 'variable name map'. Though because the specific variables used in webpack/etc can change between builds, my thought was that first 'normalising' them to a 'known format' based on their context would make sense to do first. > > > > That could then be letter enhanced/expanded by being able to pre-process these 'variable name mappings' for various open source projects in a way that could then be applied 'automagically' without the end user needing to first create them. > > > > It could also be enhanced by similar techniques such as what the `humanify` project does, by using LLMs/similar to generate suggested variable name mappings based on the code. > > > > My personal ideal end goal for a feature like that would then allow me to use it within an IDE-like environment, where I can rename variables 'as I explore', knowing that the mappings/etc will be kept up to date. > > > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1807393509_ > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807398430_ --- > Another link from my reference notes that I forgot to include earlier; my thoughts on how to rename otherwise unknown variables are based on similar concepts that are used in reverse engineering tools such as IDA: > > - https://hex-rays.com/blog/igors-tip-of-the-week-34-dummy-names/ > - > In IDA’s disassembly, you may have often observed names that may look strange and cryptic on first sight: `sub_73906D75`, `loc_40721B`, `off_40A27C` and more. In IDA’s terminology, they’re called dummy names. They are used when a name is required by the assembly syntax but there is nothing suitable available > - https://www.hex-rays.com/products/ida/support/idadoc/609.shtml > - > IDA Help: Names Representation > - > Dummy names are automatically generated by IDA. They are used to denote subroutines, program locations and data. Dummy names have various prefixes depending on the item type and value > > --- > > > And a few more I was looking at recently as well (that is sort of basically `smart-rename`: > > > > - https://binary.ninja/2023/09/15/3.5-expanded-universe.html#automatic-variable-naming > > - > Automatic Variable Naming > > > One easy way to improve decompilation output is to come up with [better default names](https://github.com/Vector35/binaryninja-api/issues/2558) for variables. There’s a lot of possible defaults you could choose and a number of different strategies are seen throughout different reverse engineering tools. Prior to 3.5, Binary Ninja left variables named based on their origin. Stack variables were var_OFFSET, register-based variables were reg_COUNTER, and global data variables were (data_). While this scheme isn’t changing, we’re being much more intelligent about situations where additional information is available. > > > > > > For example, if a variable is passed to a function and a variable name is available, we can now make a much better guess for the variable name. This is most obvious in binaries with type libraries. > > - > This isn’t the only style of default names. Binary Ninja also will name loop counters with simpler names like `i`, or `j`, `k`, etc (in the case of nested loops) > > - https://github.com/Vector35/binaryninja-api/issues/2558 > > > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1822263687_ > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1822262649_ ### On `pionxzh/wakaru` #### Issue 34: support `un-mangle` identifiers > I just finished up writing some thoughts/references for variable renaming on the `webcrack` repo, that could also be a useful idea for here. (see quotes below) > > When I was exploring PoC ideas for my own project previously, I was looking to generate a file similar to the 'module map' that this project is using; but instead of just for the names of modules, I wanted to be able to use it to provide a 'variable name map'. Though because the specific variables used in webpack/etc can change between builds, my thought was that first 'normalising' them to a 'known format' based on their context would make sense to do first. > > That could then be later enhanced/expanded by being able to pre-process these 'variable name mappings' for various open source projects in a way that could then be applied 'automagically' without the end user needing to first create them. > > It could also be enhanced by similar techniques such as what the `humanify` project does, by using LLMs/similar to generate suggested variable name mappings based on the code. > > My personal ideal end goal for a feature like that would then allow me to use it within an IDE-like environment, where I can rename variables 'as I explore', knowing that the mappings/etc will be kept up to date. > > --- > > > When I was exploring this concept in my own deobfuscation PoC project, I was exploring to make the variable names unique + have them add sort of semantic information about their source/scope. > > > > Eg. if it was an arg to a function, it might be `arg_1`. Or potentially if the function is `foo`, it might end up as `foo_arg_1` > > > > It looks like most of the PoC code I was playing with was local/in a pretty messy/hacky state, but I did find a link in it to an online REPL I was playing around with some of it in. Not sure how outdated that code is, but it might be useful: > > > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples > > > > There were a number of different AST parsers I was playing around with, but I think that this babel code may have been the latest (not sure which one): > > > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples#babel_v1_3.js > > - https://replit.com/@0xdevalias/Rewriting-JavaScript-Variables-via-AST-Examples#babel_v1_3_clean.js > > > > Within those files, I believe the functions `getNameFromPath`, `getPrefix` (and older commented out functions `getTypePrefix`, `getPrefix` > > > > --- > > > > **Edit:** Came across this in another issue here: > > > > > I published my decompiler that I used in the above example. I think it might be a good reference for adding this feature. > > > https://github.com/e9x/krunker-decompiler > > > > > > _Originally posted by @e9x in https://github.com/j4k0xb/webcrack/issues/10#issuecomment-1546764062_ > > > > And looking at it's `libRenameVars` code seems to be taking a vaguely similar approach to how I was looking at doing things in my original PoC that I described above: > > > > - https://github.com/e9x/krunker-decompiler/blob/master/src/libRenameVars.ts > > - `getVarPrefix` will set a prefix based on the type (eg. `func`, `arg`, `Class`, `imported`, `var`) > > - `getName` generates a new variable name that does not conflict with existing names or reserved keywords > > - `generateName` generates a new name for a variable considering its scope, type, and the context in which it is used (e.g., whether it's a class, a function variable, etc.). > > It employs various AST manipulations to ensure the generated name is appropriate and does not conflict with existing names. > > > > A more generalised summary/overview (via ChatGPT): > > > > > Certainly, the code implements a sophisticated algorithm for renaming variables in a JavaScript program, adhering to several high-level rules and strategies: > > > > > > 1. **Type-Specific Prefixing**: > > > - The `getVarPrefix` function assigns specific prefixes to variable names based on their type (e.g., "func" for function names, "arg" for parameters). This approach helps in identifying the role of a variable just by its name. > > > > > > 2. **Avoiding Reserved Keywords**: > > > - The script includes a comprehensive list of reserved JavaScript keywords. If a variable's name matches a reserved keyword, it is prefixed with an underscore to prevent syntax errors. > > > > > > 3. **Unique Naming with Context Consideration**: > > > - The `generateName` function ensures that each variable gets a unique name that doesn't conflict with other variables in its scope. It also considers the context in which a variable is used. For example, if a variable is part of a class, it may receive a name that reflects this context, using `pascalCase` or `camelCase` as appropriate. > > > > > > 4. **Handling Special Cases**: > > > - The script contains logic to handle special cases, such as variables that are function expressions (`isFuncVar`) or class instances (`isClass`). This affects the naming convention applied to these variables. > > > > > > 5. **Randomness with Mersenne Twister**: > > > - A Mersenne Twister is used to generate random elements for variable names, ensuring that the names are not only unique within the scope of the program but also less predictable. > > > > > > 6. **AST-Based Renaming**: > > > - The script analyzes the Abstract Syntax Tree (AST) of the program to understand the structure and scope of variables. This analysis guides the renaming process, ensuring that the new names are consistent with the variable's usage and position in the code. > > > > > > 7. **Scope Analysis with ESLint Scope**: > > > - By leveraging `eslint-scope`, the script can accurately determine the scope of each variable. This is crucial in avoiding name collisions and ensuring that the renaming respects lexical scoping rules in JavaScript. > > > > > > 8. **Consideration for Exported and Assigned Variables**: > > > - The script pays special attention to variables that are exported or assigned in specific ways (e.g., through `Object.defineProperty`). It ensures that these variables receive names that are appropriate for their roles. > > > > > > In summary, the script uses a combination of type-based naming conventions, context consideration, randomness, AST analysis, and scope analysis to systematically rename variables in a JavaScript program. This approach aims to enhance readability, avoid conflicts, and maintain the logical structure of the program. > > > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807290502_ > > --- > > > And for an even cooler/more extreme version of improving variable naming; I just came across this blog post / project from @jehna that makes use of `webcrack` + ChatGPT for variable renaming: > > > > - https://thejunkland.com/blog/using-llms-to-reverse-javascript-minification.html > > - > Using LLMs to reverse JavaScript variable name minification > > > This blog introduces a novel way to reverse minified Javascript using large language models (LLMs) like ChatGPT and llama2 while keeping the code semantically intact. The code is open source and available at [Github project Humanify](https://github.com/jehna/humanify). > > - https://github.com/jehna/humanify > > - > Un-minify Javascript code using ChatGPT > > - > This tool uses large language modeles (like ChatGPT & llama2) and other tools to un-minify Javascript code. Note that LLMs don't perform any structural changes – they only provide hints to rename variables and functions. The heavy lifting is done by Babel on AST level to ensure code stays 1-1 equivalent. > > > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1807342986_ > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1807393509_ > > For now, we have `smart-rename` that can guess the variable name based on the context. I would like to expand it to cover some other generic cases. > > Linking to my `smart-rename` related issues to keep the contextual link here: > > - https://github.com/pionxzh/wakaru/issues/40 > - https://github.com/pionxzh/wakaru/issues/48 > - https://github.com/pionxzh/wakaru/issues/49 > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1818578969_ --- > > Another link from my reference notes that I forgot to include earlier; my thoughts on how to rename otherwise unknown variables are based on similar concepts that are used in reverse engineering tools such as IDA: > > > > - https://hex-rays.com/blog/igors-tip-of-the-week-34-dummy-names/ > > - > In IDA’s disassembly, you may have often observed names that may look strange and cryptic on first sight: `sub_73906D75`, `loc_40721B`, `off_40A27C` and more. In IDA’s terminology, they’re called dummy names. They are used when a name is required by the assembly syntax but there is nothing suitable available > > - https://www.hex-rays.com/products/ida/support/idadoc/609.shtml > > - > IDA Help: Names Representation > > - > Dummy names are automatically generated by IDA. They are used to denote subroutines, program locations and data. Dummy names have various prefixes depending on the item type and value > > > > _Originally posted by @0xdevalias in https://github.com/j4k0xb/webcrack/issues/21#issuecomment-1822262649_ > > --- > > And a few more I was looking at recently as well (that is sort of basically `smart-rename`: > > - https://binary.ninja/2023/09/15/3.5-expanded-universe.html#automatic-variable-naming > - > Automatic Variable Naming > > One easy way to improve decompilation output is to come up with [better default names](https://github.com/Vector35/binaryninja-api/issues/2558) for variables. There’s a lot of possible defaults you could choose and a number of different strategies are seen throughout different reverse engineering tools. Prior to 3.5, Binary Ninja left variables named based on their origin. Stack variables were var_OFFSET, register-based variables were reg_COUNTER, and global data variables were (data_). While this scheme isn’t changing, we’re being much more intelligent about situations where additional information is available. > > > > For example, if a variable is passed to a function and a variable name is available, we can now make a much better guess for the variable name. This is most obvious in binaries with type libraries. > - > This isn’t the only style of default names. Binary Ninja also will name loop counters with simpler names like `i`, or `j`, `k`, etc (in the case of nested loops) > - https://github.com/Vector35/binaryninja-api/issues/2558 > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1822263687_ --- > Was looking closer at the `sourcemap` spec today, and the `names` field jumped out at me as potentially useful: > > - https://tc39.es/source-map-spec/#names > - > `names`: a list of symbol names used by the `mappings` entry > - https://tc39.es/source-map-spec/#mappings > - > `mappings`: a string with the encoded mapping data (see _4.1 Mappings Structure_) > - https://tc39.es/source-map-spec/#mappings-structure > - > The `mappings` data is broken down as follows: > > > > - each group representing a line in the generated file is separated by a semicolon (`;`) > > - each segment is separated by a comma (`,`) > > - each segment is made up of `1`, `4`, or `5` variable length fields. > - It then goes on to describe the segment's in greater detail, but the specific part I was thinking could be relevant here would be this: > - > If present, the zero-based index into the `names` list associated with this segment. This field is a base 64 VLQ relative to the previous occurrence of this field unless this is the first occurrence of this field, in which case the whole value is represented. > > Obviously if there is a full `sourcemap` for the webapp, then `wakaru` isn't really needed anyway.. but what I was thinking of here is that in combination with module detection (see - https://github.com/pionxzh/wakaru/issues/41), if there are `sourcemaps`s available for that original module, then we could potentially extract the original function/variable/etc names from the `names` field of the `sourcemap`, and use them in a sort of 'smart-rename with sourcemap' type way. > > --- > > Another `sourcemap` related idea I had (which probably deserves it's own issue) is that it would be cool to be able to 'retroactively generate a `sourcemap`) for a webapp, based on the unminified output from `wakaru`; such that we could than take that `sourcemap`, and apply it to the original minified web app source for debugging the live app. > > **Edit:** Created a new issue to track this: > > - https://github.com/pionxzh/wakaru/issues/66 > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1837858619_ --- > > It isn't very meaningful to support such a feature when you can access all the source code. > > @pionxzh I was specifically talking about it in terms of bundled modules (eg. React, etc), and not the unique web app code of the app itself. > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1839791102_ --- > > You mean like, for popular open-source projects, we can put some sourcemap in our project / read from the chunk, and then reverse map the minified variable and function name back to normal? > > @pionxzh Similar to that, but probably not "put the sourcemap in our project" directly; but more process the sourcemaps from popular open-source projects and extract those details to an 'intermediary form'. That 'intermediary form' would be similar to the 'module map' file, as I described earlier in this thread: > > > When I was exploring PoC ideas for my own project previously, I was looking to generate a file similar to the 'module map' that this project is using; but instead of just for the names of modules, I wanted to be able to use it to provide a 'variable name map'. Though because the specific variables used in webpack/etc can change between builds, my thought was that first 'normalising' them to a 'known format' based on their context would make sense to do first. > > > > That could then be later enhanced/expanded by being able to pre-process these 'variable name mappings' for various open source projects in a way that could then be applied 'automagically' without the end user needing to first create them. > > > > It could also be enhanced by similar techniques such as what the `humanify` project does, by using LLMs/similar to generate suggested variable name mappings based on the code. > > > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1807393509_ > > --- > > > A configuration table/profile can be provided to allow users to manually write correspondences. wakaru can simply include the rules of the better known packages. > > @StringKe _nods_, sounds like we are thinking about similar things here :) > > --- > > > Can you specify the content that you would expect to have? and the corresponding behavior > > @pionxzh For me personally, I haven't deeply thought through all the use cases in depth, but at a high level I basically want to be able to take a web app that is going to be re-built multiple times, and be able to have a 'config file' similar to the 'module mapping' that wakaru has/had; but that also allows me to specify the variable/function names ('symbols') that are used within it. > > The slightly more challenging part is that because the app will be re-built multiple times, the minified variables will change (sometimes every build), so we can't easily use those as the 'key' of the mapping. One idea I had for solving that is potentially by first renaming all of the variables based on a 'stable naming pattern' (eg. `func_*`, `arg_*`, `const_*`, etc; and then could just use a counter/similar based on the 'scope' it's being defined in) that would be generated based on the scope/type of the 'symbol', and would therefore be resilient to the minified variable names changing each build. Those 'stable intermediary names' could then potentially be used for the keys in the variable mapping. > > Though then we also need to figure out what level of 'granularity' makes sense to generate those 'stable intermediary names' at; as having a 1:1 mapping of those 'stable name scopes' to JS scopes could potentially end up being really noisy in the mapping file. So maybe using a 'higher abstracted scope' would make more sense (eg. at the module level or similar) > > My original hacky implementation of this in my own PoC code was using JS objects/JSON to map an explicit minified variable name to it's 'proper' name; but that broke because the minified names changed between builds. Even by implementing the 'stable naming pattern', if those 'stable names' included a 'counter' in them (eg. `func_1`, `const_8`, etc) we still probably wouldn't want to use those stable names directly as the key of an object, as if a new variable was added 'in between' in a later build, that would flow on to 'shifting' the 'counter' for every variable of a matching type afterwards, which would be a lot of effort to manually update in a mapping file. While I haven't thought too deeply about it, I think that by using an array in the mapping file, it should simplify things so that we only need to make a small change to 'fix the mappings' when a new variable is added that 'shifts' everything. > > Even by using the array concept in the mappings file, there is still some manual pain/effort involved in trying to keep the mapping 'up to date' in newer builds. That's what lead me into some of the deeper/more esoteric ideas/thinking around 'fingerprinting' that I expand on below. > > -- > > Another area I started looking into (but haven't deeply explored yet) for both figuring out how to map variable names to sections of code in a 'smart' way, and potentially also for module identification (see #41); is in the space of 'structural AST fingerprinting' or 'code similarity' algorithms and similar. (I realise that this is a rather deep/esoteric angle to be looking at this from, and that there are likely going to be far simpler/easier ways to implement the variable mapping/module identification in a 'good enough' way without going to this level of depth; but I'm curious to explore it regardless, to see if any good ideas come out of it) > > I haven't gotten too far in my reading yet (got distracted on other things), but the high level of my idea was that maybe we could generate an 'AST fingerprint' that isn't impacted by the variable/function/etc names ('symbols') changing during minification; and then use that as the basis for the 'key' in the 'mappings file'; as that fingerprint could theoretically still identify a 'scope' (which might be a literal JS scope, or might be a higher level abstraction that we decide makes sense; the most abstract being probably at the bundled module level) even if the bundler decides to move some functions around to a different module/etc. Then obviously if we were able to generate those 'resilient fingerprints' to identify code even when it's been minified, that would make perfect sense to apply to module detection/etc (see #41) as well. > > Some of the high level ideas / search terms that I was using to start my research in that area was things like: > > - AST fingerprinting > - Source code similarity fingerprinting > - Control flow graphs > - Call flow graphs > - Program dependence graph > - etc > > Here is a link dump of a bunch of the tabs I have open but haven't got around to reviewing in depth yet, RE: 'AST fingerprinting' / Code Similarity / etc: > >
> Unsorted/Unreviewed Initial Link Dump RE: 'AST fingerprinting' / Code Similarity > > - https://openreview.net/forum?id=BJxWx0NYPr > - > On the Effectiveness of Program Analysis in AI and Vice Versa > - https://dl.acm.org/doi/10.1145/3486860 > - > A Unified Framework for Code Representation Learning > - https://inria.hal.science/hal-01648996/document > - > A Survey of Program Slicing Techniques > - https://hal.science/hal-00627811/document > - > From Program Slicing to Abstract Interpretation > - https://igm.univ-mlv.fr/~chilowi/research/syntax_tree_fingerprinting/syntax_tree_fingerprinting_ICPC09.pdf > - > Syntax Tree Fingerprinting for Source Code Similarity Detection > - https://ieeexplore.ieee.org/document/9960266 > - > Exploring the Landscape of Deep Learning for Source Code Analysis > - https://en.wikipedia.org/wiki/Program_dependence_graph > - > Program Dependence Graph - Wikipedia > - https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=3437&context=theses > - > Enhancing Software Reliability Using Program Analysis Techniques > - https://dl.acm.org/doi/pdf/10.1145/1286821.1286826 > - > Static Analysis of Executables to Detect Malicious Patterns > - https://patents.google.com/patent/US9459861B1/en > - > US Patent US9459861B1: Techniques for Code Analysis and Optimization > - https://www.unomaha.edu/college-of-information-science-and-technology/research-labs/_files/software-nsf.pdf > - > Software Engineering Research and Practice: NSF Workshop Report > - https://www.computer.org/csdl/journal/ts/2023/08/10125077/1Nc4Vd4vb7W > - > Advances in Automated Program Repair: IEEE Transactions > - https://stackoverflow.com/questions/7283702/assembly-level-function-fingerprint > - > Stack Overflow: Assembly-Level Function Fingerprinting > - https://en.wikipedia.org/wiki/Control-flow_graph > - > Control-Flow Graph - Wikipedia > - https://www.kapravelos.com/publications/fptechniques-www23.pdf > - > Novel Techniques for Malware and Vulnerability Analysis in Web Applications > - https://pubmed.ncbi.nlm.nih.gov/12197666/ > - > PubMed: Program Analysis Techniques for Software Debugging > - https://stackoverflow.com/questions/15087195/data-flow-graph-construction > - > Stack Overflow: Data Flow Graph Construction in Programming > - https://www.cs.columbia.edu/~suman/secure_sw_devel/Basic_Program_Analysis_CF.pdf > - > Basic Program Analysis Techniques for Secure Software Development > - https://codeql.github.com/docs/writing-codeql-queries/about-data-flow-analysis/ > - > CodeQL Guide: Understanding Data Flow Analysis for Secure Software Development > - https://codereview.stackexchange.com/questions/276387/call-flow-graph-from-python-abstract-syntax-tree > - > Code Review Stack Exchange: Call Flow Graph from Python AST > - https://clang.llvm.org/docs/DataFlowAnalysisIntro.html > - > LLVM Clang Documentation: Introduction to Data Flow Analysis > - https://www.researchgate.net/figure/AST-based-program-graph-with-additional-control-and-data-flow-edges_fig2_370980383 > - > ResearchGate: AST-Based Program Graphs with Control and Data Flow Edges > >
> > -- > > Another idea I've had, but only lightly explored so far, is looking into how various projects like Terser, Webpack, etc choose their minified variable names in general; but also how they handle 'stable minified variables' between builds (which is something that I know at least Webpack has some concept of). My thought there is that by understanding how they implement their own 'stable minified variables between builds', that we might be able to leverage to either a) do similar, or b) be able to reverse engineer that in a way that might be able to be 'retroactively applied' on top of an existing minified project that didn't use 'stable minified variables', to 'stabilise' them. > > _Originally posted by @0xdevalias in https://github.com/pionxzh/wakaru/issues/34#issuecomment-1843850057_