GSByeon · July 21, 2019 01:56 · Feb 17, 2019 · Feb 17, 2019
diff --git a/.md → ...e_library_functions_in_executable_code.md b/.md → ...e_library_functions_in_executable_code.md
diff --git a/.md b/.md
@@ -0,0 +1,762 @@
+# Searching statically-linked vulnerable library functions in executable code 
+
+번역 [Project Zero: Searching statically-linked vulnerable library functions in executable code
+](https://googleprojectzero.blogspot.com/2018/12/searching-statically-linked-vulnerable.html)
+
+## Summary
+
+취약한 라이브러리가 **static link** (정적 링크)된 유사한 파일들을 binary 레벨에서 탐지하기가 어렵다
+
+이 글에선 바이너리 파일을 정적으로 분석하여 **취약한 오픈 소스 라이브러리** 를 탐지하는 결과를 나타내고 있다.
+
+## Technique
+
+오픈소스 라이브러리의 취약점이 자주 발견된다. 이를 사용할 때 위험하므로 주의해야한다.
+
+dynamic linked 라이브러리를 패치할 때에는 라이브러리만 교체하면 되지만 static linked 라이브러리를 패치하기 위해서는 링킹된 바이너리를 업데이트 해야한다.
+
+굳이 새로운 취약점을 찾으려고 하지 않아도 타겟에서 취약점을 쉽게 찾을수 있는 **기회**이다.
+
+이와 관련된 도구가 부족하여 **도전**이기도 하다.
+
+##### static link library를 탐지하는 가장 일반적인 방법
+
+* 고유한 문자열 검색
+* 추측
+* `BinDiff` 툴 사용
+
+
+
+# 1. Tech Stack and Algorithms for Implementation
+
+## Tech Problem
+
+비교적 큰 크기로 효율적인 `fuzzy search`  
+
+`Fuzzy Search`가 필요한 이유 : compiler의 차이(다양성), 최적화(optimization) 변화, 코드 변화에 따라 noise가 발생하기 때문이다.
+
+## Algorithm 1 - Fuzzy Search
+
+맞춤법 검사나 맞춤법 오류 교정에 사용되는 알고리즘.
+
+정확한 키워드로 검색하지 않아도 유의어 검색시  **관련이 깊은/유사한** 결과를 탐색해준다.
+
+Example : Misissippi -> Mississippi
+
+**유사한** 결과와 관련된 유사 검색 알고리즘이므로 이 연구 진행에 적합하다.
+
+정적이고 정확한 검색보다 효율적이다.
+
+## Fuzzy Matching
+
+컴파일러와 컴파일 옵션, 최적화에 따라 CFG가 다르게 변화한다.
+
+**code duplication**, **instruction movement** and **scheduling**으로 인해 disassembly들이 다양해진다.
+
+함수의 변화를 식별해야하기 때문에 필요하다.
+
+## Algorithm 2 - SimHash
+
+검색엔진에서 중복된 문서를 제거할 때 사용하는 알고리즘.
+
+`Locality Sensitive Hashing`알고리즘 과정의 일부분이다.
+
+유사한 코드를 탐지할때 필요하다.
+
+![](https://ws1.sinaimg.cn/large/006tKfTcgy1g09e77vivej30ka0cyt95.jpg)
+
+* feature의 n 비트가 0이면 n번째 float point에서 1을 빼고 1이면 더한다
+* float point vector -> 128bit vector로 변환 : +는 1로, -는 0으로 변경 
+
+계산한 fingerprint(hash)를 XOR 후 POPCNT 를 수행하면 `hamming distance`를 구할수 있다.
+
+### feature(특징)로 사용 할만한 요소
+
+유사하다고 탐지할 **특징**이 필요하다.
+* cfg의 sub graph
+
+* asm mnemonic의 ngram(연속된 집합)
+
+* ~~함수 prologue~~
+
+함수의 prologue는 유사함수 판별에 사용할 수 없다.
+
+## Approximate Nearest-Neighbor Search
+
+가장 근접한 해시를 간단히 근사하는 방법
+
+`Locality Sensitive Hashing` - 함수에 대한 similarity-preserving hash를 계산하여 크기가 정해지지 않은 **가장 유사한 해시**를 찾기위해 사용
+
+![](https://ws1.sinaimg.cn/large/006tKfTcgy1g09e93ger4j30eh05ktc0.jpg)
+
+![image-20190112042431254](https://ws3.sinaimg.cn/large/006tKfTcgy1g09e78pgogj308c05q0t6.jpg)
+
+
+
+랜덤한 선을 그어 파티션을 나눈후 파티션을 bucket에 넣고 분류한다.
+
+
+
+### Algorithm 3 - LSH(Locality Sensitive Hashing)
+
+`ANNS(Approximate Nearest-Neighbor Search)`의 종류이다. 
+
+벡터 공간을 낮은 차원의 **공간으로 분할**하여 hash를 구하고 data bucket에 담아 유사도를 판별한다.
+
+차원이 높은 data set들을 낮은 차원으로 변환 한다.
+
+일정 범위내의 점들을 한 버킷으로 분류한다.
+
+#### Random bits for LSH
+
+bit vector를 입력하므로 이 비트들을 **subsample**한다. -> 비트의 **순열**을 구한다.
+
+k개의 서로 다른 해시를 구하기 위해 k개의 순열을 구한다. 128bit 순열은 코스트가 크지 않아 적당하다.
+
+
+
+## Data Structure
+
+```
+<PermutationIndex, k-th-permutation-of-input-hash, result-id>
+<k, perm_k(input) & (0xFFL << 56), 0>
+```
+binary search를 k번 수행하여 얻은 hash bucket들을 후보에 넣으면 이들과 입력된 hash간의 hamming distance를 구할수 있다. 
+
+메모리와 캐시의 가장 효율적인 버전은 정렬된 flat array / vector를 사용한다. 
+
+추가를 위해선  `std::set`가 효율적이고 읽기가 많을 경우 정렬된 벡터가 효율적이다.
+
+
+
+## Problem in SimHash
+
+입력 셋의 모든 feature의 중요도가 같게 처리되는 문제가 있었다.
+
+다행히 float point vector에 +1 -1하는 대신에 weight(가중치)를 더하거나 빼는 방법으로 해결할 수 있다.
+
+
+
+## Cheap Gradient Principle
+
+ML은  `Automatic Differentiation`(미분 자동화)가 핵심이다.  ->  `Cheap Gradient Principle`에 적용시 도움이 된다. 
+
+오차함수인  `Loss Function`의 값을 **최소화** 해야한다. -> 가중치가 포함된 함수를 구해야한다.
+
+`같다` 혹은 `같지 않다` 로 분류된 데이터로 효율적인 loss function을 지정할 수 있다.
+
+
+
+## Loss Function for SimHash Distance
+
+거리는 두 비트 벡터 사이의 해밍 거리이므로 기울기가 0일 확률이 높다.
+
+간단한 해결법은 해시를 계산할때 마지막 과정을 제외하는 것이다.
+
+Hash 비교 대신 유클리드 거리 측정으로 대체한다. -> 비효율적
+
+가장 간단한 해결은 유사한 두 함수의 가중치를 0으로 줄이는 것.
+
+### Penalize
+
+거리가 멀때 함수가 유사하다고 판별하거나 거리가 가까울 때 유사하지 않다고 판별하면 `penalize`해야한다. 
+
+### 함수의 조건
+
+두 값이 **같은** 부호 -> - 함수 or 0 함수
+
+두 값이 **다른** 부호 -> + 함수. 
+
+기울기 / 인센티브 : **같은** 부호의 방향으로 입력을 이동
+
+![](https://ws1.sinaimg.cn/large/006tKfTcgy1g09e945o6aj305g01emwz.jpg) <- 유도된 함수
+
+
+
+### g(x,y) = -xy/sqrt((x^2)*(y^2)+1) + 1
+
+![](https://ws2.sinaimg.cn/large/006tKfTcgy1g09e951odvj305l03zgm0.jpg)
+
+x와 y의 부호가 다를 때 손실값이 높고  같으면 손실이 0이된다.
+
+**문제점** : 그래프가 **평평**하여 값이 일정하다.
+
+
+
+### g(x,y) * d(x,y) // d(x,y) = sqrt(x-y+0.01)
+
+![](https://ws1.sinaimg.cn/large/006tKfTcgy1g09e95zkkhj305n03x3yv.jpg)
+
+조건에 맞는 함수를 구하였다.
+
+파라미터를 조정할수 있게 되었다.
+
+
+
+# 2. Get Data For Training
+
+## Generating Training Data
+
+* 데이터 생성이 간단해야함.
+* **compiler의 다양한 version과 다양한 option**을 설정하여 오픈 소스 코드 컴파일
+  * 탐색할 라이브러리가 어떤 컴파일 옵션과 버전으로 컴파일되어 있는지 모르기 때문에 가능한한 비교군을 많이 만들어야한다.
+* **symbol info parsing** - 변화한/다양한 함수들의 그룹을 만들때 필요
+* 컴파일 옵션에 따른 다양한 함수를 **그룹화**
+* 유사하지 않은 쌍은 다른 기호의 다른 함수로 정의할 수 있음.
+
+하지만  **symbol parsing** 과 **CFG 재구성**에 많은 구현 문제가 존재
+
+
+
+### Issue 1 - Symbols
+
+PDB 파일을 parsing하는 cross platform 도구가 없다.
+
+GCC, CLANG 대체한다면 해결할 수 있지만 Visual Studio와 함께 빌드하지 않는 프로젝트는 더 적다. -> 대체가 불가능하다.
+
+VS와 GCC로 같은 codebase를 안정적으로 만드는 것을 **포기**하였다.
+
+**C++ mangling convention**이 컴파일러마다 다르다. -> 같은 함수의 이름이 다르다.
+
+Type info를 제거하고 hackish한 툴로 세 컴파일러들의 표기법을 통일하여 해결할 수 있다.
+
+
+
+### Issue 2 - CFG, Polluted Data Sets
+
+디스어셈블러에서 switch문을 처리할떄 오류가 발생 : 함수 잘림, basic block 할당 문제
+`-fPIE`와 `-fPIC`로 컴파일된 gcc binary는 더 문제가 많다. 요즘 linux 시스템은 기본값으로 설정되어있다.
+
+**Stack Cookie Check** 할때 return을 하지 않아 디스어셈블러의 CFG에서 문제가 발생한다.
+더 안정적인 disassembly로 해결할수 있지만 `-fno-PIE`  `-fno-PIC` 옵션으로 해결할 수 밖에 없다.
+
+
+
+## Real Data Generation
+
+`testdata/generate_training_data.py` 스크립트로 생성할수 있다.
+
+`testdata/ELF` 와 `testdata/PE` 폴더에 있는 모든 binary file을 분석한다. 
+
+
+
+### Extract Function Symbols
+
+ELF : DWARF debug info가 있을 경우 `objdump`로 추출
+
+PE : Linux에서 PDB 분석 방법을 찾을수 없었음. `DIA2Dump` 툴 사용 
+
+
+
+### Result File
+
+EXE ID : SHA256
+```
+./extracted_symbols_<EXEID>.txt
+[exe ID] [exe path] [function address] [base64 encoded symbol] false
+```
+```
+./functions_<EXEID>.txt
+[exe ID]:[function address] [sequence of 128-bit hashes per feature]
+```
+```
+./[training|validation]_data_[seen|unseen]/attract.txt
+./repulse.txt
+[exe ID]:[function address] [exe ID]:[function address]
+```
+
+## Split Training/Validation Data
+
+Training/Validation을 따로 나누어 생성한 이유 : 데이터를 먼저 학습한 다음 **분리** 시킨후 나머지를 탐지 해야하기 때문이다.
+
+![](https://ws2.sinaimg.cn/large/006tKfTcgy1g09e96grivj3085082q3h.jpg)
+
+
+
+```
+(1) 학습한 함수들을 기반으로 변화를 탐지
+```
+* 함수의 변화를 분리
+
+* 나머지를 학습 
+
+* 분리한 함수 validation
+
+
+
+```
+(2) 학습동안 함수를 사용할수 있는 버전이 없더라도 함수의 변화를 탐지
+```
+* 가장 유용하지만 현실적으로 불가능.
+* training/validation data를 Function group(같은 함수 코드가 변화한 집합)에 따라 나누어야 한다. 
+* Function group을 분리 후 다른 function group을 학습
+  분리한 그룹을 validation해야 한다.
+* 
+
+# 3. Training
+
+## Traning Issue
+
+#### 병렬 자동화, GPU offload
+
+* `TensorFlow` 
+* `Julia` + `AutoDiff`
+
+하지만 dependency를 줄이기위해 C++을 사용하여 loss function 설정.  
+#### SPII
+
+[SPII](https://github.com/PetterS/spii) C++라이브러리 사용하여 최소화
+
+* 장점 : 깔끔하고 좋은 프로그래밍 model
+* 단점 : CPU만 사용. GPU로 대체 필요.
+
+
+
+## Real Training
+
+```
+thomasdullien@machine-learning-training:~/sources/functionsimsearch/bin$ ./trainsimhashweights -data=/mnt/training_data/training_data_seen/ —weights=weights_seen.txt
+```
+`L-BFGS`가 500번 반복 됐고 학습을 20번할떄마다 snapshot 생성됨.
+
+### L-BFGS
+
+[Limited-memory BFGS - Wikipedia](https://en.wikipedia.org/wiki/Limited-memory_BFGS)
+
+`Quasi-Newton` 방법중 하나로 **최적화** 알고리즘이다.
+
+### Quasi-Newton
+
+[Quasi-Newton method - Wikipedia](https://en.wikipedia.org/wiki/Quasi-Newton_method)
+
+Hessian 행렬을 미분 없이 함수값으로 근사하는 방법이다.
+
+![](https://ws1.sinaimg.cn/large/006tKfTcgy1g09e96zl0bj308l05c747.jpg)
+
+그래프는 x축은 스텝 y축은 유사한 함수와 유사하지 않은 함수의 비트(거리) 차이를 나타낸다.
+420 스텝이후로 그래프가 감소.
+
+더이상 최적화의 필요가 없어짐.
+
+거리 차이가 10비트에서 25비트로 향상 되었다. -> 함수의 변화를 인식하는 학습이 적용되었다.
+
+
+
+# 4. Result Analysis
+
+## Training Results
+
+1. 고차원 시각화하여 분석 -  `t-SNE` or `MDS` 사용 
+2. **AUC** (Area-under-ROC-curve) 분석
+3. 구한 feature **가중치**를 분석하여 학습결과 분석
+
+## t-SNE VIsualisation
+
+T-distributed Stochastic Neighbor Embedding
+
+
+
+![](https://ws4.sinaimg.cn/large/006tKfTcgy1g09e97jpdnj30ay05qdgr.jpg)
+
+고차원을 저차원으로 줄이는 차원감소 시각화.
+
+T 분포를 이용하여 유사도가 비슷한것끼리 묶어 시각화하는 방법이다.
+
+```bash
+./plot_function_groups.py ../bin/symbols.txt ../bin/unit_index.txt /tmp/unit_features.html
+./plot_function_groups.py ../bin/symbols.txt ../bin/learnt_index.txt /tmp/learnt_features.html
+```
+
+
+![](https://ws4.sinaimg.cn/large/006tKfTcgy1g09e98ah01j306e06z0sx.jpg)![](https://ws2.sinaimg.cn/large/006tKfTcgy1g09e99kinij30670700sx.jpg)
+
+
+
+모든 feature가 학습효과를 보이진 않았다. 몇몇 feature는 학습효과가 떨어졌다.
+
+## TPR, FPR, IRR, ROC-curve
+
+* TP(True Positive) : True를 True라 하는것 
+* TPR(True Positive Rate) : 정밀도 
+* FP(False Positive) or Type 1 Error : False인것을 True라 하는것.
+* FPR(False Positive Rate) : 위양도
+* IRR(Irrelevant Result Rate): 부적합도
+* ROC-Curve : x축이 FPR, y축이 TPR인 그래프
+
+
+
+Hash bucket 얼마나 선택해야하는지 정할 떄와 근사율과 정확도를 계산하는데 도움이 된다.
+
+ROC 데이터를 생성하는 스크립트를 개발했다.
+
+```
+testdata/evaluate_ROC_curve.py --symbols=/media/thomasdullien/roc/symbols.txt --dbdump=/media/thomasdullien/roc/search.index.txt --index=/media/thomasdullien/roc/search.index
+```
+
+`gnuplot`을 이용해 그래프 생성
+
+```bash
+gnuplot -c ./testdata/plot_results_of_evaluate_ROC_curve.gnuplot ./untrained_roc.txt
+gnuplot -c ./testdata/tpr_fpr_curve.gnuplot ./untrained_roc.txt ./trained_roc.txt
+```
+
+
+
+### 그래프 분석 (학습 전)
+
+![](https://ws4.sinaimg.cn/large/006tKfTcgy1g09e9anj45j30ed0ecmz9.jpg)
+
+#### 1번째 그래프
+
+* TPR이 50%가 넘을떄 20%가 irrelevant하다. Cut-off distance는 25비트 정도이다.
+* TPR이 55%가 되는것을 보아 35비트 부터 irrelevant 하다. **가중치**를 더 학습하여 개선할수 있다.
+
+#### 2번째 그래프
+* TPR과 FPR이 평평하다.
+* 비트를 확장시켜 search space가 점점 감소하고 있다.
+* relevant를 개선해야한다.
+
+#### 3번째 그래프
+
+* Recall 향상과 정확도 저하 비교
+
+
+
+
+### 그래프 분석 (학습 후)
+![](https://ws2.sinaimg.cn/large/006tKfTcgy1g09e9bp1aoj30eb0ed774.jpg)
+
+#### 1번째 그래프
+
+* 10비트 쪽에서 IRR이 15%->5% 감소
+* TPR 감소
+* 33% 성공
+* IRR 효율은 좋았으나 결과는 떨어짐
+
+#### 2번째 그래프
+* 15%의 IRR을 채택하면 45% 성공률을 가질수음 (학습 전)
+* 5% IRR 채택시 40% 성공률이 감소 (학습 후)
+
+
+
+### 학습 결과
+
+* IRR을 낮추는데에는 학습이 효과적
+* 학습하지 않은 결과가 더 좋았음.
+
+
+
+## Generalizing out-of-sample function
+
+Out-of-sample : 학습하지 않은 표본을 예측
+
+질문(2)를 위해 질문(1)을 평균 거리 차이로 시각화한 그래프
+
+![](https://ws2.sinaimg.cn/large/006tKfTcgy1g09e9cnkd8j308q05daa2.jpg)
+
+80 학습 스텝 이후 11.42비트에서 12.81비트로 올랐다.
+
+
+
+# 5. Practical Results
+
+## Practical Searching
+
+ `IDA`, `Radare`, `Binary Ninja`, `Miasm`등 RE도구와 연동을 위해`FunctionSimSearch`는 Python Binding을 제공한다.
+
+
+
+```python
+ jsonstring = (... load the JSON ... )
+ fg = functionsimsearch.FlowgraphWithInstructions()
+ fg.from_json(jsonstring)
+ hasher = functionsimsearch.SimHasher("../testdata/weights.txt")
+ function_hash = hasher.calculate_hash(fg)
+```
+
+JSON기반의 Python API로 사용하여 Flow Graph를 나타낼수 있다.
+
+
+
+```json
+{
+ "edges": [ { "destination": 1518838580, "source": 1518838565 }, (...) ],
+ "name": "CFG",
+ "nodes": [
+   {
+     "address": 1518838565,
+     "instructions": [
+       { "mnemonic": "xor", "operands": [ "EAX", "EAX" ] },
+       { "mnemonic": "cmp", "operands": [ "[ECX + 4]", "EAX" ] },
+       { "mnemonic": "jnle", "operands": [ "5a87a334" ] } ]
+   }, (...)  ]
+}
+```
+
+이 json 데이터들을 input으로 넣어주면 Python Tuple로 function hash가 나온다.
+
+
+
+### Python Plugin Binding
+
+* IDA Plugin - [functionsimsearch/ida_example.py at master · googleprojectzero/functionsimsearch · GitHub](https://github.com/googleprojectzero/functionsimsearch/blob/master/pybindings/ida_example.py)
+
+* Binary Ninja - [functionsimsearch/pybindings/binary_ninja_plugin at master · googleprojectzero/functionsimsearch · GitHub](https://github.com/googleprojectzero/functionsimsearch/tree/master/pybindings/binary_ninja_plugin)
+
+
+
+## Searching For unrar Code In mpengine.dll
+
+`ida_example.py`를 사용하여 `mpengine.dll`에 존재하는 `unrar`Open Source Library 코드를 분석하도록 한다.
+
+
+
+```
+Result is 125.000000 - build.VS2015\unrar32\Release\UnRAR.exe 'memcpy_s' (1 in inf searches)
+Result is 125.000000 - build.VS2015\unrar32\MinSize\UnRAR.exe 'memcpy_s' (1 in inf searches)
+Result is 125.000000 - build.VS2015\unrar32\FullOpt\UnRAR.exe 'memcpy_s' (1 in inf searches)
+--------------------------------------
+Result is 108.000000 - build.VS2015\unrar32\MinSize\UnRAR.exe '?RestartModelRare@ModelPPM@@AAEXXZ' (1 in 12105083908.189119 searches)
+Result is 107.000000 - build.VS2013\unrar32\MinSize\UnRAR.exe '?RestartModelRare@ModelPPM@@AAEXXZ' (1 in 3026270977.047280 searches)
+Result is 107.000000 - build.VS2010\unrar32\MinSize\UnRAR.exe '?RestartModelRare@ModelPPM@@AAEXXZ' (1 in 3026270977.047280 searches)
+--------------------------------------
+Result is 106.000000 - build.VS2010\unrar32\Release\UnRAR.exe '?Execute@RarVM@@QAEXPAUVM_PreparedProgram@@@Z' (1 in 784038800.726675 searches)
+Result is 106.000000 - build.VS2010\unrar32\FullOpt\UnRAR.exe '?Execute@RarVM@@QAEXPAUVM_PreparedProgram@@@Z' (1 in 784038800.726675 searches)
+Result is 105.000000 - build.VS2010\unrar32\MinSize\UnRAR.exe '?Execute@RarVM@@QAEXPAUVM_PreparedProgram@@@Z' (1 in 209474446.235050 searches)
+--------------------------------------
+Result is 106.000000 - build.VS2010\unrar32\MinSize\UnRAR.exe '?ExecuteCode@RarVM@@AAE_NPAUVM_PreparedCommand@@I@Z' (1 in 784038800.726675 searches)
+--------------------------------------
+Result is 105.000000 - ar\build.VS2015\unrar64\Debug\UnRAR.exe 'strrchr' (1 in 209474446.235050 searches)
+Result is 105.000000 - ar\build.VS2013\unrar64\Debug\UnRAR.exe 'strrchr' (1 in 209474446.235050 searches)
+Result is 105.000000 - ar\build.VS2012\unrar64\Debug\UnRAR.exe 'strrchr' (1 in 209474446.235050 searches)
+--------------------------------------
+```
+
+
+
+### 결과 분석
+
+```
+Result is ??
+```
+
+?? 는 128bit 해시중 몇 비트가 유사한지 나타낸 것이다.
+
+
+
+### memcpy_s - 97.6%  (125bit/128bit)
+
+![](https://ws1.sinaimg.cn/large/006tKfTcgy1g09e79ytb8j31cv0u0q92.jpg)
+
+* 비트 유사도 : 97.6%
+* 유사 비트 수 : 125bit/128bit
+* 사소한 변화 외 거의 일치
+* CFG 일치
+
+
+
+###  ppmii :: ModelPPM :: RestartModelRare - 84.3% (108bit/128bit)
+
+![](https://ws4.sinaimg.cn/large/006tKfTcgy1g09e7bap2pj318g0r4dks.jpg)
+
+* 비트 유사도 : 84.3%
+* 유사 비트 수 : 108bit / 128bit
+* Structure Offset에 꽤 변화
+* CFG 많이 바뀌진 않음.
+* 첫 basic block에 차이가 보임 -> 그래도 유사함
+
+
+
+### RarVM :: ExecuteCode - 82.8% (106bit/128bit)
+
+![](https://ws2.sinaimg.cn/large/006tKfTcgy1g09e7cj5kuj318g0abadd.jpg)
+
+* 비트 유사도 : 82.8%
+* 유사 비트 수 : 106bit/128bit
+* 0x17D7840 상수가 같음.
+
+
+
+### [FP Case] RarTime :: operator == - 97.6% ( 125bit/128bit) 
+
+![](https://ws4.sinaimg.cn/large/006tKfTcgy1g09e7dih0lj318g0rq44h.jpg)
+
+* 비트 유사도 : 97.6%
+* 유사 비트수 : 125bit/128bit
+* 비트가 매우 유사하다고 판별하였지만 **False Positive** 임
+* 다른 함수이지만 비트가 유사하다고 판별.
+* `operater==` 에서 쉽게 발생 가능
+
+
+
+## libtiff in Adobe Reader
+
+예전 Adobe Reader가`libtiff`의 구버전을 사용하여 취약점이 발생한적이 있다.
+
+
+
+### Search libtiff Code in All of WIndows DLL
+
+* 다양한 버전의 Visual Studio와 다양한 Compile Setting으로 컴파일한 디렉토리
+* PDB 파일 디버깅을 위해 `DIA2Dump`를 사용하여 `.debugdump` 파일을 얻는다.
+
+
+
+```bash
+for i in $(find /media/thomasdullien/storage/libtiff/PE/ -name tiff.dll); do./addfunctionstoindex --input=$i --format=PE --index=/var/tmp/work/simhash.index; done 
+```
+
+`tiff.dll`의 SimHash index를 생성한다.
+
+```bash
+~/Desktop/sources/functionsimsearch/testdata/generate_training_data.py --work_directory=/var/tmp/work/ --executable_directory=/media/thomasdullien/storage/libtiff/ --generate_fingerprints=True --generate_json_data=False
+
+cat /var/tmp/work/extracted_symbols* > /var/tmp/work/simhash.index.meta
+```
+
+`generate_training_data` 함수의 symbol들을 만들고 training data를 생성한다.
+
+```bash
+for i in $(find /media/DLLs -iname ./*.dll); do echo $i; ./matchfunctionsindex --index=/var/tmp/work/simhash.index --input $i; done 
+```
+
+`matchfunctionsindex`를 사용하여 생성한 데이터와 인덱스를 기반으로 모든 dll을 탐색한다.
+
+
+
+#### 실행결과
+
+```
+/home/thomasdullien/Desktop/sources/adobe/binaries/AGM.dll
+(...)
+/home/thomasdullien/Desktop/sources/adobe/binaries/BIBUtils.dll
+(...)
+[!] (3191/3788 - 23 branching nodes) 0.851562: cf1cc98bead49abf.53135c10 matches 39dd1e8a79a9f2bc.1001d43d /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2015.32bits.O1/libtiff.dll PackBitsEncode( tiff*, unsigned char*, int, unsigned short)
+
+/media/dlls/Windows/SysWOW64/WindowsCodecs.dll
+[!] Executable id is cf1cc98bead49abf
+[!] Loaded search index, starting disassembly.
+[!] Done disassembling, beginning search.
+[!] (3191/3788 - 23 branching nodes) 0.851562: cf1cc98bead49abf.53135c10 matches 39dd1e8a79a9f2bc.1001d43d /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2015.32bits.O1/libtiff.dll PackBitsEncode( tiff*, unsigned char*, int, unsigned short)
+[!] (3192/3788 - 23 branching nodes) 0.804688: cf1cc98bead49abf.53135c12 matches 4614edc967480a0d.1002329a /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2013.32bits.O2/libtiff.dll
+[!] (3192/3788 - 23 branching nodes) 0.804688: cf1cc98bead49abf.53135c12 matches af5e68a627daeb0.1002355a /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2013.32bits.Ox/libtiff.dll
+[!] (3192/3788 - 23 branching nodes) 0.804688: cf1cc98bead49abf.53135c12 matches a5f4285c1a0af9d9.10017048 /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2017.32bits.O1/libtiff.dll PackBitsEncode( tiff*, unsigned char*, int, unsigned short)
+[!] (3277/3788 - 13 branching nodes) 0.828125: cf1cc98bead49abf.5313b08e matches a5f4285c1a0af9d9.10014477 /home/thomasdullien/Desktop/tiff-3.9.5-builds/PE/vs2017.32bits.O1/libtiff.dll 
+```
+
+`WindowsCodecs.dll`에 `libtiff.dll`이 매치 하는것을 찾았다.
+
+
+
+#### 결과 분석
+
+![](https://ws4.sinaimg.cn/large/006tKfTcgy1g09e7eh40cj318g0oaq54.jpg)
+
+CFG가 그리 유사한것처럼 보이지는 않다.
+
+![](https://ws4.sinaimg.cn/large/006tKfTcgy1g09e7feiryj318g0oc7ak.jpg)
+
+확대해보면 유사한 부분이 존재한다.
+
+PDB에서 얻은 함수 이름인 `PackBitsEncode`가 일치 하기 때문에 확신이 가능하다.
+
+조사결과 `WindowsCodecs.dll`에서 `libtiff` 3.9.5 version을 사용한 것을 알아냈다. - `libjpeg`를 link 하고있다.
+
+
+
+# 6. Conclusion
+
+## Lessons
+
+### Search Index vs Linear Sweep
+
+* 요즘 CPU는 큰 메모리로 Linear Sweep이 빠르다.
+
+* Hash Load - XOR - Calc Bit하고 수억개의 해시를 검사한다.
+
+* 얼마나 많은 해시를 비교해야하는지는 불분명하다.
+
+* Search index는 over-engineered(과도)할수 있다. 
+
+* 효율적인 스토리지 관리로 인해 간단한 Linear Sweep이 더 나을 수도 있다.
+
+
+
+### Search Simple String
+
+* Static Linked LIbrary를 찾을때 대부분(90%) known/magic string을 검색하는것이 가장 효과적일 것이다.
+
+* 특이한 문자열이 많을 경우 결과도 좋아진다.
+
+* 연구한 툴은 오픈소스 라이브러리 코드를 붙여넣어 컴파일 했을때 유용하다. (문자열이 없는 상황에 유용하다.)
+* `mpengine.dll`에 적용될수 있었다.
+
+
+
+### Still Hard Problem
+
+* 40%의 결과만 확신할수 있다.
+
+* IRR을 줄이고 TPR을 90%이상으로 개선해야한다.
+
+## Future Step
+
+#### TensorFlow나 Julia로 다시 코딩해야한다.
+* C++로 loss function을 코딩하였기 때문에 학습에 시간이 오래걸린다.
+* Single language codebase로는 좋다.
+* GPU로 학습을 병렬화 하면 더 쉬울 것이다.
+
+
+#### L-BFGS 최적화를 **SGD**로 변경해야한다.
+
+  * 큰 데이터는 L-BFGS에 부적합하다.
+
+
+#### Triplet and Quadruplet Training
+
+  * [Beyond triplet loss: a deep quadruplet network for person re-identification](https://arxiv.org/abs/1704.01719)
+  * 직관적인 관점에서 효과가 있을수 있다.
+
+
+#### 더 효과적인 Features 선택
+
+  * mnemonic tuple, CFG, 4의 배수가아닌 상수 등 현재 features는 좋지 않다.
+  * 중요한 정보가있는 **operand**, **structure** **offset**, **strings**은 고려되지 않았다.
+
+
+#### Graph-NN
+
+  * **그래프 학습**하는 많은 ML 연구가 진행되었다. 
+  * 관련 논문 - [CCS'17](https://arxiv.org/pdf/1708.06525.pdf)
+  * 문자열 매치로 탐지할수 없는 부분을 탐지할 수 있을 것이다.
+
+
+#### 함수 인접성 활용
+
+* 라이브러리를 가져와 사용한 함수는 바이너리 레벨에서 **인접하게 배치** 되어있다.
+* 이 정보를 활용하면 더 확실해질 것이다.
+
+
+### ANN tree data structure 를 flat array로 변경
+
+* ANN 은 데이터 구조가 **복잡**하기 때문에 Linear Sweep을 사용하는 경우 비효율적일수 있다.
+  * 저장시 오버헤드가 발생한다.
+* Linear Sweep 을 사용할때 LSH 같은 비트 순열보다 효과적이어야한다.
+
+
+
+
+
+# References
+
+* [functionsimsearch/01-motivation-and-overview.md at master · googleprojectzero/functionsimsearch · GitHub](https://github.com/googleprojectzero/functionsimsearch/blob/master/doc/01-motivation-and-overview.md)
+* [Near Duplicate Documents Detection · aragorn/home Wiki · GitHub](https://github.com/aragorn/home/wiki/Near-Duplicate-Documents-Detection)
+* [Random Projection and Locality Sensitive Hashing | LOVIT x DATA SCIENCE](https://lovit.github.io/machine%20learning/vector%20indexing/2018/03/28/lsh/)
+* [LSH.9 Locality-sensitive hashing: how it works - YouTube](https://www.youtube.com/watch?v=Arni-zkqMBA)
+* [Quasi-Newton method - Wikipedia](https://en.wikipedia.org/wiki/Quasi-Newton_method)
+* [Receiver operating characteristic - Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
+
+
+
+
+
No results found