Skip to content

Instantly share code, notes, and snippets.

@malithjkmt
Last active August 4, 2017 00:41
Show Gist options
  • Save malithjkmt/9fcaba23f7635766c45de01349084105 to your computer and use it in GitHub Desktop.
Save malithjkmt/9fcaba23f7635766c45de01349084105 to your computer and use it in GitHub Desktop.

Revisions

  1. malithjkmt renamed this gist Aug 4, 2017. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  2. malithjkmt created this gist Aug 3, 2017.
    28 changes: 28 additions & 0 deletions corusShuffler.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,28 @@
    # To run: python corpusShuffler -src sourceCourpus.txt -tdt targetCorpus.txt

    import argparse
    import random

    parser = argparse.ArgumentParser(description='## CORPUS SHUFLER ##')
    parser.add_argument(
    '-src', help='sorce language corpus to shuffle', required=True)
    parser.add_argument(
    '-tgt', help='target language corpus to shuffle', required=True)
    args = parser.parse_args()

    src = open(args.src, 'r')
    tgt = open(args.tgt, 'r')

    srcOut = open(args.src + '_shuffled', 'w')
    tgtOut = open(args.tgt + '_shuffled', 'w')

    srcData = src.readlines()
    tgtData = tgt.readlines()

    random.seed(7) # same seed for both files (to save the alignment)
    random.shuffle(srcData)
    random.seed(7) # same seed for both files (to save the alignment)
    random.shuffle(tgtData)

    open(args.src + '_shuffled', 'w').writelines(srcData)
    open(args.tgt + '_shuffled', 'w').writelines(tgtData)