Skip to content

Instantly share code, notes, and snippets.

@jprante
Last active August 23, 2022 09:47
Show Gist options
  • Select an option

  • Save jprante/10827708 to your computer and use it in GitHub Desktop.

Select an option

Save jprante/10827708 to your computer and use it in GitHub Desktop.

Revisions

  1. jprante revised this gist Apr 16, 2014. 1 changed file with 0 additions and 1 deletion.
    1 change: 0 additions & 1 deletion harvard-marc21-to-elasticsearch.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,3 @@

    HOWTO
    =====

  2. jprante revised this gist Apr 16, 2014. 1 changed file with 121 additions and 107 deletions.
    228 changes: 121 additions & 107 deletions harvard-marc21-to-elasticsearch.md
    Original file line number Diff line number Diff line change
    @@ -1,115 +1,129 @@

    HOWTO
    =====

    Ingest Harvard Harvard Library Bibliographic Dataset into Elasticsearch as "raw unmapped MARC21 fields"
    -------------------------------------------------------------------------------------------------------
    Ingest Harvard Library Bibliographic Dataset into Elasticsearch (as raw unmapped MARC21 fields)
    -----------------------------------------------------------------------------------------------

    This HOWTO is for Linux systems (Windows is very similar)

    - install Java 8 into /usr/java/jdk1.8.0
    - install Java 8 into `/usr/java/jdk1.8.0`
    - install Elasticsearch 1.1.0
    -- place this file ${ES_HOME}/bin/elasticsearch.in.sh
    #!/bin/sh
    ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-1.1.0.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/*
    JAVA_OPTS="$JAVA_OPTS -Xms4g"
    JAVA_OPTS="$JAVA_OPTS -Xmx4g"
    JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
    -- place this file in ${ES_HOME}/config/elasticsearch.yml
    cluster:
    name: xbib
    index:
    codec:
    bloom:
    load: false
    merge:
    scheduler:
    type: concurrent
    max_thread_count: 4
    policy:
    type: tiered
    max_merged_segment: 1gb
    segments_per_tier: 4
    max_merge_at_once: 4
    max_merge_at_once_explicit: 4
    indices:
    memory:
    index_buffer_size: 33%
    store:
    throttle:
    type: none
    threadpool:
    merge:
    type: fixed
    size: 4
    queue_size: 32
    bulk:
    type: fixed
    size: 8
    queue_size: 32
    - export a TOOLS folder to your environment (e.g. export TOOLS=$HOME/xbib), and create a $TOOLS folder structure:
    -- mkdir -p $TOOLS/lib
    -- mkdir -p $TOOLS/bin
    -- mkdir -p $TOOLS/logs
    -- mkdir -p $TOOLS/import
    - download MARC21 records from http://openmetadata.lib.harvard.edu/bibdata to $TOOLS/import and unpack the tar.gz file to *.mrc files. Result should look like
    $ cd $TOOLS/import
    $ find .
    .
    ./20140408
    ./20140408/data
    ./20140408/data/hlom
    ./20140408/data/hlom/ab.bib.00.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.12.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.06.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.09.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.10.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.11.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.02.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.01.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.13.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.07.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.08.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.05.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.03.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.04.20140404.full.mrc
    ./20140408/harvard.tar.gz
    - download http://xbib.org/repository/org/xbib/tools/1.0.0.Beta2/tools-1.0.0.Beta2-feeder.jar
    to $TOOLS/lib
    - assign 50% of RAM to Elasticsearch heap and enabling G1 GC by placing this file to `${ES_HOME}/bin/elasticsearch.in.sh`. This example is for 8G RAM:

    #!/bin/sh
    ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-1.1.0.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/*
    JAVA_OPTS="$JAVA_OPTS -Xms4g"
    JAVA_OPTS="$JAVA_OPTS -Xmx4g"
    JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"

    - place this file in `${ES_HOME}/config/elasticsearch.yml`

    cluster:
    name: xbib
    index:
    codec:
    bloom:
    load: false
    merge:
    scheduler:
    type: concurrent
    max_thread_count: 4
    policy:
    type: tiered
    max_merged_segment: 1gb
    segments_per_tier: 4
    max_merge_at_once: 4
    max_merge_at_once_explicit: 4
    indices:
    memory:
    index_buffer_size: 33%
    store:
    throttle:
    type: none
    threadpool:
    merge:
    type: fixed
    size: 4
    queue_size: 32
    bulk:
    type: fixed
    size: 8
    queue_size: 32
    - export a TOOLS folder to your environment (e.g. `export TOOLS=$HOME/xbib`), and create a `$TOOLS` folder structure:

    mkdir -p $TOOLS/lib
    mkdir -p $TOOLS/bin
    mkdir -p $TOOLS/logs
    mkdir -p $TOOLS/import

    - download MARC21 records from `http://openmetadata.lib.harvard.edu/bibdata` to $TOOLS/import and unpack the tar.gz file to `*.mrc` files. Result should look like

    $ cd $TOOLS/import
    $ find .
    .
    ./20140408
    ./20140408/data
    ./20140408/data/hlom
    ./20140408/data/hlom/ab.bib.00.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.12.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.06.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.09.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.10.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.11.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.02.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.01.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.13.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.07.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.08.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.05.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.03.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.04.20140404.full.mrc
    ./20140408/harvard.tar.gz

    - download `http://xbib.org/repository/org/xbib/tools/1.0.0.Beta2/tools-1.0.0.Beta2-feeder.jar`
    to `$TOOLS/lib`

    - create a logging configuration file in $TOOLS/bin/log4j.properties
    log4j.rootLogger=INFO, file, console
    log4j.appender.out=org.apache.log4j.ConsoleAppender
    log4j.appender.out.layout=org.apache.log4j.PatternLayout
    log4j.appender.out.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n
    log4j.appender.file=org.apache.log4j.FileAppender
    log4j.appender.file.layout=org.apache.log4j.PatternLayout
    log4j.appender.file.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n
    log4j.appender.file.append=false
    log4j.appender.file.file=logs/xbib.log
    log4j.logger.org.xbib.elasticsearch=DEBUG
    - create feed script in $TOOLS/bin/harvard2es
    #!/bin/sh
    java="/usr/java/jdk1.8.0/bin/java"
    echo '
    {
    "path" : "'${TOOLS}'/import/",
    "pattern" : "*.mrc",
    "elements" : "marc/bib",
    "concurrency" : 8,
    "elasticsearch" : "es://localhost:9300?es.cluster.name=xbibh&es.sniff=true",
    "index" : "harvard",
    "type" : "title",
    "shards" : 1,
    "replica" : 0,
    "maxbulkactions" : 3000,
    "maxconcurrentbulkrequests" : 10,
    "maxtimewait" : "180s",
    "mock" : false,
    "client" : "bulk",
    "direct" : true
    }
    ' | ${java} \
    -cp $(pwd)/bin:$(pwd)/lib/tools-1.0.0.Beta2-feeder.jar \
    org.xbib.tools.Runner org.xbib.tools.feed.elasticsearch.harvard.FromMARC
    - change directory to $TOOLS
    - run $TOOLS/bin/harvard2es
    - wait ~70 minutes (on a single node

    log4j.rootLogger=INFO, file, console
    log4j.appender.out=org.apache.log4j.ConsoleAppender
    log4j.appender.out.layout=org.apache.log4j.PatternLayout
    log4j.appender.out.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n
    log4j.appender.file=org.apache.log4j.FileAppender
    log4j.appender.file.layout=org.apache.log4j.PatternLayout
    log4j.appender.file.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n
    log4j.appender.file.append=false
    log4j.appender.file.file=logs/xbib.log
    log4j.logger.org.xbib.elasticsearch=DEBUG

    - create ingest script in `$TOOLS/bin/harvard2es`

    #!/bin/sh
    java="/usr/java/jdk1.8.0/bin/java"
    echo '
    {
    "path" : "'${TOOLS}'/import/",
    "pattern" : "*.mrc",
    "elements" : "marc/bib",
    "concurrency" : 8,
    "elasticsearch" : "es://localhost:9300?es.cluster.name=xbibh&es.sniff=true",
    "index" : "harvard",
    "type" : "title",
    "shards" : 1,
    "replica" : 0,
    "maxbulkactions" : 3000,
    "maxconcurrentbulkrequests" : 10,
    "maxtimewait" : "180s",
    "mock" : false,
    "client" : "bulk",
    "direct" : true
    }
    ' | ${java} \
    -cp $(pwd)/bin:$(pwd)/lib/tools-1.0.0.Beta2-feeder.jar \
    org.xbib.tools.Runner org.xbib.tools.feed.elasticsearch.harvard.FromMARC
    - change directory to `$TOOLS`
    - run `$TOOLS/bin/harvard2es`
    - wait ~70 minutes (with a single Elasticsearch node on commodity hardware)
  3. jprante revised this gist Apr 16, 2014. 1 changed file with 0 additions and 1 deletion.
    1 change: 0 additions & 1 deletion harvard-marc21-to-elasticsearch.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,3 @@

    HOWTO
    =====

  4. jprante created this gist Apr 16, 2014.
    116 changes: 116 additions & 0 deletions harvard-marc21-to-elasticsearch.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,116 @@

    HOWTO
    =====

    Ingest Harvard Harvard Library Bibliographic Dataset into Elasticsearch as "raw unmapped MARC21 fields"
    -------------------------------------------------------------------------------------------------------

    This HOWTO is for Linux systems (Windows is very similar)

    - install Java 8 into /usr/java/jdk1.8.0
    - install Elasticsearch 1.1.0
    -- place this file ${ES_HOME}/bin/elasticsearch.in.sh
    #!/bin/sh
    ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-1.1.0.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/*
    JAVA_OPTS="$JAVA_OPTS -Xms4g"
    JAVA_OPTS="$JAVA_OPTS -Xmx4g"
    JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
    -- place this file in ${ES_HOME}/config/elasticsearch.yml
    cluster:
    name: xbib
    index:
    codec:
    bloom:
    load: false
    merge:
    scheduler:
    type: concurrent
    max_thread_count: 4
    policy:
    type: tiered
    max_merged_segment: 1gb
    segments_per_tier: 4
    max_merge_at_once: 4
    max_merge_at_once_explicit: 4
    indices:
    memory:
    index_buffer_size: 33%
    store:
    throttle:
    type: none
    threadpool:
    merge:
    type: fixed
    size: 4
    queue_size: 32
    bulk:
    type: fixed
    size: 8
    queue_size: 32
    - export a TOOLS folder to your environment (e.g. export TOOLS=$HOME/xbib), and create a $TOOLS folder structure:
    -- mkdir -p $TOOLS/lib
    -- mkdir -p $TOOLS/bin
    -- mkdir -p $TOOLS/logs
    -- mkdir -p $TOOLS/import
    - download MARC21 records from http://openmetadata.lib.harvard.edu/bibdata to $TOOLS/import and unpack the tar.gz file to *.mrc files. Result should look like
    $ cd $TOOLS/import
    $ find .
    .
    ./20140408
    ./20140408/data
    ./20140408/data/hlom
    ./20140408/data/hlom/ab.bib.00.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.12.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.06.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.09.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.10.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.11.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.02.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.01.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.13.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.07.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.08.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.05.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.03.20140404.full.mrc
    ./20140408/data/hlom/ab.bib.04.20140404.full.mrc
    ./20140408/harvard.tar.gz
    - download http://xbib.org/repository/org/xbib/tools/1.0.0.Beta2/tools-1.0.0.Beta2-feeder.jar
    to $TOOLS/lib
    - create a logging configuration file in $TOOLS/bin/log4j.properties
    log4j.rootLogger=INFO, file, console
    log4j.appender.out=org.apache.log4j.ConsoleAppender
    log4j.appender.out.layout=org.apache.log4j.PatternLayout
    log4j.appender.out.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n
    log4j.appender.file=org.apache.log4j.FileAppender
    log4j.appender.file.layout=org.apache.log4j.PatternLayout
    log4j.appender.file.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n
    log4j.appender.file.append=false
    log4j.appender.file.file=logs/xbib.log
    log4j.logger.org.xbib.elasticsearch=DEBUG
    - create feed script in $TOOLS/bin/harvard2es
    #!/bin/sh
    java="/usr/java/jdk1.8.0/bin/java"
    echo '
    {
    "path" : "'${TOOLS}'/import/",
    "pattern" : "*.mrc",
    "elements" : "marc/bib",
    "concurrency" : 8,
    "elasticsearch" : "es://localhost:9300?es.cluster.name=xbibh&es.sniff=true",
    "index" : "harvard",
    "type" : "title",
    "shards" : 1,
    "replica" : 0,
    "maxbulkactions" : 3000,
    "maxconcurrentbulkrequests" : 10,
    "maxtimewait" : "180s",
    "mock" : false,
    "client" : "bulk",
    "direct" : true
    }
    ' | ${java} \
    -cp $(pwd)/bin:$(pwd)/lib/tools-1.0.0.Beta2-feeder.jar \
    org.xbib.tools.Runner org.xbib.tools.feed.elasticsearch.harvard.FromMARC
    - change directory to $TOOLS
    - run $TOOLS/bin/harvard2es
    - wait ~70 minutes (on a single node