Last active
August 23, 2022 09:47
-
-
Save jprante/10827708 to your computer and use it in GitHub Desktop.
Revisions
-
jprante revised this gist
Apr 16, 2014 . 1 changed file with 0 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,3 @@ HOWTO ===== -
jprante revised this gist
Apr 16, 2014 . 1 changed file with 121 additions and 107 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,115 +1,129 @@ HOWTO ===== Ingest Harvard Library Bibliographic Dataset into Elasticsearch (as raw unmapped MARC21 fields) ----------------------------------------------------------------------------------------------- This HOWTO is for Linux systems (Windows is very similar) - install Java 8 into `/usr/java/jdk1.8.0` - install Elasticsearch 1.1.0 - assign 50% of RAM to Elasticsearch heap and enabling G1 GC by placing this file to `${ES_HOME}/bin/elasticsearch.in.sh`. This example is for 8G RAM: #!/bin/sh ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-1.1.0.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/* JAVA_OPTS="$JAVA_OPTS -Xms4g" JAVA_OPTS="$JAVA_OPTS -Xmx4g" JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC" - place this file in `${ES_HOME}/config/elasticsearch.yml` cluster: name: xbib index: codec: bloom: load: false merge: scheduler: type: concurrent max_thread_count: 4 policy: type: tiered max_merged_segment: 1gb segments_per_tier: 4 max_merge_at_once: 4 max_merge_at_once_explicit: 4 indices: memory: index_buffer_size: 33% store: throttle: type: none threadpool: merge: type: fixed size: 4 queue_size: 32 bulk: type: fixed size: 8 queue_size: 32 - export a TOOLS folder to your environment (e.g. `export TOOLS=$HOME/xbib`), and create a `$TOOLS` folder structure: mkdir -p $TOOLS/lib mkdir -p $TOOLS/bin mkdir -p $TOOLS/logs mkdir -p $TOOLS/import - download MARC21 records from `http://openmetadata.lib.harvard.edu/bibdata` to $TOOLS/import and unpack the tar.gz file to `*.mrc` files. Result should look like $ cd $TOOLS/import $ find . . ./20140408 ./20140408/data ./20140408/data/hlom ./20140408/data/hlom/ab.bib.00.20140404.full.mrc ./20140408/data/hlom/ab.bib.12.20140404.full.mrc ./20140408/data/hlom/ab.bib.06.20140404.full.mrc ./20140408/data/hlom/ab.bib.09.20140404.full.mrc ./20140408/data/hlom/ab.bib.10.20140404.full.mrc ./20140408/data/hlom/ab.bib.11.20140404.full.mrc ./20140408/data/hlom/ab.bib.02.20140404.full.mrc ./20140408/data/hlom/ab.bib.01.20140404.full.mrc ./20140408/data/hlom/ab.bib.13.20140404.full.mrc ./20140408/data/hlom/ab.bib.07.20140404.full.mrc ./20140408/data/hlom/ab.bib.08.20140404.full.mrc ./20140408/data/hlom/ab.bib.05.20140404.full.mrc ./20140408/data/hlom/ab.bib.03.20140404.full.mrc ./20140408/data/hlom/ab.bib.04.20140404.full.mrc ./20140408/harvard.tar.gz - download `http://xbib.org/repository/org/xbib/tools/1.0.0.Beta2/tools-1.0.0.Beta2-feeder.jar` to `$TOOLS/lib` - create a logging configuration file in $TOOLS/bin/log4j.properties log4j.rootLogger=INFO, file, console log4j.appender.out=org.apache.log4j.ConsoleAppender log4j.appender.out.layout=org.apache.log4j.PatternLayout log4j.appender.out.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n log4j.appender.file=org.apache.log4j.FileAppender log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n log4j.appender.file.append=false log4j.appender.file.file=logs/xbib.log log4j.logger.org.xbib.elasticsearch=DEBUG - create ingest script in `$TOOLS/bin/harvard2es` #!/bin/sh java="/usr/java/jdk1.8.0/bin/java" echo ' { "path" : "'${TOOLS}'/import/", "pattern" : "*.mrc", "elements" : "marc/bib", "concurrency" : 8, "elasticsearch" : "es://localhost:9300?es.cluster.name=xbibh&es.sniff=true", "index" : "harvard", "type" : "title", "shards" : 1, "replica" : 0, "maxbulkactions" : 3000, "maxconcurrentbulkrequests" : 10, "maxtimewait" : "180s", "mock" : false, "client" : "bulk", "direct" : true } ' | ${java} \ -cp $(pwd)/bin:$(pwd)/lib/tools-1.0.0.Beta2-feeder.jar \ org.xbib.tools.Runner org.xbib.tools.feed.elasticsearch.harvard.FromMARC - change directory to `$TOOLS` - run `$TOOLS/bin/harvard2es` - wait ~70 minutes (with a single Elasticsearch node on commodity hardware) -
jprante revised this gist
Apr 16, 2014 . 1 changed file with 0 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,3 @@ HOWTO ===== -
jprante created this gist
Apr 16, 2014 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,116 @@ HOWTO ===== Ingest Harvard Harvard Library Bibliographic Dataset into Elasticsearch as "raw unmapped MARC21 fields" ------------------------------------------------------------------------------------------------------- This HOWTO is for Linux systems (Windows is very similar) - install Java 8 into /usr/java/jdk1.8.0 - install Elasticsearch 1.1.0 -- place this file ${ES_HOME}/bin/elasticsearch.in.sh #!/bin/sh ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-1.1.0.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/* JAVA_OPTS="$JAVA_OPTS -Xms4g" JAVA_OPTS="$JAVA_OPTS -Xmx4g" JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC" -- place this file in ${ES_HOME}/config/elasticsearch.yml cluster: name: xbib index: codec: bloom: load: false merge: scheduler: type: concurrent max_thread_count: 4 policy: type: tiered max_merged_segment: 1gb segments_per_tier: 4 max_merge_at_once: 4 max_merge_at_once_explicit: 4 indices: memory: index_buffer_size: 33% store: throttle: type: none threadpool: merge: type: fixed size: 4 queue_size: 32 bulk: type: fixed size: 8 queue_size: 32 - export a TOOLS folder to your environment (e.g. export TOOLS=$HOME/xbib), and create a $TOOLS folder structure: -- mkdir -p $TOOLS/lib -- mkdir -p $TOOLS/bin -- mkdir -p $TOOLS/logs -- mkdir -p $TOOLS/import - download MARC21 records from http://openmetadata.lib.harvard.edu/bibdata to $TOOLS/import and unpack the tar.gz file to *.mrc files. Result should look like $ cd $TOOLS/import $ find . . ./20140408 ./20140408/data ./20140408/data/hlom ./20140408/data/hlom/ab.bib.00.20140404.full.mrc ./20140408/data/hlom/ab.bib.12.20140404.full.mrc ./20140408/data/hlom/ab.bib.06.20140404.full.mrc ./20140408/data/hlom/ab.bib.09.20140404.full.mrc ./20140408/data/hlom/ab.bib.10.20140404.full.mrc ./20140408/data/hlom/ab.bib.11.20140404.full.mrc ./20140408/data/hlom/ab.bib.02.20140404.full.mrc ./20140408/data/hlom/ab.bib.01.20140404.full.mrc ./20140408/data/hlom/ab.bib.13.20140404.full.mrc ./20140408/data/hlom/ab.bib.07.20140404.full.mrc ./20140408/data/hlom/ab.bib.08.20140404.full.mrc ./20140408/data/hlom/ab.bib.05.20140404.full.mrc ./20140408/data/hlom/ab.bib.03.20140404.full.mrc ./20140408/data/hlom/ab.bib.04.20140404.full.mrc ./20140408/harvard.tar.gz - download http://xbib.org/repository/org/xbib/tools/1.0.0.Beta2/tools-1.0.0.Beta2-feeder.jar to $TOOLS/lib - create a logging configuration file in $TOOLS/bin/log4j.properties log4j.rootLogger=INFO, file, console log4j.appender.out=org.apache.log4j.ConsoleAppender log4j.appender.out.layout=org.apache.log4j.PatternLayout log4j.appender.out.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n log4j.appender.file=org.apache.log4j.FileAppender log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=[%d{ABSOLUTE}][%-5p][%-25c][%t] %m%n log4j.appender.file.append=false log4j.appender.file.file=logs/xbib.log log4j.logger.org.xbib.elasticsearch=DEBUG - create feed script in $TOOLS/bin/harvard2es #!/bin/sh java="/usr/java/jdk1.8.0/bin/java" echo ' { "path" : "'${TOOLS}'/import/", "pattern" : "*.mrc", "elements" : "marc/bib", "concurrency" : 8, "elasticsearch" : "es://localhost:9300?es.cluster.name=xbibh&es.sniff=true", "index" : "harvard", "type" : "title", "shards" : 1, "replica" : 0, "maxbulkactions" : 3000, "maxconcurrentbulkrequests" : 10, "maxtimewait" : "180s", "mock" : false, "client" : "bulk", "direct" : true } ' | ${java} \ -cp $(pwd)/bin:$(pwd)/lib/tools-1.0.0.Beta2-feeder.jar \ org.xbib.tools.Runner org.xbib.tools.feed.elasticsearch.harvard.FromMARC - change directory to $TOOLS - run $TOOLS/bin/harvard2es - wait ~70 minutes (on a single node