javagrails · August 18, 2021 18:43 · Jul 12, 2018 · Jul 5, 2018 · Jul 5, 2018 · Jul 5, 2018
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ This [Seed Streams](https://github.com/lucidworks/streams) guide illustrates how
 
 ### Start the Crawl
 1. Navigate to Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
-1. Navigate to on Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`.
+1. Navigate to Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`.
 1. Run a search and ensure the `image_url_s` and `image_url_t` fields are present.
 
 ### Add Vision

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ## Overview
-This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website whose URIs match a [regular expression](https://regexr.com/). Additionally, `img src` fields are extracted with a JavaScript parsing stage and inserted into the index for use in other indexing stages. A vision network may be utilized to extract additional fields from the images.
+This [Seed Streams](https://github.com/lucidworks/streams) guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website whose URIs match a [regular expression](https://regexr.com/). Additionally, `img src` fields are extracted with a JavaScript parsing stage and inserted into the index for use in other indexing stages. A vision network may be utilized to extract additional fields from the images.
 
 ### Start Fusion and Create a New Appliction
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.

diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow
 1. Navigate to Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`.
 1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API.
 1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field.
-1. Add a mapping of returned values XPath Expression. Use `//responses/fullTextAnnotation/text` for the first expressions. Set the target field to be `gv_text_s`. Click `Append To Existing Values In Target Field`. Click save at top.
+1. Add a mapping of returned values XPath Expression. Use `//responses/fullTextAnnotation/text` for the first expression. Set the target field to be `gv_text_s`. Click `Append To Existing Values In Target Field`. Click save at top.
 1. Navigate to Indexing..Datasources, click clear datasource then run..start to restart the crawl.
 1. Run a search and ensure the `gv_text_s` field is present.
 

diff --git a/README.md b/README.md
@@ -8,26 +8,26 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow
 
 ### Add a New Datasource and Limit the Documents
 1. Create a new datasource under Indexing..Datasources. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
-1. Goto Indexing..Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
+1. Navigate to Indexing..Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`, `https://xkcd.com/4/`, etc.
 
 ### Configure the Parsers
 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
-1. Goto Indexing...Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
+1. Navigate to Indexing...Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right.
 
 ### Start the Crawl
-1. Back under Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
-1. Click on Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`.
+1. Navigate to Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
+1. Navigate to on Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`.
 1. Run a search and ensure the `image_url_s` and `image_url_t` fields are present.
 
 ### Add Vision
 1. Note that the text of the comic is already available in the ```<div id="transcript">``` tag on the comic page. Google's Vision API returns other data about images, however.
-1. Click on Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`.
+1. Navigate to Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`.
 1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API.
 1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field.
 1. Add a mapping of returned values XPath Expression. Use `//responses/fullTextAnnotation/text` for the first expressions. Set the target field to be `gv_text_s`. Click `Append To Existing Values In Target Field`. Click save at top.
-1. Under Indexing..Datasources, click clear datasource then run..start to restart the crawl.
+1. Navigate to Indexing..Datasources, click clear datasource then run..start to restart the crawl.
 1. Run a search and ensure the `gv_text_s` field is present.
 
 #### Creating a Google Vision API Key

diff --git a/README.md b/README.md
@@ -23,7 +23,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow
 
 ### Add Vision
 1. Note that the text of the comic is already available in the ```<div id="transcript">``` tag on the comic page. Google's Vision API returns other data about images, however.
-1. Goto Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`.
+1. Click on Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`.
 1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API.
 1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field.
 1. Add a mapping of returned values XPath Expression. Use `//responses/fullTextAnnotation/text` for the first expressions. Set the target field to be `gv_text_s`. Click `Append To Existing Values In Target Field`. Click save at top.

diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow
 ### Start the Crawl
 1. Back under Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
 1. Click on Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`.
-1. Run a search and ensure the `img_url_s` and `img_url_t` fields are present.
+1. Run a search and ensure the `image_url_s` and `image_url_t` fields are present.
 
 ### Add Vision
 1. Note that the text of the comic is already available in the ```<div id="transcript">``` tag on the comic page. Google's Vision API returns other data about images, however.

diff --git a/request_entity_indexing_pipeline.json b/request_entity_indexing_pipeline.json
@@ -3,7 +3,7 @@
        "image":{
            "source":{
                "imageUri":
-               "${src_url_s}"
+               "${image_url_s}"
            }
        },
        "features": [

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ## Overview
-This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website which match a [regular expression](https://regexr.com/). Extracted fields are inserted into the index to provide image reference data. A vision network may be utilized to extract additional fields from the images.
+This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website whose URIs match a [regular expression](https://regexr.com/). Additionally, `img src` fields are extracted with a JavaScript parsing stage and inserted into the index for use in other indexing stages. A vision network may be utilized to extract additional fields from the images.
 
 ### Start Fusion and Create a New Appliction
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ## Overview
-This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website which match a [regular expression](https://regexr.com/). Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields.
+This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website which match a [regular expression](https://regexr.com/). Extracted fields are inserted into the index to provide image reference data. A vision network may be utilized to extract additional fields from the images.
 
 ### Start Fusion and Create a New Appliction
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.

diff --git a/README.md b/README.md
@@ -26,6 +26,9 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow
 1. Goto Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`.
 1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API.
 1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field.
+1. Add a mapping of returned values XPath Expression. Use `//responses/fullTextAnnotation/text` for the first expressions. Set the target field to be `gv_text_s`. Click `Append To Existing Values In Target Field`. Click save at top.
+1. Under Indexing..Datasources, click clear datasource then run..start to restart the crawl.
+1. Run a search and ensure the `gv_text_s` field is present.
 
 #### Creating a Google Vision API Key
 1. Navigate to the [Credentials dashboard](https://console.cloud.google.com/apis/credentials). You may need to select the correct project.
@@ -34,7 +37,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow
 1. Click the API restrictions tab. Set the API restrictions to *Cloud Vision API*.
 1. Click save.
 
-#### Debugging
+### Debugging
 Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug:
 
 ```

diff --git a/request_entity_indexing_pipeline.json b/request_entity_indexing_pipeline.json
@@ -7,7 +7,6 @@
            }
        },
        "features": [
-          { "type": "LABEL_DETECTION", "maxResults": 50 },
           { "type": "TEXT_DETECTION", "maxResults": 50 }
        ]
     }]

diff --git a/README.md b/README.md
@@ -4,24 +4,35 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow
 ### Start Fusion and Create a New Appliction
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.
 1. Create a new application. Call it `XKCD`.
+1. Click on the new application. 
 
 ### Add a New Datasource and Limit the Documents
-1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
-1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
+1. Create a new datasource under Indexing..Datasources. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
+1. Goto Indexing..Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`, `https://xkcd.com/4/`, etc.
 
 ### Configure the Parsers
 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
-1. Goto Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
+1. Goto Indexing...Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right.
 
-## Start the Crawl
-1. Back under datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
-1. Click on query workbench. Set the display fields to be `id` and `img_url_s`.
+### Start the Crawl
+1. Back under Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
+1. Click on Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`.
 1. Run a search and ensure the `img_url_s` and `img_url_t` fields are present.
 
-#### Add Vision
-Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the ```<div id="transcript">``` tag on the comic page.
+### Add Vision
+1. Note that the text of the comic is already available in the ```<div id="transcript">``` tag on the comic page. Google's Vision API returns other data about images, however.
+1. Goto Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`.
+1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API.
+1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field.
+
+#### Creating a Google Vision API Key
+1. Navigate to the [Credentials dashboard](https://console.cloud.google.com/apis/credentials). You may need to select the correct project.
+1. Click the create credentials button. Select API key. Copy the API key when it appears.
+1. Click restrict key. In the application restrictions tab, select `IP addresses`. Enter the just IP address of the Fusion instance from your browser, *without the port number or colon*.
+1. Click the API restrictions tab. Set the API restrictions to *Cloud Vision API*.
+1. Click save.
 
 #### Debugging
 Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug:

diff --git a/request_entity_indexing_pipeline.json b/request_entity_indexing_pipeline.json
@@ -0,0 +1,14 @@
+{
+    "requests": [{
+       "image":{
+           "source":{
+               "imageUri":
+               "${src_url_s}"
+           }
+       },
+       "features": [
+          { "type": "LABEL_DETECTION", "maxResults": 50 },
+          { "type": "TEXT_DETECTION", "maxResults": 50 }
+       ]
+    }]
+ }
diff --git a/README.md b/README.md
@@ -1,20 +1,26 @@
 ## Overview
 This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website which match a [regular expression](https://regexr.com/). Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields.
 
-### Instructions
+### Start Fusion and Create a New Appliction
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.
 1. Create a new application. Call it `XKCD`.
+
+### Add a New Datasource and Limit the Documents
 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`, `https://xkcd.com/4/`, etc.
+
+### Configure the Parsers
 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
 1. Goto Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right.
+
+## Start the Crawl
 1. Back under datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
 1. Click on query workbench. Set the display fields to be `id` and `img_url_s`.
-1. Run a search.
+1. Run a search and ensure the `img_url_s` and `img_url_t` fields are present.
 
-#### Next Steps
+#### Add Vision
 Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the ```<div id="transcript">``` tag on the comic page.
 
 #### Debugging

diff --git a/javascript_indexing_pipeline_stage.js b/javascript_indexing_pipeline_stage.js
@@ -24,14 +24,14 @@ function(doc){
     while (iter.hasNext()) {
       div = iter.next();
       if (div.attr("id").equals("bottom")) {
-        // found the image
+        // found the containing div of img
         break; // break out to there
       }
     }
-    // break out to here to add field for transcript
+    // break out to here to add field for img src
     if (div != null) {
       img = div.child(0); // get the image element
-      logger.info("SRC: " + img.attr("src"));
+      logger.info("SRC: " + img.attr("src")); // log the image URL
       doc.addField("image_url", img.attr("src"));
     } else {
       logger.warn("div was null");

diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow
 1. Create a new application. Call it `XKCD`.
 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
-1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`.
+1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`, `https://xkcd.com/4/`, etc.
 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
 1. Goto Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right.

diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow
 1. Create a new application. Call it `XKCD`.
 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
-1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages.
+1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`.
 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
 1. Goto Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right.

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ## Overview
-This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a number of specific documents on a website. Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields.
+This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website which match a [regular expression](https://regexr.com/). Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields.
 
 ### Instructions
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ## Overview
-This guide illustrates how to use Fusion to crawl a number of specific documents on a website. Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields.
+This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a number of specific documents on a website. Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields.
 
 ### Instructions
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ## Overview
-This guide illustrates how to use Fusion to crawl a number of specific documents on a website. Extract fields from those pages are inserted into the index for searching or sending off to a vision network for processing.
+This guide illustrates how to use Fusion to crawl a number of specific documents on a website. Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields.
 
 ### Instructions
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.

diff --git a/README.md b/README.md
@@ -1,4 +1,7 @@
-## Instructions
+## Overview
+This guide illustrates how to use Fusion to crawl a number of specific documents on a website. Extract fields from those pages are inserted into the index for searching or sending off to a vision network for processing.
+
+### Instructions
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.
 1. Create a new application. Call it `XKCD`.
 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
@@ -11,10 +14,10 @@
 1. Click on query workbench. Set the display fields to be `id` and `img_url_s`.
 1. Run a search.
 
-### Next Steps
+#### Next Steps
 Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the ```<div id="transcript">``` tag on the comic page.
 
-### Debugging
+#### Debugging
 Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug:
 
 ```

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 ## Instructions
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.
-1. Create a new application. Call it XKCD.
+1. Create a new application. Call it `XKCD`.
 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages.

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ## Instructions
-1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance).
+1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password.
 1. Create a new application. Call it XKCD.
 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.

diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@
 1. Run a search.
 
 ### Next Steps
-Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the <div id="transcript"> tag on the comic page.
+Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the ```<div id="transcript">``` tag on the comic page.
 
 ### Debugging
 Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug:

diff --git a/README.md b/README.md
@@ -9,6 +9,10 @@
 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right.
 1. Back under datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
 1. Click on query workbench. Set the display fields to be `id` and `img_url_s`.
+1. Run a search.
+
+### Next Steps
+Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the <div id="transcript"> tag on the comic page.
 
 ### Debugging
 Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug:

diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages.
 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
-1. Goto index workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
+1. Goto Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right.
 1. Back under datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
 1. Click on query workbench. Set the display fields to be `id` and `img_url_s`.

diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance).
 1. Create a new application. Call it XKCD.
 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
-1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in `javascript_indexing_pipeline_stage.js` code into script body. Click save.
+1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save.
 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages.
 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
 1. Goto index workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.

diff --git a/README.md b/README.md
@@ -1,4 +1,5 @@
 ## Instructions
+1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance).
 1. Create a new application. Call it XKCD.
 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in `javascript_indexing_pipeline_stage.js` code into script body. Click save.

diff --git a/README.md b/README.md
@@ -1,8 +1,9 @@
 ## Instructions
 1. Create a new application. Call it XKCD.
 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
-1. Goto index pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in `javascript_indexing_pipeline_stage.js` code into script body. Click save.
-1. Add a new include documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages.
+1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in `javascript_indexing_pipeline_stage.js` code into script body. Click save.
+1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages.
+1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
 1. Goto index workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right.
 1. Back under datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.

diff --git a/javascript_indexing_pipeline_stage.js b/javascript_indexing_pipeline_stage.js
@@ -38,7 +38,7 @@ function(doc){
     }
   } catch ( e) {
     logger.warn("something went wrong");
- 		logger.error(e);
+    logger.error(e);
   }
   return doc;
 }
diff --git a/javascript_indexing_pipeline_stage.js b/javascript_indexing_pipeline_stage.js
@@ -15,7 +15,7 @@ function(doc){
   var iter = java.util.Iterator;
   var divs = org.jsoup.select.Elements;
 
-	try {
+  try {
     jdoc = Jsoup.parse(content);
     divs = jdoc.select("div");
     iter = divs.iterator();
@@ -36,7 +36,7 @@ function(doc){
     } else {
       logger.warn("div was null");
     }
-	} catch ( e) {
+  } catch ( e) {
     logger.warn("something went wrong");
  		logger.error(e);
   }
No results found