-
-
Save javagrails/4b1bee9293c0a941317a0f076005dd0f to your computer and use it in GitHub Desktop.
Revisions
-
kordless revised this gist
Jul 12, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -18,7 +18,7 @@ This [Seed Streams](https://github.com/lucidworks/streams) guide illustrates how ### Start the Crawl 1. Navigate to Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds. 1. Navigate to Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`. 1. Run a search and ensure the `image_url_s` and `image_url_t` fields are present. ### Add Vision -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,5 @@ ## Overview This [Seed Streams](https://github.com/lucidworks/streams) guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website whose URIs match a [regular expression](https://regexr.com/). Additionally, `img src` fields are extracted with a JavaScript parsing stage and inserted into the index for use in other indexing stages. A vision network may be utilized to extract additional fields from the images. ### Start Fusion and Create a New Appliction 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -26,7 +26,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow 1. Navigate to Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`. 1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API. 1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field. 1. Add a mapping of returned values XPath Expression. Use `//responses/fullTextAnnotation/text` for the first expression. Set the target field to be `gv_text_s`. Click `Append To Existing Values In Target Field`. Click save at top. 1. Navigate to Indexing..Datasources, click clear datasource then run..start to restart the crawl. 1. Run a search and ensure the `gv_text_s` field is present. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 6 additions and 6 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -8,26 +8,26 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow ### Add a New Datasource and Limit the Documents 1. Create a new datasource under Indexing..Datasources. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Navigate to Indexing..Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save. 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`, `https://xkcd.com/4/`, etc. ### Configure the Parsers 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika. 1. Navigate to Indexing...Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save. 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right. ### Start the Crawl 1. Navigate to Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds. 1. Navigate to on Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`. 1. Run a search and ensure the `image_url_s` and `image_url_t` fields are present. ### Add Vision 1. Note that the text of the comic is already available in the ```<div id="transcript">``` tag on the comic page. Google's Vision API returns other data about images, however. 1. Navigate to Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`. 1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API. 1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field. 1. Add a mapping of returned values XPath Expression. Use `//responses/fullTextAnnotation/text` for the first expressions. Set the target field to be `gv_text_s`. Click `Append To Existing Values In Target Field`. Click save at top. 1. Navigate to Indexing..Datasources, click clear datasource then run..start to restart the crawl. 1. Run a search and ensure the `gv_text_s` field is present. #### Creating a Google Vision API Key -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -23,7 +23,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow ### Add Vision 1. Note that the text of the comic is already available in the ```<div id="transcript">``` tag on the comic page. Google's Vision API returns other data about images, however. 1. Click on Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`. 1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API. 1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field. 1. Add a mapping of returned values XPath Expression. Use `//responses/fullTextAnnotation/text` for the first expressions. Set the target field to be `gv_text_s`. Click `Append To Existing Values In Target Field`. Click save at top. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -19,7 +19,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow ### Start the Crawl 1. Back under Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds. 1. Click on Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`. 1. Run a search and ensure the `image_url_s` and `image_url_t` fields are present. ### Add Vision 1. Note that the text of the comic is already available in the ```<div id="transcript">``` tag on the comic page. Google's Vision API returns other data about images, however. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -3,7 +3,7 @@ "image":{ "source":{ "imageUri": "${image_url_s}" } }, "features": [ -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,5 @@ ## Overview This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website whose URIs match a [regular expression](https://regexr.com/). Additionally, `img src` fields are extracted with a JavaScript parsing stage and inserted into the index for use in other indexing stages. A vision network may be utilized to extract additional fields from the images. ### Start Fusion and Create a New Appliction 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. -
kordless revised this gist
Jul 5, 2018 . No changes.There are no files selected for viewing
-
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,5 @@ ## Overview This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website which match a [regular expression](https://regexr.com/). Extracted fields are inserted into the index to provide image reference data. A vision network may be utilized to extract additional fields from the images. ### Start Fusion and Create a New Appliction 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. -
kordless revised this gist
Jul 5, 2018 . 2 changed files with 4 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -26,6 +26,9 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow 1. Goto Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`. 1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API. 1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field. 1. Add a mapping of returned values XPath Expression. Use `//responses/fullTextAnnotation/text` for the first expressions. Set the target field to be `gv_text_s`. Click `Append To Existing Values In Target Field`. Click save at top. 1. Under Indexing..Datasources, click clear datasource then run..start to restart the crawl. 1. Run a search and ensure the `gv_text_s` field is present. #### Creating a Google Vision API Key 1. Navigate to the [Credentials dashboard](https://console.cloud.google.com/apis/credentials). You may need to select the correct project. @@ -34,7 +37,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow 1. Click the API restrictions tab. Set the API restrictions to *Cloud Vision API*. 1. Click save. ### Debugging Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug: ``` This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -7,7 +7,6 @@ } }, "features": [ { "type": "TEXT_DETECTION", "maxResults": 50 } ] }] -
kordless revised this gist
Jul 5, 2018 . 2 changed files with 33 additions and 8 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -4,24 +4,35 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow ### Start Fusion and Create a New Appliction 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. 1. Create a new application. Call it `XKCD`. 1. Click on the new application. ### Add a New Datasource and Limit the Documents 1. Create a new datasource under Indexing..Datasources. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Goto Indexing..Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save. 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`, `https://xkcd.com/4/`, etc. ### Configure the Parsers 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika. 1. Goto Indexing...Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save. 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right. ### Start the Crawl 1. Back under Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds. 1. Click on Querying..Query Workbench. Set the display fields to be `id` and `img_url_s`. 1. Run a search and ensure the `img_url_s` and `img_url_t` fields are present. ### Add Vision 1. Note that the text of the comic is already available in the ```<div id="transcript">``` tag on the comic page. Google's Vision API returns other data about images, however. 1. Goto Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to `https://vision.googleapis.com/v1/images:annotate`. Change the call method to `post`. 1. Create a query parameter with a property name of `key`. Set the property value to your [Google API key](https://console.cloud.google.com/apis/credentials) for the Vision API. 1. Copy and paste in the `request_entity_indexing_pipeline.json` string below into the request entity field. #### Creating a Google Vision API Key 1. Navigate to the [Credentials dashboard](https://console.cloud.google.com/apis/credentials). You may need to select the correct project. 1. Click the create credentials button. Select API key. Copy the API key when it appears. 1. Click restrict key. In the application restrictions tab, select `IP addresses`. Enter the just IP address of the Fusion instance from your browser, *without the port number or colon*. 1. Click the API restrictions tab. Set the API restrictions to *Cloud Vision API*. 1. Click save. #### Debugging Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug: This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,14 @@ { "requests": [{ "image":{ "source":{ "imageUri": "${src_url_s}" } }, "features": [ { "type": "LABEL_DETECTION", "maxResults": 50 }, { "type": "TEXT_DETECTION", "maxResults": 50 } ] }] } -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 9 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,20 +1,26 @@ ## Overview This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website which match a [regular expression](https://regexr.com/). Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields. ### Start Fusion and Create a New Appliction 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. 1. Create a new application. Call it `XKCD`. ### Add a New Datasource and Limit the Documents 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save. 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`, `https://xkcd.com/4/`, etc. ### Configure the Parsers 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika. 1. Goto Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save. 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right. ## Start the Crawl 1. Back under datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds. 1. Click on query workbench. Set the display fields to be `id` and `img_url_s`. 1. Run a search and ensure the `img_url_s` and `img_url_t` fields are present. #### Add Vision Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the ```<div id="transcript">``` tag on the comic page. #### Debugging -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 3 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -24,14 +24,14 @@ function(doc){ while (iter.hasNext()) { div = iter.next(); if (div.attr("id").equals("bottom")) { // found the containing div of img break; // break out to there } } // break out to here to add field for img src if (div != null) { img = div.child(0); // get the image element logger.info("SRC: " + img.attr("src")); // log the image URL doc.addField("image_url", img.attr("src")); } else { logger.warn("div was null"); -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -6,7 +6,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow 1. Create a new application. Call it `XKCD`. 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save. 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`, `https://xkcd.com/4/`, etc. 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika. 1. Goto Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save. 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -6,7 +6,7 @@ This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/dow 1. Create a new application. Call it `XKCD`. 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save. 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages which appear in the format `https://xkcd.com/501/`. 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika. 1. Goto Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save. 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,5 @@ ## Overview This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a specific set of documents on a website which match a [regular expression](https://regexr.com/). Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields. ### Instructions 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,5 @@ ## Overview This guide illustrates how to use [Lucidworks Fusion](https://lucidworks.com/download/) to crawl a number of specific documents on a website. Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields. ### Instructions 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,5 @@ ## Overview This guide illustrates how to use Fusion to crawl a number of specific documents on a website. Extracted fields are inserted into the index to provide search. A vision network may also be utilized for extracting additional fields. ### Instructions 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 6 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,7 @@ ## Overview This guide illustrates how to use Fusion to crawl a number of specific documents on a website. Extract fields from those pages are inserted into the index for searching or sending off to a vision network for processing. ### Instructions 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. 1. Create a new application. Call it `XKCD`. 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. @@ -11,10 +14,10 @@ 1. Click on query workbench. Set the display fields to be `id` and `img_url_s`. 1. Run a search. #### Next Steps Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the ```<div id="transcript">``` tag on the comic page. #### Debugging Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug: ``` -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,6 +1,6 @@ ## Instructions 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. 1. Create a new application. Call it `XKCD`. 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save. 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,5 @@ ## Instructions 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with `admin` and the new password. 1. Create a new application. Call it XKCD. 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -12,7 +12,7 @@ 1. Run a search. ### Next Steps Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the ```<div id="transcript">``` tag on the comic page. ### Debugging Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug: -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 4 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -9,6 +9,10 @@ 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right. 1. Back under datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds. 1. Click on query workbench. Set the display fields to be `id` and `img_url_s`. 1. Run a search. ### Next Steps Implement Google Vision API to extract text and detected objects from comics. Note that the text of the comic is also available in the <div id="transcript"> tag on the comic page. ### Debugging Tail the `connectors-classic.log` in the `./fusion/4.0.2/var/log/connectors/connectors-classic` directory to debug: -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -5,7 +5,7 @@ 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save. 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages. 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika. 1. Goto Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save. 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right. 1. Back under datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds. 1. Click on query workbench. Set the display fields to be `id` and `img_url_s`. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -2,7 +2,7 @@ 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). 1. Create a new application. Call it XKCD. 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the `javascript_indexing_pipeline_stage.js` code below into the script body. Click save. 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages. 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika. 1. Goto index workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,4 +1,5 @@ ## Instructions 1. [Start a Fusion instance on Google](https://github.com/lucidworks/streams/blob/master/README.md#launching-a-fusion-4x-demo-instance). 1. Create a new application. Call it XKCD. 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in `javascript_indexing_pipeline_stage.js` code into script body. Click save. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 3 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,8 +1,9 @@ ## Instructions 1. Create a new application. Call it XKCD. 1. Create a new datasource. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right. 1. Goto Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in `javascript_indexing_pipeline_stage.js` code into script body. Click save. 1. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to `.*/[0-9]{1,5}/*.` and click save. This limits the documents to comics pages. 1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika. 1. Goto index workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save. 1. Click on the Tika parser stage and check `Return parsed content as XML or HTML` and `Return original XML and HTML instead of Tika XML output`. Click apply below. Click save at top right. 1. Back under datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds. -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -38,7 +38,7 @@ function(doc){ } } catch ( e) { logger.warn("something went wrong"); logger.error(e); } return doc; } -
kordless revised this gist
Jul 5, 2018 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -15,7 +15,7 @@ function(doc){ var iter = java.util.Iterator; var divs = org.jsoup.select.Elements; try { jdoc = Jsoup.parse(content); divs = jdoc.select("div"); iter = divs.iterator(); @@ -36,7 +36,7 @@ function(doc){ } else { logger.warn("div was null"); } } catch ( e) { logger.warn("something went wrong"); logger.error(e); }
NewerOlder