src/overview.html - plexi/fs - Git at Google

 <body>
 <h3 id="fsadaptor">Deployment of File System Adaptor</h3>

 <p>A single instance of File System adaptor can have
 GSA index a single UNC share.  DFS is supported.

 <h4>Requirements</h4>
 <ul>
   <li>GSA 7.2 or higher
   <li>Java JRE 1.7 update 6 or higher installed on computer that runs adaptor
   <li>File System Adaptor JAR executable
   <li>Requires running on Microsoft Windows
   <li>A Windows account with sufficient permissions for the adaptor
       (see the <b>Permissions needed by the Adaptor</b> section below)
 </ul>

 <h4>Permissions needed by the Adaptor</h4>
   <p>The Windows user account that the adaptor is running under must
   have the following permissions:

   <ul>
     <li>list the content of folders,</li>
     <li>read the content of documents,</li>
     <li>read attributes of files and foldersr.</li>
   </ul>

 <h5>Special permissions needed to read the ACLs</h5>
   Additionaly, the GSA must be able to:
   <ul>
     <li>read ACLs for both files and folders,</li>
     <li>reset last access dates to the original value prior to the GSA access.</li>
   </ul>

   To get this set of permission, the account must be member of one of
   the following groups:
   <ul>
     <li>Administrator</li>
     <li>Power User</li>
     <li>Print Operator</li>
     <li>Server Operator</li>
   </ul>
   <p>Please note that it is not sufficient for the user to be a member
   of one of these groups at the Domain level: the user must be a member
   of one of these group at the local level.</p>

   <p>More information in the Microsoft documentation, on the
   <a href="http://msdn.microsoft.com/en-us/library/bb525388(VS.85).aspx">
   NetShareGetinfo function</a>.</p>

 <h4>Configure GSA for Adaptor</h4>
 <ol>
   <li>Add the IP address of the computer that hosts the adaptor to the <b>List
     of Trusted IP Addresses</b> on the GSA.
     <p>In the GSA's Admin Console, go to <b>Content Sources &gt; Feeds</b>,
     and scroll down to <b>List of Trusted IP Addresses</b>. Add the IP address
     for the adaptor to the list.

   <li>Add the URLs provided by the adaptor to the <b>Follow Patterns</b>
     on the GSA.
     <p>In the Admin console, go to <b>Content Sources &gt; Web Crawl
     &gt; Start and Block URLs</b>, and
     scroll down to <b>Follow Patterns</b>.
     Add an entry like <code>http://adaptor.example.com:5678/doc/
     </code> where <code>adaptor.example.com</code> is the hostname of the
     machine that hosts the adaptor. By default the adaptor runs on port 5678.
 </ol>

 <h4>Configure Adaptor</h4>
 <ol>
   <li>Create a file named <code>adaptor-config.properties</code> in the
   directory that contains the adaptor binary.
   <p>
   Here is an example configuration (bold items are example values to be
   replaced):
 <pre>
 gsa.hostname=<b>yourgsa.hostname.com</b>
 filesystemadaptor.src=<b>\\\\host\\share</b>
 </pre>
   <p> Note: Backslashes are entered as double backslashes. In order
       to represent a single '\' you need to enter '\\'.
   <p> Note: DFS links can be given as
       filesystemadaptor.src: <b>\\\\host\\dfsnamespace\\link</b>
   <p> Note: UNICODE, as well as non-ASCII, characters can be used in
       filesystemadaptor.src. Including these characters will require
       the <code>adaptor-config.properties</code> file to be saved
       using UTF-8 encoding.
   <br>

   <li> Create file named <code>logging.properties</code> in the same directory
   that contains adaptor binary:
   <pre>
 .level=INFO
 handlers=java.util.logging.FileHandler,java.util.logging.ConsoleHandler
 java.util.logging.FileHandler.formatter=com.google.enterprise.adaptor.CustomFormatter
 java.util.logging.FileHandler.pattern=logs/adaptor.%g.log
 java.util.logging.FileHandler.limit=10485760
 java.util.logging.FileHandler.count=20
 java.util.logging.ConsoleHandler.formatter=com.google.enterprise.adaptor.CustomFormatter
 </pre>

   <li><p>Create a directory named <code>logs</code> inside same directory that contains
     the adaptor binary.

   <li><p>Run the adaptor using a command line like:
   <pre>java -Djava.util.logging.config.file=logging.properties -jar adaptor-fs-YYYYMMDD-withlib.jar</pre>
 </ol>

 <h4>Running as service on Windows</h4>
   <p>Example service creation on Windows with prunsrv:
   <pre>prunsrv install adaptor-fs --StartPath="%CD%" ^
   --Classpath=adaptor-fs-YYYYMMDD-withlib.jar ^
   --StartMode=jvm --StartClass=com.google.enterprise.adaptor.Daemon ^
   --StartMethod=serviceStart --StartParams=com.google.enterprise.adaptor.fs.FsAdaptor ^
   --StopMode=jvm --StopClass=com.google.enterprise.adaptor.Daemon ^
   --StopMethod=serviceStop --StdOutput=stdout.log --StdError=stderr.log ^
   ++JvmOptions=-Djava.util.logging.config.file=logging.properties</pre>

   <p> Note: By default the File System adaptor service runs using the Windows Local System account.
       This should be fine in most cases but this can cause issues if access to documents is
       restricted through Acls.
       In cases where the File System adaptor service is not able to crawl documents due
       to Acl restrictions, you would need to specify a user for the File System adaptor
       service through the Service Control Manager that has sufficient access to crawl the documents.

 <h4>Optional <code>adaptor-config.properties</code> fields</h4>
 <dl>
   <dt>
   <code>server.dashboardPort</code>
   </dt>
   <dd>
   Port on which to view web page showing information
   and diagnostics.  Defaults to "5679".
   </dd>
   <br>
   <dt>
   <code>filesystemadaptor.supportedAccounts</code>
   </dt>
   <dd>
   Accounts that are in the supportedAccounts will be
   included in Acls regardless if they are builtin or
   not.
   By default the value is:
   <pre>
   BUILTIN\\Administrators,\\Everyone,BUILTIN\\Users,
   BUILTIN\\Guest,NT AUTHORITY\\INTERACTIVE,
   NT AUTHORITY\\Authenticated Users
   </pre>
   </dd>
   <dt>
   <code>filesystemadaptor.builtinGroupPrefix</code>
   </dt>
   <dd>
   Builtin accounts are excluded from the Acls
   that are pushed to the GSA. An account that starts with
   this prefix is considered a builtin account and will be
   excluded from the Acls.
   By default the value is:
   <pre>
   BUILTIN\\
   </pre>
   </dd>
   <dt>
   <code>filesystemadaptor.crawlHiddenFiles</code>
   </dt>
   <dd>
   This boolean configuration property allows or disallows indexing
   of hidden files and folders. The definition of hidden files and
   folders is platform dependent. On Windows file sytems a file or
   folder is considered hidden if the DOS <code>hidden</code>
   attribute is set.
   <p/>
   By default, hidden files are not indexed and the contents of
   hidden folders are not indexed. Setting
   <code>filesystemadaptor.crawlHiddenFiles</code> to <code>true</code>
   will allow hidden files and folders to be crawled by the Search
   Appliance. By default the value is:
   <pre>
   false
   </pre>
   </dd>
   <dt>
   <code>filesystemadaptor.lastAccessedDate</code>
   </dt>
   <dd>
   This configuration property can be used to disable crawling of files
   whose time of last access is earlier than a specific date.  The cut-off
   date is specified in <a href="http://www.w3.org/TR/NOTE-datetime">
   ISO8601</a> date format, <code>YYYY-MM-DD</code>.
   <p/>
   Setting <code>filesystemadaptor.lastAccessedDate</code> to
   <code>2010-01-01</code> would only crawl content that has been accessed
   since the beginning of 2010.
   <p/>
   By default, filtering content based upon last accessed time is disabled.
   <br>
   Only one of <code>filesystemadaptor.lastAccessedDate</code> or
   <code>filesystemadaptor.lastAccessedDays</code> may be specified.
   </dd>
   <dt>
   <code>filesystemadaptor.lastAccessedDays</code>
   </dt>
   <dd>
   This configuration property can be used to disable crawling of files
   that have not been accessed within the specified number of days. Unlike the
   absolute cut-off date used by <code>filesystemadaptor.lastAccessedDate</code>,
   this property can be used to expire previously indexed content if it
   has not been accessed in a while.
   <p/>
   The expiration window is specified as a positive integer number of days.
   <br>
   Setting <code>filesystemadaptor.lastAccessedDays</code> to
   <code>365</code> would only crawl content that has been accessed
   in the last year.
   <p/>
   By default, filtering content based upon last accessed time is disabled.
   <br>
   Only one of <code>filesystemadaptor.lastAccessedDate</code> or
   <code>filesystemadaptor.lastAccessedDays</code> may be specified.
   </dd>
   <dt>
   <code>filesystemadaptor.lastModifiedDate</code>
   </dt>
   <dd>
   This configuration property can be used to disable crawling of files
   whose time of last access is earlier than a specific date.  The cut-off
   date is specified in <a href="http://www.w3.org/TR/NOTE-datetime">
   ISO8601</a> date format, <code>YYYY-MM-DD</code>.
   <p/>
   Setting <code>filesystemadaptor.lastModifiedDate</code> to
   <code>2010-01-01</code> would only crawl content that has been modified
   since the beginning of 2010.
   <p/>
   By default, filtering content based upon last modified time is disabled.
   <br>
   Only one of <code>filesystemadaptor.lastModifiedDate</code> or
   <code>filesystemadaptor.lastModifiedDays</code> may be specified.
   </dd>
   <dt>
   <code>filesystemadaptor.lastModifiedDays</code>
   </dt>
   <dd>
   This configuration property can be used to disable crawling of files
   that have not been modified within the specified number of days. Unlike the
   absolute cut-off date used by <code>filesystemadaptor.lastModifiedDate</code>,
   this property can be used to expire previously indexed content if it
   has not been modified in a while.
   <p/>
   The expiration window is specified as a positive integer number of days.
   <br>
   Setting <code>filesystemadaptor.lastModifiedDays</code> to
   <code>365</code> would only crawl content that has been modified
   in the last year.
   <p/>
   By default, filtering content based upon last modified time is disabled.
   <br>
   Only one of <code>filesystemadaptor.lastModifiedDate</code> or
   <code>filesystemadaptor.lastModifiedDays</code> may be specified.
   </dd>
   <dt>
   <code>adaptor.incrementalPollPeriodSecs</code>
   </dt>
   <dd>
   Time between incremental crawls. Default value is 300 seconds.
   </dd>
   <br>
   <dt>
   <code>adaptor.namespace</code>
   </dt>
   <dd>
   Namespace used for ACLs sent to GSA.  Defaults to "Default".
   </dd>
 </dl>

 <br>
 <br>

 <h3> Advanced Topics </h3>

 <h4>Not changing 'last access' of the documents on the share</h4>
 <p>The adaptor attempts to restore the last access date for documents after
 it reads the document content during a crawl. In order for the last access
 date to be restored back to the original value before the content was read,
 the user account that the adaptor is running under needs to have write permission.
 If the account has read-only permission and not write permission for documents,
 then the last access date for documents will change as the adaptor reads
 document content during a crawl.

 <br>
 <br>


 <h3> Developer Topics </h3>

 <h4>File System Adaptor Acl Overview</h4>

 <p>ACLs for documents and folders are read, preserved and pushed to the Google
 Search Appliance by the File System Adaptor for UNC and DFS UNC paths.
 </p>

 <p>The following images show the ACL inheritance used by the File System Adaptor.
 The green and pink arrows signify inheritance. While the dotted arrows show an
 optional inheritance depending on whether the item inherits permission from
 its parent or if it breaks inheritance and defines its own set of permissions.
 </p>

 <h4>non-DFS ACL inheritance</h4>
 <img src="non_dfs_acls.jpg" />

 <h4>DFS ACL inheritance</h4>
 <img src="dfs_acls.jpg" />

 </body>
	<body>
	<h3 id="fsadaptor">Deployment of File System Adaptor</h3>

	<p>A single instance of File System adaptor can have
	GSA index a single UNC share. DFS is supported.

	<h4>Requirements</h4>
	<ul>
	<li>GSA 7.2 or higher
	<li>Java JRE 1.7 update 6 or higher installed on computer that runs adaptor
	<li>File System Adaptor JAR executable
	<li>Requires running on Microsoft Windows
	<li>A Windows account with sufficient permissions for the adaptor
	(see the <b>Permissions needed by the Adaptor</b> section below)
	</ul>

	<h4>Permissions needed by the Adaptor</h4>
	<p>The Windows user account that the adaptor is running under must
	have the following permissions:

	<ul>
	<li>list the content of folders,</li>
	<li>read the content of documents,</li>
	<li>read attributes of files and foldersr.</li>
	</ul>

	<h5>Special permissions needed to read the ACLs</h5>
	Additionaly, the GSA must be able to:
	<ul>
	<li>read ACLs for both files and folders,</li>
	<li>reset last access dates to the original value prior to the GSA access.</li>
	</ul>

	To get this set of permission, the account must be member of one of
	the following groups:
	<ul>
	<li>Administrator</li>
	<li>Power User</li>
	<li>Print Operator</li>
	<li>Server Operator</li>
	</ul>
	<p>Please note that it is not sufficient for the user to be a member
	of one of these groups at the Domain level: the user must be a member
	of one of these group at the local level.</p>

	<p>More information in the Microsoft documentation, on the
	<a href="http://msdn.microsoft.com/en-us/library/bb525388(VS.85).aspx">
	NetShareGetinfo function</a>.</p>

	<h4>Configure GSA for Adaptor</h4>
	<ol>
	<li>Add the IP address of the computer that hosts the adaptor to the <b>List
	of Trusted IP Addresses</b> on the GSA.
	<p>In the GSA's Admin Console, go to <b>Content Sources > Feeds</b>,
	and scroll down to <b>List of Trusted IP Addresses</b>. Add the IP address
	for the adaptor to the list.

	<li>Add the URLs provided by the adaptor to the <b>Follow Patterns</b>
	on the GSA.
	<p>In the Admin console, go to <b>Content Sources > Web Crawl
	> Start and Block URLs</b>, and
	scroll down to <b>Follow Patterns</b>.
	Add an entry like <code>http://adaptor.example.com:5678/doc/
	</code> where <code>adaptor.example.com</code> is the hostname of the
	machine that hosts the adaptor. By default the adaptor runs on port 5678.
	</ol>

	<h4>Configure Adaptor</h4>
	<ol>
	<li>Create a file named <code>adaptor-config.properties</code> in the
	directory that contains the adaptor binary.
	<p>
	Here is an example configuration (bold items are example values to be
	replaced):
	<pre>
	gsa.hostname=<b>yourgsa.hostname.com</b>
	filesystemadaptor.src=<b>\\\\host\\share</b>
	</pre>
	<p> Note: Backslashes are entered as double backslashes. In order
	to represent a single '\' you need to enter '\\'.
	<p> Note: DFS links can be given as
	filesystemadaptor.src: <b>\\\\host\\dfsnamespace\\link</b>
	<p> Note: UNICODE, as well as non-ASCII, characters can be used in
	filesystemadaptor.src. Including these characters will require
	the <code>adaptor-config.properties</code> file to be saved
	using UTF-8 encoding.
	<br>

	<li> Create file named <code>logging.properties</code> in the same directory
	that contains adaptor binary:
	<pre>
	.level=INFO
	handlers=java.util.logging.FileHandler,java.util.logging.ConsoleHandler
	java.util.logging.FileHandler.formatter=com.google.enterprise.adaptor.CustomFormatter
	java.util.logging.FileHandler.pattern=logs/adaptor.%g.log
	java.util.logging.FileHandler.limit=10485760
	java.util.logging.FileHandler.count=20
	java.util.logging.ConsoleHandler.formatter=com.google.enterprise.adaptor.CustomFormatter
	</pre>

	<li><p>Create a directory named <code>logs</code> inside same directory that contains
	the adaptor binary.

	<li><p>Run the adaptor using a command line like:
	<pre>java -Djava.util.logging.config.file=logging.properties -jar adaptor-fs-YYYYMMDD-withlib.jar</pre>
	</ol>

	<h4>Running as service on Windows</h4>
	<p>Example service creation on Windows with prunsrv:
	<pre>prunsrv install adaptor-fs --StartPath="%CD%" ^
	--Classpath=adaptor-fs-YYYYMMDD-withlib.jar ^
	--StartMode=jvm --StartClass=com.google.enterprise.adaptor.Daemon ^
	--StartMethod=serviceStart --StartParams=com.google.enterprise.adaptor.fs.FsAdaptor ^
	--StopMode=jvm --StopClass=com.google.enterprise.adaptor.Daemon ^
	--StopMethod=serviceStop --StdOutput=stdout.log --StdError=stderr.log ^
	++JvmOptions=-Djava.util.logging.config.file=logging.properties</pre>

	<p> Note: By default the File System adaptor service runs using the Windows Local System account.
	This should be fine in most cases but this can cause issues if access to documents is
	restricted through Acls.
	In cases where the File System adaptor service is not able to crawl documents due
	to Acl restrictions, you would need to specify a user for the File System adaptor
	service through the Service Control Manager that has sufficient access to crawl the documents.

	<h4>Optional <code>adaptor-config.properties</code> fields</h4>
	<dl>
	<dt>
	<code>server.dashboardPort</code>
	</dt>
	<dd>
	Port on which to view web page showing information
	and diagnostics. Defaults to "5679".
	</dd>
	<br>
	<dt>
	<code>filesystemadaptor.supportedAccounts</code>
	</dt>
	<dd>
	Accounts that are in the supportedAccounts will be
	included in Acls regardless if they are builtin or
	not.
	By default the value is:
	<pre>
	BUILTIN\\Administrators,\\Everyone,BUILTIN\\Users,
	BUILTIN\\Guest,NT AUTHORITY\\INTERACTIVE,
	NT AUTHORITY\\Authenticated Users
	</pre>
	</dd>
	<dt>
	<code>filesystemadaptor.builtinGroupPrefix</code>
	</dt>
	<dd>
	Builtin accounts are excluded from the Acls
	that are pushed to the GSA. An account that starts with
	this prefix is considered a builtin account and will be
	excluded from the Acls.
	By default the value is:
	<pre>
	BUILTIN\\
	</pre>
	</dd>
	<dt>
	<code>filesystemadaptor.crawlHiddenFiles</code>
	</dt>
	<dd>
	This boolean configuration property allows or disallows indexing
	of hidden files and folders. The definition of hidden files and
	folders is platform dependent. On Windows file sytems a file or
	folder is considered hidden if the DOS <code>hidden</code>
	attribute is set.
	<p/>
	By default, hidden files are not indexed and the contents of
	hidden folders are not indexed. Setting
	<code>filesystemadaptor.crawlHiddenFiles</code> to <code>true</code>
	will allow hidden files and folders to be crawled by the Search
	Appliance. By default the value is:
	<pre>
	false
	</pre>
	</dd>
	<dt>
	<code>filesystemadaptor.lastAccessedDate</code>
	</dt>
	<dd>
	This configuration property can be used to disable crawling of files
	whose time of last access is earlier than a specific date. The cut-off
	date is specified in <a href="http://www.w3.org/TR/NOTE-datetime">
	ISO8601</a> date format, <code>YYYY-MM-DD</code>.
	<p/>
	Setting <code>filesystemadaptor.lastAccessedDate</code> to
	<code>2010-01-01</code> would only crawl content that has been accessed
	since the beginning of 2010.
	<p/>
	By default, filtering content based upon last accessed time is disabled.
	<br>
	Only one of <code>filesystemadaptor.lastAccessedDate</code> or
	<code>filesystemadaptor.lastAccessedDays</code> may be specified.
	</dd>
	<dt>
	<code>filesystemadaptor.lastAccessedDays</code>
	</dt>
	<dd>
	This configuration property can be used to disable crawling of files
	that have not been accessed within the specified number of days. Unlike the
	absolute cut-off date used by <code>filesystemadaptor.lastAccessedDate</code>,
	this property can be used to expire previously indexed content if it
	has not been accessed in a while.
	<p/>
	The expiration window is specified as a positive integer number of days.
	<br>
	Setting <code>filesystemadaptor.lastAccessedDays</code> to
	<code>365</code> would only crawl content that has been accessed
	in the last year.
	<p/>
	By default, filtering content based upon last accessed time is disabled.
	<br>
	Only one of <code>filesystemadaptor.lastAccessedDate</code> or
	<code>filesystemadaptor.lastAccessedDays</code> may be specified.
	</dd>
	<dt>
	<code>filesystemadaptor.lastModifiedDate</code>
	</dt>
	<dd>
	This configuration property can be used to disable crawling of files
	whose time of last access is earlier than a specific date. The cut-off
	date is specified in <a href="http://www.w3.org/TR/NOTE-datetime">
	ISO8601</a> date format, <code>YYYY-MM-DD</code>.
	<p/>
	Setting <code>filesystemadaptor.lastModifiedDate</code> to
	<code>2010-01-01</code> would only crawl content that has been modified
	since the beginning of 2010.
	<p/>
	By default, filtering content based upon last modified time is disabled.
	<br>
	Only one of <code>filesystemadaptor.lastModifiedDate</code> or
	<code>filesystemadaptor.lastModifiedDays</code> may be specified.
	</dd>
	<dt>
	<code>filesystemadaptor.lastModifiedDays</code>
	</dt>
	<dd>
	This configuration property can be used to disable crawling of files
	that have not been modified within the specified number of days. Unlike the
	absolute cut-off date used by <code>filesystemadaptor.lastModifiedDate</code>,
	this property can be used to expire previously indexed content if it
	has not been modified in a while.
	<p/>
	The expiration window is specified as a positive integer number of days.
	<br>
	Setting <code>filesystemadaptor.lastModifiedDays</code> to
	<code>365</code> would only crawl content that has been modified
	in the last year.
	<p/>
	By default, filtering content based upon last modified time is disabled.
	<br>
	Only one of <code>filesystemadaptor.lastModifiedDate</code> or
	<code>filesystemadaptor.lastModifiedDays</code> may be specified.
	</dd>
	<dt>
	<code>adaptor.incrementalPollPeriodSecs</code>
	</dt>
	<dd>
	Time between incremental crawls. Default value is 300 seconds.
	</dd>
	<br>
	<dt>
	<code>adaptor.namespace</code>
	</dt>
	<dd>
	Namespace used for ACLs sent to GSA. Defaults to "Default".
	</dd>
	</dl>

	<br>
	<br>

	<h3> Advanced Topics </h3>

	<h4>Not changing 'last access' of the documents on the share</h4>
	<p>The adaptor attempts to restore the last access date for documents after
	it reads the document content during a crawl. In order for the last access
	date to be restored back to the original value before the content was read,
	the user account that the adaptor is running under needs to have write permission.
	If the account has read-only permission and not write permission for documents,
	then the last access date for documents will change as the adaptor reads
	document content during a crawl.

	<br>
	<br>


	<h3> Developer Topics </h3>

	<h4>File System Adaptor Acl Overview</h4>

	<p>ACLs for documents and folders are read, preserved and pushed to the Google
	Search Appliance by the File System Adaptor for UNC and DFS UNC paths.
	</p>

	<p>The following images show the ACL inheritance used by the File System Adaptor.
	The green and pink arrows signify inheritance. While the dotted arrows show an
	optional inheritance depending on whether the item inherits permission from
	its parent or if it breaks inheritance and defines its own set of permissions.
	</p>

	<h4>non-DFS ACL inheritance</h4>
	<img src="non_dfs_acls.jpg" />

	<h4>DFS ACL inheritance</h4>
	<img src="dfs_acls.jpg" />

	</body>