Open-SearchEngine v1.7
Open-SearchEngine represents a big step forward in providing DotNetNuke with a true enterprise search engine capable of indexing and searching any site the administrator chooses.
Being based on Lucene, and including a true web spider, it is capable of indexing any site wether it be a DNN based site or not. This allows administarators and developers to not have to worry about a module implementing the ISearcheable interface in order to appear in the search results.
In order to obtain accurate search results, Open-searchEngine provides a rich query syntax, similar to the one offered by Google. (see Query Syntax)
Open-SearchEngine stores its indexes in files, and thus it does not rely on DataBase space or overhead.
Demo
The live demo can be seen at www.opendnn.com under the Open-SearchEngine Demo tab.
What Is New in Version 1.7
- Indexing of PDF files . Now you can configure Open-SearchEngine to index PDF files. Once indexed, the content of the PDF will be searchable.
Please see the "Setup of PDF Indexing" section for further details, as this functionality requires an extra set of files to be downloaded and setup separately from this module.
Other Features
- Duplicates Reduction. One of the common complaints was the presence of dupicates in the search results. These duplicates were due to various factors, including some specifics to the way DNN works. We have made quite a few improvements in this area, through the creation of a configurable xml file, that will let you specify what in a url is considered duplicate and what not. (see Avoid Duplicates section)
- Indexing of Office Documents. Now you can configure Open-SearchEngine to index a number of Microsoft Office documents such as Word, Excel and PowerPoint. This new functionality uses the IFilter interface (the same used by the Windows Explorer search that comes preinstalled with Win2k, XP and Win2003). You can add IFilters for othet document types by modifying the code that is provided.
- Search Input implemented as a Skin Object. You can include the search input box directly into your skin as a Skin Object, so that it will appear on every page on your site, with no need to add it as a module.
- Configurable Search Scope. Every search result module can be configured to limit the search results to only some of the sites that the spider has indexed. This is helpful in case you have different searches in different pages on your site, and only want to show results from specific sites.
- Configurable Search Impersonation. When spidering your own site, you can impersonate a role other than the anonymous user. This will allow you to index content that was not previously accessible to the spider. All the pages available to a user in the role you selected, will be indexed by the spider.
- Compliance with META directives. We have added meta tags compliance, so as to follow the directives of "no index", "no follow" that can be found on some web pages. This way, if these directives are present, the page will either not be indexed and/or it's links not be followed nor spidered.
- Click Enter and Search . If you have just entered a search criteria in the Search Input bos, then just by clicking on the Enter key, you will see the search results. We have added ClientAPI code to give focus to the execute search button/image automatically (in both Search Module and Skin Object).
- Automatic exclusion of non searchable Items. There are some file types that should not be spidered even though they are referenced by a link. These file types are images, style sheets etc... We have automatically excluded these file types. You keep having the option to add your own.
- Automatic exclusion of anonymous modules. This is a DNN specific feature that was added. Pages such as Privacy, Terms, Login and Register, are in reality modules that load in the same page where the user is referencing them from. This means that from DNN, you can access the same content from different URLs. This caused searches for terms such as "terms" that appear in the Terms page, to appear listed as many times as there were pages on your site. With this new version, they will only appear once.
How It Does It
- Open-SearchEngine is installed like any other DNN module and requires no changes to the core.
- Once installed, there will be three modules available from the list of controls that can be added to any page, as well as a Scheduled Task.
- The first module is Open-SearchInput. This is where the user enters the query word or phrase. This module has the same characteristics as the standard DNN SearchInput module.
- The second module is Open-SearchResults. In this module the user will see the results of the query in the familiar format presented by Google.
- The third module is the Open-SearchSpiderSettings which allows the administrator to specify: what sites need to be crawled (being them DNN or not) as well as other characteristics of the spider (see setup instructions).
- The Scheduled Task represents the actual spider. The spider will be the application charged with crawling the site(s) that the administrator desires.
- The spider will start crawling the default page of the site(s) you indicate, and store all the links that from the first page redirect to other pages in your site. Each stored link represents a page that in turn will be crawled and the process of parsing and storing links is repeated until there are no more links and no more pages. All the crawled pages content is stored in index files (on the server) that will then be available to perform queries on.
- Note: the spider will follow links in the form of: <a href="url"></a> that are on your pages. If you have documents that are accessible only through "click events", then you will need to provide a series of links to these documents in order to be found and indexed by the spider.
The following are visual examples of the modules.
Fig 1. This is a snapshot of Open-SearchEngine in action. The live demo can be seen at www.opendnn.com

Fig 2. shows the Open-SearchInput module and the various ways in which the administrator can alter its appearance. The module can also appear without a container and as as Skin Object.
Fig 3. shows the module settings specific to the SearchInput module.

Fig 4. shows the Open-SearchResults module, after a search has been executed. Notice that the look and feel of the search results are like the familiar Google interface. Of interest is the fact that the Open-SearchResults module has been placed in the same page as the SearchInput module, but it is invisible until a search is executed. This option can be turned on or off, and the SearchResults module can be placed on any page.

fig 5. shows the settings specific to the SearchResults module. The administrator has ample control over what is shown and in what format. The "Results Scope" option, allows the administrator to restrict the sites that will be searched. A list of all spidered sites appears, and only the ones that are selected will be searched. There is also an option "Title Link Target" to show the details of a search result on the same page or on a separate window.

Fig 6. shows: the Scheduled Task (SearchSpider) that is created upon installing Open-SearchEngine, as well as the Open-SearchSpiderSettings module.
The SearchSpiderSettings module allows the administrator to:
- define the sites that need to be spidered
- define if the spider is to act as a particular role when spidering the local site (so as to get access to pages that would normally require a login)
- define if the spider is to index MSOffice documents
- exclude links types that do not need to be indexed (in this case .pdf).
- the administrator can also decide in which folder the resulting indexes will be placed
- define how many simultaneous spider threads should be spun.
You will notice that we have put the Open-SearchSpiderSettings module in the same page where Scheduled Tasks are administered. This allows us to have a central place to deal with the Spider.

Tips
Viewing Inex activity
You can view the activity of the spider as it is crawling a site, in the Scheduler Status screens. To access the Scheduler Status screen, go to Host/Scheduler and then select the menu option as indicated in the image below.

what you will see is shown in the image below: the spider as it crawls, it will report on an every page that it is crawling.
Forcing the scheduled task to start
In some occasions, you may want to force the spider's scheduled task to run. To do so, go to Host/Schedule and click on the edit icon of the scheduled task as shown in the image below.

Once you are in the Edit Schedule screen, click on Update twice, as in the image below. The scheduled task will automatically start.

Setup
- Open-SearchEngine is installed like any other DNN module and requires no changes to the core. There are various .zip files that come in the download package. One file will install the entire search engine, another one the Skin Object (optional). In order to install the search engine, you should go to the section Host/Module Settings on your portal and upload and install the file named XepientSolutions.XSSearchEngine.v1.5.PA.zip.
- Once installed, there will be three modules available from the list of controls that can be added to any page, as well as a Scheduled Task that is ready to be activated.
- Go to any page on your site and from the list of available modules select Open-SearchInput. This is where the user enters the query word or phrase (see figure 2).
- Go to any page on your site and from the list of available modules select Open-SearchResults. This is where the search results will be displayed (see figure 4). Unless you specify otherwise, the "Results Scope" will be all the sites that you spidered. You can limit the scope by selecting/deselecting from the list.
By default the module is visible to administrators only, unless a search is executed. You can change this default behaviour in the settings of the module (see figure 5).
- The page where we suggest you add the next module is the "Host/Schedule" page. This page is where the host goes to administer any scheduled task. By adding the Open-SearchSpidersettings to this page, you will be able to administer the XSSearchSpider scheduled task and its settings from a single page (see figure 6). Obviously, you can put the module anywhere you wish.
In the Open-SearchSpiderSettings module, you will see various fields to be filled:
Spider Base URI(s):
enter the sites that you wish indexed, if there is more than one site, hit the <return> button in order to separate them. There is no limit to the number of sites that you can spider, and none of them need to be local, nor DNN sites.
Need more space to add web sites to crawl? just increase the length of the SettingValue field in the ScheduleItemSettings DNN core table. Open-SearchEngine will do the rest.
Role Impersonation: This option allows you to specify what role should the spider use when spidering your local site. By default, the spider will have access to public pages only. If you choose any role, then the spider will also have access to all the pages that are restricted to the role selected, and will do the equivalent of an automatic login in order to index the restricted page. Make sure that the role you select, has at least one registered user in order for this option to work.
Spider MSOffice Documents: This option allows to index the Microsoft Office documents that are listed. It uses the IFilter (the same used by the Windows Explorer search, and that comes preinstalled with Win2k, XP and Win2003). By selecting the type of documents that you want to index, you will be able to search their content.
Spider Excl. Extension(s):
Lucene can only read plain text. Therefore, there is no point in spidering links that end with a .pdf extension or any other extension that eventually will not be indexed. This field allows you to designate what extensions will not be crawled. If there is more than one extension, hit the <return> button in order to separate one from the next. We do not exclude these extensions automatically because in the future we will implement a .pdf to text parser that will allow indexing of various formats.
Spider Index Path:
This field allows you to designate where the indexed files will be stored. You can leave the default or designate a different folder.
The way the spider works is by creating 2 sub-directories for each site you crawl. For example if you only were to spider http://www.xepient.com, then the spider would create a www.xepient.com directory where it would put the index as it is being created, and then a www.xepient.com.out directory where upon completion of the indexing process the contents www.xepient.com would be copied. This allows users to keep searching while the spider is functioning, as well as to have a backup of any index.
Spider Threads:
The spider can function as a single threaded or as a multi- threaded application. Since the threads are joined in any case, there is no real gain in having more than one thread for the spider.
- Once you have entered the appropriate settings for the spider, it is time to schedule the Spider Scheduled Task. The scheduled task will have already been created for you by the installation program with some default settings. All you need to do is activate it. Click on the Scheduled task pencil icon (see fig. 6), and then click on the "Activate" check box, after having decided how often to run the spider and refresh the index. Save your changes, and see the spider working in the Schedule Status screens (host/Schedule/(module action menu) status). The Spider will give feedback on every page it is crawling, and upon completeion, the indexes will be available in the folders mentioned above. You are done. Just go to the Open-SearchInput module and enter a search term.
- If you need to start the spider before its scheduled time, you can Click on the Scheduled task pencil icon (see fig. 6) and without changing any parameter, click on the update link twice.
- Note: in order to run, the spider needs full trust to be set in the web.config.
Setup of PDF indexing
- Because Open-SearchEngine relies on a third party (PDFBox) implementation in order to provide its PDF indexing capabilities, you will need to install some additional .dlls in your DNN /bin directory in order to enable this functionality.
- PDFBox required .dlls are available as a separate download on the www.opendnn.com site.
- Full instructions are available inside the download package, but all it is required is for the three .dlls to be copied in the DNN /bin directory, and for the .pdf extension to be removed from the Spider Excl. Extensions attribute in the Open-SearchSpiderSettings module.
Setup of Skin Object
- The Skin Object that comes in the Open-SearchEngine is installed like any other DNN Skin Object module and requires you to change your skin in order to include it. There are various .zip files that come in the download package. The file that will install the Skin Object is named XepientSolutions.XSSearchEngineSkinObject.v1.5.PA.zip. In order to install the skin object, you should go to the section Host/Module Settings on your portal and upload and install it as any other module.
- You should be familiar with how to include skin object in order to use it, and there are various manuals that give detailed instructions on how to do it.
- For .aspx skins: Select the .ascx file that corresponds to the skin that you want to add the Skin Object to (usually in directory Portals/_default/Skins...) and add the following code to the very top of the source code of the skin:
<%@ Register TagPrefix="xs" TagName="OPENSEARCH" Src="~/DesktopModules/XSSearchInput/XSSearchInputSkinObject.ascx" %>
- The above code will tell the skin where the Skin Object is located in the file system.
- Then, add the following code, in order to display the Skin Object in the desired location on the page.
<xs:OPENSEARCH runat="server" id="xsOpenSearch" ShowGoImage="true"/>
- For .html skins: Select the .html file that corresponds to the skin that you want to add the Skin Object to (usually in directory Portals/_default/Skins...) and add the following code to the very top of the source code of the skin:
[OPENSEARCH]
- You are done. There is no need to recompile. The Search Input will now appear as part of your skin, and the results will be displayed in a Results module that will automatically be injected in the page where you are performing the search.
Avoid Duplicates
This section is an explanation of how duplicates are avoided. You do not need to make any changes to the configuration of the spider in order for it to work correctly in most scenarios.
By default, a spider will crawl every single page that in encounters. Once a page is crawled, its contents are stripped of html tags and placed in an index, ready to be searched.
In principle, the same page is not crawled twice. The criteria that the Open-SearchEngine uses in order to determine if a page has been crawled already, is by looking at its url. If the url is the same as one that was previously indexed, then the page is bypassed.
The above method is not sufficient in DNN, because there are many different paths (urls) to the same page.
For example
http://www.site.com/tabid/16/default is the same page as http://www.site.com/resources/tabid/16/default.aspx.
Since the spider cannot know that url 1, in the example above, is exactly the same as url 2, a duplicate could occurr.
To avoid this type of duplicates, the spider, by default, implements a configurable method of pre-emptive exclusion. Unless you specify otherwise, pages that have the same tabid, will only be indexed once.
The above method would be too restrictive, since there are many scenarios where the same page (tabid) yelds different results, depending on additional parameters appended to the url. An example would be the dnn forum, that uses various arguments to display different threads, but always in the same page.
To make the spider as flexible as possible, the rules that thell the spider what to considered a duplicate and what not, have been extracted to an xml file, and the following rules have already been implemented:
- Spider dnn forum content
- Spider dnn blog content
- Spider multi page content (a module by www.dotnetnuke.dk)
rules defined: a regular expression that recognizes a series of parameters and captures its values.
ex: if your module uses itemid in order to display different content, then you should create a regular expression that recognizes this pattern in the url: itemid/n (friendly urls) or itmeid=n (regular url with querystring) and captures the value n. the regular expression for the above scenario is:
itemid=(?<id>\d*[^&])|itemid/(?<id>\d*[^/])
You can add your rules through in the following file:
../DesktopModules/XSSearchSpider/XSSearchSpiderDuplicatePatterns.xml
Background and Credits
We needed a search engine that would be:
- Fast
- Accurate
- Flexible
- Allowed for multiple sites to be crawled
- Allowed for non DNN sites to be crawled
- Gave the administrator full control
- Was available as a DNN module
- Did not require a DNN module to implement the ISearcheable interface in order to be included in the search results
There was nothing on the market that fit the requirements, so we built our own, but we did not start from scratch.
Lucene (http://lucene.apache.org/java/docs) has been gaining popularity as an excellent open-source indexing application. Being Lucene in Java, we looked for a .NET version. We found DotLucene (http://www.dotlucene.net), which was exactly what we were looking for.
The only issue with Lucene is that it is an indexing engine, but it did not come with a spider integrated. We came across Dan Bartels' blog (http://blog.danbartels.com), in which he had modified a spider made by Jeff Heaton (http://www.jeffheaton.com), to work with DotLucene. After extensively modifying Dan's proof of concept, we developed the spider we needed.
The next step was to develop the DNN modules needed to provide a user interface and the Scheduled Task. All this resulted in Open-SearchEngine.