Open-SearchEngine v1.7
Enterprise Search Engine for DNN Powered by Lucene.Net

Open-SearchEngine represents a big step forward in providing DotNetNuke with a true enterprise search engine capable of indexing and searching any site the administrator chooses.

Being based on Lucene, and including a true web spider, it is capable of indexing any site wether it be a DNN based site or not. This allows administarators and developers to not have to worry about a module implementing the ISearcheable interface in order to appear in the search results.

In order to obtain accurate search results, Open-searchEngine provides a rich query syntax, similar to the one offered by Google. (see Query Syntax)

Open-SearchEngine stores its indexes in files, and thus it does not rely on DataBase space or overhead.

Demo

The live demo can be seen at www.opendnn.com under the Open-SearchEngine Demo tab.

What Is New in Version 1.7

Other Features

How It Does It

The following are visual examples of the modules.

Fig 1. This is a snapshot of Open-SearchEngine in action. The live demo can be seen at www.opendnn.com

Fig 2. shows the Open-SearchInput module and the various ways in which the administrator can alter its appearance. The module can also appear without a container and as as Skin Object.

 

Fig 3. shows the module settings specific to the SearchInput module.

Fig 4. shows the Open-SearchResults module, after a search has been executed. Notice that the look and feel of the search results are like the familiar Google interface. Of interest is the fact that the Open-SearchResults module has been placed in the same page as the SearchInput module, but it is invisible until a search is executed. This option can be turned on or off, and the SearchResults module can be placed on any page.

fig 5. shows the settings specific to the SearchResults module. The administrator has ample control over what is shown and in what format. The "Results Scope" option, allows the administrator to restrict the sites that will be searched. A list of all spidered sites appears, and only the ones that are selected will be searched. There is also an option "Title Link Target" to show the details of a search result on the same page or on a separate window.

Fig 6. shows: the Scheduled Task (SearchSpider) that is created upon installing Open-SearchEngine, as well as the Open-SearchSpiderSettings module.

The SearchSpiderSettings module allows the administrator to:

You will notice that we have put the Open-SearchSpiderSettings module in the same page where Scheduled Tasks are administered. This allows us to have a central place to deal with the Spider.

 

Tips

Viewing Inex activity

You can view the activity of the spider as it is crawling a site, in the Scheduler Status screens. To access the Scheduler Status screen, go to Host/Scheduler and then select the menu option as indicated in the image below.

what you will see is shown in the image below: the spider as it crawls, it will report on an every page that it is crawling.

 

Forcing the scheduled task to start

In some occasions, you may want to force the spider's scheduled task to run. To do so, go to Host/Schedule and click on the edit icon of the scheduled task as shown in the image below.

Once you are in the Edit Schedule screen, click on Update twice, as in the image below. The scheduled task will automatically start.

 

Setup


Setup of PDF indexing

Setup of Skin Object

<%@ Register TagPrefix="xs" TagName="OPENSEARCH" Src="~/DesktopModules/XSSearchInput/XSSearchInputSkinObject.ascx" %>
<xs:OPENSEARCH runat="server" id="xsOpenSearch" ShowGoImage="true"/>
[OPENSEARCH]


Avoid Duplicates

This section is an explanation of how duplicates are avoided. You do not need to make any changes to the configuration of the spider in order for it to work correctly in most scenarios.

By default, a spider will crawl every single page that in encounters. Once a page is crawled, its contents are stripped of html tags and placed in an index, ready to be searched.

In principle, the same page is not crawled twice. The criteria that the Open-SearchEngine uses in order to determine if a page has been crawled already, is by looking at its url. If the url is the same as one that was previously indexed, then the page is bypassed.

The above method is not sufficient in DNN, because there are many different paths (urls) to the same page.

For example http://www.site.com/tabid/16/default is the same page as http://www.site.com/resources/tabid/16/default.aspx.

Since the spider cannot know that url 1, in the example above, is exactly the same as url 2, a duplicate could occurr.

To avoid this type of duplicates, the spider, by default, implements a configurable method of pre-emptive exclusion. Unless you specify otherwise, pages that have the same tabid, will only be indexed once.

The above method would be too restrictive, since there are many scenarios where the same page (tabid) yelds different results, depending on additional parameters appended to the url. An example would be the dnn forum, that uses various arguments to display different threads, but always in the same page.

To make the spider as flexible as possible, the rules that thell the spider what to considered a duplicate and what not, have been extracted to an xml file, and the following rules have already been implemented:

rules defined: a regular expression that recognizes a series of parameters and captures its values.

ex: if your module uses itemid in order to display different content, then you should create a regular expression that recognizes this pattern in the url: itemid/n (friendly urls) or itmeid=n (regular url with querystring) and captures the value n. the regular expression for the above scenario is:

itemid=(?<id>\d*[^&])|itemid/(?<id>\d*[^/])

You can add your rules through in the following file:
../DesktopModules/XSSearchSpider/XSSearchSpiderDuplicatePatterns.xml


Background and Credits

We needed a search engine that would be:

There was nothing on the market that fit the requirements, so we built our own, but we did not start from scratch.

Lucene (http://lucene.apache.org/java/docs) has been gaining popularity as an excellent open-source indexing application. Being Lucene in Java, we looked for a .NET version. We found DotLucene (http://www.dotlucene.net), which was exactly what we were looking for.

The only issue with Lucene is that it is an indexing engine, but it did not come with a spider integrated. We came across Dan Bartels' blog (http://blog.danbartels.com), in which he had modified a spider made by Jeff Heaton (http://www.jeffheaton.com), to work with DotLucene. After extensively modifying Dan's proof of concept, we developed the spider we needed.

The next step was to develop the DNN modules needed to provide a user interface and the Scheduled Task. All this resulted in Open-SearchEngine.