solr

apache nutch don't crawl website

久未见 提交于 2020-01-05 07:14:36
问题 I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt : User-Agent: * Disallow: / Is there any way to crawl this website with apache nutch? 回答1: In nutch-site.xml, set protocol.plugin.check.robots to false OR You can comment out the code where the robots check is done. In Fetcher.java, lines 605-614 are doing the check. Comment that entire block if (!rules.isAllowed(fit.u)) { // unblock fetchQueues.finishFetchItem(fit, true); if (LOG

SOLR mm and phrase queries not working after upgrading from SOLR 4 to SOLR 6

蹲街弑〆低调 提交于 2020-01-05 05:57:09
问题 I'm working on testing a new SOLR 6 server (6.2.0), as we have been running 4.3.1 for some time, and it was time for an upgrade. One thing I've noticed is that the mm (minMatch) term does not seem to work the way it used to (or it's being ignored), and phrase searches are not working properly either. For example, searching for "tabletop scanning electron microscope" (including quotes) in our index should return two matching documents, but I get zero matches. The search is set to use edismax.

profanity filteration in solr

隐身守侯 提交于 2020-01-05 05:45:50
问题 I am servering some webdata to my site using apache solr 4.10.3 . I have to block profanity. How do I block profanity in search? I have some confusion about this filter deployment also. Should I apply filter for profanity at document indexing time or document search time? 回答1: You have 2 possibilities : Don't send the document to Solr in the first place (filter it in your code) Implement a custom UpdateRequestProcessor : https://cwiki.apache.org/confluence/display/solr/Update+Request

Solr - Aggregate Term Frequency by Group

断了今生、忘了曾经 提交于 2020-01-05 05:39:07
问题 Let's say I have the following set of grouped websites crawled and indexed in Solr (latest) : { "id":"1", "domain": "http://www.category1website1.com", "domainGroup": "Group 1" },{ "id":"2", "domain": "http://www.category1website2.com", "domainGroup": "Group 1" },{ "id":"3", "domain": "http://www.category2website1.com", "domainGroup": "Group 2" } I'm looking for a result set that will give me the term frequency in each individual domain but also the aggregated term frequency of that search

How to change defaults.last_index_time format in solr

我与影子孤独终老i 提交于 2020-01-05 04:52:45
问题 I am using Apache Solr 6.2, and need a timestamp for defaults.last_index_time field or need separate field for a Core config. Default value was defaults.last_index_time=2016-09-19 14:55:17 . I need something like defaults.last_index_time=1474297085558 回答1: Use Property Writer to change last_index_time format in solr.Add the element to the DIH configuration file, directly under the dataConfig element <propertyWriter dateFormat="yyyy-MM-dd'T'HH:mm:ss.SSSXXX" type="SimplePropertiesWriter" /> In

Rails Sunspot filter / facet issue with URL? Deleting old “get params”

▼魔方 西西 提交于 2020-01-05 04:45:07
问题 I'm trying to understand and set up Sunspot gem in my Rails 4.0 project. I'm trying to implement a better search in my open-source project, BTC-Stores, but I'm a bit confused about how to do that with Sunspot. Currently, I have the following architecture (model): Item: # Relationship with Category belongs_to :category accepts_nested_attributes_for :category searchable do text :name, :description integer :category_id string :sort_name do # why I have this here? I dont understand this code name

SOLR - delete documents depending on index size

只愿长相守 提交于 2020-01-05 03:58:06
问题 I want to purge SOLR index whenever the index occupies more than 10% of the total disk space. The purge should result in deletion of the oldest documents that will bring the index space to less than 10% of the total space. How can I go about finding these oldest documents? I thought of finding the size of a single document and using that as the base to determine how many docs to delete(sort by date asc and rows = N). Is there an other way to go about it? Thanks. 回答1: When you are indexing

Solr faceting considering the availability of product at attribute combination level for e-commerce merchandise like Garment

风流意气都作罢 提交于 2020-01-05 03:48:10
问题 We are using Apache Solr for powering our search & faceting for the e-commerce website. We have a faceting filter that works fine except for the product that has the multiple combination for the attributes (variant options) that turns out to be different SKU, for example, a T-Shirt that has multiple colors & size options. Currently, we have a facet that filters by Color as well as by Size, however, it does not consider the availability of product on combination level due to the fact that it

Solr result grouping error .Unexpected docvalues type SORTED_SET for field 'vendor' (expected=SORTED)

只谈情不闲聊 提交于 2020-01-05 03:08:23
问题 I have a solr schema like this <fields> <field name="id" type="string" indexed="false" stored="true" required="true" /> <field name="product" type="string" indexed="true" stored="true" required="true" /> <field name="vendor" type="string" indexed="true" stored="true" required="true" /> <field name="language" type="string" indexed="true" stored="true" required="true" /> <field name="TotalInvoices" type="float" indexed="true" stored="true" required="true"/> </fields> I am querying the schema

Nested functional range query with OR

左心房为你撑大大i 提交于 2020-01-05 02:54:26
问题 I'm trying to do a functional query with an OR/AND however it seems there are limitations in doing so. Here is the logic for the SQL equivalent of the query I'm trying to perform: ABS(col1-:val1)<1 OR (col1 IS NULL AND ABS(col2-:val1)<1) Here is my current working fq query to grab the the documents with an ABS difference of <1. fq={!frange l=0 u=1}abs(sub(col1,val1)) Here is what I'm trying to execute but can't without error fq={!frange l=0 u=1}abs(sub(col1,val1)) OR (-col1:[* TO *] AND {