query-optimization

Spark count vs take and length

安稳与你 提交于 2019-11-28 14:14:55
I'm using com.datastax.spark:spark-cassandra-connector_2.11:2.4.0 when run zeppelin notebooks and don't understand the difference between two operations in spark. One operation takes a lot of time for computation, the second one executes immediately. Could someone explain to me the differences between two operations: import com.datastax.spark.connector._ import org.apache.spark.sql.cassandra._ import org.apache.spark.sql._ import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import spark.implicits._ case class SomeClass(val someField:String) val timelineItems = spark

Linq2SQL “or/and” operators (ANDed / ORed conditions)

霸气de小男生 提交于 2019-11-28 14:13:35
Let's say we need to apply several conditions to select from a table called "Things" (unknown count and nature) if conditions are known, we can write db.Things.Where(t=>foo1 && foo2 || foo3); but if we have to build that Where condition programatically, I can imagine how can we apply ANDed conditions IQuerable DesiredThings = db.Things.AsQuerable(); foreach (Condition c in AndedConditions) DesiredThings = DesiredThings.Where(t => GenerateCondition(c,t)); What about ORed conditions ? Note: we don't want to perform union, unique, or any other costly operations, it's desired that a query is

Creating a flattened table/view of a hierarchically-defined set of data

天大地大妈咪最大 提交于 2019-11-28 14:03:05
I have a table containing hierarchical data. There are currently ~8 levels in this hierarchy. I really like the way the data is structured, but performance is dismal when I need to know if a record at level 8 is a child of a record at level 1. I have PL/SQL stored functions which do these lookups for me, each having a select * from tbl start with ... connect by... statement. This works fine when I'm querying a handful of records, but I'm in a situation now where I need to query ~10k records at once and for each of them run this function. It's taking 2-3 minutes where I need it to run in just a

Preventing N+1 queries in Rails

☆樱花仙子☆ 提交于 2019-11-28 12:50:01
I've seen a few examples of passing an :include hash value when calling one of ActiveRecord's find methods in Rails. However, I haven't seen any examples of whether this is possible via relationship methods. For example, let's say I have the following: def User < ActiveRecord::Base has_many :user_favorites has_many :favorites, :through => :user_favorites end def Favorite < ActiveRecord::Base has_many :user_favorites has_many :users, :through => :user_favorites end def UserFavorite < ActiveRecord::Base belongs_to :user belongs_to :favorite end All the examples I see show code like this: User

MySQL MyISAM table performance problem revisited

☆樱花仙子☆ 提交于 2019-11-28 12:03:36
问题 This question is related to this one. I have a page table with the following structure: CREATE TABLE mydatabase.page ( pageid int(10) unsigned NOT NULL auto_increment, sourceid int(10) unsigned default NULL, number int(10) unsigned default NULL, data mediumtext, processed int(10) unsigned default NULL, PRIMARY KEY (pageid), KEY sourceid (sourceid) ) ENGINE=MyISAM AUTO_INCREMENT=9768 DEFAULT CHARSET=latin1; The data column contains text whose size is around 80KB - 200KB per record. The total

Query last N related rows per row

拜拜、爱过 提交于 2019-11-28 11:44:39
I have the following query which fetches the id of the latest N observations for each station : SELECT id FROM ( SELECT station_id, id, created_at, row_number() OVER(PARTITION BY station_id ORDER BY created_at DESC) AS rn FROM ( SELECT station_id, id, created_at FROM observations ) s ) s WHERE rn <= #{n} ORDER BY station_id, created_at DESC; I have indexes on id , station_id , created_at . This is the only solution I have come up with that can fetch more than a single record per station. However it is quite slow (154.0 ms for a table of 81000 records). How can I speed up the query? Assuming at

Broadcast not happening while joining dataframes in Spark 1.6

前提是你 提交于 2019-11-28 11:33:32
Below is the sample code that I am running. when this spark job runs, Dataframe joins are happening using sortmergejoin instead of broadcastjoin. def joinedDf (sqlContext: SQLContext, txnTable: DataFrame, countriesDfBroadcast: Broadcast[DataFrame]): DataFrame = { txnTable.as("df1").join((countriesDfBroadcast.value).withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries"), $"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner") } joinedDf(sqlContext, txnTable, countriesDfBroadcast).write.parquet("temp") The broadcastjoin is not happening even when I specify a broadcast() hint in the

Higher cardinality column first in an index when involving a range?

浪子不回头ぞ 提交于 2019-11-28 11:30:36
CREATE TABLE `files` ( `did` int(10) unsigned NOT NULL DEFAULT '0', `filename` varbinary(200) NOT NULL, `ext` varbinary(5) DEFAULT NULL, `fsize` double DEFAULT NULL, `filetime` datetime DEFAULT NULL, PRIMARY KEY (`did`,`filename`), KEY `fe` (`filetime`,`ext`), -- This? KEY `ef` (`ext`,`filetime`) -- or This? ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ; There are a million rows in the table. The filetimes are mostly distinct. There are a finite number of ext values. So, filetime has a high cardinality and ext has a much lower cardinality. The query involves both ext and filetime : WHERE ext = '...'

SQL multiple rows as columns (optimizing)

折月煮酒 提交于 2019-11-28 09:53:23
问题 I have a SQL query which gives the correct result, but performs too slow. The query operates on the following three tables: customers contains lots of customer data like name, address, phone etc. To simplify the table i am only using the name. customdatas contains certain custom (not customer) data. (The tables are created in software, which is why the plural form is wrong for this table) customercustomdatarels associates custom data with a customer. customers Id Name (many more columns) ----

Fetching RAND() rows without ORDER BY RAND() in just one query

余生长醉 提交于 2019-11-28 09:33:59
Using RAND() in MySQL to get a single random row out of a huge table is very slow: SELECT quote FROM quotes ORDER BY RAND() LIMIT 1 Here is an article about this issue and why this is the case. Their solution is to use two queries: SELECT COUNT(*) AS cnt FROM quotes - Use result to generate a number between 0 and COUNT(*) SELECT quote FROM quotes LIMIT $generated_number, 1 I was wondering, whether this would be possible in just one query. So my approach was: SELECT * FROM quotes LIMIT ( ROUND( (SELECT COUNT(*) FROM quotes) * RAND() ) ), 1 But it seams MySQL does not allow any logic within