query-optimization | 易学教程

Spark count vs take and length

阅读更多关于 Spark count vs take and length

I'm using com.datastax.spark:spark-cassandra-connector_2.11:2.4.0 when run zeppelin notebooks and don't understand the difference between two operations in spark. One operation takes a lot of time for computation, the second one executes immediately. Could someone explain to me the differences between two operations: import com.datastax.spark.connector._ import org.apache.spark.sql.cassandra._ import org.apache.spark.sql._ import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import spark.implicits._ case class SomeClass(val someField:String) val timelineItems = spark

Linq2SQL “or/and” operators (ANDed / ORed conditions)

阅读更多关于 Linq2SQL “or/and” operators (ANDed / ORed conditions)

Let's say we need to apply several conditions to select from a table called "Things" (unknown count and nature) if conditions are known, we can write db.Things.Where(t=>foo1 && foo2 || foo3); but if we have to build that Where condition programatically, I can imagine how can we apply ANDed conditions IQuerable DesiredThings = db.Things.AsQuerable(); foreach (Condition c in AndedConditions) DesiredThings = DesiredThings.Where(t => GenerateCondition(c,t)); What about ORed conditions ? Note: we don't want to perform union, unique, or any other costly operations, it's desired that a query is

Creating a flattened table/view of a hierarchically-defined set of data

阅读更多关于 Creating a flattened table/view of a hierarchically-defined set of data

I have a table containing hierarchical data. There are currently ~8 levels in this hierarchy. I really like the way the data is structured, but performance is dismal when I need to know if a record at level 8 is a child of a record at level 1. I have PL/SQL stored functions which do these lookups for me, each having a select * from tbl start with ... connect by... statement. This works fine when I'm querying a handful of records, but I'm in a situation now where I need to query ~10k records at once and for each of them run this function. It's taking 2-3 minutes where I need it to run in just a

Preventing N+1 queries in Rails

阅读更多关于 Preventing N+1 queries in Rails

I've seen a few examples of passing an :include hash value when calling one of ActiveRecord's find methods in Rails. However, I haven't seen any examples of whether this is possible via relationship methods. For example, let's say I have the following: def User < ActiveRecord::Base has_many :user_favorites has_many :favorites, :through => :user_favorites end def Favorite < ActiveRecord::Base has_many :user_favorites has_many :users, :through => :user_favorites end def UserFavorite < ActiveRecord::Base belongs_to :user belongs_to :favorite end All the examples I see show code like this: User

MySQL MyISAM table performance problem revisited

阅读更多关于 MySQL MyISAM table performance problem revisited

问题 This question is related to this one. I have a page table with the following structure: CREATE TABLE mydatabase.page ( pageid int(10) unsigned NOT NULL auto_increment, sourceid int(10) unsigned default NULL, number int(10) unsigned default NULL, data mediumtext, processed int(10) unsigned default NULL, PRIMARY KEY (pageid), KEY sourceid (sourceid) ) ENGINE=MyISAM AUTO_INCREMENT=9768 DEFAULT CHARSET=latin1; The data column contains text whose size is around 80KB - 200KB per record. The total

Query last N related rows per row

阅读更多关于 Query last N related rows per row

I have the following query which fetches the id of the latest N observations for each station : SELECT id FROM ( SELECT station_id, id, created_at, row_number() OVER(PARTITION BY station_id ORDER BY created_at DESC) AS rn FROM ( SELECT station_id, id, created_at FROM observations ) s ) s WHERE rn <= #{n} ORDER BY station_id, created_at DESC; I have indexes on id , station_id , created_at . This is the only solution I have come up with that can fetch more than a single record per station. However it is quite slow (154.0 ms for a table of 81000 records). How can I speed up the query? Assuming at

Broadcast not happening while joining dataframes in Spark 1.6

阅读更多关于 Broadcast not happening while joining dataframes in Spark 1.6

Below is the sample code that I am running. when this spark job runs, Dataframe joins are happening using sortmergejoin instead of broadcastjoin. def joinedDf (sqlContext: SQLContext, txnTable: DataFrame, countriesDfBroadcast: Broadcast[DataFrame]): DataFrame = { txnTable.as("df1").join((countriesDfBroadcast.value).withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries"), $"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner") } joinedDf(sqlContext, txnTable, countriesDfBroadcast).write.parquet("temp") The broadcastjoin is not happening even when I specify a broadcast() hint in the

Higher cardinality column first in an index when involving a range?

阅读更多关于 Higher cardinality column first in an index when involving a range?

CREATE TABLE `files` ( `did` int(10) unsigned NOT NULL DEFAULT '0', `filename` varbinary(200) NOT NULL, `ext` varbinary(5) DEFAULT NULL, `fsize` double DEFAULT NULL, `filetime` datetime DEFAULT NULL, PRIMARY KEY (`did`,`filename`), KEY `fe` (`filetime`,`ext`), -- This? KEY `ef` (`ext`,`filetime`) -- or This? ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ; There are a million rows in the table. The filetimes are mostly distinct. There are a finite number of ext values. So, filetime has a high cardinality and ext has a much lower cardinality. The query involves both ext and filetime : WHERE ext = '...'

SQL multiple rows as columns (optimizing)

阅读更多关于 SQL multiple rows as columns (optimizing)

问题 I have a SQL query which gives the correct result, but performs too slow. The query operates on the following three tables: customers contains lots of customer data like name, address, phone etc. To simplify the table i am only using the name. customdatas contains certain custom (not customer) data. (The tables are created in software, which is why the plural form is wrong for this table) customercustomdatarels associates custom data with a customer. customers Id Name (many more columns) ----

Fetching RAND() rows without ORDER BY RAND() in just one query

阅读更多关于 Fetching RAND() rows without ORDER BY RAND() in just one query

Using RAND() in MySQL to get a single random row out of a huge table is very slow: SELECT quote FROM quotes ORDER BY RAND() LIMIT 1 Here is an article about this issue and why this is the case. Their solution is to use two queries: SELECT COUNT(*) AS cnt FROM quotes - Use result to generate a number between 0 and COUNT(*) SELECT quote FROM quotes LIMIT $generated_number, 1 I was wondering, whether this would be possible in just one query. So my approach was: SELECT * FROM quotes LIMIT ( ROUND( (SELECT COUNT(*) FROM quotes) * RAND() ) ), 1 But it seams MySQL does not allow any logic within