pentaho | 易学教程

Fast alternative to split in R

阅读更多关于 Fast alternative to split in R

I'm partitioning a data frame with split() in order to use parLapply() to call a function on each partition in parallel. The data frame has 1.3 million rows and 20 cols. I'm splitting/partitioning by two columns, both character type. Looks like there are ~47K unique IDs and ~12K unique codes, but not every pairing of ID and code are matched. The resulting number of partitions is ~250K. Here is the split() line: system.time(pop_part <- split(pop, list(pop$ID, pop$code))) The partitions will then be fed into parLapply() as follows: cl <- makeCluster(detectCores()) system.time(par_pop <-

Add a new data type to Pentaho Kettle

阅读更多关于 Add a new data type to Pentaho Kettle

I am trying to add a new data type (Geometry) to Kettle. I have added a new Value type to org.pentaho.di.compatibility. I have added a ValueGeometry class and made the necessary modifications to ValueInterface and Value. The code compiles but the new data type does not show up in plugins like Select. What am I missing here? Also I'd appreciate if you could point me towards the source code for these plugins. Thanks. As of Kettle 5.0, it is possible to create a plugin to provide new Value types: http://jira.pentaho.com/browse/PDI-191 I have a plugin to add a key/value type (like java.util.Map):

Pass DB Connection parameters to a Kettle a.k.a PDI table Input step dynamically from Excel

阅读更多关于 Pass DB Connection parameters to a Kettle a.k.a PDI table Input step dynamically from Excel

I have a requirement such that whenever i run my Kettle job, the database connection parameters must be taken dynamically from an excel source on each run. Say i have an excel with column names : HostName, Username, Database, Password. i want to pass these connection parameters to my table input step dynamically whenever the job runs. This is what i was trying to do. You can achieve this by reading the DB connection parameters from a source (e.g. Excel or in my example a CSV file) storing the parameters in variables using the variables in your connection setting. Proceed as follows Create

COMP-3 data unpacking in Java (Embedded in Pentaho)

阅读更多关于 COMP-3 data unpacking in Java (Embedded in Pentaho)

问题 We are facing a challenge in reading the COMP-3 data in Java embedded inside Pentaho ETL. There are few Float values stored as packed decimals in a flat file along with other plain text. While the plain texts are getting read properly, we tried using Charset.forName("CP500"); , but it never worked. We still get junk characters. Since Pentaho scripts doesn't support COMP-3, in their forums they suggested to go with User Defined Java class . Could anyone help us if you have come across and

Fast alternative to split in R

阅读更多关于 Fast alternative to split in R

问题 I'm partitioning a data frame with split() in order to use parLapply() to call a function on each partition in parallel. The data frame has 1.3 million rows and 20 cols. I'm splitting/partitioning by two columns, both character type. Looks like there are ~47K unique IDs and ~12K unique codes, but not every pairing of ID and code are matched. The resulting number of partitions is ~250K. Here is the split() line: system.time(pop_part <- split(pop, list(pop$ID, pop$code))) The partitions will

Add a new data type to Pentaho Kettle

阅读更多关于 Add a new data type to Pentaho Kettle

问题 I am trying to add a new data type (Geometry) to Kettle. I have added a new Value type to org.pentaho.di.compatibility. I have added a ValueGeometry class and made the necessary modifications to ValueInterface and Value. The code compiles but the new data type does not show up in plugins like Select. What am I missing here? Also I'd appreciate if you could point me towards the source code for these plugins. Thanks. 回答1: As of Kettle 5.0, it is possible to create a plugin to provide new Value

Using Pentaho Kettle, how do I load multiple tables from a single table while keeping referential integrity?

阅读更多关于 Using Pentaho Kettle, how do I load multiple tables from a single table while keeping referential integrity?

Need to load data from a single file with a 100,000+ records into multiple tables on MySQL maintaining the relationships defined in the file/tables; meaning the relationships already match. The solution should work on the latest version of MySQL, and needs to use the InnoDB engine; MyISAM does not support foreign keys. I am a completely new to using Pentaho Data Integration (aka Kettle) and any pointers would be appreciated. I might add that it is a requirement that the foreign key constraints are NOT disabled. Since it's my understanding that if there is something wrong with the database's

Problems connecting Pentaho Kettle/Spoon to Heroku PostgreSQL using SSL

阅读更多关于 Problems connecting Pentaho Kettle/Spoon to Heroku PostgreSQL using SSL

问题 I'm trying to connect spoon to a Heroku PostgreSQL instance using the JDBC driver that came with Spoon. Heroku requires SSL for it's stand alone PostgreSQL instances, which I have enabled. I'm able to connect to the database using other client software using SSL so this seems to be specific to Java/JDBC. I don't know enough about Java to troubleshoot this so hoping someone out there has been though this before. I get the following and rather verbose error message which mentions a

Pass DB Connection parameters to a Kettle a.k.a PDI table Input step dynamically from Excel

阅读更多关于 Pass DB Connection parameters to a Kettle a.k.a PDI table Input step dynamically from Excel

问题 I have a requirement such that whenever i run my Kettle job, the database connection parameters must be taken dynamically from an excel source on each run. Say i have an excel with column names : HostName, Username, Database, Password. i want to pass these connection parameters to my table input step dynamically whenever the job runs. This is what i was trying to do. 回答1: You can achieve this by reading the DB connection parameters from a source (e.g. Excel or in my example a CSV file)

How to get last 7 days data from current datetime to last 7 days in sql server

阅读更多关于 How to get last 7 days data from current datetime to last 7 days in sql server

Hi I am loading table A data from sql server to mysql using pentaho when loading data i need to get only last 7 days data from sql server A table to mysql In sql server createddate column data type is like datetime AND In mysql created_on column datatype is timestamp Here I used below query but i am getting only 5 days data Please help me in this issue select id, NewsHeadline as news_headline, NewsText as news_text, state, CreatedDate as created_on from News WHERE CreatedDate BETWEEN GETDATE()-7 AND GETDATE() order by createddate DESC Try something like: SELECT id, NewsHeadline as news