bigdata

How do I use multiple consumers in Kafka?

耗尽温柔 提交于 2019-12-20 08:57:16
问题 I am a new student studying Kafka and I've run into some fundamental issues with understanding multiple consumers that articles, documentations, etc. have not been too helpful with so far. One thing I have tried to do is write my own high level Kafka producer and consumer and run them simultaneously, publishing 100 simple messages to a topic and having my consumer retrieve them. I have managed to do this successfully, but when I try to introduce a second consumer to consume from the same

sqoop merge-key creating multiple part files instead of one which doesn't serve the purpose of using merge-key

微笑、不失礼 提交于 2019-12-20 05:26:07
问题 Ideally, when we run incremental without merge-key it will create new file with the appended data set but if we use merge-key then it will create new whole data set including the previous dataset in one file only. But I am not getting one part file when I use incremental append in my sqoop job. Below are my steps: 1) Initial data: mysql> select * from departments_per; +---------------+-----------------+ | department_id | department_name | +---------------+-----------------+ | 2 | Fitness | |

Spark::KMeans calls takeSample() twice?

Deadly 提交于 2019-12-20 03:45:09
问题 I have many data and I have experimented with partitions of cardinality [20k, 200k+]. I call it like that: from pyspark.mllib.clustering import KMeans, KMeansModel C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None) C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None) and I see that initRandom() calls takeSample() once. Then the takeSample() implementation doesn't seem to call itself or something like that, so I would

R: Expanding an R factor into dummy columns for every factor level

南笙酒味 提交于 2019-12-20 03:08:13
问题 I have a quite big data frame in R with two columns. I am trying to make out of the Code column ( factor type with 858 levels) the dummy variables. The problem is that the R Studio always crashed when I am trying to do that. > str(d) 'data.frame': 649226 obs. of 2 variables: $ User: int 210 210 210 210 269 317 317 317 317 326 ... $ Code : Factor w/ 858 levels "AA02","AA03",..: 164 494 538 626 464 496 435 464 475 163 ... The User column is not unique, meaning that there can be several rows

How to avoid reading old files from S3 when appending new data?

允我心安 提交于 2019-12-19 12:06:15
问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

How to avoid reading old files from S3 when appending new data?

筅森魡賤 提交于 2019-12-19 12:05:26
问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

How to fix resource changed on src filesystem issue

六眼飞鱼酱① 提交于 2019-12-19 11:04:38
问题 I'm trying to use Hive on MR executing SQL and it fails half way with errors below: Application application_1570514228864_0001 failed 2 times due to AM Container for appattempt_1570514228864_0001_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-10-08 13:57:49.272]Failed to download resource { { s3a://tpcds/tmp/hadoop-yarn/staging/root/.staging/job_1570514228864_0001/libjars, 1570514262820, FILE, null },pending,[(container_1570514228864_0001_02_000001)]

How to fix resource changed on src filesystem issue

喜夏-厌秋 提交于 2019-12-19 11:04:09
问题 I'm trying to use Hive on MR executing SQL and it fails half way with errors below: Application application_1570514228864_0001 failed 2 times due to AM Container for appattempt_1570514228864_0001_000002 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2019-10-08 13:57:49.272]Failed to download resource { { s3a://tpcds/tmp/hadoop-yarn/staging/root/.staging/job_1570514228864_0001/libjars, 1570514262820, FILE, null },pending,[(container_1570514228864_0001_02_000001)]

Importing binary LabVIEW files with header information into MATLAB?

心不动则不痛 提交于 2019-12-19 10:43:09
问题 I have large .bin files (10GB 60GB) that I want to import to MATLAB; each binary file represents the output of two sensors, thus there are too columns of data. Here is a more manageable sized example of my data. You will notice that there is a .txt version of the data; I need to upload the .bin files directly to MATLAB, I can't use the .txt version because it takes hours to convert with larger files. The problem I have is that the .bin file has header information that I can't seem to

Trouble with grouby on millions of keys on a chunked file in python pandas

孤街浪徒 提交于 2019-12-19 10:24:47
问题 I have a very big CSV file (tens of Gigas) containing web logs with the following columns: user_id , time_stamp , category_clicked . I have to build a scorer to identify what categories users like and dislike. Note that I have more than 10 millions users. I first cut it in chunks and store them in a HDFStore named input.h5 then I use groupby on user_id following Jeff's way. Here is my data: about 200 millions rows, 10 millions unique user_ids. user id | timestamp | category_clicked