bigdata

Handling very large files with openpyxl python

↘锁芯ラ 提交于 2020-01-30 10:53:15
问题 I have a spreadsheet with 11,000 rows and 10 columns. I am trying to copy each row with selected columns, add additional information per line and output to a txt. Unfortunately, I am having really bad performance issues, files start to slug after 100 rows and kill my processor. Is there a way to speed this up or use better methodology? I am already using read_only=True and data_only=True Most memory intensive part is iterating through each cell : for i in range(probeStart, lastRow+1): dataRow

Starting Hadoop Services using Command Line (CDH 5)

半城伤御伤魂 提交于 2020-01-25 22:05:08
问题 I know how to start services using Cloudera manager interface, but I prefer to know what is really happening behind the scene and not rely on "magic". I read this page but it does not give the desired information I know there are some .sh files to be used but they seem to vary from version to version, and I'm using the latest as of today (5.3). I would be grateful to have a list of service starting commands (specifically HDFS) PS : Looks like somehow Cloudera ditched the classic Apache

How to ignore loading huge fields in django admin list_display?

时光怂恿深爱的人放手 提交于 2020-01-24 09:25:26
问题 I'm using django 1.9 and django.contrib.gis with an Area model that has a huge gis MultiPolygonField : # models.py from django.contrib.gis.db import models as gis_models class Area(gis_models.Model): area_color = gis_models.IntegerField() mpoly = gis_models.MultiPolygonField(srid=4326) class Meta: verbose_name = 'Area' verbose_name_plural = 'Areas' I have the associated AreaAdmin class to manage the Area s inside django admin: # admin.py from django.contrib.gis import admin as gis_admin class

Finding gaps in huge event streams?

烂漫一生 提交于 2020-01-22 18:37:27
问题 I have about 1 million events in a PostgreSQL database that are of this format: id | stream_id | timestamp ----------+-----------------+----------------- 1 | 7 | .... 2 | 8 | .... There are about 50,000 unique streams. I need to find all of the events where the time between any two of the events is over a certain time period. In other words, I need to find event pairs where there was no event in a certain period of time. For example: a b c d e f g h i j k | | | | | | | | | | | \____2 mins____

Django + Postgres + Large Time Series

跟風遠走 提交于 2020-01-19 09:59:08
问题 I am scoping out a project with large, mostly-uncompressible time series data, and wondering if Django + Postgres with raw SQL is the right call. I have time series data that is ~2K objects/hour, every hour. This is about 2 million rows per year I store, and I would like to 1) be able to slice off data for analysis through a connection, 2) be able to do elementary overview work on the web, served by Django. I think the best idea is to use Django for the objects themselves, but drop to raw SQL

how to read files from GetFilesProcessor in NiFi

时光毁灭记忆、已成空白 提交于 2020-01-16 09:01:04
问题 Below is my flow: GetFile > ExecuteSparkInteractive > PutFile I want to read files from GetFile processor in ExecuteSparkInteractive processor, apply some transformations and put it in some location. Below is my flow I wrote spark scala code under code section of spark processor: val sc1=sc.textFile("local_path") sc1.foreach(println) There is nothing happening in the flow. So how can I read files in spark processor using GetFile processor. 2nd Part: I tried below flow just for practice:

Kafka, new storage

断了今生、忘了曾经 提交于 2020-01-16 06:46:10
问题 I'm trying to add new storage for Kafka, here is what I have already done: Add, prepare and mount storage under Linux OS Add new storage in Kafka Broker: log.dirs: /data0/kafka-logs,/data1/kafka-logs Restart Kafka Brokers New directories under /data1/kafka-logs has been created but the size is: du -csh /data1/kafka-logs/ 156K /data1/kafka-logs/ And the size isn't growing only the old /data0 is used. What I'm missing? What should I do more to solve this problem? The storage is almost full, and

Correct way of writing two floats into a regular txt

北慕城南 提交于 2020-01-15 05:58:05
问题 I am running a big job, in cluster mode. However, I am only interested in two floats numbers, which I want to read somehow, when the job succeeds. Here what I am trying: from pyspark.context import SparkContext if __name__ == "__main__": sc = SparkContext(appName='foo') f = open('foo.txt', 'w') pi = 3.14 not_pi = 2.79 f.write(str(pi) + "\n") f.write(str(not_pi) + "\n") f.close() sc.stop() However, 'foo.txt' doesn't appear to be written anywhere (probably it gets written in an executor, or

difference between ff and filehash package in R [closed]

人盡茶涼 提交于 2020-01-14 08:19:25
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I have a dataframe compose of 25 col and ~1M rows, split into 12 files, now I need to import them and then use some reshape package to

difference between ff and filehash package in R [closed]

放肆的年华 提交于 2020-01-14 08:19:06
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I have a dataframe compose of 25 col and ~1M rows, split into 12 files, now I need to import them and then use some reshape package to