bigdata | 易学教程

How do you ingest Spring boot logs directly into elastic

阅读更多关于 How do you ingest Spring boot logs directly into elastic

问题 I’m investigating feasability of sending spring boot application logs directly into elastic search. Without using filebeats or logstash. I believe the Ingest plugin may help with this. My initial thoughts are to do this using logback over TCP. https://github.com/logstash/logstash-logback-encoder <?xml version="1.0" encoding="UTF-8"?> <configuration> <appender name="stash" class="net.logstash.logback.appender.LogstashTcpSocketAppender"> <destination>127.0.0.1:4560</destination> <encoder class=

Python generator to read large CSV file

阅读更多关于 Python generator to read large CSV file

I need to write a Python generator that yields tuples (X, Y) coming from two different CSV files. It should receive a batch size on init, read line after line from the two CSVs, yield a tuple (X, Y) for each line, where X and Y are arrays (the columns of the CSV files). I've looked at examples of lazy reading but I'm finding it difficult to convert them for CSVs: Lazy Method for Reading Big File in Python? Read large text files in Python, line by line without loading it in to memory Also, unfortunately Pandas Dataframes are not an option in this case. Any snippet I can start from? Thanks

How to expand one column in Pandas to many columns?

阅读更多关于 How to expand one column in Pandas to many columns?

As the title, I have one column (series) in pandas, and each row of it is a list like [0,1,2,3,4,5] . Each list has 6 numbers. I want to change this column into 6 columns, for example, the [0,1,2,3,4,5] will become 6 columns, with 0 is the first column, 1 is the second, 2 is the third and so on. How can I make it? Not as fast as @jezrael's solution. But elegant :-) apply with pd.Series df.a.apply(pd.Series) 0 1 2 3 4 5 0 0 1 2 3 4 5 1 0 1 2 3 4 5 or df.a.apply(pd.Series, index=list('abcdef')) a b c d e f 0 0 1 2 3 4 5 1 0 1 2 3 4 5 You can convert lists to numpy array by values and then use

Convert PL/SQL to Hive QL

阅读更多关于 Convert PL/SQL to Hive QL

I want a tool through which I can get the respective hive query by giving the PL/SQL query. There are lots of tools available which convert sql to hql. ie: taod for cloude database. But it does not show me the respective hive query. Is there any such kind of tool whose convert given sql to hql. Please help me. Thanks and Regards, Ratan Please take a look at open-source project PL/HQL at http://www.plhql.org . It allows you to run existing SQL Server, Oracle, Teradata, MySQL etc. stored procedures in Hive. shiva kumar s Ratan, I did not how to start responding. So, lets start like this. I think

Spark data type guesser UDAF

阅读更多关于 Spark data type guesser UDAF

Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess. Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables. How do you normally determine data types in Spark? P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be

removing duplicate units from data frame

阅读更多关于 removing duplicate units from data frame

I'm working on a large dataset with n covariates. Many of the rows are duplicates. In order to identify the duplicates I need to use a subset of the covariates to create an identification variable. That is, (n-x) covariates are irrelevant. I want to concatenate the values on the x covariates to uniquely identify the observations and eliminate the duplicates. set.seed(1234) UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6) DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004","1/2/2005","1/2/2005", "1/1/2011","1/1/2011","1/1/2011","1/1/2009","1/1/2008","1/1/2008","1/1/2012","1

How to balance my data across the partitions?

阅读更多关于 How to balance my data across the partitions?

问题 Edit : The answer helps, but I described my solution in: memoryOverhead issue in Spark. I have an RDD with 202092 partitions, which reads a dataset created by others. I can manually see that the data is not balanced across the partitions, for example some of them have 0 images and other have 4k, while the mean lies at 432. When processing the data, I got this error: Container killed by YARN for exceeding memory limits. 16.9 GB of 16 GB physical memory used. Consider boosting spark.yarn

How do I upsert into HDFS with spark?

阅读更多关于 How do I upsert into HDFS with spark?

I have partitioned data in the HDFS. At some point I decide to update it. The algorithm is: Read the new data from a kafka topic. Find out new data's partition names. Load the data from partitions with these names that is in the HDFS. Merge the HDFS data with the new data. Overwrite partitions that are already on disk. The problem is that what if the new data has partitions that don't exist on disk yet. In that case they don't get written. https://stackoverflow.com/a/49691528/10681828 <- this solution doesn't write new partitions for example. The above picture describes the situation. Let's

Dynamodb query error - Query key condition not supported

阅读更多关于 Dynamodb query error - Query key condition not supported

I am trying to query my dynamodb table to get feed_guid and status_id = 1. But it returns Query key condition not supported error. Please find my table schema and query. $result =$dynamodbClient->createTable(array( 'TableName' => 'feed', 'AttributeDefinitions' => array( array('AttributeName' => 'user_id', 'AttributeType' => 'S'), array('AttributeName' => 'feed_guid', 'AttributeType' => 'S'), array('AttributeName' => 'status_id', 'AttributeType' => 'N'), ), 'KeySchema' => array( array('AttributeName' => 'feed_guid', 'KeyType' => 'HASH'), ), 'GlobalSecondaryIndexes' => array( array( 'IndexName'

jq --stream filter on multiple values of same key

阅读更多关于 jq --stream filter on multiple values of same key

I am processing a very large JSON wherein I need to filter the inner JSON objects using a value of a key. My JSON looks like as follows: {"userActivities":{"L3ATRosRdbDgSmX75Z":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday","localDate":"2018-01-20"},"L3ATSFGrpAYRkIIKqrh":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday","localDate":"2018-01-21"},"L3AVHvmReBBPNGluvHl":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday","localDate":"2018-01-22"},"L3AVIcqaDpZxLf6ispK":{"deviceId":"60ee32c2fae8dcf0","dow":"Friday,"localDate":"2018-01-19"}}} I want to put a filter on localDate values such that localDate in