bigdata | 易学教程

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?

阅读更多关于 Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?

问题 Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark? For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data? 回答1: Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI. Pseudocode: val rawFile = sparkContext.binaryFiles(...

Cannot initialize cluster exception while running job on Hadoop 2

阅读更多关于 Cannot initialize cluster exception while running job on Hadoop 2

问题 The question is linked to my previous question All the daemons are running, jps shows: 6663 JobHistoryServer 7213 ResourceManager 9235 Jps 6289 DataNode 6200 NameNode 7420 NodeManager but the wordcount example keeps on failing with the following exception: ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

Spark data type guesser UDAF

阅读更多关于 Spark data type guesser UDAF

问题 Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess. Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables. How do you normally determine data types in Spark? P.S. Frameworks like h2o automatically determine data

removing duplicate units from data frame

阅读更多关于 removing duplicate units from data frame

问题 I'm working on a large dataset with n covariates. Many of the rows are duplicates. In order to identify the duplicates I need to use a subset of the covariates to create an identification variable. That is, (n-x) covariates are irrelevant. I want to concatenate the values on the x covariates to uniquely identify the observations and eliminate the duplicates. set.seed(1234) UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6) DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004",

iPad - Parsing an extremely huge json - File (between 50 and 100 mb)

阅读更多关于 iPad - Parsing an extremely huge json - File (between 50 and 100 mb)

I'm trying to parse an extremely big json-File on an iPad. The filesize will vary between 50 and 100 mb (there is an initial file and there will be one new full set of data every month, which will be downloaded, parsed and saved into coredata) I'm building this app for a company as an Enterprise solution - the json file contains sensitive customerdata and it needs to be saved locally on the ipad so it will work even offline. It worked when the file was below 20mb, but now the set of data became bigger and I really need to parse it. I'm receiving memory warnings during parsing and after the

How to quickly export data from R to SQL Server

阅读更多关于 How to quickly export data from R to SQL Server

The standard RODBC package's sqlSave function even as a single INSERT statement (parameter fast = TRUE ) is terribly slow for large amounts of data due to non-minimal loading. How would I write data to my SQL server with minimal logging so it writes much more quickly? Currently trying: toSQL = data.frame(...); sqlSave(channel,toSQL,tablename="Table1",rownames=FALSE,colnames=FALSE,safer=FALSE,fast=TRUE); By writing the data to a CSV locally and then using a BULK INSERT (not readily available as a prebuilt function akin to sqlSave ), the data can be written to the MS SQL Server very quickly.

How do I upsert into HDFS with spark?

阅读更多关于 How do I upsert into HDFS with spark?

问题 I have partitioned data in the HDFS. At some point I decide to update it. The algorithm is: Read the new data from a kafka topic. Find out new data's partition names. Load the data from partitions with these names that is in the HDFS. Merge the HDFS data with the new data. Overwrite partitions that are already on disk. The problem is that what if the new data has partitions that don't exist on disk yet. In that case they don't get written. https://stackoverflow.com/a/49691528/10681828 <- this

Query Failed Error: Resources exceeded during query execution: The query could not be executed in the allotted memory

阅读更多关于 Query Failed Error: Resources exceeded during query execution: The query could not be executed in the allotted memory

问题 I am using Standard SQL.Even though its a basic query it is still throwing errors. Any suggestions pls SELECT fullVisitorId, CONCAT(CAST(fullVisitorId AS string),CAST(visitId AS string)) AS session, date, visitStartTime, hits.time, hits.page.pagepath FROM `XXXXXXXXXX.ga_sessions_*`, UNNEST(hits) AS hits WHERE _TABLE_SUFFIX BETWEEN "20160801" AND "20170331" ORDER BY fullVisitorId, date, visitStartTime 回答1: The only way for this query to work is by removing the ordering applied in the end:

Dynamodb query error - Query key condition not supported

阅读更多关于 Dynamodb query error - Query key condition not supported

问题 I am trying to query my dynamodb table to get feed_guid and status_id = 1. But it returns Query key condition not supported error. Please find my table schema and query. $result =$dynamodbClient->createTable(array( 'TableName' => 'feed', 'AttributeDefinitions' => array( array('AttributeName' => 'user_id', 'AttributeType' => 'S'), array('AttributeName' => 'feed_guid', 'AttributeType' => 'S'), array('AttributeName' => 'status_id', 'AttributeType' => 'N'), ), 'KeySchema' => array( array(

How can I tell when my dataset in R is going to be too large?

阅读更多关于 How can I tell when my dataset in R is going to be too large?

I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead of time how much room my data is going to take up in RAM, and whether I will have enough. I know how much RAM I have (not a huge amount - 3GB under XP), and I know how many rows and cols my logfile will end up as and what data types the col entries ought to be (which presumably I need to check as it reads). How do I put this together into a go/nogo