问题
How do I get data directly which is entering on a website concurrently on hdfs?
回答1:
If you plan to have High availability read and writes, then you can use Hbase to store the data.
If you are using REST API, you can store the data directly to Hbase as it has dedicated Hbase REST API that can store into Hbase Tables.
1) Linear and modular scalability. 2) Strictly consistent reads and writes. 3) Automatic and configurable sharding of tables.
For more about HBase :- https://hbase.apache.org/
Else if you want some streaming data into HDFS from any source, you can look into confluent platform ( which has inbuilt kafka ) and can store into HDFS.
回答2:
This entirely depends on what data you have and how willing you are to maintain extra tools on top of Hadoop.
if you're just accepting events from a logfile, Flume, Fluentd, or Filebeat are your best options.
If you are accepting client side events, such as clicks or mouse movements, for example, then you need some backend server accepting those requests. For example, Flume TCP Source, but you probably want some type of authentication endpoint in front of this service to prevent random external messages to your event channel.
You can also use Kafka. The Kafka REST Proxy (by Confluent) can be used to accept REST requests and produce to a Kafka topic. Kafka HDFS Connect (also by Confluent) can consume from Kafka and publish messages to HDFS in near real time, much like Flume
Other options include Apache Nifi or Streamsets. Again, using a TCP or HTTP event source listener with an HDFS destination processor
来源:https://stackoverflow.com/questions/49726697/getting-data-directly-from-a-website-to-a-hdfs