Today I will talk about cloud computing, big data and artificial intelligence. Why do you talk about these three things? Because these three things are very popular now, and they seem to be related to each other: when talking about cloud computing, big data will be mentioned, when talking about artificial intelligence, big data will be mentioned, and when talking about artificial intelligence, cloud computing will be mentioned... …I feel that the three are complementary and inseparable. But if it is a non-technical person, it may be more difficult to understand the relationship between the three, so it is necessary to explain.
1. The original goal of cloud computing
Let's talk about cloud computing first. The initial goal of cloud computing is the management of resources, which are mainly in three aspects: computing resources, network resources, and storage resources.
1 A data center is like a computer
What are computing, network, and storage resources?
For example, if you want to buy a laptop, do you care about what kind of CPU this computer is? How much memory? These two are called computing resources.
To access the Internet, this computer needs to have a network port that can plug in a network cable, or a wireless network card that can connect to our router. Your home also needs to go to an operator such as China Unicom, China Mobile or Telecom to open a network, such as 100M bandwidth. Then a master will get a network cable to your home, the master may help you configure your router and their company's network connection. In this way, all computers, mobile phones, and tablets in your home can go online through your router. This is the network resource.
You may also ask how big the hard drive is? In the past, hard drives were very small, such as 10G in size; later, even 500G, 1T, 2T hard drives were not new. (1T is 1000G), this is the storage resource.
This is the case for a computer, and the same for a data center. Imagine you have a very, very large computer room with a lot of servers piled up. These servers also have CPU, memory, and hard disks, and they are also connected to the Internet through devices like routers. The question at this time is: How do the people who operate the data center manage these devices in a unified manner?
2 Flexibility means you have it whenever you want, whatever you want
The goal of management is to achieve flexibility in two aspects. What are the two specific aspects?
Take an example to understand: For example, someone needs a very small computer with only one CPU, 1G memory, 10G hard disk, and one megabyte of bandwidth. Can you give it to him? For computers with such a small size, any laptop is now better than this configuration, and a broadband at home is 100M. However, if he goes to a cloud computing platform, he only needs one point when he wants this resource.
In this case, it can achieve two aspects of flexibility:
- Time flexibility: you want it when you want it, and it comes out when you need it;
- Space flexibility: as much as you want. A computer that is too small can be satisfied; a very large space such as a cloud disk is required. The space allocated by the cloud disk to everyone is very large at any time. There is space for uploading at any time, and it can never be used up. of.
Space flexibility and time flexibility are what we often call the elasticity of cloud computing. It has been a long time to solve this problem of flexibility.
3 Physical equipment is not flexible
The first stage is the period of physical equipment. During this period, the customer needed a computer, so we bought one and placed it in the data center.
Of course, physical equipment is getting better and better. For example, servers, memory is 100G of memory at every turn; for example, network equipment, the bandwidth of a port can be tens of gigabytes or even hundreds of gigabytes; such as storage, at least PB level in data centers One P is 1000 T, and one T is 1000 G).
However, physical equipment cannot be very flexible:
- The first is its lack of time flexibility. It cannot be achieved when you want it when you want it. For example, buying a server or a computer requires time for purchase. If suddenly a user tells a cloud vendor that he wants to open a computer and use a physical server, it will be difficult to purchase at that time. A good relationship with a supplier may take a week, and a normal relationship with a supplier may take a month to purchase. The user waited for a long time before the computer was in place. At this time, the user had to log in to slowly start deploying his application. Time flexibility is very poor.
- Secondly, its spatial flexibility is not good. For example, the above-mentioned user needs a very small computer, but how can there be such a small computer now? You can’t buy such a small machine as long as one G of memory is an 80G hard drive to satisfy users. But if you buy a big one, you will need to charge more money from the user because the computer is big, but the user needs only that small amount, so paying more will be wronged.
4 Virtualization is much more flexible
Someone figured out a solution. The first way is virtualization. Don’t users just need a small computer? The physical equipment in the data center is very powerful. I can virtualize a small piece from the physical CPU, memory, and hard disk to give customers, and at the same time, I can virtualize a small piece to other customers. Each customer can only see a small piece of his own, but in fact each customer uses a small piece of the entire large device.
Virtualization technology makes the computers of different customers appear to be isolated. That is to say, I look as if this disk is mine, and you look at this disk as yours, but the actual situation may be that my 10G and your 10G are on the same large storage. And if the physical equipment is prepared in advance, it is very fast for the virtualization software to virtualize a computer, which can basically be solved in a few minutes. So if you want to create a computer on any cloud, it will come out in just a few minutes. This is the reason.
In this way, space flexibility and time flexibility are basically solved.
5 Making money and feelings in the virtual world
In the virtualization stage, the best company is VMware. It is a company that implemented virtualization technology relatively early, and can achieve virtualization of computing, network, and storage. This company is very good, the performance is very good, the virtualization software is also very good, and it made a lot of money, and then it was acquired by EMC (the world's top 500, the first brand of storage manufacturers).
But there are still many sentient people in this world, especially programmers. What do sentient people like to do? Open source.
Many software in this world are open source if there is closed source, and the source is the source code. That is to say, a certain software is well done and everyone loves to use it, but the code of this software is closed by me, only my company knows, others do not. If other people want to use this software, they have to pay me. This is called closed source.
But there are always some big cows in the world who can't understand the situation where all the money makes a family. The big cows think that you know this technology and I will; if you can develop it, so can I. When I develop it, I don’t charge money. I take out the code and share it with everyone. Anyone in the world can use it, and everyone can enjoy the benefits. This is called open source.
For example, the recent Tim Berners Lee is a very caring person. In 2017, he won the 2016 Turing Award for "inventing the World Wide Web, the first browser, and the basic protocols and algorithms that allowed the World Wide Web to expand." The Turing Prize is the Nobel Prize in the computer industry. However, his most admirable thing is that he contributed the World Wide Web, which is our common WWW technology, to the world for free. We should thank him for all our actions on the Internet. If he uses this technology to collect money, he should be as rich as Bill Gates.
There are many examples of open source and closed source:
For example, in the closed-source world, there is Windows, and everyone has to pay Microsoft to use Windows; in the open-source world, Linux appears. Bill Gates has made a lot of money by relying on closed-source software such as Windows and Office. Known as the world's richest man, Daniel has developed another operating system, Linux. Many people may not have heard of Linux. Many of the programs running on the back-end servers are on Linux. For example, everyone enjoys Double Eleven, whether it is Taobao, Jingdong, Koala...The systems that support Double Eleven snap-ups are all running On Linux.
If there is Apple, there is Android. Apple's market value is very high, but we can't see the code of Apple's system. So Daniel wrote the Android mobile operating system. So you can see that almost all other mobile phone manufacturers have Android systems installed in them. The reason is that the Apple system is not open source, and the Android system can be used by everyone.
The same goes for virtualization software. With VMware, this software is very expensive. Then there are two open source virtualization software written by Daniel, one is called Xen and the other is called KVM. If you don't do technology, you can ignore these two names, but they will be mentioned later.
6 Semi-automatic virtualization and fully automatic cloud computing
To say that virtualization software solves the problem of flexibility is not all right. Because virtualization software generally creates a virtual computer, it is necessary to manually specify which physical machine the virtual computer is placed on. This process may also require more complicated manual configuration. Therefore, to use VMware's virtualization software, you need to take a very good certificate, and the salary of those who can get this certificate is quite high, and the complexity is also visible.
Therefore, the cluster size of physical machines that can be managed only by virtualization software is not particularly large, generally in a scale of a dozen, dozens, or at most a hundred.
This aspect will affect the flexibility of time: Although the time to virtualize a computer is very short, as the scale of the cluster expands, the manual configuration process becomes more and more complicated and time-consuming. On the other hand, it also affects the flexibility of space: when the number of users is large, the scale of this cluster is far from reaching the level of how much you want. It is likely that this resource will be used up soon and you have to purchase it.
Therefore, as the scale of the cluster becomes larger and larger, it basically starts with a thousand units, with tens of thousands or even tens of millions of units. If you check BAT, including NetEase, Google, and Amazon, the number of servers is scary. It is almost impossible for so many machines to rely on people to choose a place to put this virtualized computer and configure it accordingly. Machines are still needed to do this.
People have invented a variety of algorithms to do this, and the name of the algorithm is called Scheduler . In layman's terms, there is a dispatch center with thousands of machines in a pool. No matter how many virtual computers the user needs for CPU, memory, and hard disks, the dispatch center will automatically find a place in the large pool that can meet the needs of users. Start the virtual computer and configure it, and the user can use it directly. This stage is called pooling or cloudification. At this stage, it can be called cloud computing. Before that, it can only be called virtualization.
7 Private and public cloud computing
Cloud computing is roughly divided into two types: one is private cloud, the other is public cloud, and some people connect private cloud and public cloud as hybrid cloud. I won't talk about it here for the time being.
- Private cloud : Deploy the software of virtualization and cloudization in other people's data centers. Users who use private clouds are often very rich, buy their own land to build a computer room, buy their own servers, and then let cloud vendors deploy them here. In addition to virtualization, VMware later launched cloud computing products and made a lot of money in the private cloud market.
- Public cloud : Deploying virtualization and cloudization software in the cloud manufacturer's own data center does not require a large investment. Users only need to register an account to create a virtual computer on a web page with one click. For example, AWS is Amazon's public cloud; such as the domestic Alibaba Cloud, Tencent Cloud, and NetEase Cloud.
Why does Amazon want to be a public cloud? We know that Amazon turned out to be a relatively large foreign e-commerce company. When it does e-commerce, it will definitely encounter a scene similar to Double Eleven: at a certain moment everyone rushes to buy things. When everyone rushes to buy things, the time flexibility and space flexibility of the cloud are especially needed. Because it cannot prepare all the resources at all times, it would be too wasteful. But you can't be prepared for nothing. Watching Double Eleven, so many users want to buy things and fail to get on board. Therefore, when Double Eleven is needed, a large number of virtual computers will be created to support e-commerce applications. After Double Eleven, these resources will be released to do other things. So Amazon needs a cloud platform.
However, commercial virtualization software is too expensive. Amazon can never give all the money it makes in e-commerce to virtualization vendors. So Amazon developed its own cloud software based on open source virtualization technology, Xen or KVM as mentioned above. Unexpectedly, Amazon's later e-commerce business became more and more powerful, and the cloud platform became more and more powerful.
Because its cloud platform needs to support its own e-commerce applications; while traditional cloud computing vendors are mostly IT vendors and have almost no applications of their own, Amazon's cloud platform is more friendly to applications and has rapidly developed into the first brand of cloud computing. , Made a lot of money.
Before Amazon announced the financial report of its cloud computing platform, people speculated that Amazon's e-commerce makes money, but the cloud also makes money? Later, as soon as the financial report was announced, it was found that it was not ordinary making money. Last year alone, Amazon AWS had annual revenue of 12.2 billion US dollars and operating profit of 3.1 billion US dollars.
8 Money and Feelings of Cloud Computing
Amazon, the first place in the public cloud, had a good time, and the second place, Rackspace, had a so-so. No way, this is the cruelty of the Internet industry, which is mostly a winner-takes-all model. So if the second place is not in the cloud computing industry, many people may not have heard of it.
The second place just thought, what should I do if I can't do the boss? Open source. As mentioned above, although Amazon uses open source virtualization technology, the cloudized code is closed source. Many companies that want to do but cannot do a cloud platform can only watch Amazon earn a lot of money. As soon as Rackspace releases the source code, the entire industry can work together to make this platform better and better. Brothers, let's go together and fight with the boss.
So Rackspace and NASA cooperated to create the open source software OpenStack. The architecture diagram of OpenStack is shown in the figure above. It is not the cloud computing industry that does not need to understand this diagram, but you can see three keywords: Compute, Networking, Storage storage. It is also a cloud management platform for computing, network and storage.
Of course, the technology of the second place is also very good. With OpenStack, as Rackspace thought, all the big companies that want to be cloud are crazy. All the big IT companies you can imagine: IBM, HP, Dell, Huawei, Lenovo, etc. are all crazy.
It turns out that everyone wants to do it on the cloud platform. Looking at Amazon and VMware making so much money, it seems that there is no way to do it by yourself. Well now, with such an open source cloud platform OpenStack, all IT vendors have joined the community, contributed to this cloud platform, packaged it into their own products, and sold them together with their own hardware devices. Some have made private clouds and some have made public clouds. OpenStack has become the de facto standard for open source cloud platforms.
9 IaaS, flexibility at the resource level
As the technology of OpenStack becomes more and more mature, the scale that can be managed is getting bigger and bigger, and there can be multiple OpenStack clusters to deploy multiple sets. For example, one set in Beijing, two sets in Hangzhou, one set in Guangzhou, and then unified management. In this way, the entire scale is even larger.
At this scale, the perception of ordinary users is basically able to do what they want when they want, and how much they want. Take the cloud disk as an example. Each user's cloud disk is allocated 5T or more of space. If there are 100 million people, how much space will add up.
In fact, the mechanism behind it is this: to allocate your space, you may only use a small amount of it. For example, it allocates 5 T to you. Such a large space is just what you see, not really. Here you are, you actually only used 50 Gs, but what you really give you is 50 Gs. As your files keep uploading, more and more space will be allocated to you.
When everyone uploads and the cloud platform finds that it is almost full (for example, 70% is used), it will purchase more servers and expand the resources behind it. This is transparent and invisible to users. In terms of feeling, the elasticity of cloud computing is realized. In fact, it is a bit like a bank. It feels to depositors that they can withdraw money at any time. As long as they don’t run on the same time, the bank will not collapse.
10 Summary
At this stage, cloud computing basically realized time flexibility and space flexibility; realized the flexibility of computing, network, and storage resources. Computing, network, and storage are often referred to as infrastructure Infranstracture, so the elasticity at this stage is called resource-level elasticity. The cloud platform for managing resources, we call infrastructure services, which is what we often hear about IaaS (Infranstracture As A Service).
2. Cloud computing not only cares about resources but also applications
With IaaS, is it enough to achieve resource-level flexibility? Obviously not, there is flexibility at the application level.
Here is an example: For example, to implement an e-commerce application, usually ten machines are enough, and double eleven requires one hundred. You may think it's easy to handle. With IaaS, just create 90 new machines. However, the 90 machines were created empty, and the e-commerce application was not put on it. The company's operation and maintenance personnel had to do it one by one, and it took a long time to install.
Although flexibility is achieved at the resource level, without the flexibility of the application layer, flexibility is still not enough. Is there a way to solve this problem?
People have added a layer on top of the IaaS platform to manage the application flexibility above resources. This layer is usually called PaaS (Platform As A Service). This layer is often difficult to understand, and it is roughly divided into two parts: one part is called "automatic installation of your own application", and the other part is called "general application without installation".
- Automatic installation of your own applications : For example, e-commerce applications are developed by you, and no one else knows how to install them except yourself. For e-commerce applications, you need to configure your Alipay or WeChat account during installation, so that when someone buys something on your e-commerce, the money you pay is sent to your account. No one knows except you. So the installation process platform cannot help, but it can help you automate it. You need to do some work to integrate your own configuration information into the automated installation process. For example, in the above example, the 90 newly created machines on Double Eleven are empty. If a tool can be provided to automatically install e-commerce applications on the new 90 machines, then real flexibility at the application level can be achieved. . For example, Puppet, Chef, Ansible, Cloud Foundary can all do this, and the latest container technology Docker can do this better.
- General-purpose applications do not need to be installed: The so-called general-purpose applications generally refer to some more complex applications that are used by everyone, such as databases. Almost all applications will use a database, but the database software is standard. Although the installation and maintenance are more complicated, it is the same no matter who installs it. Such applications can be turned into standard PaaS layer applications and placed on the interface of the cloud platform. When a user needs a database, it comes out at one point and the user can use it directly. Someone asked, since everyone has the same installation, I will do it myself, and I don't need to spend money to buy it on the cloud platform. Of course not. The database is a very difficult thing. Oracle alone can make so much money on the database. It also costs a lot of money to buy Oracle.
However, most cloud platforms will provide open source databases such as MySQL, which are open source and do not need to spend so much money. But to maintain this database, you need to hire a large team. If this database can be optimized to support Double Eleven, it will not be possible to do it in one or two years.
For example, if you are a bicycle manufacturer, of course, there is no need to hire a very large database team to do this. The cost is too high. It should be handed over to the cloud platform to do this. Professional people do it. Cloud The platform has dedicated hundreds of people to maintain this system. You only need to focus on your bicycle application.
Either automatic deployment or no deployment. In general, you have to worry less about the application layer. This is the important role of the PaaS layer.
Although the scripting method can solve the deployment problem of your own application, different environments are very different. A script often runs correctly in one environment, but it is incorrect in another environment.
The container can better solve this problem.
Container is Container. Container also means container. In fact, the idea of container is to become a container for software delivery. The characteristics of containers: one is packaging, the other is standard.
In an era when there are no containers, it is assumed that the goods are transported from A to B, passing through three terminals and changing ships three times. Every time, the goods are unloaded from the ship, placed in a sloppy manner, and then put on the ship and placed in order. Therefore, when there is no container, every time the ship changes, the crew has to stay ashore for several days before leaving.
With the container, all the goods are packed together, and the size of the container is all the same, so every time the ship is changed, one box can be moved as a whole, and the hour level can be completed. The crew no longer has to go ashore for a long time. Up.
This is the application of the two major characteristics of container "encapsulation" and "standard" in life.
So how does the container package the application? Still have to learn about containers. First of all, there must be a closed environment to encapsulate the goods so that they do not interfere with each other and are isolated from each other, so that loading and unloading is convenient. Fortunately, the LXC technology in Ubuntu can do this long ago.
The closed environment mainly uses two technologies. One is the seemingly isolated technology called Namespace, which means that the applications in each Namespace see different IP addresses, user spaces, and process numbers. The other is to use isolation technology, called Cgroups, which means that the entire machine has a lot of CPU and memory, and an application can only use part of them.
The so-called mirror image is to save the state of the container at the moment you weld the container, just like Monkey King said: "fix", the container is fixed at that moment, and then the state of this moment is saved as a series of files. The format of these files is standard, and anyone who sees these files can restore the fixed moment at that time. The process of restoring the image to runtime (the process of reading the image file and restoring that moment) is the process of running the container.
With containers, the automatic deployment of the PaaS layer for users' own applications becomes fast and elegant.
3. Big data embraces cloud computing
A complex general application in the PaaS layer is the big data platform. How does big data integrate into cloud computing step by step?
1 Little data also contains wisdom
At the beginning, this big data was not big. How much data is there? Now everyone goes to read e-books and read news on the Internet. When we were born in the 1980s, the amount of information was not that big, so we read books and newspapers. How many words does a week’s newspaper add up? If you are not in a big city, the library of an ordinary school does not add up to a few bookshelves. Later, with the advent of informatization, there will be more and more information.
First, let's take a look at the data in big data. There are three types, one is called structured data, one is called unstructured data, and the other is called semi-structured data.
- Structured data: data with a fixed format and limited length. For example, the filled form is structured data, nationality: People's Republic of China, ethnicity: Chinese, gender: male, these are all called structured data.
- Unstructured data: There are more and more unstructured data, that is, data with variable length and no fixed format. For example, web pages are sometimes very long, sometimes even a few words are gone; for example, voice and video are both non-standard. Structured data.
- Semi-structured data: It is in some XML or HTML format. Those who do not engage in technology may not understand it, but it does not matter.
In fact, the data itself is not useful and must be processed. For example, you wear a bracelet every day to collect data. So many web pages on the Internet are also data, which we call Data. The data itself is useless, but the data contains a very important thing called information (Information).
Data is very messy and can be called information only after sorting and cleaning. Information contains many laws. We need to summarize the laws from the information, which is called knowledge, and knowledge changes destiny. There is a lot of information, but some people see the information for nothing, but some people see the future of e-commerce from the information, and some see the future of live broadcast, so people are bullish. If you don't extract knowledge from the information, you can only watch the Moments in the Internet every day.
With knowledge, and then apply this knowledge to actual combat, some people will do very well, this thing is called intelligence (Intelligence). Knowledge does not necessarily have wisdom. For example, many scholars are very knowledgeable. What has happened can be analyzed from various angles. However, once you have done your work, you will not be able to turn into wisdom. And the reason why many entrepreneurs are great is that they apply the acquired knowledge to practice and finally do a lot of business.
Therefore, the application of data is divided into these four steps: data, information, knowledge, and wisdom .
The final stage is what many businesses want. You see, I have collected so much data, can you help me make the next decision based on this data and improve my product. For example, when the user is watching a video, an advertisement will pop up next to him, which is exactly what he wants to buy; when the user is allowed to listen to music, he also recommends some other music that he really wants to listen to.
The user clicks the mouse on my application or website, and the input text is data to me. I just want to extract some of these things, guide practice, and form wisdom, so that the user can not be extricated in my application. I don’t want to leave after I’m on my website, so I keep buying places and buying.
Many people say that I want to disconnect the Internet on Double Eleven. My wife keeps buying and buying on it. After buying A and recommending B, the wife said, "Oh, B is also my favorite, husband I want to buy." Why do you think this program is so awesome, so wise, and knows my wife better than I do, how did this happen?
2 How data can be upgraded to wisdom
The data processing is divided into several steps, and the wisdom will be finally obtained after completion.
The first step is called data collection. First of all, there must be data. There are two ways to collect data:
- The first way is to grab it. In a professional way, it is called grabbing or crawling. For example, a search engine does this: it downloads all the information on the Internet to its data center, and then you can search it out. For example, when you search, the result will be a list. Why is this list in the search engine company? It's because he took all the data, but if you click a link, the website will not be listed in the search engine. For example, Sina has a news, you search Baidu, when you don't click, the page is in the Baidu data center, and the page that comes out is in Sina's data center.
- The second method is push, there are many terminals that can help me collect data. For example, Xiaomi bracelet can upload your daily running data, heartbeat data, and sleep data to the data center.
The second step is the transmission of data. Generally, it will be done in a queue, because the amount of data is too large, and the data must be processed to be useful. But the system couldn't handle it, so I had to line up and deal with it slowly.
The third step is the storage of data . Now data is money, and mastering data is equivalent to mastering money. How else would the website know what you want to buy? It is because it has your historical transaction data. This information cannot be given to others and is very precious, so it needs to be stored.
The fourth step is data processing and analysis. The data stored above is raw data, and most of the raw data is messy and has a lot of garbage data in it, so it needs to be cleaned and filtered to get some high-quality data. For high-quality data, you can analyze it to classify the data, or discover the relationship between the data, and get knowledge.
For example, the rumored story of beer and diapers in Wal-Mart supermarkets is based on analyzing people’s purchase data and found that men generally buy beer when they buy diapers. In this way, they have discovered the relationship between beer and diapers. Knowledge is then applied to practice, and wisdom is gained by putting the beer and diaper counters close together.
The fifth step is the retrieval and mining of data. Retrieval is search, the so-called foreign affairs do not decide to ask Google, internal affairs do not decide to ask Baidu. Both internal and external search engines put the analyzed data into the search engine, so when people want to find information, they can search it.
The other is mining. Just searching it out can no longer meet people's requirements. It is also necessary to dig out the mutual relationship from the information. For example, in financial search, when searching for a certain company’s stock, should the company’s executives also be discovered? If you only search for the company’s stock and find that it has risen particularly well, you will buy it. In fact, the executive issued a statement that it was very unfavorable to the stock and fell the next day. Wouldn’t this harm the majority of investors? Therefore, it is very important to mine the relationships in the data through various algorithms to form a knowledge base.
3 In the era of big data, the firewood is high
When the amount of data is small, a few machines can solve it. Slowly, when the amount of data is getting bigger and bigger, and the best server can't solve the problem, what should we do? At this time, the power of multiple machines must be gathered, and everyone will work together to get this done.
For data collection: In terms of IoT, thousands of detection devices are deployed outside to collect a large amount of temperature, humidity, monitoring, power and other data; in terms of search engines for Internet web pages, the entire Internet needs to be collected. All web pages are downloaded. Obviously, one machine can't do this. It requires multiple machines to form a web crawler system. Each machine downloads a part and works at the same time to download a large number of web pages within a limited time.
For data transmission: A queue in the memory will definitely be squeezed by a large amount of data, so a distributed queue based on hard disks is generated, so that the queue can be transmitted by multiple machines at the same time, depending on your data volume, as long as my queue Enough enough, the pipe is thick enough to be able to hold it.
For data storage: the file system of one machine is definitely not enough, so a large distributed file system is needed to do this, and the hard disks of multiple machines are combined into a large file system.
For data analysis: It may be necessary to decompose, count, and summarize a large amount of data. A machine will definitely not be able to handle it, and the analysis will not be completed until the year of the monkey. Therefore, there is a distributed computing method, which divides a large amount of data into small portions, and each machine processes a small portion, and multiple machines process in parallel, and the calculation can be completed quickly. For example, the famous Terasort sorts 1 TB of data, which is equivalent to 1000G. If it is processed on a single machine, it will take several hours, but the parallel processing is completed in 209 seconds.
So what is big data? To put it bluntly, one machine can't finish it, everyone can do it together. However, with the increasing amount of data, many small companies need to process a lot of data. What can these small companies do if they don't have so many machines?
4 Big data needs cloud computing, and cloud computing needs big data
Having said that, everyone thinks of cloud computing. When you want to do these tasks, you need a lot of machines to do it together. It really means you want it when you want it, and you want it as much as you want.
For example, the financial situation of a big data analysis company may be analyzed once a week. If one hundred machines or one thousand machines are to be stored there, it would be very wasteful to use it once a week. Can you take out these thousand machines when you need to calculate; when not, let these thousand machines do other things?
Who can do this? Only cloud computing can provide resource-level flexibility for big data operations. And cloud computing will also deploy big data on its PaaS platform as a very, very important general application. Because the big data platform can enable multiple machines to do one thing together, this thing is not something that ordinary people can develop, nor is it easy for ordinary people to play. How can you hire dozens or hundreds of people to play this.
So just like a database, a bunch of professional people are actually needed to play this thing. Nowadays, there are basically big data solutions on the public cloud. When a small company needs a big data platform, it does not need to purchase a thousand machines. As long as you go to the public cloud, all these thousand machines will come out. The big data platform that has been deployed above, just put the data in it.
Cloud computing needs big data, and big data needs cloud computing. The two are combined in this way.
Fourth, artificial intelligence embraces big data
1 When will the machine understand the human heart
Even with big data, people's desires cannot be satisfied. Although there is a search engine in the big data platform, you can search for what you want. But there are also situations where the things I want will not be searched or expressed, and the searched ones are not what I want.
For example, the music software recommends a song. I haven't listened to this song. Of course, I don't know the name or search it. But the software recommended to me, I really like it, this is what search can't do. When people use this kind of application, they will find that the machine knows what I want, instead of searching in the machine when I want it. This machine really understands me like my friend, which is a bit artificial.
People have been thinking about this for a long time. In the earliest days, people imagined that if there was a wall and there was a machine behind the wall, and I spoke to it, it would respond to me. If I can't tell whether it is a human or a machine, then it is really an artificial intelligence thing.
2 Let the machine learn to reason
How can this be done? People think: I must first tell the computer the ability of human reasoning. What do you think is important to people? What is the difference between humans and animals? Just being able to reason. What if you tell the machine my reasoning ability and let the machine infer the corresponding answer based on your question?
In fact, people are slowly allowing machines to do some inferences, such as proving mathematical formulas. This is a very surprising process. The machine can prove mathematical formulas. But slowly I discovered that this result was not so surprising. Because everyone found a problem: the mathematical formula is very rigorous, the reasoning process is also very rigorous, and the mathematical formula is easy to express with a machine, and the program is relatively easy to express.
However, human language is not so simple. For example, tonight, you dated your girlfriend, and your girlfriend said: If you come early, I did not come; you wait, if I come early; if you do not come, you wait! This machine is more difficult to understand, but everyone understands it. So if you date your girlfriend, you dare not be late.
3 Teach machine knowledge
Therefore, it is not enough to tell the machine strict reasoning, but also to tell the machine some knowledge. But telling machine knowledge about this matter may not be possible for ordinary people. Maybe an expert can, such as an expert in the language field or an expert in the financial field.
Can knowledge in the field of language and finance be more rigorous like mathematical formulas? For example, language experts may summarize the grammatical rules of subject, predicate, object, definite adverbial complement, the subject must be the predicate, and the predicate must be the object. Will these be summarized and expressed strictly?
Later, I found that this was not possible, it was too difficult to summarize, and the language expression was ever-changing. Take the example of subject, predicate and object. Many times, the predicate is omitted in the spoken language. Others ask: Who are you? I replied: I am Liu Chao. But you can’t stipulate that in speech semantic recognition, you must speak standard written language to the machine. This is not smart enough. Just like Luo Yonghao said in a speech, every time you face your mobile phone, you say in written language: Please help me call someone XX, this is a very embarrassing thing.
This stage of artificial intelligence is called an expert system. Expert systems are not easy to succeed. On the one hand, knowledge is more difficult to summarize, and on the other hand, the summarized knowledge is difficult to teach the computer. Because you are still in a daze and feel that there is a pattern, but you can’t tell, how can you teach a computer through programming?
4 Forget it, can't you learn by yourself?
So people thought: machines are a completely different species from humans, so let the machines learn by themselves.
How does the machine learn? Since the statistical ability of the machine is so strong, based on statistical learning, certain rules can be found from a large number of numbers.
In fact, there is a good example in the entertainment industry, which can be seen in general:
A netizen counted the lyrics of 117 songs in 9 albums released by well-known singers on the mainland. The same word only counts once in a song. The top ten adjectives, nouns and verbs are shown in the following table (the number after the word Is the number of occurrences):
What if we randomly write a series of numbers, and then take out a word from adjectives, nouns, and verbs according to the digits.
For example, take the pi ratio of 3.1415926, the corresponding words are: strong, road, fly, freedom, rain, buried, lost. Connect and polish it a bit:
Strong child,
Still on the road,
Spread your wings and fly to freedom,
Let the rain bury his confusion.
Does it feel a little bit? Of course, the real statistics-based learning algorithm is much more complicated than this simple statistics.
However, statistical learning is easier to understand simple correlations: for example, one word and another word always appear together, the two words should be related; it cannot express complex correlations. In addition, the formulas of statistical methods are often very complicated. In order to simplify the calculations, various independent assumptions are often made to reduce the calculation difficulty of the formulas. However, in real life, there are relatively few independent events.
5 Simulate how the brain works
So humans began to reflect on how the human world works from the machine world.
The human brain does not store a large number of rules or records a large number of statistical data, but is achieved through the triggering of neurons. Each neuron has input from other neurons. When receiving input, it will generate One output to stimulate other neurons. As a result, a large number of neurons react with each other and finally form various output results.
For example, when people see the pupils of beautiful women dilate, it is by no means that the brain makes regular judgments based on body proportions, nor does it count all the beauties that have been seen in life, but neurons are triggered from the retina to the brain and back to the pupils. In this process, it is actually difficult to summarize what effect each neuron has on the final result, anyway, it works.
So people began to simulate neurons with a mathematical unit.
This neuron has input and output. The input and output are expressed by a formula. The input is different in importance (weight), which affects the output.
So connect n neurons together like a neural network. The number n can be very large, and all neurons can be divided into many columns, and many of them are arranged in each column. Each neuron can have a different weight for the input, so the formula for each neuron is also different. When people input something from this network, they hope to output a result that is correct for humans.
For example, in the above example, input a picture with 2 written, the second number in the output list is the largest, in fact, from the machine, it neither knows that the input picture is written as 2 nor the output series The meaning of numbers, it doesn't matter, people just need to know the meaning. Just as for neurons, they neither know that the retina sees beautiful women, nor do they know that the pupils are dilated to see clearly. Anyway, when they see beautiful women, their pupils are dilated.
For any neural network, no one can guarantee that the input is 2 and the output must be the second largest number. To ensure this result, training and learning are required. After all, seeing beautiful women with dilated pupils is also the result of many years of human evolution. The learning process is to input a lot of pictures, and if the result is not what you want, make adjustments.
How to adjust it? That is, each weight of each neuron is fine-tuned to the target. Because there are too many neurons and weights, it is difficult for the results produced by the entire network to show either-or results, but slightly toward the results. Progress can eventually achieve the target result.
Of course, these adjustment strategies are still very skillful and require careful adjustments by algorithm experts. Just as human beings see beautiful women, their pupils are not enlarged enough to see clearly at first, so the beautiful women ran away with others, and the result of the next study is that the pupils are enlarged a little bit instead of the nostrils.
6 Unreasonable but can do it
It doesn't sound so reasonable, but it can be done, it is so capricious!
The universal theorem of neural networks says this. Suppose someone gives you some kind of complicated and peculiar function, f(x):
No matter what this function is, it will always ensure that there is a neural network that can input x for any possible input, and its value f(x) (or some accurate approximation) is the output of the neural network.
If the function represents the law, it also means that the law, no matter how wonderful and incomprehensible, can be expressed through a large number of neurons and through a large number of weight adjustments.
7 Economic explanation of artificial intelligence
This reminds me of economics, so it is easier to understand.
We regard each neuron as an individual engaged in economic activities in the society. Therefore, the neural network is equivalent to the entire economy and society. Each neuron has the power to adjust the input to the society and make corresponding output. For example, when wages rise, vegetable prices rise, and stocks fall, what should I do? Spend your own money. Is there no pattern here? There must be, but what is the specific law? It's hard to say clearly.
The economy based on the expert system belongs to the planned economy. The expression of the entire economic law does not hope to be expressed through the independent decision-making of each economic individual, but hopes to be summarized through the expertise and foresight of experts. But experts can never know which street in which city lacks a sweet tofu brain.
So experts say how much steel should be produced and how many steamed buns should be produced. There is often a big gap between the real needs of the people's lives. Even if the entire plan is written several hundred pages, it cannot express the small laws hidden in the people's lives.
Macro-control based on statistics is much more reliable. Every year, the Statistics Bureau will collect statistics on the employment rate, inflation rate, GDP and other indicators of the entire society. These indicators often represent a lot of internal laws, although they cannot be accurately expressed, they are relatively reliable.
However, the summary expression based on statistics is relatively rough. For example, when economists see these statistics, they can conclude whether housing prices are rising or falling in the long run, and whether stocks are rising or falling in the long run. For example, if the economy generally rises, housing prices and stocks should both rise. However, based on statistical data, it is impossible to summarize the law of small fluctuations in stock and commodity prices.
Neural network-based microeconomics is the most accurate expression of the entire economic law. Everyone makes their own adjustments to their input in the society, and the adjustments will also be fed back to the society as input. Imagine the subtle volatility curve of the stock market, which is the result of each independent individual's continuous trading. There is no uniform law to follow.
And everyone makes independent decisions based on the input of the entire society. When certain factors are trained many times, they will also form macroscopic statistical laws, which is what macroeconomics can see. For example, every time a large amount of currency is issued, the house price will eventually rise, and after many trainings, people will learn.
8 Artificial intelligence needs big data
However, the neural network contains so many nodes, and each node contains a lot of parameters. The amount of parameters is too large, and the amount of calculation required is too large. But it doesn't matter. We have a big data platform that can gather the power of multiple machines to calculate together and get the desired result in a limited time.
There are many things that artificial intelligence can do, such as identifying spam, identifying yellow violent text and pictures. This also went through three stages:
- The first stage relies on keyword black and white lists and filtering technology, which words are yellow or violent. With the increasing number of online languages, the words are constantly changing, and it is a bit overwhelming to keep updating this thesaurus.
- In the second stage, based on some new algorithms, such as Bayesian filtering, you don't care what the Bayesian algorithm is, but you should have heard the name, this is a probability-based algorithm.
- The third stage is based on big data and artificial intelligence to carry out more accurate user portrait and text understanding and image understanding.
Because artificial intelligence algorithms mostly rely on a large amount of data, these data often need to be accumulated for a specific field (such as e-commerce, mailbox) for a long time. If there is no data, even artificial intelligence algorithms are useless, so artificial intelligence Programs seldom, like the previous IaaS and PaaS, install a set of artificial intelligence programs for a certain customer and let the customer use it. Because a single set is installed for a certain customer, the customer does not have relevant data for training, and the result is often very poor.
However, cloud computing vendors often accumulate a large amount of data, so they install a set of cloud computing vendors to expose a service interface. For example, if you want to identify whether a text involves pornography and violence, you can use this online service directly. This kind of service is called software as a service in cloud computing, SaaS (Software AS A Service)
As a result, industrial intelligence programs entered cloud computing as a SaaS platform.
5. A good life based on the relationship between the three
Finally, the three brothers of cloud computing have gathered together, namely IaaS, PaaS and SaaS. So generally on a cloud computing platform, cloud, big data, and artificial intelligence can all be found. A big data company has accumulated a large amount of data and will use some artificial intelligence algorithms to provide some services; an artificial intelligence company cannot be without the support of a big data platform.
Therefore, when cloud computing, big data, and artificial intelligence are integrated in this way, the process of encounter, acquaintance, and acquaintance is completed.
No comments:
Post a Comment